Welcome to the third project of the Machine Learning Engineer Nanodegree! In this notebook, some template code has already been provided for you, and it will be your job to implement the additional functionality necessary to successfully complete this project. Sections that begin with **'Implementation'** in the header indicate that the following block of code will require additional functionality which you must provide. Instructions will be provided for each section and the specifics of the implementation are marked in the code block with a `'TODO'`

statement. Please be sure to read the instructions carefully!

In addition to implementing code, there will be questions that you must answer which relate to the project and your implementation. Each section where you will answer a question is preceded by a **'Question X'** header. Carefully read each question and provide thorough answers in the following text boxes that begin with **'Answer:'**. Your project submission will be evaluated based on your answers to each of the questions and the implementation you provide.

Note:Code and Markdown cells can be executed using theShift + Enterkeyboard shortcut. In addition, Markdown cells can be edited by typically double-clicking the cell to enter edit mode.

In this project, you will analyze a dataset containing data on various customers' annual spending amounts (reported in *monetary units*) of diverse product categories for internal structure. One goal of this project is to best describe the variation in the different types of customers that a wholesale distributor interacts with. Doing so would equip the distributor with insight into how to best structure their delivery service to meet the needs of each customer.

The dataset for this project can be found on the UCI Machine Learning Repository. For the purposes of this project, the features `'Channel'`

and `'Region'`

will be excluded in the analysis — with focus instead on the six product categories recorded for customers.

Run the code block below to load the wholesale customers dataset, along with a few of the necessary Python libraries required for this project. You will know the dataset loaded successfully if the size of the dataset is reported.

In [267]:

```
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from IPython.display import display # Allows the use of display() for DataFrames
# Import supplementary visualizations code visuals.py
import visuals as vs
# Pretty display for notebooks
%matplotlib inline
# Load the wholesale customers dataset
try:
data = pd.read_csv("customers.csv")
data.drop(['Region', 'Channel'], axis = 1, inplace = True)
print "Wholesale customers dataset has {} samples with {} features each.".format(*data.shape)
except:
print "Dataset could not be loaded. Is the dataset missing?"
```

In this section, you will begin exploring the data through visualizations and code to understand how each feature is related to the others. You will observe a statistical description of the dataset, consider the relevance of each feature, and select a few sample data points from the dataset which you will track through the course of this project.

Run the code block below to observe a statistical description of the dataset. Note that the dataset is composed of six important product categories: **'Fresh'**, **'Milk'**, **'Grocery'**, **'Frozen'**, **'Detergents_Paper'**, and **'Delicatessen'**. Consider what each category represents in terms of products you could purchase.

In [268]:

```
# Display a description of the dataset
display(data.describe())
```

To get a better understanding of the customers and how their data will transform through the analysis, it would be best to select a few sample data points and explore them in more detail. In the code block below, add **three** indices of your choice to the `indices`

list which will represent the customers to track. It is suggested to try different sets of samples until you obtain customers that vary significantly from one another.

In [269]:

```
# TODO: Select three indices of your choice you wish to sample from the dataset
indices = [95, 176, 200]
# Create a DataFrame of the chosen samples
samples = pd.DataFrame(data.loc[indices], columns = data.keys()).reset_index(drop = True)
print "Chosen samples of wholesale customers dataset:"
display(samples)
```

Consider the total purchase cost of each product category and the statistical description of the dataset above for your sample customers.

*What kind of establishment (customer) could each of the three samples you've chosen represent?*

**Hint:** Examples of establishments include places like markets, cafes, and retailers, among many others. Avoid using names for establishments, such as saying *"McDonalds"* when describing a sample customer as a restaurant.

**Answer:**

- I think the first customer is likely to be a small corner shop, given that they order very little fresh ingredients (3 vs a mean of 12000.3) while they do order a proportionately higher amount of Milk (2920 vs a mean of 5796.27) and Grocery type foods (6252 vs a mean of 7951.28) than other types of goods. Corner shops typically have a small frozen section (440 vs a mean of 3071.93) and household items area (223 vs a mean of 2881.49), but can sometimes have a large delicatessen area (709 vs a mean of 1524.87) depending on the demographic makeup of the area the shop is in. Looking at the heatmap of normalized expenditures below, this sample purchases much less goods overall than other customers, but seems to purchase proportionately more Milk, Grocery, and Delicatessen goods than other types of goods.
- The second customer is more likely to be a big market-like shop, with a huge fresh produce section (likely organic produce given the cost, with an expenditure of 45640 compared to a mean of 12000.3) and smaller and equal areas for milk (6958 vs a mean of 5796.27), normal groceries (6536 vs a mean of 7951.28), and frozen foods (7368 vs a mean of 3071.93). I would assume a place like this would also have a bigger delicatessen area, though it appears not to be the case (230 vs a mean of 1524.87). Looking at the heatmap of normalized expenditures below, this sample purchases significantly more fresh produce than other customers and substantially more frozen goods than other customers, but is closer to the mean for both Milk and Grocery products, and has below average consumption of Detergents_Paper and Delicatessen products.
- The final customer strikes me as a typical supermarket with a modest fresh produce area (3067 vs a mean of 12000.3), but a shop where you're more likely to go for typical grocery items like bread, cereals, tins of food, etc. (23127 vs a mean of 7951.28). These kind of stores typically have a substantial dairy selection (13240 vs a mean of 5796.27) and frozen area (3941 vs a mean of 3071.93), and also have aisles dedicated solely to household items like detergent and paper towels (9959 vs a mean of 2881.49). You're less likely to find a delicatessen area in a store like this (731 vs a mean of 1524.87), though there would appear to be a small part of the store that caters to customers interested in those types of goods. Looking at the heatmap of normalized expenditures below, this sample purchases significantly more Milk, Grocery, and Delicatessen_Paper products, slightly more frozen goods and slightly less delicatessen products, and a below average amount of fresh produce than other customers.

In [270]:

```
import seaborn as sns
sns.heatmap((samples-data.mean())/data.std(ddof=0), annot=True, cbar=False, square=True)
```

Out[270]:

One interesting thought to consider is if one (or more) of the six product categories is actually relevant for understanding customer purchasing. That is to say, is it possible to determine whether customers purchasing some amount of one category of products will necessarily purchase some proportional amount of another category of products? We can make this determination quite easily by training a supervised regression learner on a subset of the data with one feature removed, and then score how well that model can predict the removed feature.

In the code block below, you will need to implement the following:

- Assign
`new_data`

a copy of the data by removing a feature of your choice using the`DataFrame.drop`

function. - Use
`sklearn.cross_validation.train_test_split`

to split the dataset into training and testing sets.- Use the removed feature as your target label. Set a
`test_size`

of`0.25`

and set a`random_state`

.

- Use the removed feature as your target label. Set a
- Import a decision tree regressor, set a
`random_state`

, and fit the learner to the training data. - Report the prediction score of the testing set using the regressor's
`score`

function.

In [271]:

```
print new_data
```

In [272]:

```
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
# TODO: Make a copy of the DataFrame, using the 'drop' function to drop the given feature
new_data = data.copy()
new_data.drop(['Grocery'], axis=1, inplace = True)
# TODO: Split the data into training and testing sets using the given feature as the target
X_train, X_test, y_train, y_test = train_test_split(new_data, data['Grocery'], test_size = 0.25, random_state = 1)
# TODO: Create a decision tree regressor and fit it to the training set
regressor = DecisionTreeRegressor(random_state = 1)
regressor.fit(X_train, y_train)
# TODO: Report the score of the prediction using the testing set
score = regressor.score(X_test, y_test)
print score
```

*Which feature did you attempt to predict? What was the reported prediction score? Is this feature necessary for identifying customers' spending habits?*

**Hint:** The coefficient of determination, `R^2`

, is scored between 0 and 1, with 1 being a perfect fit. A negative `R^2`

implies the model fails to fit the data.

**Answer:**

I attempted to predict the 'Grocery' feature. The reported prediction score was 0.7958, which means that the model was able to predict its value reasonably well and could mean that it's not a necessary feature for identifying customers' spending habits. Other features could be used to predict customers' purchasing behaviour of groceries with a reasonable degree of accuracy.

In [273]:

```
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeRegressor
def calculate_r_2_for_feature(data,feature):
new_data = data.drop(feature, axis=1)
X_train, X_test,y_train, y_test = train_test_split(new_data,data[feature],test_size=0.25)
regressor = DecisionTreeRegressor()
regressor.fit(X_train,y_train)
score = regressor.score(X_test,y_test)
return score
def r_2_mean(data,feature,runs=200):
return np.array([calculate_r_2_for_feature(data,feature)
for _ in range(200) ]).mean().round(4)
print "{0:17} {1}".format("Fresh: ", r_2_mean(data,'Fresh'))
print "{0:17} {1}".format("Milk: ", r_2_mean(data,'Milk'))
print "{0:17} {1}".format("Grocery: ", r_2_mean(data,'Grocery'))
print "{0:17} {1}".format("Frozen: ", r_2_mean(data,'Frozen'))
print "{0:17} {1}".format("Detergents_Paper: ", r_2_mean(data,'Detergents_Paper'))
print "{0:17} {1}".format("Delicatessen: ", r_2_mean(data,'Delicatessen'))
```

To get a better understanding of the dataset, we can construct a scatter matrix of each of the six product features present in the data. If you found that the feature you attempted to predict above is relevant for identifying a specific customer, then the scatter matrix below may not show any correlation between that feature and the others. Conversely, if you believe that feature is not relevant for identifying a specific customer, the scatter matrix might show a correlation between that feature and another feature in the data. Run the code block below to produce a scatter matrix.

In [274]:

```
# Produce a scatter matrix for each pair of features in the data
pd.scatter_matrix(data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');
```

In [275]:

```
corr = data.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask, 1)] = True
with sns.axes_style("white"):
ax = sns.heatmap(corr, mask=mask, square=True, annot=True,
cmap='RdBu', fmt='+.3f')
```

*Are there any pairs of features which exhibit some degree of correlation? Does this confirm or deny your suspicions about the relevance of the feature you attempted to predict? How is the data for those features distributed?*

**Hint:** Is the data normally distributed? Where do most of the data points lie?

**Answer:**

It appears that Grocery and Detergents_Paper have the strongest correlation of the pairs. It also looks like there is some correlation between Detergents_Paper and Milk, and Grocery and Milk. This confirms my suspicion above that Grocery was correlated with some other features that would allow for its value to be predicted with some degree of accuracy. All of the distributions appear to be skewed to the right, with more points hovering closer to the origin and some larger points extending it to the right. The shape of the distributions of Detergents_Paper, Grocery, and Milk are all quite similar.

In this section, you will preprocess the data to create a better representation of customers by performing a scaling on the data and detecting (and optionally removing) outliers. Preprocessing data is often times a critical step in assuring that results you obtain from your analysis are significant and meaningful.

If data is not normally distributed, especially if the mean and median vary significantly (indicating a large skew), it is most often appropriate to apply a non-linear scaling — particularly for financial data. One way to achieve this scaling is by using a Box-Cox test, which calculates the best power transformation of the data that reduces skewness. A simpler approach which can work in most cases would be applying the natural logarithm.

In the code block below, you will need to implement the following:

- Assign a copy of the data to
`log_data`

after applying logarithmic scaling. Use the`np.log`

function for this. - Assign a copy of the sample data to
`log_samples`

after applying logarithmic scaling. Again, use`np.log`

.

In [276]:

```
# TODO: Scale the data using the natural logarithm
log_data = np.log(data.copy())
# TODO: Scale the sample data using the natural logarithm
log_samples = np.log(samples)
# Produce a scatter matrix for each pair of newly-transformed features
pd.scatter_matrix(log_data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');
```

After applying a natural logarithm scaling to the data, the distribution of each feature should appear much more normal. For any pairs of features you may have identified earlier as being correlated, observe here whether that correlation is still present (and whether it is now stronger or weaker than before).

Run the code below to see how the sample data has changed after having the natural logarithm applied to it.

In [277]:

```
# Display the log-transformed sample data
display(log_samples)
```

Detecting outliers in the data is extremely important in the data preprocessing step of any analysis. The presence of outliers can often skew results which take into consideration these data points. There are many "rules of thumb" for what constitutes an outlier in a dataset. Here, we will use Tukey's Method for identfying outliers: An *outlier step* is calculated as 1.5 times the interquartile range (IQR). A data point with a feature that is beyond an outlier step outside of the IQR for that feature is considered abnormal.

In the code block below, you will need to implement the following:

- Assign the value of the 25th percentile for the given feature to
`Q1`

. Use`np.percentile`

for this. - Assign the value of the 75th percentile for the given feature to
`Q3`

. Again, use`np.percentile`

. - Assign the calculation of an outlier step for the given feature to
`step`

. - Optionally remove data points from the dataset by adding indices to the
`outliers`

list.

**NOTE:** If you choose to remove any outliers, ensure that the sample data does not contain any of these points!

Once you have performed this implementation, the dataset will be stored in the variable `good_data`

.

In [278]:

```
from collections import Counter
# For each feature find the data points with extreme high or low values
for feature in log_data.keys():
# TODO: Calculate Q1 (25th percentile of the data) for the given feature
Q1 = np.percentile(log_data[feature], 25)
# TODO: Calculate Q3 (75th percentile of the data) for the given feature
Q3 = np.percentile(log_data[feature], 75)
# TODO: Use the interquartile range to calculate an outlier step (1.5 times the interquartile range)
step = (Q3 - Q1) * 1.5
# Display the outliers
print "Data points considered outliers for the feature '{}':".format(feature)
display(log_data[~((log_data[feature] >= Q1 - step) & (log_data[feature] <= Q3 + step))])
# OPTIONAL: Select the indices for data points you wish to remove
outliers = []
# Remove the outliers, if any were specified
good_data = log_data.drop(log_data.index[outliers]).reset_index(drop = True)
```

*Are there any data points considered outliers for more than one feature based on the definition above? Should these data points be removed from the dataset? If any data points were added to the outliers list to be removed, explain why.*

**Answer:**

75 is considered an outlier for both the Grocery and Detergents_Paper features; 154 is considered an outlier for all of Milk, Grocery, and Delicatessen features; 65 is in Fresh and Frozen; 66 is in Delicatessen and Fresh; 128 is in Delicatessen and Fresh. I think these data points should be removed from the dataset because they were outliers for more than one feature, and therefore may reduce the predictive capability of our model if it is trained on these noisy datapoints. Interestingly, one of the datapoints I selected for the sample, 95, turned out to be an outlier due to the amount of Fresh produce the customer purchased.

In this section you will use principal component analysis (PCA) to draw conclusions about the underlying structure of the wholesale customer data. Since using PCA on a dataset calculates the dimensions which best maximize variance, we will find which compound combinations of features best describe customers.

Now that the data has been scaled to a more normal distribution and has had any necessary outliers removed, we can now apply PCA to the `good_data`

to discover which dimensions about the data best maximize the variance of features involved. In addition to finding these dimensions, PCA will also report the *explained variance ratio* of each dimension — how much variance within the data is explained by that dimension alone. Note that a component (dimension) from PCA can be considered a new "feature" of the space, however it is a composition of the original features present in the data.

In the code block below, you will need to implement the following:

- Import
`sklearn.decomposition.PCA`

and assign the results of fitting PCA in six dimensions with`good_data`

to`pca`

. - Apply a PCA transformation of
`log_samples`

using`pca.transform`

, and assign the results to`pca_samples`

.

In [279]:

```
from sklearn.decomposition import PCA
# TODO: Apply PCA by fitting the good data with the same number of dimensions as features
pca = PCA(n_components=6)
pca.fit(good_data)
# TODO: Transform log_samples using the PCA fit above
pca_samples = pca.transform(log_samples)
# Generate PCA results plot
pca_results = vs.pca_results(good_data, pca)
```