**Outlier Detection : **

Detecting outliers in the data is extremely important in the data preprocessing step of any analysis. The presence of outliers can often skew results which take into consideration these data points. There are many “rules of thumb” for what constitutes an outlier in a dataset. Here, we will use Tukey’s Method for identfying outliers: An *outlier step* is calculated as 1.5 times the interquartile range (IQR). A data point with a feature that is beyond an outlier step outside of the IQR for that feature is considered abnormal.

import itertools

# Select the indices for data points you wish to remove

outliers_lst = []

# For each feature find the data points with extreme high or low values

for feature in log_data.columns:

# Calculate Q1 (25th percentile of the data) for the given feature

Q1 = np.percentile(log_data.loc[:, feature], 25)

# Calculate Q3 (75th percentile of the data) for the given feature

Q3 = np.percentile(log_data.loc[:, feature], 75)

# Use the interquartile range to calculate an outlier step (1.5 times the interquartile range)

step = 1.5 * (Q3 – Q1)

# Display the outliers

print “Data points considered outliers for the feature ‘{}’:”.format(feature)

# The tilde sign ~ means not

# So here, we’re finding any points outside of Q1 – step and Q3 + step

outliers_rows = log_data.loc[~((log_data[feature] >= Q1 – step) & (log_data[feature] <= Q3 + step)), :]

# display(outliers_rows)

outliers_lst.append(list(outliers_rows.index))

outliers = list(itertools.chain.from_iterable(outliers_lst))

# List of unique outliers

# We use set()

# Sets are lists with no duplicate entries

uniq_outliers = list(set(outliers))

# List of duplicate outliers

outliers = list(set([x for x in outliers if outliers.count(x) > 1]))

print ‘Outliers list:\n’, uniq_outliers

print ‘Length of outliers list:\n’, len(uniq_outliers)

print ‘Duplicate list:\n’, dup_outliers

print ‘Length of duplicates list:\n’, len(outliers)

# Remove duplicate outliers

# Only 5 specified

good_data = log_data.drop(log_data.index[outliers]).reset_index(drop = True)

# Original Data

print ‘Original shape of data:\n’, data.shape

# Processed Data

print ‘New shape of data:\n’, good_data.shape

Data points considered outliers for the feature 'Fresh': Data points considered outliers for the feature 'Milk': Data points considered outliers for the feature 'Grocery': Data points considered outliers for the feature 'Frozen': Data points considered outliers for the feature 'Detergents_Paper': Data points considered outliers for the feature 'Delicatessen': Outliers list: [128, 193, 264, 137, 142, 145, 154, 412, 285, 161, 420, 38, 171, 429, 175, 304, 305, 439, 184, 57, 187, 65, 66, 203, 325, 289, 75, 81, 338, 86, 343, 218, 95, 96, 353, 98, 355, 356, 357, 233, 109, 183] Length of outliers list: 42 Duplicate list: [128, 65, 66, 75, 154] Length of duplicates list: 5 Original shape of data: (440, 6) New shape of data: (435, 6)

There are five examples that have duplicates. They must be removed as they are present in more than one category.

**PCA**

Now that the data has been scaled to a more normal distribution and has had any necessary outliers removed, we can now apply PCA to the `good_data`

to discover which dimensions about the data best maximize the variance of features involved. In addition to finding these dimensions, PCA will also report the*explained variance ratio* of each dimension — how much variance within the data is explained by that dimension alone. Note that a component (dimension) from PCA can be considered a new “feature” of the space, however it is a composition of the original features present in the data.

# Apply PCA by fitting the good data with the same number of dimensions as features

from sklearn.decomposition import PCA

pca = PCA(n_components=6)

pca.fit(good_data)

# Transform log_samples using the PCA fit above

pca_samples = pca.transform(log_samples)

# Generate PCA results plot

pca_results = vs.pca_results(good_data, pca)

display(pca_results)

Explained Variance | Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicatessen | |
---|---|---|---|---|---|---|---|

Dimension 1 | 0.4430 | -0.1675 | 0.4014 | 0.4381 | -0.1782 | 0.7514 | 0.1499 |

Dimension 2 | 0.2638 | 0.6859 | 0.1672 | 0.0707 | 0.5005 | 0.0424 | 0.4941 |

Dimension 3 | 0.1231 | -0.6774 | 0.0402 | -0.0195 | 0.3150 | -0.2117 | 0.6286 |

Dimension 4 | 0.1012 | -0.2043 | 0.0128 | 0.0557 | 0.7854 | 0.2096 | -0.5423 |

Dimension 5 | 0.0485 | 0.0026 | -0.7192 | -0.3554 | 0.0331 | 0.5582 | 0.2092 |

Dimension 6 | 0.0204 | -0.0292 | 0.5402 | -0.8205 | -0.0205 | 0.1824 | -0.0197 |

In total, the first and second principal component explains, 71.9% ((0.4424+0.2766).100) of variation in the data. The first four principal components explains, 93.14% ((0.4424+0.2766+0.1162+0.0962).100) of variation in the data.

The first dimension is the first principal component(PC1) since it explains maximum variation. “Milk”, “Grocery” and “Detergents_Paper” best represents PC1.

The second dimension is the second principal component(PC2). “Fresh”, “Frozen” and “Delicatessen” best represents PC2.

The third dimension is the third principal component(PC3). “Fresh” and “Delicatessen” best represents PC3.

The fourth dimension is the fourth principal component(PC4). “Frozen” and “Delicatessen” best repesents PC4.

# Display sample log-data after having a PCA transformation applied

display(pd.DataFrame(np.round(pca_samples, 4), columns = pca_results.index.values))

Dimension 1 | Dimension 2 | Dimension 3 | Dimension 4 | Dimension 5 | Dimension 6 | |
---|---|---|---|---|---|---|

0 | 3.1072 | -2.7017 | -0.6386 | 1.8708 | -0.6452 | -0.1333 |

1 | 2.2406 | 1.2419 | -1.0729 | -1.9589 | -0.2160 | 0.1782 |

2 | -2.3404 | 1.6911 | 0.7155 | 0.5932 | -0.4606 | -0.4074 |