Project: Creating Customer Segments- Part 3

Outlier Detection : 

Detecting outliers in the data is extremely important in the data preprocessing step of any analysis. The presence of outliers can often skew results which take into consideration these data points. There are many “rules of thumb” for what constitutes an outlier in a dataset. Here, we will use Tukey’s Method for identfying outliers: An outlier step is calculated as 1.5 times the interquartile range (IQR). A data point with a feature that is beyond an outlier step outside of the IQR for that feature is considered abnormal.

import itertools

# Select the indices for data points you wish to remove
outliers_lst = []

# For each feature find the data points with extreme high or low values
for feature in log_data.columns:
#  Calculate Q1 (25th percentile of the data) for the given feature
Q1 = np.percentile(log_data.loc[:, feature], 25)

# Calculate Q3 (75th percentile of the data) for the given feature
Q3 = np.percentile(log_data.loc[:, feature], 75)

# Use the interquartile range to calculate an outlier step (1.5 times the interquartile range)
step = 1.5 * (Q3 – Q1)

# Display the outliers
print “Data points considered outliers for the feature ‘{}’:”.format(feature)

# The tilde sign ~ means not
# So here, we’re finding any points outside of Q1 – step and Q3 + step
outliers_rows = log_data.loc[~((log_data[feature] >= Q1 – step) & (log_data[feature] <= Q3 + step)), :]
# display(outliers_rows)

outliers_lst.append(list(outliers_rows.index))

outliers = list(itertools.chain.from_iterable(outliers_lst))

# List of unique outliers
# We use set()
# Sets are lists with no duplicate entries
uniq_outliers = list(set(outliers))

# List of duplicate outliers
outliers = list(set([x for x in outliers if outliers.count(x) > 1]))

print ‘Outliers list:\n’, uniq_outliers
print ‘Length of outliers list:\n’, len(uniq_outliers)

print ‘Duplicate list:\n’, dup_outliers
print ‘Length of duplicates list:\n’, len(outliers)

# Remove duplicate outliers
# Only 5 specified
good_data = log_data.drop(log_data.index[outliers]).reset_index(drop = True)

# Original Data
print ‘Original shape of data:\n’, data.shape
# Processed Data
print ‘New shape of data:\n’, good_data.shape

Data points considered outliers for the feature 'Fresh':
Data points considered outliers for the feature 'Milk':
Data points considered outliers for the feature 'Grocery':
Data points considered outliers for the feature 'Frozen':
Data points considered outliers for the feature 'Detergents_Paper':
Data points considered outliers for the feature 'Delicatessen':
Outliers list:
[128, 193, 264, 137, 142, 145, 154, 412, 285, 161, 420, 38, 171, 429, 175, 304, 305, 439, 184, 57, 187, 65, 66, 203, 325, 289, 75, 81, 338, 86, 343, 218, 95, 96, 353, 98, 355, 356, 357, 233, 109, 183]
Length of outliers list:
42
Duplicate list:
[128, 65, 66, 75, 154]
Length of duplicates list:
5
Original shape of data:
(440, 6)
New shape of data:
(435, 6)

There are five examples that have duplicates. They must be removed as they are present in more than one category. 

PCA

Now that the data has been scaled to a more normal distribution and has had any necessary outliers removed, we can now apply PCA to the good_data to discover which dimensions about the data best maximize the variance of features involved. In addition to finding these dimensions, PCA will also report theexplained variance ratio of each dimension — how much variance within the data is explained by that dimension alone. Note that a component (dimension) from PCA can be considered a new “feature” of the space, however it is a composition of the original features present in the data.

# Apply PCA by fitting the good data with the same number of dimensions as features
from sklearn.decomposition import PCA

pca = PCA(n_components=6)
pca.fit(good_data)
# Transform log_samples using the PCA fit above

pca_samples = pca.transform(log_samples)
# Generate PCA results plot
pca_results = vs.pca_results(good_data, pca)

download (2).png

display(pca_results)

Explained Variance Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
Dimension 1 0.4430 -0.1675 0.4014 0.4381 -0.1782 0.7514 0.1499
Dimension 2 0.2638 0.6859 0.1672 0.0707 0.5005 0.0424 0.4941
Dimension 3 0.1231 -0.6774 0.0402 -0.0195 0.3150 -0.2117 0.6286
Dimension 4 0.1012 -0.2043 0.0128 0.0557 0.7854 0.2096 -0.5423
Dimension 5 0.0485 0.0026 -0.7192 -0.3554 0.0331 0.5582 0.2092
Dimension 6 0.0204 -0.0292 0.5402 -0.8205 -0.0205 0.1824 -0.0197

In total, the first and second principal component explains, 71.9% ((0.4424+0.2766).100) of variation in the data. The first four principal components explains, 93.14% ((0.4424+0.2766+0.1162+0.0962).100) of variation in the data.

The first dimension is the first principal component(PC1) since it explains maximum variation. “Milk”, “Grocery” and “Detergents_Paper” best represents PC1.

The second dimension is the second principal component(PC2). “Fresh”, “Frozen” and “Delicatessen” best represents PC2.

The third dimension is the third principal component(PC3). “Fresh” and “Delicatessen” best represents PC3.

The fourth dimension is the fourth principal component(PC4). “Frozen” and “Delicatessen” best repesents PC4.

# Display sample log-data after having a PCA transformation applied
display(pd.DataFrame(np.round(pca_samples, 4), columns = pca_results.index.values))

Dimension 1 Dimension 2 Dimension 3 Dimension 4 Dimension 5 Dimension 6
0 3.1072 -2.7017 -0.6386 1.8708 -0.6452 -0.1333
1 2.2406 1.2419 -1.0729 -1.9589 -0.2160 0.1782
2 -2.3404 1.6911 0.7155 0.5932 -0.4606 -0.4074

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s