Business Problem

Business Problem

A mall store utilizes membership cards to gather basic information about engaged customers. The store has data on customers that includes identification numbers, age, gender, annual income and a Spending Score. The Spending Score is a value assigned to the customer based on pre-defined parameters like customer behavior and purchasing data.

The store wants to better understand its membership data to guide a new marketing effort and increase sales.

Tab Legend

  • Business Problem
    • Business Problem
    • Tab Legend
    • Citations
  • Recommendation
    • Recommendation
    • Segmentation Analysis and Highlights
    • Cluster Descriptions
    • Methodology and Discussion
  • Data Audit
    • Variable summary statistics
    • Variable density plots
  • Correlations
    • Correlation table (Pearson)
    • Correlation matrix
    • Scatterplots for each variable against Spending Score
  • k-means clustering
    • Determine optimal k
    • Cluster plots for k=3 through k=7
    • Clusters silhouette plot
    • Variable comparison by cluster
  • Cluster 1 Demographics
    • Summary statistics
    • Density plots
  • Cluster 2 Demographics
    • Summary statistics
    • Density plots
  • Cluster 3 Demographics
    • Summary statistics
    • Density plots
  • Cluster 4 Demographics
    • Summary statistics
    • Density plots

Citations

Business Problem adapted from: https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python

Data source: https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python/download

Recommendation

Recommendation

A k-means clustering analysis generated four distinct membership clusters, and two of those clusters should be targeted for membership marketing priorities. The first group is defined as wealthy professionals between the ages of 30-36 who make between $74K-$94K annually. The second group is defined young adults between the ages of 21-31 who make between $24K-$57K annually. These two groups show higher than average Spending Scores.

The other two cluster groups reviewed in this study contained older store members with age ranges between 34-46 and 48-63. These older members, regardless of income level, have average to below average Spending Scores. Marketing resources should not be allocated to these two groups at this time. The mild negative correlation between Spending Score and age further supports prioritizing the two younger membership segments.

Segmentation Analysis and Highlights

The store is best segmented into four groups: The Thrifty Elders; Middle Aged and Cheap; Rich Spending Professional and Young Adult Spenders. These groups together can explain 65% of variance in the membership data.

  • Membership Demographics at a Glance
    • There are 200 members in the database.
    • The average Spending Score is 50 with a range between 1-99. The interquartile range is 34-73.
    • Membership average income is $60,560 with a range between $15K-$137K. The interquartile range is $41K-$78K.
    • Membership is 56% female.
    • Membership age ranges from 18-70 with an average age of 38. The median age is 36. The interquartile age range is 28-49.
    • There is a slight to moderate negative correlation between age and spending score (-.33), meaning that spending score decreases as age increases.
  • Cluster 1: The Thrifty Elders
    • 65 members.
    • Higher than average age, oldest of the clusters (48-63 years old).
    • Average spending score (32-51).
    • Lower Income ($39K-$60K).
    • No. 1-ranked cluster by cluster definition.
  • Cluster 2: Middle Aged and Cheap
    • 38 members.
    • Higher than average age, second oldest of clusters (34-46 years old).
    • Low spending score, lowest of cluster group (10-27).
    • High income quartile $72K-$96K, highest of cluster group.
    • No. 4-ranked cluster by cluster definition.
  • Cluster 3: Wealthy Professional Spenders
    • 40 members.
    • Lower than average age, second youngest of clusters (30-36 years old).
    • Highest spending score of cluster group, (74-90).
    • Higher than average income ($74K-$94K), second highest of the cluster group.
    • No. 2-ranked cluster by cluster definition.
  • Cluster 4: Young Adult Spenders
    • 57 members.
    • Lower than average age, youngest of clusters (21-31 years old).
    • Higher than average spending score, second highest of clusters (48-73).
    • Lower than average income, lowest of clusters ($24K-$57K).
    • No. 3-ranked cluster by cluster definition.

Methodology and Discussion

The k-means clustering algorithm was used to gather and group data into a defined number of clusters. The selection of k=4 clusters was chosen because of the model’s ability to explain 65% of customer data and relative cluster assignment accuracy when compared against other candidate models.

The NbClust R package automatically determined the optimal k=4 by parsing 30 indicies to select the best cluster fit. A range of k=2 to k=7 models was then evaluated. The k=5 model explains 72% of the data, but results in more cluster overlap and a higher risk of cluster misclassification compared against k=4 (see cluster plots and silouhettes). The clusters are more separated at k=4 compared against k=5. The improvement of cluster assignment accuracy for k=4 (silhouette width = .01) compared to k=5 (silhouette width=.00) was another determining factor in selection of k=4. The k=3 model was discarded because it appears to generalize too much for the business objective. The cluster fit at k=5, if desirable, may be improved with additional member measurements to provide more segmentation criteria in a future analysis.

The cluster ranges use the 25th and 75th percentile in age, spending score and income descriptions. For this business problem related to marketing, the use of the interquartile range as cluster descriptors provides a more representative view by omitting the minimum and maximum values, and potential outliers.

Data Audit

#Read in Data and Clean Column Names with the Janitor library
mc_data <- read.csv("D:/data_projects/mall_customers/mall_customers_raw_data.csv") %>% clean_names

#Rename columns
mc_data <- mc_data %>% 
  rename(
    annual_income = annual_income_k,
    spending_score = spending_score_1_100
    )

#Derive income level in thousands
mc_data$income <- mc_data$annual_income*1000

#Generate flag variables for gender
mc_data$gender_men <- ifelse(mc_data$gender == 'Male', 1, 0)
mc_data$gender_women <- ifelse(mc_data$gender == 'Female', 1, 0)

#Subset to exclude variables not in analysis
mc_data <- subset(mc_data, select=-c(customer_id,
                                     annual_income))
#Show data summary stats
datatable(audit_numeric_summary(mc_data), 
          class = 'cell-border stripe compact hover',
          caption = "Summary Statistics")
## [1] "age"
## [1] "spending_score"
## [1] "income"
## [1] "gender_men"
## [1] "gender_women"
#Loop through variables and generate density charts
audit_numeric_viz(mc_data)
## [1] "age"

## [1] "spending_score"

## [1] "income"

## [1] "gender_men"

## [1] "gender_women"

## [1] "Done Processing"

Correlations

datatable(audit_corr_summary(mc_data), 
          class = 'cell-border stripe compact hover',
          caption = "Correlation Summary")
audit_corr_matrix_viz(mc_data)

audit_scatter_viz(mc_data, "spending_score")
## [1] "age"

## [1] "spending_score"

## [1] "income"

## [1] "gender_men"

## [1] "gender_women"

## [1] "Done Processing"

k-means clustering

# Adapted from: https://www.datanovia.com/en/blog/types-of-clustering-methods-overview-and-quick-start-r-code/

library("cluster")
library("factoextra")
library("magrittr")

mc_data_clus <- subset(mc_data, select=-c(gender, gender_men, gender_women))


my_data <- mc_data_clus %>%
  na.omit() %>%          # Remove missing values (NA)
  scale()                # Scale variables
#Determining optimal amount of clusters
#There are different methods for determining the optimal number of clusters.
#In the R code below, we'll use the NbClust R package, which provides 30 indices for determining the best number of clusters.

# Compute
library("NbClust")
res.nbclust <- mc_data_clus %>%
  scale() %>%
  NbClust(distance = "euclidean",
          min.nc = 2, max.nc = 10, 
          method = "complete", index ="all") 

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 2 proposed 2 as the best number of clusters 
## * 2 proposed 3 as the best number of clusters 
## * 8 proposed 4 as the best number of clusters 
## * 5 proposed 5 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 4 proposed 8 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  4 
##  
##  
## *******************************************************************
# Visualize
fviz_nbclust(res.nbclust, ggtheme = theme_minimal())
## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 1 proposed  1 as the best number of clusters
## * 2 proposed  2 as the best number of clusters
## * 2 proposed  3 as the best number of clusters
## * 8 proposed  4 as the best number of clusters
## * 5 proposed  5 as the best number of clusters
## * 1 proposed  7 as the best number of clusters
## * 4 proposed  8 as the best number of clusters
## * 1 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  4 .

#Gap stat suggests 4
fviz_nbclust(my_data, kmeans, method = "gap_stat")

km.res <- kmeans(my_data, 2, nstart = 25)
# Visualize
fviz_cluster(km.res, data = my_data,
             geom = "point",
             ellipse.type = "convex",
             palette = "jco",
             ggtheme = theme_bw())

km.res <- kmeans(my_data, 3, nstart = 25)
# Visualize
fviz_cluster(km.res, data = my_data,
             geom = "point",
             ellipse.type = "convex",
             palette = "jco",
             ggtheme = theme_bw())

km.res <- kmeans(my_data, 4, nstart = 25)
# Visualize
fviz_cluster(km.res, data = my_data,
             geom = "point",
             ellipse.type = "convex",
             palette = "jco",
             ggtheme = theme_bw())

km.res <- kmeans(my_data, 5, nstart = 25)
# Visualize
fviz_cluster(km.res, data = my_data,
             geom = "point",
             ellipse.type = "convex",
             palette = "jco",
             ggtheme = theme_bw())

km.res <- kmeans(my_data, 6, nstart = 25)
# Visualize
fviz_cluster(km.res, data = my_data,
             geom = "point",
             ellipse.type = "convex",
             palette = "jco",
             ggtheme = theme_bw())

km.res <- kmeans(my_data, 7, nstart = 25)
# Visualize
fviz_cluster(km.res, data = my_data,
             geom = "point",
             ellipse.type = "convex",
             palette = "jco",
             ggtheme = theme_bw())

# Adapted from: https://towardsdatascience.com/clustering-analysis-in-r-using-k-means-73eca4fb7967

#' Plots a chart showing the sum of squares within a group for each execution of the kmeans algorithm. 
#' In each execution the number of the initial groups increases by one up to the maximum number of centers passed as argument.
#'
#' @param data The dataframe to perform the kmeans 
#' @param nc The maximum number of initial centers
#'
wssplot <- function(data, nc=15, seed=123){
               wss <- (nrow(data)-1)*sum(apply(data,2,var))
               for (i in 2:nc){
                    set.seed(seed)
                    wss[i] <- sum(kmeans(data, centers=i)$withinss)}
                plot(1:nc, wss, type="b", xlab="Number of groups",
                     ylab="Sum of squares within a group")}

wssplot(mc_data_clus, nc = 20)

#Interpretation: https://towardsdatascience.com/clustering-analysis-in-r-using-k-means-73eca4fb7967

km.res_2<- kmeans(my_data, 2, nstart = 25)
km.res_3<- kmeans(my_data, 3, nstart = 25)
km.res_4<- kmeans(my_data, 4, nstart = 25)
km.res_5<- kmeans(my_data, 5, nstart = 25)
km.res_6<- kmeans(my_data, 6, nstart = 25)


km.res_2
## K-means clustering with 2 clusters of sizes 103, 97
## 
## Cluster means:
##          age spending_score       income
## 1  0.7071480     -0.6976405 -0.002469258
## 2 -0.7508891      0.7407935  0.002621995
## 
## Clustering vector:
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   2   2   1   2   2   2   1   2   1   2   1   2   1   2   1   2   1   2   1   2 
##  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
##   1   2   1   2   1   2   1   2   1   2   1   2   1   2   1   2   1   2   1   2 
##  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
##   1   2   1   2   1   2   1   2   2   2   1   2   2   1   1   1   1   1   2   1 
##  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
##   1   2   1   1   1   2   1   1   2   2   1   1   1   1   1   2   1   1   2   1 
##  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
##   1   2   1   1   2   1   1   2   2   1   1   2   1   1   2   2   1   2   1   2 
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 
##   2   1   1   2   1   2   1   1   1   1   1   2   1   2   2   2   1   1   1   1 
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 
##   2   1   2   2   2   2   1   2   1   2   1   2   2   2   1   2   1   2   1   2 
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 
##   1   2   2   2   1   2   1   2   1   2   1   2   1   2   1   2   1   2   1   2 
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 
##   1   2   1   2   1   2   1   2   1   2   1   2   1   2   1   2   1   2   1   2 
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 
##   1   2   1   2   1   2   1   2   1   2   1   2   1   2   1   2   1   2   1   2 
## 
## Within cluster sum of squares by cluster:
## [1] 217.7489 169.6903
##  (between_SS / total_SS =  35.1 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
km.res_3
## K-means clustering with 3 clusters of sizes 91, 41, 68
## 
## Cluster means:
##          age spending_score     income
## 1  0.8894494     -0.6192497  0.0472953
## 2 -0.4292604      1.1530422  1.0196744
## 3 -0.9314738      0.1334852 -0.6780959
## 
## Clustering vector:
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   3   3   3   3   3   3   3   3   1   3   1   3   1   3   3   3   3   3   1   3 
##  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
##   3   3   1   3   1   3   1   3   3   3   1   3   1   3   1   3   1   3   3   3 
##  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
##   1   3   1   3   1   3   1   3   3   3   1   3   3   1   1   1   1   1   3   1 
##  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
##   1   3   1   1   1   3   1   1   3   3   1   1   1   1   1   3   1   1   3   1 
##  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
##   1   3   1   1   3   1   1   3   3   1   1   3   1   1   3   3   1   3   1   3 
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 
##   3   1   1   3   1   3   1   1   1   1   1   3   1   3   3   3   1   1   1   1 
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 
##   3   1   2   2   3   2   1   2   1   2   1   2   3   2   3   2   1   2   3   2 
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 
##   1   2   3   2   3   2   1   2   1   2   1   2   1   2   1   2   1   2   1   2 
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 
##   1   2   3   2   1   2   1   2   1   2   1   2   1   2   1   2   1   2   1   2 
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 
##   1   2   1   2   1   2   1   2   1   2   1   2   1   2   1   2   1   2   2   2 
## 
## Within cluster sum of squares by cluster:
## [1] 156.4471  33.4872 103.8019
##  (between_SS / total_SS =  50.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
km.res_4
## K-means clustering with 4 clusters of sizes 65, 38, 40, 57
## 
## Cluster means:
##           age spending_score     income
## 1  1.08344244     -0.3961802 -0.4893373
## 2  0.03711223     -1.1857814  0.9876366
## 3 -0.42773261      1.2130414  0.9724070
## 4 -0.96008279      0.3910484 -0.7827991
## 
## Clustering vector:
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   4   4   4   4   4   4   1   4   1   4   1   4   1   4   1   4   4   4   1   4 
##  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
##   4   4   1   4   1   4   1   4   1   4   1   4   1   4   1   4   1   4   1   4 
##  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
##   1   4   1   4   1   4   1   4   4   4   1   4   4   1   1   1   1   1   4   1 
##  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
##   1   4   1   1   1   4   1   1   4   4   1   1   1   1   1   4   1   1   4   1 
##  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
##   1   4   1   1   4   1   1   4   4   1   1   4   1   1   4   4   1   4   1   4 
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 
##   4   1   1   4   1   4   1   1   1   1   1   4   2   4   4   4   1   1   1   1 
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 
##   4   2   3   3   2   3   2   3   1   3   2   3   2   3   2   3   2   3   2   3 
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 
##   2   3   2   3   2   3   2   3   2   3   2   3   2   3   2   3   2   3   2   3 
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 
##   1   3   2   3   2   3   2   3   2   3   2   3   2   3   2   3   2   3   2   3 
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 
##   2   3   2   3   2   3   2   3   2   3   2   3   2   3   2   3   2   3   2   3 
## 
## Within cluster sum of squares by cluster:
## [1] 74.83280 44.01863 23.91544 61.43215
##  (between_SS / total_SS =  65.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
km.res_5
## K-means clustering with 5 clusters of sizes 54, 47, 40, 39, 20
## 
## Cluster means:
##           age spending_score     income
## 1 -0.97822376     0.46627028 -0.7411999
## 2  1.20182469    -0.05223672 -0.2351832
## 3 -0.42773261     1.21304137  0.9724070
## 4  0.07314728    -1.19429976  0.9725047
## 5  0.52974416    -1.23337167 -1.2872781
## 
## Clustering vector:
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   1   1   5   1   1   1   5   1   5   1   5   1   5   1   5   1   5   1   5   1 
##  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
##   5   1   5   1   5   1   5   1   5   1   5   1   5   1   5   1   5   1   5   1 
##  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
##   2   1   5   1   5   1   2   1   1   1   2   1   1   2   2   2   2   2   1   2 
##  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
##   2   1   2   2   2   1   2   2   1   1   2   2   2   2   2   1   2   2   1   2 
##  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
##   2   1   2   2   1   2   2   1   1   2   2   1   2   2   1   1   2   1   2   1 
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 
##   1   2   2   1   2   1   2   2   2   2   2   1   4   1   1   1   2   2   2   2 
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 
##   1   4   3   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3 
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 
##   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3 
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 
##   2   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3 
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 
##   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3 
## 
## Within cluster sum of squares by cluster:
## [1] 51.85673 26.65665 23.91544 46.38992 18.58760
##  (between_SS / total_SS =  72.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
km.res_6
## K-means clustering with 6 clusters of sizes 38, 33, 21, 24, 45, 39
## 
## Cluster means:
##          age spending_score     income
## 1 -0.8709130    -0.09334615 -0.1135003
## 2  0.2211606    -1.28682305  1.0805138
## 3  0.4777583    -1.19344867 -1.3049552
## 4 -0.9735839     1.03458649 -1.3221791
## 5  1.2515802    -0.04388764 -0.2396117
## 6 -0.4408110     1.23640011  0.9891010
## 
## Clustering vector:
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   4   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4 
##  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
##   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4 
##  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
##   5   4   3   4   3   4   5   1   1   1   5   1   1   5   5   5   5   5   1   5 
##  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
##   5   1   5   5   5   1   5   5   1   1   5   5   5   5   5   1   5   1   1   5 
##  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
##   5   1   5   5   1   5   5   1   1   5   5   1   5   1   1   1   5   1   5   1 
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 
##   1   5   5   1   5   1   5   5   5   5   5   1   1   1   1   1   5   5   5   5 
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 
##   1   1   1   6   1   6   2   6   2   6   2   6   1   6   2   6   2   6   1   6 
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 
##   2   6   1   6   2   6   2   6   2   6   2   6   2   6   2   6   2   6   2   6 
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 
##   5   6   2   6   2   6   2   6   2   6   2   6   2   6   2   6   2   6   2   6 
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 
##   2   6   2   6   2   6   2   6   2   6   2   6   2   6   2   6   2   6   2   6 
## 
## Within cluster sum of squares by cluster:
## [1] 20.20990 34.51630 20.52332 11.71664 23.87015 22.36267
##  (between_SS / total_SS =  77.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
sil <- silhouette(km.res_2$cluster, dist(mc_data_clus))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1  103          0.05
## 2       2   97         -0.06

sil <- silhouette(km.res_3$cluster, dist(mc_data_clus))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1   91         -0.22
## 2       2   41          0.39
## 3       3   68          0.13

sil <- silhouette(km.res_4$cluster, dist(mc_data_clus))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1   65          0.10
## 2       2   38         -0.06
## 3       3   40          0.00
## 4       4   57         -0.05

sil <- silhouette(km.res_5$cluster, dist(mc_data_clus))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1   54         -0.51
## 2       2   47          0.41
## 3       3   40         -0.01
## 4       4   39         -0.07
## 5       5   20          0.53

sil <- silhouette(km.res_6$cluster, dist(mc_data_clus))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1   38         -0.05
## 2       2   33         -0.06
## 3       3   21          0.00
## 4       4   24         -0.08
## 5       5   45          0.03
## 6       6   39         -0.03

library(plotly)
mc_data_clus$cluster <- as.factor(km.res_4$cluster)


p <- ggparcoord(data = mc_data_clus, columns=c(1:3), groupColumn = "cluster", scale = "std") + labs(x = "variable", y = "value (in standard-deviation units)", title = "Clustering") +
  theme_bw()
ggplotly(p)

Cluster 1 Demographics

mc_data_clus_1 <- subset(mc_data_clus, cluster==1)

datatable(audit_numeric_summary(mc_data_clus_1), 
          class = 'cell-border stripe compact hover',
          caption = "Summary Statistics")
## [1] "age"
## [1] "spending_score"
## [1] "income"
audit_numeric_viz(mc_data_clus_1)
## [1] "age"

## [1] "spending_score"

## [1] "income"

## [1] "Done Processing"

Cluster 2 Demographics

mc_data_clus_2 <- subset(mc_data_clus, cluster==2)

datatable(audit_numeric_summary(mc_data_clus_2), 
          class = 'cell-border stripe compact hover',
          caption = "Summary Statistics")
## [1] "age"
## [1] "spending_score"
## [1] "income"
audit_numeric_viz(mc_data_clus_2)
## [1] "age"

## [1] "spending_score"

## [1] "income"

## [1] "Done Processing"

Cluster 3 Demographics

mc_data_clus_3 <- subset(mc_data_clus, cluster==3)

datatable(audit_numeric_summary(mc_data_clus_3), 
          class = 'cell-border stripe compact hover',
          caption = "Summary Statistics")
## [1] "age"
## [1] "spending_score"
## [1] "income"
audit_numeric_viz(mc_data_clus_3)
## [1] "age"

## [1] "spending_score"

## [1] "income"

## [1] "Done Processing"

Cluster 4 Demographics

mc_data_clus_4 <- subset(mc_data_clus, cluster==4)

datatable(audit_numeric_summary(mc_data_clus_4), 
          class = 'cell-border stripe compact hover',
          caption = "Summary Statistics")
## [1] "age"
## [1] "spending_score"
## [1] "income"
audit_numeric_viz(mc_data_clus_4)
## [1] "age"

## [1] "spending_score"

## [1] "income"

## [1] "Done Processing"
end.time <- Sys.time()
elapsed.time <- round((end.time - start.time), 3)

paste0("Elapsed Time is : ", elapsed.time )
## [1] "Elapsed Time is : 23.61"