Describing unsupervised learning clusters
As a data scientist / analyst, besides doing cool modelling stuff, we’re often asked to churn out descriptive statistics. Yes, we know. It’s part of the process.
I chanced upon this really nifty concept at work to describe the clusters derived from unsupervised learnig. Here’s how it goes,
- Say it’s a nominal or ordinal variable. First, I find the proportion of the feature across the X clusters
- Second, I rank this proportion through percentiles across these X values
- The cluster with the highest percentile will earn its right to be represented by the feature
- And if it’s a scale variable, you may find the mean of the feature for each cluster and repeat the steps.
Below is a nifty function to carry out the above steps. You can compile it into a package and may start using it in your data science work!
Enjoy!
# This function takes in a data-frame
#' @title Finding feature importance associated with each cluster
#' @param dp_data data frame for clustering
#' @param sp_resp_key_name primary key in dataset
#' @param sp_feature Feature name
#' @param sp_cluster_name column name for cluster
#' @param sp_scale_categorical scale or categorical
#' @param sp_IndivYr_aggregate IndivYr or aggregate
#' @param sp_weight_name Column names for weights
#' @param np_feature_imp_threshold threshold for percentile ranking
#' @param sp_all_filter filter or no filter
#' @return list of proportion, mean or percentile for 6 cluster importance
#' @export
#'
l_feature_importance_cluster = function(dp_data, sp_resp_key_name,
sp_feature, sp_cluster_name, sp_scale_categorical,
sp_IndivYr_aggregate = NULL, sp_weight_name,
np_feature_imp_threshold = 1, sp_all_filter = "all"){
tryCatch({
if((sp_scale_categorical == "categorical") & base::is.null(sp_IndivYr_aggregate)){
#Tabulate
dTabulated_data = dp_data %>%
dplyr::select(c(sp_resp_key_name, sp_feature, sp_weight_name, sp_cluster_name)) %>% #select columns
dplyr::group_by(!! rlang:: sym(sp_cluster_name), !! rlang:: sym(sp_feature)) %>% #Count by cluster and feature
dplyr::summarize(counts = n(),
wt_counts = sum(!! rlang:: sym(sp_weight_name))) %>%
dplyr::group_by(!! rlang:: sym(sp_cluster_name)) %>% #Find proportion of feature in cluster group
dplyr::mutate(counts_cluster = sum(counts, na.rm = T),
prop_feature_within_cluster = counts/counts_cluster,
counts_cluster_wt = sum(wt_counts, na.rm = T),
prop_feature_within_cluster_wt = wt_counts/counts_cluster_wt) %>%
dplyr::group_by(!! rlang:: sym(sp_feature)) %>% #Find percentile of proportions
dplyr::mutate(percentile_feature = percent_rank(prop_feature_within_cluster),
percentile_feature_wt = percent_rank(prop_feature_within_cluster_wt),
counts_feature = sum(counts, na.rm = T),
prop_cluster_within_features = counts/counts_feature) %>% #Find proportion of cluster in feature
dplyr::group_by(!! rlang:: sym(sp_cluster_name), !! rlang:: sym(sp_feature)) %>%
dplyr::mutate(feature_name = sp_feature) %>%
dplyr::rename(feature_value = sp_feature)
#Filter to keep only the 'meaningful' features
if(sp_all_filter == "filter"){
dTabulated_data = dRaw_data %>%
dplyr::filter(percentile_feature >= np_feature_imp_threshold)
}
} else if((sp_scale_categorical == "scale") & base::is.null(sp_IndivYr_aggregate)){
dTabulated_data = dp_data %>%
dplyr::select(c(sp_resp_key_name, sp_feature, sp_cluster_name, sp_weight_name)) %>%
dplyr::group_by(!! rlang:: sym(sp_cluster_name)) %>%
dplyr::summarize(avg_by_cluster = mean(!! rlang:: sym(sp_feature), na.rm = T),
avg_by_cluster_wt = weighted.mean(!! rlang:: sym(sp_feature), !! rlang:: sym(sp_weight_name), na.rm = T)) %>%
dplyr::mutate(feature_name = sp_feature) %>%
dplyr::mutate(percentile_feature = percent_rank(avg_by_cluster),
percentile_feature_wt = percent_rank(avg_by_cluster_wt))
#Filter to keep only the 'meaningful' features
if(sp_all_filter == "filter"){
dTabulated_data = dRaw_data %>%
dplyr::filter(percentile_feature >= np_feature_imp_threshold)
}
}
}, error = function(e){
print(paste0("Error with ", sp_feature))
})
return(dTabulated_data)
}