Function to describe clusters derived from unsupervised learning

Describing unsupervised learning clusters

As a data scientist / analyst, besides doing cool modelling stuff, we’re often asked to churn out descriptive statistics. Yes, we know. It’s part of the process.

I chanced upon this really nifty concept at work to describe the clusters derived from unsupervised learnig. Here’s how it goes,

  • Say it’s a nominal or ordinal variable. First, I find the proportion of the feature across the X clusters
  • Second, I rank this proportion through percentiles across these X values
  • The cluster with the highest percentile will earn its right to be represented by the feature
  • And if it’s a scale variable, you may find the mean of the feature for each cluster and repeat the steps.

Below is a nifty function to carry out the above steps. You can compile it into a package and may start using it in your data science work!


#  This function takes in a data-frame
#' @title Finding feature importance associated with each cluster
#' @param dp_data data frame for clustering
#' @param sp_resp_key_name primary key in dataset
#' @param sp_feature Feature name
#' @param sp_cluster_name column name for cluster 
#' @param sp_scale_categorical scale or categorical
#' @param sp_IndivYr_aggregate IndivYr or aggregate
#' @param sp_weight_name Column names for weights
#' @param np_feature_imp_threshold threshold for percentile ranking
#' @param sp_all_filter filter or no filter
#' @return list of proportion, mean or percentile for 6 cluster importance
#' @export

l_feature_importance_cluster = function(dp_data, sp_resp_key_name,
                                        sp_feature, sp_cluster_name, sp_scale_categorical,
                                        sp_IndivYr_aggregate = NULL, sp_weight_name,
                                        np_feature_imp_threshold = 1, sp_all_filter = "all"){

  if((sp_scale_categorical == "categorical") & base::is.null(sp_IndivYr_aggregate)){

    dTabulated_data = dp_data %>%
        dplyr::select(c(sp_resp_key_name, sp_feature, sp_weight_name, sp_cluster_name)) %>%                #select columns
        dplyr::group_by(!! rlang:: sym(sp_cluster_name), !! rlang:: sym(sp_feature)) %>%   #Count by cluster and feature
        dplyr::summarize(counts = n(),
                         wt_counts = sum(!! rlang:: sym(sp_weight_name))) %>%
        dplyr::group_by(!! rlang:: sym(sp_cluster_name)) %>%                               #Find proportion of feature in cluster group
        dplyr::mutate(counts_cluster = sum(counts, na.rm = T),
                      prop_feature_within_cluster = counts/counts_cluster,
                      counts_cluster_wt = sum(wt_counts, na.rm = T),
                      prop_feature_within_cluster_wt = wt_counts/counts_cluster_wt) %>%
        dplyr::group_by(!! rlang:: sym(sp_feature)) %>%                                    #Find percentile of proportions
        dplyr::mutate(percentile_feature = percent_rank(prop_feature_within_cluster),
                      percentile_feature_wt = percent_rank(prop_feature_within_cluster_wt),
                      counts_feature = sum(counts, na.rm = T),
                      prop_cluster_within_features = counts/counts_feature) %>%            #Find proportion of cluster in feature
        dplyr::group_by(!! rlang:: sym(sp_cluster_name), !! rlang:: sym(sp_feature)) %>%
        dplyr::mutate(feature_name = sp_feature) %>%
        dplyr::rename(feature_value = sp_feature)

    #Filter to keep only the 'meaningful' features   

    if(sp_all_filter == "filter"){
      dTabulated_data = dRaw_data %>%
        dplyr::filter(percentile_feature >= np_feature_imp_threshold)   

  } else if((sp_scale_categorical == "scale") & base::is.null(sp_IndivYr_aggregate)){
    dTabulated_data = dp_data %>%
        dplyr::select(c(sp_resp_key_name, sp_feature, sp_cluster_name, sp_weight_name)) %>%
        dplyr::group_by(!! rlang:: sym(sp_cluster_name)) %>%
        dplyr::summarize(avg_by_cluster = mean(!! rlang:: sym(sp_feature), na.rm = T),
                         avg_by_cluster_wt = weighted.mean(!! rlang:: sym(sp_feature), !! rlang:: sym(sp_weight_name), na.rm = T))  %>%
        dplyr::mutate(feature_name = sp_feature) %>%
        dplyr::mutate(percentile_feature = percent_rank(avg_by_cluster),
                      percentile_feature_wt = percent_rank(avg_by_cluster_wt))

     #Filter to keep only the 'meaningful' features

    if(sp_all_filter == "filter"){
      dTabulated_data = dRaw_data %>%
                            dplyr::filter(percentile_feature >= np_feature_imp_threshold)    
  }, error = function(e){
    print(paste0("Error with ", sp_feature))



comments powered by Disqus