Loading excel data with correct variable types

Loading data with data types

When reading static files into R or Python, most of the times we are lazy as we load the data with no regard to the data types.

But in mission critical ETL jobs or Data analytics workflow, data types are quintessential and there’s a fine line between life and death. Ok, I’m exaggerating here.

What I’ve written below is a swiss army knife function to read an excel file: 1st tab is data and 2nd tab is the variable types (e.g. database variable types mapped to R variable types)

Note: If I read or write to database, I would have to modify my function below. Oh well.

The steps in the function are pretty simple,

  • First I read in the data type
  • Second I read in the column name of the dataset
  • Third I left join the column names to its data type from database (or .sav or .dta or .sas files)
  • Fourth I left join the (DB variable types) to (R data types) translation into the column names
  • Lastly, I read in the dataset through read_excel functions with col_types as the parameter

Here you go! Hope this is useful.

# Reading excel data together with data types
#' @title Reading data with data types
#' @param spName_file Path of the data file
#' @param spTab_name_data Tab name of data
#' @param spTab_name_dataType Tab name of data type
#' @param sp_dataType_col_name Column name of dataType's (variables column)
#' @param sp_dataType_datatype_name Column name of dataType's (data type column)
#' @param dp_r_hana_type data frame of R to hana variable types conversion
#' @param sp_r_hana_type_DBType Column name of r_hana_type (DB_TYPE column)
#' @param sp_r_hana_type_RType Column name of r_hana_type (R_TYPE column)
#' @return A data frame with corresponding data types
#' @export
df_read_data_with_types = function(spName_file, spTab_name_data,
                                   spTab_name_dataType,
                                   sp_dataType_col_name, sp_dataType_datatype_name,
                                   dp_r_hana_type, sp_r_hana_type_DBType, sp_r_hana_type_RType){

  #Read in data types
  data_type = df_read_tab(spName_file, spTab_name_dataType)

  #Read in just first row of dataset
  col_name = read_excel(spName_file, spTab_name_data, n_max = 1)
  col_name = data.frame(col_name = names(col_name)); col_name$col_name = as.character(col_name$col_name)
  col_name$column_order = 1:nrow(col_name)

  #Left join data type to name
  col_name = col_name %>%
    dplyr::left_join(data_type, by = c("col_name" = sp_dataType_col_name))

  #Split this from above because of errors. weird
  col_name = base::merge(col_name, dp_r_hana_type,
                   by.x = sp_dataType_datatype_name, by.y = sp_r_hana_type_DBType,
                   all.x = T)

  col_name = arrange(col_name, column_order)

  #Read in full dataset with assignment of classes (look at readxl.tidyverse.org)
  data = readxl::read_excel(spName_file, spTab_name_data, col_types = col_name[, sp_r_hana_type_RType])

  return(data)
}

# spName_file = './data/input/TB_OVSS_FNB_FACT.xlsx'
# spTab_name_data = 'data'
# spTab_name_dataType = 'data_type'
# sp_dataType_col_name = 'COLUMN_NAME' 
# sp_dataType_datatype_name = 'DATA_TYPE_NAME'
# dp_r_hana_type = r_hana_type 
# sp_r_hana_type_DBType = 'DB_TYPE' 
# sp_r_hana_type_RType = 'R_TYPE'
#
# a = df_read_data_with_types (spName_file, spTab_name_data,
#                          spTab_name_dataType,
#                          sp_dataType_col_name, sp_dataType_datatype_name,
#                          dp_r_hana_type, sp_r_hana_type_DBType, sp_r_hana_type_RType)

Related

comments powered by Disqus