Convert NAs to Obscure Number in Data Frame to Aid in Recoding/ Feature Engineering

Converting NAs to obscure numbers to prevent the data from messing up the recoding.

1 issue that I encounter while I data-munge is that NAs in data seem to mess up my recoding. Here’s a neat swiss army knife utility function I developed recently.

suppressMessages(library(dplyr))

# Converting NA to obscure number to prevent awkward recoding situations that require & !is.na(<variable>)
# Doesn't work for factors
#' @title Convert NA to obscure number
#' @param dp_dataframe Dataframe in consideration
#' @param np_obscure_num Numeric - Obscure number
#' @param bp_na_to_num Boolean if TRUE, convert NA to num. If FALSE, convert num to NA
#' @return A data frame with converted NAs
#' @export

df_convertNAs_to_obscureNo = function(dp_dataframe, np_obscure_num, bp_na_to_num){

  if(bp_na_to_num == T){

    dConverted_data = dp_dataframe %>%
      mutate_if(is.integer, ~ replace(., is.na(.), np_obscure_num)) %>%
      mutate_if(is.numeric, ~ replace(., is.na(.), np_obscure_num)) %>%
      mutate_if(is.character, ~ replace(., is.na(.), as.character(np_obscure_num)))  

  }else if(bp_na_to_num == F){
    
    bf_is_obscure_num = function(np_num){
      return(np_num == np_obscure_num)
    }

    bf_is_obscure_num_string = function(np_num){
      return(as.character(np_num) == as.character(np_obscure_num))
    }

    dConverted_data = dp_dataframe %>%
      mutate_if(is.integer, ~ replace(., bf_is_obscure_num(.), NA)) %>%
      mutate_if(is.numeric, ~ replace(., bf_is_obscure_num(.), NA)) %>%
      mutate_if(is.character, ~ replace(., bf_is_obscure_num_string(.), NA))
  }
  
  return(dConverted_data)
}

To illustrate how we could use the function, first, I load in the car dataset.

Second, I insert NA values into 1 of the cells.

data(cars)
data = head(cars)
data$dist2 = data$dist
str(data)
## 'data.frame':    6 obs. of  3 variables:
##  $ speed: num  4 4 7 7 8 9
##  $ dist : num  2 10 4 22 16 10
##  $ dist2: num  2 10 4 22 16 10
data$dist[1] = NA
print(data)
##   speed dist dist2
## 1     4   NA     2
## 2     4   10    10
## 3     7    4     4
## 4     7   22    22
## 5     8   16    16
## 6     9   10    10

Third, I implement a recoding-logic as follows,

data$is_true = ifelse(data$speed == 4 & data$dist != 0 & data$dist2 == 2, 1, 0)
print(data)
##   speed dist dist2 is_true
## 1     4   NA     2      NA
## 2     4   10    10       0
## 3     7    4     4       0
## 4     7   22    22       0
## 5     8   16    16       0
## 6     9   10    10       0

Notice in the first row that just because dist is NA, is_true returns NA. This holds true for even longer recoding rules. A mere NA would render the return value to be NA even if you would expect the result to be 1!

Here comes the highlight of the post.

By applying the df_convertNAs_to_obscureNo function I wrote above to the data frame with the following parameters,

  • Data frame name;
  • An obscure number that will never be used in recoding (e.g. -123456);
  • And a boolean flag (will explain this later),

You would be able to skirt the issue I highlighted above.

data(cars)
data = head(cars)
data$dist2 = data$dist
str(data)
## 'data.frame':    6 obs. of  3 variables:
##  $ speed: num  4 4 7 7 8 9
##  $ dist : num  2 10 4 22 16 10
##  $ dist2: num  2 10 4 22 16 10
data$dist[1] = NA

data = df_convertNAs_to_obscureNo(data, -1234567, T)
print(data)
##   speed     dist dist2
## 1     4 -1234567     2
## 2     4       10    10
## 3     7        4     4
## 4     7       22    22
## 5     8       16    16
## 6     9       10    10
#Recoding results
data$is_true = ifelse(data$speed == 4 & data$dist != 0 & data$dist2 == 2, 1, 0)
print(data)
##   speed     dist dist2 is_true
## 1     4 -1234567     2       1
## 2     4       10    10       0
## 3     7        4     4       0
## 4     7       22    22       0
## 5     8       16    16       0
## 6     9       10    10       0

So instead of obtaining a NA value, it would return 1 instead!

And you may ask me - how do I change -123457 back into NA?

You could simply reuse the same function with a FALSE flag. And presto the obscure value is converted back into NA.

data = df_convertNAs_to_obscureNo(data, -1234567, F)
print(data)
##   speed dist dist2 is_true
## 1     4   NA     2       1
## 2     4   10    10       0
## 3     7    4     4       0
## 4     7   22    22       0
## 5     8   16    16       0
## 6     9   10    10       0

Warning!!!

Developer/ Data Scientist/ Data Analysis would need to be absolutely sure that this is what he/ she wants. If NA value is expected instead of 1 in the above use case, please do not use the function!

Related

comments powered by Disqus