Posts | Jirong's sandbox

Convert NAs to Obscure Number in Data Frame to Aid in Recoding/ Feature Engineering

Jun 7, 2019 4 min read data_cleaning, programming, R

Converting NAs to obscure numbers to prevent the data from messing up the recoding. 1 issue that I encounter while I data-munge is that NAs in data seem to mess up my recoding. Here’s a neat swiss army knife utility function I developed recently. suppressMessages(library(dplyr)) # Converting NA to obscure number to prevent awkward recoding situations that require & !is.na(<variable>) # Doesn't work for factors #' @title Convert NA to obscure number #' @param dp_dataframe Dataframe in consideration #' @param np_obscure_num Numeric - Obscure number #' @param bp_na_to_num Boolean if TRUE, convert NA to num.

Loading excel data with correct variable types

Jun 1, 2019 3 min read data_analytics, programming

Loading data with data types When reading static files into R or Python, most of the times we are lazy as we load the data with no regard to the data types. But in mission critical ETL jobs or Data analytics workflow, data types are quintessential and there’s a fine line between life and death. Ok, I’m exaggerating here. What I’ve written below is a swiss army knife function to read an excel file: 1st tab is data and 2nd tab is the variable types (e.

Function to describe clusters derived from unsupervised learning

May 24, 2019 3 min read programming

Describing unsupervised learning clusters As a data scientist / analyst, besides doing cool modelling stuff, we’re often asked to churn out descriptive statistics. Yes, we know. It’s part of the process. I chanced upon this really nifty concept at work to describe the clusters derived from unsupervised learnig. Here’s how it goes, Say it’s a nominal or ordinal variable. First, I find the proportion of the feature across the X clusters Second, I rank this proportion through percentiles across these X values The cluster with the highest percentile will earn its right to be represented by the feature And if it’s a scale variable, you may find the mean of the feature for each cluster and repeat the steps.

Playing with Google Place API

May 14, 2019 1 min read programming

Google Place API I was playing around with the API to obtain lat-long for my geo analytics work. I entered my credit card info but it seems that I’m not charged even with 9000+ API calls. Unsure if it’s because I’ve a 400+ dollars free cloud credit? Anyway, what I did here was to make API calls and storing the data into my local database. If you’re interested, you may visit this stackoverflow link (https://stackoverflow.

Using exponential distribution to estimate frequency of occurence

May 5, 2019 4 min read statistics, firstPrinciples

Simulating product failures I’m inspired by this post here (http://www.programmingr.com/examples/neat-tricks/sample-r-function/rexp/). And decided to expand on the example. Say you are an owner of a computer store and you would like to estimate the frequency of warranty repairs - and the ensuing costs. Here’s the scenario with the accompanying assumptions Each computer is expected to last an average of 7 years You only sell 1000 computers at the start of each year You sell computer from 2019 to 2025 First, I simulate an exponential distribution of 1000 points for 7 years; and place a time index of 2019 to 2025

Some thoughts on Reinforcement Learning - Q Learning

Apr 8, 2019 3 min read programming, machine_learning

Q learning I just completed a Reinforcement Learning assignment - in particular on Q-learning. According to Wikipedia here, it’s a model-free Rl algorithm. The goal for the algo is to learn a policy, which tells an agent what action to take under different circumstances. Here’s my confession. What I’m doing in this post is to summarise what I’ve just learnt so that I may come back to this at any point in future.

What're the returns (XIRR) for my CPFIS Portfolio

Mar 16, 2019 5 min read investment

What’re the returns (XIRR) for my CPFIS Portfolio? Every employee in Singapore is bounded by the same set of CPF rules. As an ex-economist/ data geek who doesn’t shy away from having skin in the game. I asked myself this question back in 2015 when I was still a starry-eyed young man 2 years into the workforce - how do I set out to optimize my returns in my CPF OA with these given set of constraints,

Hosting a Flask App on Heroku

Feb 28, 2019 1 min read programming, api, python

Following the steps here –> https://realpython.com/flask-by-example-part-1-project-setup/ I managed to deploy my python flask app in Heroku. from flask import Flask app = Flask(__name__) @app.route('/') def hello(): return "Hello World!" @app.route('/<name>') def hello_name(name): return "Hello {}!".format(name) if __name__ == '__main__': app.run() You may visit the following link –>https://jirong-stage.herokuapp.com/ & add a suffix to it. Example https://jirong-stage.herokuapp.com/jirong & this will return Hello jirong! Possibilites are immense! I can easily create APIs or host dashboard here.

Sampling With Replacement Through First Principles

Feb 27, 2019 3 min read programming, statistics, firstPrinciples

Sampling with replacement Hello! It’s me once again attempting to explain things from first principles - a term popularized by Elon Musk. I will use some psudeo code - on sampling with replacement for weights - to aid my explanation. Earlier in the week, I attempted to write a simple function from scratch but I gave up after realising that it will take me more than 15 mins! Difficulties lies in the multiple switch statements in defining the intervals.

Building a decision tree algorithm from scratch

Feb 15, 2019 2 min read programming

Building a decision tree from scratch Sometimes to truly understand and internalise an algorithm, it’s always useful to build from scratch. Rather than relying on a module or library written by someone else. I’m fortunate to be given the chance to do it in 1 of my assignments for decision trees. From this exercise, I had to rely on my knowledge on recursion, binary trees (in-order traversal) and object oriented programming.