Kaggle Titanic Competition Part VII – Random Forests and Feature Importance
In the last post we took a look at how reduce noisy variables from our data set using PCA, and today we'll actually start modelling! Random Forests are one of the easiest models to run, and highly effective as well. A great combination for sure. If you're just starting out with a new problem, this is a great model to quickly build a reference model. There aren't a whole lot of parameters to tune, which makes it very user friendly. The primary parameters include how many decision trees to include in the forest, how much data to include in each [...]
Kaggle Titanic Competition Part VI – Dimensionality Reduction
In the last post, we looked at how to use an automated process to generate a large number of non-correlated variables. Now we're going to look at a very common way to reduce the number of features that we use in modelling. You may be wondering why we'd remove variables we just took the time to create. The answer is pretty simple - sometimes it helps. If you think about a predictive model in terms of finding a "signal" or "pattern" in the data, it makes sense that you want to remove noise in the data that hides the signal. [...]
Kaggle Titantic Competition Part V – Interaction Variables
In the last post we covered some ways to derive variables from string fields using intuition and insight. This time we'll cover derived variables that are a lot easier to generate. Interaction variables capture effects of the relationship between variables. They are constructed by performing mathematical operations on sets of features. The simple approach that we use in this example is to perform basic operators (add, subtract, multiply, divide) on each pair of numerical features. We could also get much more involved and include more than 2 features in each calculation, and/or use other operators (sqrt, ln, trig functions, etc). [...]
Kaggle Titanic Competition Part IV – Derived Variables
In the previous post, we began taking a look at how to convert the raw data into features that can be used by the Random Forest model. Any variable that is generated from one or more existing variables is called a "derived" variable. We've discussed basic transformations that result in useful derived variables, and in this post we'll look at some more interesting derived variables that aren't simple transformations. An important aspect of feature engineering is using insight and creativity to find new features to feed the model. You'll read this over and over again, and it really can't be [...]
Kaggle Titanic Competition Part III – Variable Transformations
In the last two posts, we've covered reading in the data set and handling missing values. Now we can start working on transforming the variable values into formatted features that our model can use. Different implementations of the Random Forest algorithm can accept different types of data. Scikit-learn requires everything to be numeric so we'll have to do some work to transform the raw data. All possible data can be generally considered as one of two types: Quantitative and Qualitative. Quantitative variables are those whose values can be meaningfully sorted in a manner that indicates an underlying order. In the [...]
Kaggle Titanic Competition Part II – Missing Values
There will be missing/incorrect data in nearly every non-trivial data set a data scientist ever encounters. It is as certain as death and taxes. This is especially true with big data and applies to data generated by humans in a social context or by computer systems/sensors. Some predictive models inherently are able to deal with missing data (neural networks come to mind) and others require that the missing values be dealt with separately. The RandomForestClassifier model in scikit-learn is not able to handle missing values, so we'll need to use some different approaches to assign values before training the model. [...]