There will be missing/incorrect data in nearly every non-trivial data set a data scientist ever encounters. It is as certain as death and taxes. This is especially true with big data and applies to data generated by humans in a social context or by computer systems/sensors. Some predictive models inherently are able to deal with missing data (neural networks come to mind) and others require that the missing values be dealt with separately. The RandomForestClassifier model in scikit-learn is not able to handle missing values, so we’ll need to use some different approaches to assign values before training the model. The following is a partial list of ways missing values can be dealt with:
1) Throw out any data with missing values – I don’t particularly like this approach, but if you’ve got a lot of data that isn’t missing any values it is certainly the quickest and easiest way to handle it.
2) Assign a value that indicates a missing value – This is particularly appropriate for categorical variables (more on this in the next post). I really like using this approach when possible because the fact that the value is missing can be useful information in and of itself. Perhaps when a value is missing for a particular variable, that has some underlying cause that makes it correlate more highly with another value. Unfortunately, there isn’t really a great way to do this with continuous variables. One interesting trick I learned is that you can do this with binary variables (again, discussed more in the next post) by setting the false value as -1, the true value as 1, and missing values as 0.
# Replace missing values with "U0" df['Cabin'][df.Cabin.isnull()] = 'U0'
3) Assign the average value – This is a very common approach because it is simple, and for variables that aren’t extremely important it very well may be good enough. You can also incorporate other variables to create subsets and assign the average within the group. In cases of categorical variables, the most common value can be applied rather than the statistical mean.
# Take the median of all non-null Fares and use that for all missing values df['Fare'][ np.isnan(df['Fare']) ] = df['Fare'].median()
– or –
# Replace missing values with most common port df.Embarked[ df.Embarked.isnull() ] = df.Embarked.dropna().mode().values
4) Use a regression or another simple model to predict the values of missing variables – This is the approach I used for the Age variable in the Titanic set, because age seemed to be one of the more important variables and I thought this would provide better estimates than using mean values. The general approach is take whatever other feature are available (and populated) and build a model using the examples that do have values for the variable in question. Then predict the value for the others. I used the following code to populate the missing Age variable using a RandomForestClassifier model, but a simple Linear Regression probably would have been fine:
import pandas as pd from sklearn.ensemble import RandomForestRegressor ### Populate missing ages using RandomForestClassifier def setMissingAges(df): # Grab all the features that can be included in a Random Forest Regressor age_df = df[['Age','Embarked','Fare', 'Parch', 'SibSp', 'Title_id','Pclass','Names','CabinLetter']] # Split into sets with known and unknown Age values knownAge = age_df.loc[ (df.Age.notnull()) ] unknownAge = age_df.loc[ (df.Age.isnull()) ] # All age values are stored in a target array y = knownAge.values[:, 0] # All the other values are stored in the feature array X = knownAge.values[:, 1::] # Create and fit a model rtr = RandomForestRegressor(n_estimators=2000, n_jobs=-1) rtr.fit(X, y) # Use the fitted model to predict the missing values predictedAges = rtr.predict(unknownAge.values[:, 1::]) # Assign those predictions to the full data set df.loc[ (df.Age.isnull()), 'Age' ] = predictedAges return df
Kaggle Titanic Tutorial in Scikit-learn
Part I – Intro
Part II – Missing Values
Part III – Feature Engineering: Variable Transformations
Part IV – Feature Engineering: Derived Variables
Part V – Feature Engineering: Interaction Variables and Correlation
Part VI – Feature Engineering: Dimensionality Reduction w/ PCA
Part VII – Modeling: Random Forests and Feature Importance
Part VIII – Modeling: Hyperparamter Optimization
Part IX – Bias, Variance, and Learning Curves
Part X – Validation: ROC Curves
Part XI – Summary