In the last post we covered some ways to derive variables from string fields using intuition and insight. This time we’ll cover derived variables that are a lot easier to generate.
Interaction variables capture effects of the relationship between variables. They are constructed by performing mathematical operations on sets of features. The simple approach that we use in this example is to perform basic operators (add, subtract, multiply, divide) on each pair of numerical features. We could also get much more involved and include more than 2 features in each calculation, and/or use other operators (sqrt, ln, trig functions, etc).
numerics = df.loc[:, ['Age_scaled', 'Fare_scaled', 'Pclass_scaled', 'Parch_scaled', 'SibSp_scaled', 'Names_scaled', 'CabinNumber_scaled', 'Age_bin_id_scaled', 'Fare_bin_id_scaled']] # for each pair of variables, determine which mathmatical operators to use based on redundancy for i in range(0, numerics.columns.size-1): for j in range(0, numerics.columns.size-1): col1 = str(numerics.columns.values[i]) col2 = str(numerics.columns.values[j]) # multiply fields together (we allow values to be squared) if i <= j: name = col1 + "*" + col2 df = pd.concat([df, pd.Series(numerics.iloc[:,i] * numerics.iloc[:,j], name=name)], axis=1) # add fields together if i < j: name = col1 + "+" + col2 df = pd.concat([df, pd.Series(numerics.iloc[:,i] + numerics.iloc[:,j], name=name)], axis=1) # divide and subtract fields from each other if not i == j: name = col1 + "/" + col2 df = pd.concat([df, pd.Series(numerics.iloc[:,i] / numerics.iloc[:,j], name=name)], axis=1) name = col1 + "-" + col2 df = pd.concat([df, pd.Series(numerics.iloc[:,i] - numerics.iloc[:,j], name=name)], axis=1)
This process of automated feature generation can quickly produce a LOT of new variables. In our case, we use 9 features to generate 176 new interaction features. In a larger data set with dozens or hundreds of numeric features, this process can generate an overwhelming number of new interactions. Some types of models are really good at handling a very large number of features (I’ve heard of thousands to millions), which would be necessary in such a case.
It’s very likely that some of the new interaction variables are going to be highly correlated with one of their original variables, or with other interactions, which can be a problem especially for linear models. Highly correlated variables can cause an issue called “multicollinearity”. There is a lot of information out there about how to identify, deal with, and safely ignore multicollinearity in a data set so I’ll avoid an explanation here, but I’ve included some great links at the bottom of this post if you’re interested.
In our solution for the Titanic challenge, I don’t believe that multicollinearity is a problem specifically because Random Forests are not a linear model. Removing highly correlated features is a good idea anyway though, if for no other reason than to improve performance. We’ll use a Spearman correlation to identify and remove highly correlated features. We identify highly correlated features using Spearman’s rank correlation coefficient but you could certainly experiment with other methods such as Pearson product-moment correlation coefficient.
# calculate the correlation matrix (ignore survived and passenger id fields) df_corr = df.drop(['Survived', 'PassengerId'],axis=1).corr(method='spearman') # create a mask to ignore self- mask = np.ones(df_corr.columns.size) - np.eye(df_corr.columns.size) df_corr = mask * df_corr drops = [] # loop through each variable for col in df_corr.columns.values: # if we've already determined to drop the current variable, continue if np.in1d([col],drops): continue # find all the variables that are highly correlated with the current variable # and add them to the drop list corr = df_corr[abs(df_corr[col]) > 0.98].index drops = np.union1d(drops, corr) print "nDropping", drops.shape[0], "highly correlated features...n", drops df.drop(drops, axis=1, inplace=True)
If you’re interested in learning more about multicollinearity, these are some excellent posts worth checking out:
- When Can You Safely Ignore Multicollinearity?
- What Are the Effects of Multicollinearity and When Can I Ignore Them?
- Enough Is Enough! Handling Multicollinearity in Regression Analysis
In the next post, we’ll take a look at dimensionality reduction using principle component analysis (PCA).
Kaggle Titanic Tutorial in Scikit-learn
Part I – Intro
Part II – Missing Values
Part III – Feature Engineering: Variable Transformations
Part IV – Feature Engineering: Derived Variables
Part V – Feature Engineering: Interaction Variables and Correlation
Part VI – Feature Engineering: Dimensionality Reduction w/ PCA
Part VII – Modeling: Random Forests and Feature Importance
Part VIII – Modeling: Hyperparamter Optimization
Part IX – Bias, Variance, and Learning Curves
Part X – Validation: ROC Curves
Part XI – Summary