In the previous post, we began taking a look at how to convert the raw data into features that can be used by the Random Forest model. Any variable that is generated from one or more existing variables is called a “derived” variable. We’ve discussed basic transformations that result in useful derived variables, and in this post we’ll look at some more interesting derived variables that aren’t simple transformations.
An important aspect of feature engineering is using insight and creativity to find new features to feed the model. You’ll read this over and over again, and it really can’t be emphasized enough – feature engineering is a hugely important part of the data science pipeline and is where you should spend the most time and effort. The basic transformations and interaction variables that we can automate (more on that later) don’t take too much time, so that leaves us with efforts to creatively find new variables from the raw data.
Very basic examples of a useful derived variable might be pulling the country code and/or area code out of telephone numbers, or extracting country/state/city from GPS coordinates. Any time a qualitative variable represents an object in the world that we know something about, there is an opportunity to derive variables from it. Also, if a data set represents a timeseries or other historical behavioral information that can also provide a great opportunity for uncovering derived variables.
The titanic data set is very simple, and doesn’t really have a LOT to work with, but there are some text fields which provide us a few opportunities.
Name
The Name variable is useless on it’s own, but provides us the most to work with. Two obvious opportunities are:
Names – perhaps if you have more (or less) names that indicates something about your status what would effect your ability to get on a lifeboat?
# how many different names do they have? df['Names'] = df['Name'].map(lambda x: len(re.split(' ', x)))
Title – How you are addressed can definitely indicate status (and gender) which had some influence on getting on a lifeboat
# What is each person's title? df['Title'] = df['Name'].map(lambda x: re.compile(", (.*?).").findall(x)[0]) # Group low-occuring, related titles together df['Title'][df.Title == 'Jonkheer'] = 'Master' df['Title'][df.Title.isin(['Ms','Mlle'])] = 'Miss' df['Title'][df.Title == 'Mme'] = 'Mrs' df['Title'][df.Title.isin(['Capt', 'Don', 'Major', 'Col', 'Sir'])] = 'Sir' df['Title'][df.Title.isin(['Dona', 'Lady', 'the Countess'])] = 'Lady' # Build binary features df = pd.concat([df, pd.get_dummies(df['Title']).rename(columns=lambda x: 'Title_' + str(x))], axis=1)
FamilyID – A great example of using creativity to tie together several variables, Trevor Stephens created a really interesting derivied variable by identifying family members from last name and total family size. It’s in R and I decided not to duplicate it here, but definitely worth a look
Cabin
Not a lot to do here, but a little research into the deckplans (or a little common sense) indicates that the letter in the cabin variable is the deck, and the number is the room number. The room numbers increased towards the back of the boat, so perhaps that provides some useful measure of location. Additionally, different decks also provide some information on location as well as socioeconomic status, again valuable determining who gets on the lifeboats.
# Replace missing values with "U0" df['Cabin'][df.Cabin.isnull()] = 'U0' # Create a feature for the deck df['Deck'] = df['Cabin'].map( lambda x : re.compile("([a-zA-Z]+)").search(x).group()) df['Deck'] = pd.factorize(df['Deck'])[0] # Create binary features for each deck decks = pd.get_dummies(df['Deck']).rename(columns=lambda x: 'Deck_' + str(x)) df = pd.concat([df, decks], axis=1) # Create feature for the room number df['Room'] = df['Cabin'].map( lambda x : re.compile("([0-9]+)").search(x).group()).astype(int) + 1
Ticket
This variable is clearly ripe for extracting information, but it’s not immediately clear what the values mean. Some quick googling didn’t turn up any information on decoding the values, so we’ll have to make some guesses. After sorting all the values and examining them, a few things give us some clues:
- About a quarter of the tickets have an alphanumeric prefix while the rest consist only of a number
- There are 45 distinct prefixes initially. If we remove ‘.’ and ‘/’ characters (which appear to be superfluous) and make a few other adjustments that number drops to 29.
- The number part of the value seems to have some loose correlations – numbers starting with 1 are usually first class tickets, 2 usually second, and 3 third. I say usually because it holds for a majority of examples but not all. There are also tickets numbers starting with 4-9, and those are rare and almost exclusively third class.
- I can’t seem to notice any pattern to whether the ticket number is a 4, 5, or 6-digit number, but that may provide some amount of information as well.
- Several people can share a ticket number. This could be used to create another feature very similar to the familyID, except this would cover situations like nannies, or close friends which would probably act like a family unit that is being captured in the familyID
Here’s the code:
def processTicket(): global df # extract and massage the ticket prefix df['TicketPrefix'] = df['Ticket'].map( lambda x : getTicketPrefix(x.upper())) df['TicketPrefix'] = df['TicketPrefix'].map( lambda x: re.sub('[.?/?]', '', x) ) df['TicketPrefix'] = df['TicketPrefix'].map( lambda x: re.sub('STON', 'SOTON', x) ) # create binary features for each prefix prefixes = pd.get_dummies(df['TicketPrefix']).rename(columns=lambda x: 'TicketPrefix_' + str(x)) df = pd.concat([df, prefixes], axis=1) # factorize the prefix to create a numerical categorical variable df['TicketPrefixId'] = pd.factorize(df['TicketPrefix'])[0] # extract the ticket number df['TicketNumber'] = df['Ticket'].map( lambda x: getTicketNumber(x) ) # create a feature for the number of digits in the ticket number df['TicketNumberDigits'] = df['TicketNumber'].map( lambda x: len(x) ).astype(np.int) # create a feature for the starting number of the ticket number df['TicketNumberStart'] = df['TicketNumber'].map( lambda x: x[0:1] ).astype(np.int) # The prefix and (probably) number themselves aren't useful df.drop(['TicketPrefix', 'TicketNumber'], axis=1, inplace=True) def getTicketPrefix(ticket): match = re.compile("([a-zA-Z./]+)").search(ticket) if match: return match.group() else: return 'U' def getTicketNumber(ticket): match = re.compile("([d]+$)").search(ticket) if match: return match.group() else: return '0'
In the next post, we’ll take a look at automatically generating interaction variables and then testing them to remove redundant values
Kaggle Titanic Tutorial in Scikit-learn
Part I – Intro
Part II – Missing Values
Part III – Feature Engineering: Variable Transformations
Part IV – Feature Engineering: Derived Variables
Part V – Feature Engineering: Interaction Variables and Correlation
Part VI – Feature Engineering: Dimensionality Reduction w/ PCA
Part VII – Modeling: Feature Importance
Part VIII – Modeling: Hyperparamter Optimization
Part IX – Bias, Variance, and Learning Curves
Part X – Validation: ROC Curves
Part XI – Summary