In this notebook, I'm giving the classic Titanic Kaggle competition a shot. Note that some of my work here is inspried by the notebook made by Manav Sehgal: https://www.kaggle.com/startupsci/titanic-data-science-solutions/notebook.
For this project, I am adopting Aurelien Geron's End-to-End workflow from the book "Hands-On Machine Learning with Scikit-Learn and Tensorflow." The workflow is as follows:
Of course, this project will mainly focus the steps from 1-7, but if I have time, I will try to work on 8, creating a live product of my model.
The goal of this project is to predict whether passengers within the test dataset survive/not survive the Titanic disaster, using the features provided by the competition.
The main questions to answer are:
In this project, we will focus on prediction power instead of inference.
First, let us load the necessary libraries. In this attempt, I will use pandas/numpy for data manipulation, matplotlib/seaborn for data visualization, and sklearn to assist in data cleaning and model fitting.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from future_encoders import OneHotEncoder
from sklearn.preprocessing import Imputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDClassifier
Let's import the data.
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
Time for some EDA! Let's take a first look at this data set...
train.head()
print(train.columns.values)
The label in this case is the column 'Survived', which can consist of values 0 and 1; 0 for not surviving the disaster and 1 for surviving it. Note that this column is not included in the test dataframe.
Continuous: Age, Fare. Discrete: SibSp, Parch.
Categorical: Sex, Survived, Cabin, Embarked. Ordinal: Pclass. Other: Name.
From the results below, we can see that Age, Cabin, Embarked have NaN values, which we would need to impute.
train.isna().any()
Test and train have the same features.
print(train.info())
print('Sex datatype:', type(train['Sex'][0]))
test.info()
(Only training set)
print('Number of rows in Test:', len(test))
print('Number of rows in Train:', len(train))
print('Sample proportion from population (~2400):', len(train)/2400)
train.describe()
print('Elderly on board (65+): ', len(train[train['Age'] >= 65])/len(train))
print('People who paid > $100: ', len(train[train['Fare'] >= 100])/len(train))
print('People in Pclass = 3: ', len(train[train['Pclass'] == 1])/len(train))
print('People with siblings: ', len(train[train['SibSp'] == 1])/len(train))
print('People with parents/children: ', len(train[train['Parch'] > 0])/len(train))
(Only training set)
train.describe(include=['O'])
Action Items.
In this case, we need to create another categorical feature that indicates at what part of boat they live in (A, B, C, D, E, F) from the Cabin. Let's do that real quick, creating the column 'Deck'. Note that if the Cabin is NaN, we will also make Deck NaN. We will do the imputing later:
Now that we have done some high-level analysis of our features, we move on to see how these features correlate with each other, especially with the label/target feature.
First, let's make a correlation matrix to have a firsthand look at how each feature is related to the other.
train.corr()
Let us visualize this in a heat map to have a better picture of the correlations.
plt.matshow(train.corr())
The heatmap above shows the whiter the square, the more negatively correlated the features are, and vice versa.
Let us also create a scatter matrix for good measure. We're not including Survived and Passenger Id as it is not very helpful
from pandas.plotting import scatter_matrix
attributes = ['Age','SibSp','Parch','Fare']
x = scatter_matrix(train[attributes], figsize=(12,8))
The insights are summarized below:
Action Items.
With the basic facts above and common intuition, we can start making some initial hypotheses of how we can predict survivability. These hypotheses will give us a better direction of what to explore and validate, focusing on more relevant features and insights.
We can hypothesize that these types of passengers have a higher probability of survivability:
Let's now make some visualizations to explore our hypotheses.
Do more children survive?
First, let us see if age impacts survivability greatly. From the results below, we can see more infants survived ( (<5 years old).
sns.set(style = 'darkgrid')
g = sns.FacetGrid(train, col = 'Survived')
g.map(plt.hist, 'Age', bins=20)
Do more women survive?
We can see that females survive more compared to males.
len(train[train['Sex'] == 'male'])
print('Percentage of males that survived: ',float(len(train[(train['Sex'] == 'male') & (train['Survived'] == 1)]))/len(train[train['Sex'] == 'male']))
print('Percentage of females that survived: ',float(len(train[(train['Sex'] == 'female') & (train['Survived'] == 1)]))/len(train[train['Sex'] == 'female']))
Does richer/higher fare people survive?
There are a few indicators of wealth in this dataset: Title, PClass, and Fare. Let us explore these variables in our visualizations.
Do mothers survive?
To determine if mothers survive, we would first need to derive title from the columns. Let us quickly do so.
def create_title(df):
df['Title'] = [name.split(',')[1].split('.')[0].replace(' ','') for name in train['Name']]
create_title(train)
train.head()
Let's look at top-level information about the Title feature.
train['Title'].describe()
train['Title'].unique() #unique valuesa
Summarized from action items:
Now let us prepare the data for model fitting.
Here are the steps we will take in preprocessing the data:
Let's separate the data into y_train, X_train, and X_test.
y_train = train['Survived']
X_train = train.drop(['Survived'], axis = 1)
X_test = test
from sklearn.base import BaseEstimator, TransformerMixin #for creating custom classes
from sklearn.pipeline import Pipeline #for creating pipelines
from future_encoders import ColumnTransformer #to combine all pipelines
Then, we combine train and test data so that our imputation, scaling, and encoding is uniform and will not cause any trouble when predicting.
full_df = pd.concat([X_train,X_test])
full_df.head()
def create_by_adding_two_features(df, new_feat, first_feat, second_feat):
#add new feature
df[new_feat] = df[first_feat] + df[second_feat]
return df
Next, let's reiterate the function to create a title feature from Name.
def create_title(df):
df['Title'] = [name.split(',')[1].split('.')[0].replace(' ','') for name in df['Name']]
Finally, let's create Deck from Cabin name.
import re
def create_deck(df):
#improve by finding out how to find NaN
df['Cabin'] = df['Cabin'].fillna('NA')
deck = list()
for cabin in df['Cabin']:
if cabin != 'NA':
deck.append(re.sub('[0-9]*','',cabin).split(' ')[0])
else:
deck.append('NA')
df['Deck'] = deck
df = df.drop(['Cabin'], axis = 1)
Next, we drop Name, Ticket, and Passenger_Id as they do not give any additional information.
def drop_features(df, list_col):
df = df.drop(list_col, axis = 1)
return df
from sklearn.preprocessing import Imputer
#Impute numerical columns
def impute_num(df, strategy, columns_to_impute):
imputer = Imputer(strategy = strategy)
for column in columns_to_impute:
df[[column]] = imputer.fit_transform(df[[column]])
return df
For now, we will use a custom imputer to impute Embarked. (Could do a better imputer here)
def impute_embarked(df, column, custom_value):
df[[column]] = df[[column]].fillna(custom_value)
return df
We use sklearn's RobustScaler to scale the data.
from sklearn.preprocessing import RobustScaler
def scale_robust(df, columns_to_scale):
robust = RobustScaler()
for column in columns_to_scale:
df[[column]] = robust.fit_transform(df[[column]])
return df
Next, we use pandas.get_dummies to one-hot encode these categorical variables.
def encode_dummies(df, columns_to_encode):
return pd.get_dummies(df, prefix = columns_to_encode, columns = columns_to_encode)
Now let's combine these functions into one full pipeline.
def full_pipeline(X):
df = X.copy()
num_cols = ['Age','Fare','Parch','SibSp','Family']
cat_cols = ['Pclass', 'Sex','Embarked','Title']
#Feature creation
create_by_adding_two_features(df, 'Family','Parch', 'SibSp')
create_title(df)
#create_deck(df) -> future development
#Drop (for now drop cabin)
df = drop_features(df,['Name','Ticket','PassengerId','Cabin'])
#Impute
impute_num(df, 'median', num_cols)
impute_embarked(df, 'Embarked', 'S')
#Scale
scale_robust(df, num_cols)
#Encode
df = encode_dummies(df, cat_cols)
return df
Now we can easily preprocess our data with a single function.
X_clean = full_pipeline(full_df)
Then we split the train and test data back to X_train and X_test.
X_train_clean = X_clean.iloc[list(range(0,len(train)))]
X_test_clean = X_clean.iloc[list(range(len(train),len(X_clean)))]
Let us now try out different Classification models and test its accuracy by cross-validation.
#Let's try out a Stochastic Gradient Descent (SaGD)
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train_clean, y_train)
predicted = sgd_clf.predict(X_test_clean)
FUTURE WORK: CROSS VALIDATION + TRY OTHER MODELS + DO ENSEMBLING
Finally, we export the results in the form of csv to submit to the kaggle site. Currently, accuracy of test set is about 74%, which isn't too bad, but not good enough.
passenger_id = np.array(range(892,1310))
pd.DataFrame({'PassengerId':passenger_id, 'Survived':predicted}).to_csv('Prediction.csv')