Introduction

Neither Titanic dataset nor sklearn a new thing for any data scientist but there are some important features in scikit-learn that will make any model preprocessing and tuning easier, to be specific this notebook will cover the following concepts:

  • ColumnTransformer
  • Pipeline
  • SimpleImputer
  • StandardScalar
  • OneHotEncoder
  • OrdinalEncoder
  • GridSearch

Note, this tutorial is a solution to the famous kaggle competition Titanic - Machine Learning from Disaster

Mounting Filesystem

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.
/kaggle/input/titanic/gender_submission.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/train.csv

Import Packages

import pandas as pd
# Numpy for Numerical operations
import numpy as np
# Import ColumnTransformer
from sklearn.compose import ColumnTransformer
# Import Pipeline
from sklearn.pipeline import Pipeline
# Import SimpleImputer
from sklearn.impute import SimpleImputer
# Import StandardScaler, OneHotEncodr and OrdinalEncoder
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
# Import Random Forest for Classification
from sklearn.ensemble import RandomForestClassifier
# Import GridSearch
from sklearn.model_selection import GridSearchCV

Reading Data

In the following cells, we will read the train and test data and check for NaNs.

train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
# See some info
train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

It's obvious that we had to deal with NaNs

test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

Splitting Data

X_train = train_data.drop(['Survived', 'Name'], axis = 1)
X_test = test_data.drop(['Name'], axis = 1)
y_train = train_data['Survived']

Continuous and Numerical features handling

It's clear that we have some numerical features that have some missing values to be imputed and they have to be of the same scale also.

In the following cell, we will handle the numerical features separtely i.e "Age" and "Fare"

# Difine a list with the numeric features
numeric_features = ['Age', 'Fare']
# Define a pipeline for numer"ic features
numeric_features_pipeline = Pipeline(steps= [
    ('imputer', SimpleImputer(strategy = 'median')), # Impute with median value for missing
    ('scaler', StandardScaler())                     # Conduct a scaling step
])

Categorical features handling

It's clear that we have some categorical features that have some missing values to be imputed and they have to be encoded using one hot encoding.

In the following cell, we will handle the categorical features separtely i.e "Embarked" and "Sex"

Note: I choose simple imputer for the missing cells to impute with 'missing' word. My aim was to gather all missing cells in one category for further encoding.

# Difine a list with the categorical features
categorical_features = ['Embarked', 'Sex']
# Define a pipeline for categorical features
categorical_features_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value = 'missing')), # Impute with the word 'missing' for missing values
    ('onehot', OneHotEncoder(handle_unknown = 'ignore'))     # Convert all categorical variables to one hot encoding
])

Ordinal features handling

Passenger class or 'Pclass' for short is an ordinal feature that must be handled keeping in mind that class 3 is much higher than 2 and so on.

# Define a list with the ordinal features
ordinal_features = ['Pclass']
# Define a pipline for ordinal features 
ordinal_features_pipeline = Pipeline(steps=[
    ('ordinal', OrdinalEncoder(categories= [[1, 2, 3]]))
])

Construct a comprehended preprocessor

Now, we will create a preprocessor that can handle all columns in our dataset using ColumnTransformer

preprocessor = ColumnTransformer(transformers= [
    # transformer with name 'num' that will apply
    # 'numeric_features_pipeline' to numeric_features
    ('num', numeric_features_pipeline, numeric_features),
    # transformer with name 'cat' that will apply 
    # 'categorical_features_pipeline' to categorical_features
    ('cat', categorical_features_pipeline, categorical_features),
    # transformer with name 'ord' that will apply 
    # 'ordinal_features_pipeline' to ordinal_features
    ('ord', ordinal_features_pipeline, ordinal_features) 
    ])

Prediction Pipeline

Now, we will create a full prediction pipeline that uses our preprocessor and then transfer it to our classifier of choice 'Random Forest'.

clf = Pipeline(steps=[('preprocessor', preprocessor),
                     ('classifier', RandomForestClassifier(n_estimators = 120, max_leaf_nodes = 100))])

Pipeline Training

Let's train our pipeline now

clf.fit(X_train, y_train)
Pipeline(memory=None,
         steps=[('preprocessor',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('num',
                                                  Pipeline(memory=None,
                                                           steps=[('imputer',
                                                                   SimpleImputer(add_indicator=False,
                                                                                 copy=True,
                                                                                 fill_value=None,
                                                                                 missing_values=nan,
                                                                                 strategy='median',
                                                                                 verbose=0)),
                                                                  ('scaler',
                                                                   StandardScaler(copy=True,
                                                                                  with_mean...
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight=None, criterion='gini',
                                        max_depth=None, max_features='auto',
                                        max_leaf_nodes=100, max_samples=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        n_estimators=120, n_jobs=None,
                                        oob_score=False, random_state=None,
                                        verbose=0, warm_start=False))],
         verbose=False)

Pipeline Tuning

The question now, can we push it a little bit further? i.e. can we tune every single part or our Pipeline?

Here, I will use GridSearch to decide three things:

  • Simple Imputer strategy :mean or median> - n_estimators of Random Forest
  • max leaf nodes of Random Forest

Note, you can access any parameter from the outer level to the next adjacent inner one

For Example:to access the strategy of the Simple Imputer you can do the followingpreprocessornumimputer__strategy

Let's see this into action

param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    'classifier__n_estimators': [100, 120, 150, 170, 200],
    'classifier__max_leaf_nodes' : [100, 120, 150, 170, 200]
}

grid_search = GridSearchCV(clf, param_grid, cv=10)
grid_search.fit(X_train, y_train)
print(("best random forest from grid search: %.3f"
       % grid_search.score(X_train, y_train)))
print('The best parameters of Simple Imputer and C are:')
print(grid_search.best_params_)
best random forest from grid search: 0.944
The best parameters of Simple Imputer and C are:
{'classifier__max_leaf_nodes': 100, 'classifier__n_estimators': 150, 'preprocessor__num__imputer__strategy': 'median'}

Generate Predictions

Let's generate predictions now using our grid search model and submit the results

predictions = grid_search.predict(X_test)
# Generate results dataframe
results_df = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
# Save to csv file
results_df.to_csv('submission.csv', index = False)
print('Submission CSV has been saved!')
Submission CSV has been saved!