Titanic | The Power of Sklearn
Sklearn is the most powerful package in all ML libraries but, do you really use it to the fullest?! In this notebook, we will try to investigate deep concepts such as ColumnTransformers, Piplines, and much more.
- Introduction
- Mounting Filesystem
- Import Packages
- Reading Data
- Splitting Data
- Continuous and Numerical features handling
- Categorical features handling
- Ordinal features handling
- Construct a comprehended preprocessor
- Prediction Pipeline
- Pipeline Training
- Pipeline Tuning
- Generate Predictions
Introduction
Neither Titanic dataset nor sklearn a new thing for any data scientist but there are some important features in scikit-learn that will make any model preprocessing and tuning easier, to be specific this notebook will cover the following concepts:
- ColumnTransformer
- Pipeline
- SimpleImputer
- StandardScalar
- OneHotEncoder
- OrdinalEncoder
- GridSearch
Note, this tutorial is a solution to the famous kaggle competition Titanic - Machine Learning from Disaster
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# Any results you write to the current directory are saved as output.
import pandas as pd
# Numpy for Numerical operations
import numpy as np
# Import ColumnTransformer
from sklearn.compose import ColumnTransformer
# Import Pipeline
from sklearn.pipeline import Pipeline
# Import SimpleImputer
from sklearn.impute import SimpleImputer
# Import StandardScaler, OneHotEncodr and OrdinalEncoder
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
# Import Random Forest for Classification
from sklearn.ensemble import RandomForestClassifier
# Import GridSearch
from sklearn.model_selection import GridSearchCV
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
# See some info
train_data.info()
It's obvious that we had to deal with NaNs
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
test_data.info()
X_train = train_data.drop(['Survived', 'Name'], axis = 1)
X_test = test_data.drop(['Name'], axis = 1)
y_train = train_data['Survived']
# Difine a list with the numeric features
numeric_features = ['Age', 'Fare']
# Define a pipeline for numer"ic features
numeric_features_pipeline = Pipeline(steps= [
('imputer', SimpleImputer(strategy = 'median')), # Impute with median value for missing
('scaler', StandardScaler()) # Conduct a scaling step
])
Categorical features handling
It's clear that we have some categorical features that have some missing values to be imputed and they have to be encoded using one hot encoding.
In the following cell, we will handle the categorical features separtely i.e "Embarked" and "Sex"
Note: I choose simple imputer for the missing cells to impute with 'missing' word. My aim was to gather all missing cells in one category for further encoding.
# Difine a list with the categorical features
categorical_features = ['Embarked', 'Sex']
# Define a pipeline for categorical features
categorical_features_pipeline = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value = 'missing')), # Impute with the word 'missing' for missing values
('onehot', OneHotEncoder(handle_unknown = 'ignore')) # Convert all categorical variables to one hot encoding
])
# Define a list with the ordinal features
ordinal_features = ['Pclass']
# Define a pipline for ordinal features
ordinal_features_pipeline = Pipeline(steps=[
('ordinal', OrdinalEncoder(categories= [[1, 2, 3]]))
])
preprocessor = ColumnTransformer(transformers= [
# transformer with name 'num' that will apply
# 'numeric_features_pipeline' to numeric_features
('num', numeric_features_pipeline, numeric_features),
# transformer with name 'cat' that will apply
# 'categorical_features_pipeline' to categorical_features
('cat', categorical_features_pipeline, categorical_features),
# transformer with name 'ord' that will apply
# 'ordinal_features_pipeline' to ordinal_features
('ord', ordinal_features_pipeline, ordinal_features)
])
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators = 120, max_leaf_nodes = 100))])
clf.fit(X_train, y_train)
Pipeline Tuning
The question now, can we push it a little bit further? i.e. can we tune every single part or our Pipeline?
Here, I will use GridSearch to decide three things:
- Simple Imputer strategy :mean or median> - n_estimators of Random Forest
- max leaf nodes of Random Forest
Note, you can access any parameter from the outer level to the next adjacent inner one
For Example:to access the strategy of the Simple Imputer you can do the followingpreprocessornumimputer__strategy
Let's see this into action
param_grid = {
'preprocessor__num__imputer__strategy': ['mean', 'median'],
'classifier__n_estimators': [100, 120, 150, 170, 200],
'classifier__max_leaf_nodes' : [100, 120, 150, 170, 200]
}
grid_search = GridSearchCV(clf, param_grid, cv=10)
grid_search.fit(X_train, y_train)
print(("best random forest from grid search: %.3f"
% grid_search.score(X_train, y_train)))
print('The best parameters of Simple Imputer and C are:')
print(grid_search.best_params_)
predictions = grid_search.predict(X_test)
# Generate results dataframe
results_df = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
# Save to csv file
results_df.to_csv('submission.csv', index = False)
print('Submission CSV has been saved!')