Post

캐글 타이타닉으로 어디까지 갈 수 있을까 - 기본편

2024/03/26: 초안 작성일; 머리말, 요약, 본문 코드, 이미지 완성(상세 설명 제외)
2024/04/17: 이미지 오류 수정
2024/04/21: 이미지 오류 수정 완료


목차

머리말

  • 이 글은 캐글에서 가장 유명, 입문용인 Titanic - Machine Learning from Disaster 제출한 것을 정리한 글
  • 캐글 타이타닉으로 어디까지 깊게 공부할 수 있을지 끝까지 가볼 예정
  • 입문용이라지만 깊게 들어갈수록 공부할 내용은 많음.
  • 데이터셋에 대한 이해를 바탕으로 공부한 내용을 최대한 확장하고 심화할 것
  • 사용하고 있는 노트북의 성능 한계로 Web based JupyterLab을 활용함.

이번 공부의 목적

  • 단순 필사가 아닌, 문제가 생기면 스스로 해결
  • 라이브러리들의 Documentation를 하나씩 확인하며 파라미터들을 이해하고, 사용법을 정확히 익히기.
  • EDA, 전처리, 모델링의 아이디어, 논리, 기법 이해 및 숙달
  • 아무 것도 참고하지 않고 스스로 작성할 수 있을 정도로 반복

향후 정리하고 확장할 사항

  • Kaggle - Titanic (심화)에선 유명한 Tutorials 등을 참고하여 Null 값 처리, String → Numerical, Binning, Modeling 등 세부 방법들을 비교, List화하고, 정리
  • 모델링에 사용한 sklearn의 RandomForestClassifier 소스코드(위치 확인 완료)를 분석하여 어떤 로직으로 구현하는지 분석
  • 각 메서드의 kwargs들을 익히기 위해 라이브러리들의 기본이 되는 메서드, 파라미터들을 익히고 정리
  • ipynb 파일을 효율적으로 Markdown이나 HTML로 Export, 변환하는 방법, 자동화 등 모색

요약

Feature들의 피어슨 상관계수 Heatmap

png


Feature Importance순으로 Horizontal Bar

png


EDA와 Feature Engineering

  • Feature로 Fare, Age, Sex, Name Pronouns(Initial), Passenger Class (Pclass), Family Size(Family Size), Embaked를 사용함.
  • Fare는 Log를 취하고, Initial와 Embaked String을 Numeric으로, Age는 Binning하여 Numeric으로, Family Size는 Sibling, Spouse, Parent, Child 데이터를 Passenger ID를 기준으로 계산


Random Forest Classifier로 만든 Classification 모델의 성능

Classification Report

 precisionrecallf1-scoresupport
00.850.880.87172
10.780.720.7596
accuracy  0.82268
macro avg0.810.80.81268
weighted avg0.820.820.82268


본문

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import piplite

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

await piplite.install('seaborn')
import seaborn as sns

await piplite.install('plotly')
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls

plt.style.use('seaborn')
sns.set_theme(font_scale=2.5)
import jinja2

await piplite.install('missingno')
import missingno as msno

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
%config Completer.use_jedi = False
1
2
df_train = pd.read_csv('../data/titanic/train.csv')
df_test = pd.read_csv('../data/titanic/test.csv')
1
df_train.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
1
df_train.describe()
PassengerIdSurvivedPclassAgeSibSpParchFare
count891.000000891.000000891.000000714.000000891.000000891.000000891.000000
mean446.0000000.3838382.30864229.6991180.5230080.38159432.204208
std257.3538420.4865920.83607114.5264971.1027430.80605749.693429
min1.0000000.0000001.0000000.4200000.0000000.0000000.000000
25%223.5000000.0000002.00000020.1250000.0000000.0000007.910400
50%446.0000000.0000003.00000028.0000000.0000000.00000014.454200
75%668.5000001.0000003.00000038.0000001.0000000.00000031.000000
max891.0000001.0000003.00000080.0000008.0000006.000000512.329200
1
df_test.describe()
PassengerIdPclassAgeSibSpParchFare
count418.000000418.000000332.000000418.000000418.000000417.000000
mean1100.5000002.26555030.2725900.4473680.39234435.627188
std120.8104580.84183814.1812090.8967600.98142955.907576
min892.0000001.0000000.1700000.0000000.0000000.000000
25%996.2500001.00000021.0000000.0000000.0000007.895800
50%1100.5000003.00000027.0000000.0000000.00000014.454200
75%1204.7500003.00000039.0000001.0000000.00000031.500000
max1309.0000003.00000076.0000008.0000009.000000512.329200
1
2
3
for col in df_train.columns:
    msg = 'column: {:>10}\t Percent of NaN value: {:.2f}%'.format(col, 100 * (df_train[col].isnull().sum() / df_train[col].shape[0]))
    print(msg)
1
2
3
4
5
6
7
8
9
10
11
12
column: PassengerId	 Percent of NaN value: 0.00%
column:   Survived	 Percent of NaN value: 0.00%
column:     Pclass	 Percent of NaN value: 0.00%
column:       Name	 Percent of NaN value: 0.00%
column:        Sex	 Percent of NaN value: 0.00%
column:        Age	 Percent of NaN value: 19.87%
column:      SibSp	 Percent of NaN value: 0.00%
column:      Parch	 Percent of NaN value: 0.00%
column:     Ticket	 Percent of NaN value: 0.00%
column:       Fare	 Percent of NaN value: 0.00%
column:      Cabin	 Percent of NaN value: 77.10%
column:   Embarked	 Percent of NaN value: 0.22%
1
2
3
for col in df_test.columns:
    msg = 'column: {:>10}\t Percent of NaN value: {:.2f}%'.format(col, 100 * (df_test[col].isnull().sum() / df_test[col].shape[0]))
    print(msg)
1
2
3
4
5
6
7
8
9
10
11
column: PassengerId	 Percent of NaN value: 0.00%
column:     Pclass	 Percent of NaN value: 0.00%
column:       Name	 Percent of NaN value: 0.00%
column:        Sex	 Percent of NaN value: 0.00%
column:        Age	 Percent of NaN value: 20.57%
column:      SibSp	 Percent of NaN value: 0.00%
column:      Parch	 Percent of NaN value: 0.00%
column:     Ticket	 Percent of NaN value: 0.00%
column:       Fare	 Percent of NaN value: 0.24%
column:      Cabin	 Percent of NaN value: 78.23%
column:   Embarked	 Percent of NaN value: 0.00%
1
msno.matrix(df=df_train.iloc[:, :], figsize=(8, 8), color=(0.8, 0.5, 0.2))
1
<AxesSubplot:>

png

1
msno.bar(df=df_train.iloc[:, :], figsize=(8, 8), color=(0.8, 0.5, 0.2))
1
<AxesSubplot:>

png

1
msno.bar(df=df_test.iloc[:, :], figsize=(8, 8), color=(0.8, 0.5, 0.2))
1
<AxesSubplot:>

png

1
2
3
4
5
6
7
8
9
fig, ax = plt.subplots(1, 2, figsize=(18, 8))

df_train['Survived'].value_counts().plot.pie(explode=[0, 0.1], autopct='%1.1f%%', ax=ax[0], shadow=True)
ax[0].set_title('Pie Plot - Survived')
ax[0].set_ylabel('')
sns.countplot(x='Survived', data=df_train, ax=ax[1])
ax[1].set_title('Count Plot - Survived')

plt.show()

png

1
df_train[['Pclass', 'Survived']].groupby(['Pclass']).count()
Survived
Pclass
1216
2184
3491
1
df_train[['Pclass', 'Survived']].groupby(['Pclass']).sum()
Survived
Pclass
1136
287
3119
1
pd.crosstab(df_train['Pclass'], df_train['Survived'], margins=True).style.background_gradient(cmap='summer_r')
Survived01All
Pclass   
180136216
29787184
3372119491
All549342891
1
2
3
4
5
6
7
8
df_train[['Pclass', 'Survived']].groupby(['Pclass']).mean().sort_values(by='Survived', ascending=False).plot.bar()
plt.xticks(rotation=0)

Pclass_data = df_train[['Pclass', 'Survived']].groupby(['Pclass']).mean().sort_values(by='Survived', ascending=False)
for index, value in enumerate(Pclass_data['Survived']):
    plt.text(index, value, str(round(value, 2)), ha='center', va='bottom')

plt.show()

png

1
2
3
4
5
6
7
8
9
10
11
12
13
y_position = 1.02
fig, ax = plt.subplots(1, 2, figsize=(18, 8))

df_train['Pclass'].value_counts().plot.bar(color=['#CD7F32', '#FFDF00', '#D3D3D3'], ax=ax[0])
ax[0].set_title('Number of Passengers By Pclass', y=y_position)
ax[0].set_ylabel('Count')
ax[0].set_xticklabels(ax[0].get_xticklabels(), rotation=0)

sns.countplot(x='Pclass', hue='Survived', data=df_train, ax=ax[1])
ax[1].set_title('Pclass: Survived vs Dead')
ax[1].set_ylabel('Count')

plt.show()

png

1
2
3
4
5
6
7
8
9
10
11
12
fig, ax = plt.subplots(1, 2, figsize=(20, 8))

df_train[['Sex', 'Survived']].groupby(['Sex']).mean().sort_values(by='Sex', ascending=False).plot.bar(ax=ax[0])
ax[0].set_title('Sex: Survived Mean', y=1.02)
ax[0].set_ylabel('Mean')
ax[0].set_xticklabels(ax[0].get_xticklabels(), rotation=0)

sns.countplot(x='Sex', hue='Survived', data=df_train, ax=ax[1])
ax[1].set_title('Sex: Survived vs Dead', y=1.02)
ax[1].set_ylabel('Count')

plt.show()

png

1
df_train[['Sex', 'Survived']].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)
SexSurvived
0female0.742038
1male0.188908
1
pd.crosstab(df_train['Sex'], df_train['Survived'], margins=True).style.background_gradient(cmap='summer_r')
Survived01All
Sex   
female81233314
male468109577
All549342891
1
sns.catplot(data=df_train, x='Pclass', y='Survived', hue='Sex', height=6, kind='point', saturation=.5, aspect=1.5)
1
<seaborn.axisgrid.FacetGrid at 0xbc20158>

png

1
sns.catplot(data=df_train, x='Sex', y='Survived', col='Pclass', kind='point', height=6, aspect=1)
1
<seaborn.axisgrid.FacetGrid at 0xba6afb0>

png

1
2
3
print('Oldest Passenger: {:.1f} Years'.format(df_train['Age'].max()))
print('Youngest Passenger: {:.1f} Years'.format(df_train['Age'].min()))
print('Mean Age: {:.1f}'.format(df_train['Age'].mean()))
1
2
3
Oldest Passenger: 80.0 Years
Youngest Passenger: 0.4 Years
Mean Age: 29.7
1
2
3
4
5
fig, ax = plt.subplots(1, 1, figsize=(9, 5))
sns.kdeplot(df_train[df_train['Survived'] == 1]['Age'], ax=ax)
sns.kdeplot(df_train[df_train['Survived'] == 0]['Age'], ax=ax)
plt.legend(['Survived', 'Dead'], fontsize=18)
plt.show()

png

1
2
3
4
5
6
7
8
9
# Age distribution withing classes
plt.figure(figsize=(8, 6))
df_train['Age'][df_train['Pclass'] == 1].plot(kind='kde')
df_train['Age'][df_train['Pclass'] == 2].plot(kind='kde')
df_train['Age'][df_train['Pclass'] == 3].plot(kind='kde')

plt.xlabel('Age')
plt.title('Age Distribution within Pclass', pad=20)
plt.legend(['1st Class', '2nd Class', '3rd Class'], fontsize=18)
1
<matplotlib.legend.Legend at 0x18380ba8>

png

1
2
3
4
5
6
7
8
9
10
cummulate_survival_ratio = []
for i in range(1, 80):
    cummulate_survival_ratio.append(df_train[df_train['Age'] < i]['Survived'].sum() / len(df_train[df_train['Age'] < i]['Survived']))

plt.figure(figsize=(8, 7))
plt.plot(cummulate_survival_ratio)
plt.title('Survival Rate Change by Age', y=1.02)
plt.xlabel('Range of Age (0<=Age<80)')
plt.ylabel('Survival Rate')
plt.show()

png

1
2
3
4
5
6
7
8
9
10
11
12
13
fig, ax = plt.subplots(1, 2, figsize=(22, 10))

sns.violinplot(data=df_train, x='Pclass', y='Age', hue='Survived', density_norm='count', split=True, gap=.1, inner='quart', ax=ax[0])
ax[0].set_title('Age and Survived Distribution by Pclass', fontsize=25, y=1.02)
ax[0].set_yticks(range(0, 110, 10))
sns.move_legend(ax[0], 'upper right', fontsize=20)

sns.violinplot(data=df_train, x='Sex', y='Age', hue='Survived', density_norm='count', split=True, gap=.1, inner='quart', ax=ax[1])
ax[1].set_title('Age and Survived Distribution by Sex', fontsize=25, y=1.02)
ax[1].set_yticks(range(0, 110, 10))
sns.move_legend(ax[1], 'upper right', fontsize=20)

plt.show()

png

1
2
3
4
5
6
7
8
9
10
11
12
fig, ax = plt.subplots(1, 2, figsize=(30, 10))
df_train[['Embarked', 'Survived']].groupby(['Embarked']).mean().sort_values(by='Survived', ascending=False).plot.bar(ax=ax[0])
ax[0].set_title('Survived Mean by Embarkation Point', y=1.02)
ax[0].set_ylabel('Mean')
ax[0].set_xticklabels(ax[0].get_xticklabels(), rotation=0)

sns.countplot(data=df_train, x='Embarked', hue='Survived', ax=ax[1])
plt.xlabel('Embarked')
plt.ylabel('Count')
plt.title('Survival Count by Embarkation Point', y=1.02)

plt.show()

png

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
fig, ax = plt.subplots(2, 2, figsize=(20, 15))

sns.countplot(data=df_train, x='Embarked', ax=ax[0, 0])
ax[0, 0].set_title('(1) Total No. of Passengers')

sns.countplot(data=df_train, x='Embarked', hue='Sex', ax=ax[0, 1])
ax[0, 1].set_title('(2) No. of Passengers by Sex and Embarked')

sns.countplot(data=df_train, x='Embarked', hue='Survived', ax=ax[1, 0])
ax[1, 0].set_title('(3) No. of Survived by Embarked')

sns.countplot(data=df_train, x='Embarked', hue='Pclass', ax=ax[1, 1])
ax[1, 1].set_title('(4) No. of Pclass by Embakred')

plt.subplots_adjust(wspace=0.2, hspace=0.5)
plt.show()

png

1
2
3
4
5
df_train['FamilySize'] = df_train['SibSp'] + df_train['Parch'] + 1
df_test['FamilySize'] = df_test['SibSp'] + df_train['Parch'] + 1

print('Maximum Size of Family: ', df_train['FamilySize'].max())
print('Minimum Size of Family: ', df_train['FamilySize'].min())
1
2
Maximum Size of Family:  11
Minimum Size of Family:  1
1
2
3
4
5
6
7
8
9
10
11
12
13
fig, ax = plt.subplots(1, 3, figsize=(40, 10))

sns.countplot(data=df_train, x='FamilySize', ax=ax[0])
ax[0].set_title('(1) No. of Passengers by Family Size', y=1.02)

sns.countplot(data=df_train, x='FamilySize', hue='Survived', ax=ax[1])
ax[1].set_title('(2) No. of Survived by Family Size', y=1.02)

df_train[['FamilySize', 'Survived']].groupby(['FamilySize']).mean().sort_values(by='Survived', ascending=False).plot.bar(ax=ax[2])
ax[2].set_title('(3) Mean of Survived by Family Size', y=1.02)

plt.subplots_adjust(wspace=0.2, hspace=0.5)
plt.show()

png

1
2
3
4
fig, ax = plt.subplots(1, 1, figsize=(6, 6))

fare_histplot = sns.histplot(df_train['Fare'], label='Skewness: {:.2f}'.format(df_train['Fare'].skew()), kde=True, ax=ax)
fare_histplot = fare_histplot.legend(loc='best', fontsize=22)

png

1
2
3
4
df_test.loc[df_test.Fare.isnull(), 'Fare'] = df_test['Fare'].mean()

df_train['Fare'] = df_train['Fare'].map(lambda x: np.log(x) if x > 0 else 0)
df_test['Fare'] = df_test['Fare'].map(lambda x: np.log(x) if x > 0 else 0)
1
2
3
4
fig, ax = plt.subplots(1, 1, figsize=(8, 8))

fare_loghist = sns.histplot(df_train['Fare'], label='Skeness: {:.2f}'.format(df_train['Fare'].skew()), kde=True, ax=ax)
fare_loghist = fare_loghist.legend(loc='best', fontsize=22)

png

1
df_train.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
1
df_train['Ticket'].value_counts()
1
2
3
4
5
6
7
8
9
10
11
12
347082      7
CA. 2343    7
1601        7
3101295     6
CA 2144     6
           ..
9234        1
19988       1
2693        1
PC 17612    1
370376      1
Name: Ticket, Length: 681, dtype: int64
1
2
3
4
5
6
7
8
9
10
11
12
# 2번째

df_train = pd.read_csv('../data/titanic/train.csv')
df_test = pd.read_csv('../data/titanic/test.csv')

df_train['FamilySize'] = df_train['SibSp'] + df_train['Parch'] + 1
df_test['FamilySize'] = df_test['SibSp'] + df_test['Parch'] + 1

df_test.loc[df_test.Fare.isnull(), 'Fare'] = df_test['Fare'].mean()

df_train['Fare'] = df_train['Fare'].map(lambda x: np.log(x) if x > 0 else 0)
df_test['Fare'] = df_test['Fare'].map(lambda x: np.log(x) if x > 0 else 0)
1
2
df_train['Initial'] = df_train.Name.str.extract('([A-Za-z]+)\.')
df_test['Initial'] = df_test.Name.str.extract('([A-Za-z]+)\.')
1
pd.crosstab(df_train['Initial'], df_train['Sex']).T.style.background_gradient(cmap='summer_r')
InitialCaptColCountessDonDrJonkheerLadyMajorMasterMissMlleMmeMrMrsMsRevSir
Sex                 
female001010100182210125100
male12016102400005170061
1
2
3
4
df_train['Initial'].replace(['Mlle', 'Mme', 'Ms', 'Dr', 'Major', 'Lady', 'Countess', 'Jonkheer', 'Col', 'Rev', 'Capt', 'Sir', 'Don', 'Dona'],
                            ['Miss', 'Miss', 'Miss', 'Mr', 'Mr', 'Mrs', 'Mrs', 'Other', 'Other', 'Other', 'Mr', 'Mr', 'Mr', 'Mr'], inplace=True)
df_test['Initial'].replace(['Mlle', 'Mme', 'Ms', 'Dr', 'Major', 'Lady', 'Countess', 'Jonkheer', 'Col', 'Rev', 'Capt', 'Sir', 'Don', 'Dona'],
                            ['Miss', 'Miss', 'Miss', 'Mr', 'Mr', 'Mrs', 'Mrs', 'Other', 'Other', 'Other', 'Mr', 'Mr', 'Mr', 'Mr'], inplace=True)
1
df_train.groupby('Initial')['Survived'].mean().plot.bar()
1
<AxesSubplot:xlabel='Initial'>

png

1
df_train.groupby('Initial').mean()
PassengerIdSurvivedPclassAgeSibSpParchFareFamilySize
Initial
Master414.9750000.5750002.6250004.5741672.3000001.3750003.3407104.675000
Miss411.7419350.7043012.28494621.8600000.6989250.5376343.1237132.236559
Mr455.8809070.1625712.38185332.7396090.2930060.1512292.6515071.444234
Mrs456.3937010.7952761.98425235.9818180.6929130.8188983.4437512.511811
Other564.4444440.1111111.66666745.8888890.1111110.1111112.6416051.222222
1
2
3
4
5
6
7
8
9
10
11
df_train.loc[(df_train.Age.isnull())&(df_train.Initial=='Master'), 'Age'] = 5
df_train.loc[(df_train.Age.isnull())&(df_train.Initial=='Miss'), 'Age'] = 22
df_train.loc[(df_train.Age.isnull())&(df_train.Initial=='Mr'), 'Age'] = 33
df_train.loc[(df_train.Age.isnull())&(df_train.Initial=='Mrs'), 'Age'] = 36
df_train.loc[(df_train.Age.isnull())&(df_train.Initial=='Other'), 'Age'] = 46

df_test.loc[(df_test.Age.isnull())&(df_test.Initial=='Master'), 'Age'] = 5
df_test.loc[(df_test.Age.isnull())&(df_test.Initial=='Miss'), 'Age'] = 22
df_test.loc[(df_test.Age.isnull())&(df_test.Initial=='Mr'), 'Age'] = 33
df_test.loc[(df_test.Age.isnull())&(df_test.Initial=='Mrs'), 'Age'] = 36
df_test.loc[(df_test.Age.isnull())&(df_test.Initial=='Other'), 'Age'] = 46
1
print('Embarked has ', sum(df_train['Embarked'].isnull()), ' Null values')
1
Embarked has  2  Null values
1
df_train['Embarked'].fillna('S', inplace=True)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
df_train['Age_cat'] = 0
df_train.loc[df_train['Age'] < 10, 'Age_cat'] = 0
df_train.loc[(10 <= df_train['Age']) & (df_train['Age'] < 20), 'Age_cat'] = 1
df_train.loc[(20 <= df_train['Age']) & (df_train['Age'] < 30), 'Age_cat'] = 2
df_train.loc[(30 <= df_train['Age']) & (df_train['Age'] < 40), 'Age_cat'] = 3
df_train.loc[(40 <= df_train['Age']) & (df_train['Age'] < 50), 'Age_cat'] = 4
df_train.loc[(50 <= df_train['Age']) & (df_train['Age'] < 60), 'Age_cat'] = 5
df_train.loc[(60 <= df_train['Age']) & (df_train['Age'] < 70), 'Age_cat'] = 6
df_train.loc[70 <= df_train['Age'], 'Age_cat'] = 7

df_test['Age_cat'] = 0
df_test.loc[df_test['Age'] < 10, 'Age_cat'] = 0
df_test.loc[(10 <= df_test['Age']) & (df_test['Age'] < 20), 'Age_cat'] = 1
df_test.loc[(20 <= df_test['Age']) & (df_test['Age'] < 30), 'Age_cat'] = 2
df_test.loc[(30 <= df_test['Age']) & (df_test['Age'] < 40), 'Age_cat'] = 3
df_test.loc[(40 <= df_test['Age']) & (df_test['Age'] < 50), 'Age_cat'] = 4
df_test.loc[(50 <= df_test['Age']) & (df_test['Age'] < 60), 'Age_cat'] = 5
df_test.loc[(60 <= df_test['Age']) & (df_test['Age'] < 70), 'Age_cat'] = 6
df_test.loc[70 <= df_test['Age'], 'Age_cat'] = 7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def category_age(i):
    if i < 10:
        return 0
    elif i < 20:
        return 1
    elif i < 30:
        return 2
    elif i < 40:
        return 3
    elif i < 50:
        return 4
    elif i < 60:
        return 5
    elif i < 70:
        return 6
    else:
        return 7

df_train['Age_cat_2'] = df_train['Age'].apply(category_age)
1
print('Do Age_cat and Age_cat result same? ', (df_train['Age_cat'] == df_train['Age_cat_2']).all())
1
Do Age_cat and Age_cat result same?  True
1
2
df_train.drop(['Age', 'Age_cat_2'], axis=1, inplace=True)
df_test.drop(['Age'], axis=1, inplace=True)
1
2
df_train['Initial'] = df_train['Initial'].map({'Master': 0, 'Miss': 1, 'Mr': 2, 'Mrs': 3, 'Other': 4})
df_test['Initial'] = df_test['Initial'].map({'Master': 0, 'Miss': 1, 'Mr': 2, 'Mrs': 3, 'Other': 4})
1
df_train['Embarked'].unique()
1
array(['S', 'C', 'Q'], dtype=object)
1
df_train['Embarked'].value_counts()
1
2
3
4
S    646
C    168
Q     77
Name: Embarked, dtype: int64
1
2
df_train['Embarked'] = df_train['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})
df_test['Embarked'] = df_test['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})
1
df_train['Embarked'].isnull().any()
1
False
1
2
df_train['Sex'] = df_train['Sex'].map({'female': 0, 'male': 1})
df_test['Sex'] = df_test['Sex'].map({'female': 0, 'male': 1})
1
2
3
4
5
6
7
8
9
10
heatmap_data = df_train[['Survived', 'Pclass', 'Sex', 'Fare', 'Embarked', 'FamilySize', 'Initial', 'Age_cat']]

colormap = plt.cm.RdBu

plt.figure(figsize=(14, 12))
plt.title('Pearson Correlation Coefficient of Fetures', y=1.02, size=20)

sns.heatmap(heatmap_data.astype(float).corr(), vmax=1.0, cmap=colormap, annot=True, annot_kws={'size': 16}, linewidths=0.1, linecolor='white', square=True)

del heatmap_data

png

1
2
df_train = pd.get_dummies(df_train, prefix='Initial', columns=['Initial'])
df_test = pd.get_dummies(df_test, prefix='Initial', columns=['Initial'])
1
df_train.head()
PassengerIdSurvivedPclassNameSexSibSpParchTicketFareCabinEmbarkedFamilySizeAge_catInitial_0Initial_1Initial_2Initial_3Initial_4
0103Braund, Mr. Owen Harris110A/5 211711.981001NaN22200100
1211Cumings, Mrs. John Bradley (Florence Briggs Th...010PC 175994.266662C8502300010
2313Heikkinen, Miss. Laina000STON/O2. 31012822.070022NaN21201000
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)0101138033.972177C12322300010
4503Allen, Mr. William Henry1003734502.085672NaN21300100
1
df_test.head()
PassengerIdPclassNameSexSibSpParchTicketFareCabinEmbarkedFamilySizeAge_catInitial_0Initial_1Initial_2Initial_3Initial_4
08923Kelly, Mr. James1003309112.057860NaN11300100
18933Wilkes, Mrs. James (Ellen Needs)0103632721.945910NaN22400010
28942Myles, Mr. Thomas Francis1002402762.270836NaN11600100
38953Wirz, Mr. Albert1003151542.159003NaN21200100
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)01131012982.508582NaN23200010
1
2
df_train = pd.get_dummies(df_train, prefix='Embarked', columns=['Embarked'])
df_test = pd.get_dummies(df_test, prefix='Embarked', columns=['Embarked'])
1
df_train.head()
PassengerIdSurvivedPclassNameSexSibSpParchTicketFareCabinFamilySizeAge_catInitial_0Initial_1Initial_2Initial_3Initial_4Embarked_0Embarked_1Embarked_2
0103Braund, Mr. Owen Harris110A/5 211711.981001NaN2200100001
1211Cumings, Mrs. John Bradley (Florence Briggs Th...010PC 175994.266662C852300010100
2313Heikkinen, Miss. Laina000STON/O2. 31012822.070022NaN1201000001
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)0101138033.972177C1232300010001
4503Allen, Mr. William Henry1003734502.085672NaN1300100001
1
df_test.head()
PassengerIdPclassNameSexSibSpParchTicketFareCabinFamilySizeAge_catInitial_0Initial_1Initial_2Initial_3Initial_4Embarked_0Embarked_1Embarked_2
08923Kelly, Mr. James1003309112.057860NaN1300100010
18933Wilkes, Mrs. James (Ellen Needs)0103632721.945910NaN2400010001
28942Myles, Mr. Thomas Francis1002402762.270836NaN1600100010
38953Wirz, Mr. Albert1003151542.159003NaN1200100001
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)01131012982.508582NaN3200010001
1
2
df_train.drop(['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 'Cabin'], axis=1, inplace=True)
df_test.drop(['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 'Cabin'], axis=1, inplace=True)
1
df_train.head()
SurvivedPclassSexFareFamilySizeAge_catInitial_0Initial_1Initial_2Initial_3Initial_4Embarked_0Embarked_1Embarked_2
00311.9810012200100001
11104.2666622300010100
21302.0700221201000001
31103.9721772300010001
40312.0856721300100001
1
df_test.head()
PclassSexFareFamilySizeAge_catInitial_0Initial_1Initial_2Initial_3Initial_4Embarked_0Embarked_1Embarked_2
0312.0578601300100010
1301.9459102400010001
2212.2708361600100010
3312.1590031200100001
4302.5085823200010001
1
2
3
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.model_selection import train_test_split
1
2
3
X_train = df_train.drop('Survived', axis=1).values
target_label = df_train['Survived'].values
X_test = df_test.values
1
X_tr, X_vld, y_tr, y_vld = train_test_split(X_train, target_label, test_size=0.3, random_state=2018)
1
X_tr
1
2
3
4
5
6
7
8
9
10
11
12
13
array([[1.        , 1.        , 4.11373861, ..., 0.        , 0.        ,
        1.        ],
       [3.        , 1.        , 2.05412373, ..., 0.        , 0.        ,
        1.        ],
       [1.        , 1.        , 5.354225  , ..., 1.        , 0.        ,
        0.        ],
       ...,
       [2.        , 1.        , 2.35137526, ..., 0.        , 0.        ,
        1.        ],
       [3.        , 1.        , 2.08567209, ..., 0.        , 0.        ,
        1.        ],
       [3.        , 1.        , 1.98100147, ..., 0.        , 0.        ,
        1.        ]])
1
X_tr.shape[0]
1
623
1
X_vld
1
2
3
4
5
6
7
8
9
10
11
12
13
array([[1.        , 1.        , 5.57215403, ..., 0.        , 0.        ,
        1.        ],
       [3.        , 1.        , 2.08567209, ..., 0.        , 0.        ,
        1.        ],
       [1.        , 1.        , 3.41772668, ..., 0.        , 0.        ,
        1.        ],
       ...,
       [1.        , 1.        , 3.25809654, ..., 0.        , 0.        ,
        1.        ],
       [2.        , 1.        , 2.35137526, ..., 0.        , 0.        ,
        1.        ],
       [2.        , 0.        , 3.66356165, ..., 0.        , 0.        ,
        1.        ]])
1
X_vld.shape[0]
1
268
1
y_tr
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
array([0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1,
       0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1,
       0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1,
       1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,
       1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1,
       0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1,
       0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1,
       1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1,
       1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0,
       0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0,
       1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0,
       1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 0, 1, 1, 1, 0, 0], dtype=int64)
1
y_tr.shape[0]
1
623
1
y_vld
1
2
3
4
5
6
7
8
9
10
11
12
13
array([0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1,
       1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1,
       1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0,
       1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1,
       0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1,
       1, 0, 0, 1], dtype=int64)
1
y_vld.shape[0]
1
268
1
2
3
model = RandomForestClassifier()
model.fit(X_tr, y_tr)
prediction = model.predict(X_vld)
1
print(prediction)
1
2
3
4
5
6
7
8
[0 1 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0
 0 0 0 0 0 0 1 0 1 1 1 0 0 1 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 0 0 0 0 1 1 1 0
 0 0 0 0 0 0 0 0 1 1 0 1 1 1 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 0 1 1
 1 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 1 0 0
 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0 1 1 1 1 0 0 0 0 0 1
 1 1 0 0 0 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 1 1 0
 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 1 0 1 1 0 0 0 0 0 0 1 0 1 1
 0 1 0 1 1 0 0 0 1]
1
print('Survived Predection with {:.2f}% accuracy out of total passengers {}'.format(100 * metrics.accuracy_score(y_vld, prediction), y_vld.shape[0]))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

Cell In[97], line 1
----> 1 print('Survived Predection with {:.2f}% accuracy out of total passengers {}'.format(100 * metrics.accuracy_score(prediction, y_vld), y_vld.shape[0]))


File /lib/python3.11/site-packages/sklearn/utils/_param_validation.py:211, in validate_params.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    205 try:
    206     with config_context(
    207         skip_parameter_validation=(
    208             prefer_skip_nested_validation or global_skip_validation
    209         )
    210     ):
--> 211         return func(*args, **kwargs)
    212 except InvalidParameterError as e:
    213     # When the function is just a wrapper around an estimator, we allow
    214     # the function to delegate validation to the estimator, but we replace
    215     # the name of the estimator by the name of the function in the error
    216     # message to avoid confusion.
    217     msg = re.sub(
    218         r"parameter of \w+ must be",
    219         f"parameter of {func.__qualname__} must be",
    220         str(e),
    221     )


File /lib/python3.11/site-packages/sklearn/metrics/_classification.py:220, in accuracy_score(y_true, y_pred, normalize, sample_weight)
    154 """Accuracy classification score.
    155
    156 In multilabel classification, this function computes subset accuracy:
   (...)
    216 0.5
    217 """
    219 # Compute accuracy for each possible representation
--> 220 y_type, y_true, y_pred = _check_targets(y_true, y_pred)
    221 check_consistent_length(y_true, y_pred, sample_weight)
    222 if y_type.startswith("multilabel"):


File /lib/python3.11/site-packages/sklearn/metrics/_classification.py:84, in _check_targets(y_true, y_pred)
     57 def _check_targets(y_true, y_pred):
     58     """Check that y_true and y_pred belong to the same classification task.
     59
     60     This converts multiclass or binary types to a common shape, and raises a
   (...)
     82     y_pred : array or indicator matrix
     83     """
---> 84     check_consistent_length(y_true, y_pred)
     85     type_true = type_of_target(y_true, input_name="y_true")
     86     type_pred = type_of_target(y_pred, input_name="y_pred")


File /lib/python3.11/site-packages/sklearn/utils/validation.py:407, in check_consistent_length(*arrays)
    405 uniques = np.unique(lengths)
    406 if len(uniques) > 1:
--> 407     raise ValueError(
    408         "Found input variables with inconsistent numbers of samples: %r"
    409         % [int(l) for l in lengths]
    410     )


ValueError: Found input variables with inconsistent numbers of samples: [418, 268]
1
print(model.feature_importances_)
1
2
3
[0.09869027 0.11627062 0.33726262 0.09358139 0.11944675 0.01194506
 0.03424261 0.11596645 0.02474876 0.00479071 0.01296574 0.0122562
 0.01783281]
1
print(model.decision_path(X_vld))
1
2
3
4
5
6
7
8
9
10
11
12
13
(<268x26804 sparse matrix of type '<class 'numpy.intc'>'
	with 262443 stored elements in Compressed Sparse Row format>, array([    0,   255,   516,   773,  1038,  1287,  1530,  1797,  2052,
        2321,  2582,  2843,  3126,  3389,  3664,  3925,  4228,  4467,
        4722,  4997,  5270,  5523,  5778,  6031,  6312,  6579,  6852,
        7125,  7398,  7627,  7892,  8165,  8446,  8693,  8972,  9261,
        9528,  9825, 10094, 10359, 10600, 10875, 11162, 11433, 11688,
       11967, 12212, 12481, 12760, 13025, 13304, 13567, 13828, 14067,
       14318, 14595, 14880, 15163, 15466, 15757, 16016, 16315, 16582,
       16855, 17130, 17387, 17654, 17931, 18228, 18477, 18730, 18989,
       19274, 19539, 19794, 20053, 20326, 20603, 20886, 21149, 21400,
       21649, 21926, 22203, 22442, 22701, 22980, 23251, 23544, 23861,
       24120, 24385, 24654, 24913, 25178, 25451, 25700, 25957, 26242,
       26533, 26804]))
1
print(model.predict_proba(X_vld)[:10])
1
2
3
4
5
6
7
8
9
10
[[0.76       0.24      ]
 [0.38471429 0.61528571]
 [0.702      0.298     ]
 [0.99       0.01      ]
 [0.01       0.99      ]
 [1.         0.        ]
 [0.94416667 0.05583333]
 [0.98       0.02      ]
 [1.         0.        ]
 [0.74880952 0.25119048]]
1
print(model.get_params())
1
{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}
1
2
3
4
5
from pandas import Series

feature_importance = model.feature_importances_
Series_feat_imp = Series(feature_importance, index=df_test.columns)
print(Series_feat_imp)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Pclass        0.098690
Sex           0.116271
Fare          0.337263
FamilySize    0.093581
Age_cat       0.119447
Initial_0     0.011945
Initial_1     0.034243
Initial_2     0.115966
Initial_3     0.024749
Initial_4     0.004791
Embarked_0    0.012966
Embarked_1    0.012256
Embarked_2    0.017833
dtype: float64
1
2
3
4
5
6
7
8
num_features = len(Series_feat_imp)
random_colors = np.random.rand(num_features, 3)

plt.figure(figsize=(8, 8))
Series_feat_imp.sort_values(ascending=True).plot.barh(color=random_colors)
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.show()

png

1
2
cm = metrics.confusion_matrix(y_vld, prediction)
print('Confusion Matrix\n', cm)
1
2
3
Confusion Matrix
 [[152  20]
 [ 27  69]]
1
2
cr = metrics.classification_report(y_vld, prediction)
print('Classification Report\n', cr)
1
2
3
4
5
6
7
8
9
Classification Report
               precision    recall  f1-score   support

           0       0.85      0.88      0.87       172
           1       0.78      0.72      0.75        96

    accuracy                           0.82       268
   macro avg       0.81      0.80      0.81       268
weighted avg       0.82      0.82      0.82       268
1
submission = pd.read_csv('../data/titanic/gender_submission.csv')
1
submission.head()
PassengerIdSurvived
08920
18931
28940
38950
48961
1
2
prediction = model.predict(X_test)
submission['Survived'] = prediction
1
print(submission)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
     PassengerId  Survived
0            892         0
1            893         0
2            894         0
3            895         0
4            896         0
..           ...       ...
413         1305         0
414         1306         1
415         1307         0
416         1308         0
417         1309         1

[418 rows x 2 columns]
1
submission.to_csv('../data/titanic/my_first_submission.csv', index=False)
1
2
3
submission_true = pd.read_csv('../data/titanic/gender_submission.csv')
submission_true = np.array(submission_true['Survived'])
print(submission_true)
1
2
3
4
5
6
7
8
9
10
11
12
[0 1 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 1
 1 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 0 1 0
 1 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0
 1 1 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0
 0 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 1 0 1
 0 1 0 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0 1 0 1 0 0 0 0 1 1 0 1 0 1 0 1 0
 1 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1
 0 0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0
 1 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0
 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0 1 1 0 1 1 0
 0 1 0 0 1 1 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0
 0 1 1 1 1 1 0 1 0 0 0]
1
2
submission_predict = np.array(submission['Survived'])
print(submission_predict)
1
2
3
4
5
6
7
8
9
10
11
12
[0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 0 1 1 1 1 0 1 0 0 0 0 0 1 1 1 0 0
 0 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 0 0 1 1 1 1 0 0 1 1 0 1 1
 1 0 0 1 0 1 1 0 0 0 0 0 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 0 0
 1 1 1 1 0 0 1 0 1 1 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0
 1 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1
 0 1 0 0 0 0 0 1 0 1 1 1 0 0 0 1 1 1 1 0 0 1 0 1 0 0 0 0 1 1 0 1 0 1 0 1 0
 1 0 1 0 0 1 0 0 0 1 0 0 1 0 0 0 1 1 1 1 0 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 1
 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 0 0
 1 0 0 0 0 0 1 0 0 0 1 1 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0 1 1 0 0 1 1
 0 1 0 0 1 1 0 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 1 0
 0 1 1 1 1 1 0 1 0 0 1]
1
print('Survived Prediction with {:.2f}% accuracy out of total number {}'.format(100 * metrics.accuracy_score(submission_true, submission_predict), submission_true.shape[0]))
1
Survived Prediction with 83.01% accuracy out of total number 418
This post is licensed under CC-BY-NC-ND-4.0 by the author.