캐글 타이타닉으로 어디까지 갈 수 있을까 - 기본편

Posted Mar 26, 2024

By bramilch 37 min read

2024/03/26: 초안 작성일; 머리말, 요약, 본문 코드, 이미지 완성(상세 설명 제외)
2024/04/17: 이미지 오류 수정
2024/04/21: 이미지 오류 수정 완료

머리말
- 이번 공부의 목적
- 향후 정리하고 확장할 사항
요약
본문

머리말

이 글은 캐글에서 가장 유명, 입문용인 Titanic - Machine Learning from Disaster 제출한 것을 정리한 글
캐글 타이타닉으로 어디까지 깊게 공부할 수 있을지 끝까지 가볼 예정
입문용이라지만 깊게 들어갈수록 공부할 내용은 많음.
데이터셋에 대한 이해를 바탕으로 공부한 내용을 최대한 확장하고 심화할 것
사용하고 있는 노트북의 성능 한계로 Web based JupyterLab을 활용함.

이번 공부의 목적

단순 필사가 아닌, 문제가 생기면 스스로 해결
라이브러리들의 Documentation를 하나씩 확인하며 파라미터들을 이해하고, 사용법을 정확히 익히기.
EDA, 전처리, 모델링의 아이디어, 논리, 기법 이해 및 숙달
아무 것도 참고하지 않고 스스로 작성할 수 있을 정도로 반복

향후 정리하고 확장할 사항

Kaggle - Titanic (심화)에선 유명한 Tutorials 등을 참고하여 Null 값 처리, String → Numerical, Binning, Modeling 등 세부 방법들을 비교, List화하고, 정리
모델링에 사용한 sklearn의 RandomForestClassifier 소스코드(위치 확인 완료)를 분석하여 어떤 로직으로 구현하는지 분석
각 메서드의 kwargs들을 익히기 위해 라이브러리들의 기본이 되는 메서드, 파라미터들을 익히고 정리
ipynb 파일을 효율적으로 Markdown이나 HTML로 Export, 변환하는 방법, 자동화 등 모색

요약

Feature들의 피어슨 상관계수 Heatmap

Feature Importance순으로 Horizontal Bar

EDA와 Feature Engineering
Feature로 Fare, Age, Sex, Name Pronouns(Initial), Passenger Class (Pclass), Family Size(Family Size), Embaked를 사용함.
Fare는 Log를 취하고, Initial와 Embaked String을 Numeric으로, Age는 Binning하여 Numeric으로, Family Size는 Sibling, Spouse, Parent, Child 데이터를 Passenger ID를 기준으로 계산

Random Forest Classifier로 만든 Classification 모델의 성능
Classification Report
precision recall f1-score support
0 0.85 0.88 0.87 172
1 0.78 0.72 0.75 96
accuracy 0.82 268
macro avg 0.81 0.8 0.81 268
weighted avg 0.82 0.82 0.82 268

	precision	recall	f1-score	support
0	0.85	0.88	0.87	172
1	0.78	0.72	0.75	96
accuracy			0.82	268
macro avg	0.81	0.8	0.81	268
weighted avg	0.82	0.82	0.82	268

본문

  
import piplite

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

await piplite.install('seaborn')
import seaborn as sns

await piplite.install('plotly')
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls

plt.style.use('seaborn')
sns.set_theme(font_scale=2.5)
import jinja2

await piplite.install('missingno')
import missingno as msno

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
%config Completer.use_jedi = False

  
df_train = pd.read_csv('../data/titanic/train.csv')
df_test = pd.read_csv('../data/titanic/test.csv')

  
df_train.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

  
df_train.describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

  
df_test.describe()

	PassengerId	Pclass	Age	SibSp	Parch	Fare
count	418.000000	418.000000	332.000000	418.000000	418.000000	417.000000
mean	1100.500000	2.265550	30.272590	0.447368	0.392344	35.627188
std	120.810458	0.841838	14.181209	0.896760	0.981429	55.907576
min	892.000000	1.000000	0.170000	0.000000	0.000000	0.000000
25%	996.250000	1.000000	21.000000	0.000000	0.000000	7.895800
50%	1100.500000	3.000000	27.000000	0.000000	0.000000	14.454200
75%	1204.750000	3.000000	39.000000	1.000000	0.000000	31.500000
max	1309.000000	3.000000	76.000000	8.000000	9.000000	512.329200

  
for col in df_train.columns:
    msg = 'column: {:>10}\t Percent of NaN value: {:.2f}%'.format(col, 100 * (df_train[col].isnull().sum() / df_train[col].shape[0]))
    print(msg)

column: PassengerId	 Percent of NaN value: 0.00%
column:   Survived	 Percent of NaN value: 0.00%
column:     Pclass	 Percent of NaN value: 0.00%
column:       Name	 Percent of NaN value: 0.00%
column:        Sex	 Percent of NaN value: 0.00%
column:        Age	 Percent of NaN value: 19.87%
column:      SibSp	 Percent of NaN value: 0.00%
column:      Parch	 Percent of NaN value: 0.00%
column:     Ticket	 Percent of NaN value: 0.00%
column:       Fare	 Percent of NaN value: 0.00%
column:      Cabin	 Percent of NaN value: 77.10%
column:   Embarked	 Percent of NaN value: 0.22%

  
for col in df_test.columns:
    msg = 'column: {:>10}\t Percent of NaN value: {:.2f}%'.format(col, 100 * (df_test[col].isnull().sum() / df_test[col].shape[0]))
    print(msg)

column: PassengerId	 Percent of NaN value: 0.00%
column:     Pclass	 Percent of NaN value: 0.00%
column:       Name	 Percent of NaN value: 0.00%
column:        Sex	 Percent of NaN value: 0.00%
column:        Age	 Percent of NaN value: 20.57%
column:      SibSp	 Percent of NaN value: 0.00%
column:      Parch	 Percent of NaN value: 0.00%
column:     Ticket	 Percent of NaN value: 0.00%
column:       Fare	 Percent of NaN value: 0.24%
column:      Cabin	 Percent of NaN value: 78.23%
column:   Embarked	 Percent of NaN value: 0.00%

  
msno.matrix(df=df_train.iloc[:, :], figsize=(8, 8), color=(0.8, 0.5, 0.2))

<AxesSubplot:>

  
msno.bar(df=df_train.iloc[:, :], figsize=(8, 8), color=(0.8, 0.5, 0.2))

<AxesSubplot:>

  
msno.bar(df=df_test.iloc[:, :], figsize=(8, 8), color=(0.8, 0.5, 0.2))

<AxesSubplot:>

  
fig, ax = plt.subplots(1, 2, figsize=(18, 8))

df_train['Survived'].value_counts().plot.pie(explode=[0, 0.1], autopct='%1.1f%%', ax=ax[0], shadow=True)
ax[0].set_title('Pie Plot - Survived')
ax[0].set_ylabel('')
sns.countplot(x='Survived', data=df_train, ax=ax[1])
ax[1].set_title('Count Plot - Survived')

plt.show()

  
df_train[['Pclass', 'Survived']].groupby(['Pclass']).count()

	Survived
Pclass
1	216
2	184
3	491

  
df_train[['Pclass', 'Survived']].groupby(['Pclass']).sum()

	Survived
Pclass
1	136
2	87
3	119

  
pd.crosstab(df_train['Pclass'], df_train['Survived'], margins=True).style.background_gradient(cmap='summer_r')

Survived	0	1	All
Pclass
1	80	136	216
2	97	87	184
3	372	119	491
All	549	342	891

  
df_train[['Pclass', 'Survived']].groupby(['Pclass']).mean().sort_values(by='Survived', ascending=False).plot.bar()
plt.xticks(rotation=0)

Pclass_data = df_train[['Pclass', 'Survived']].groupby(['Pclass']).mean().sort_values(by='Survived', ascending=False)
for index, value in enumerate(Pclass_data['Survived']):
    plt.text(index, value, str(round(value, 2)), ha='center', va='bottom')

plt.show()

  
y_position = 1.02
fig, ax = plt.subplots(1, 2, figsize=(18, 8))

df_train['Pclass'].value_counts().plot.bar(color=['#CD7F32', '#FFDF00', '#D3D3D3'], ax=ax[0])
ax[0].set_title('Number of Passengers By Pclass', y=y_position)
ax[0].set_ylabel('Count')
ax[0].set_xticklabels(ax[0].get_xticklabels(), rotation=0)

sns.countplot(x='Pclass', hue='Survived', data=df_train, ax=ax[1])
ax[1].set_title('Pclass: Survived vs Dead')
ax[1].set_ylabel('Count')

plt.show()

  
fig, ax = plt.subplots(1, 2, figsize=(20, 8))

df_train[['Sex', 'Survived']].groupby(['Sex']).mean().sort_values(by='Sex', ascending=False).plot.bar(ax=ax[0])
ax[0].set_title('Sex: Survived Mean', y=1.02)
ax[0].set_ylabel('Mean')
ax[0].set_xticklabels(ax[0].get_xticklabels(), rotation=0)

sns.countplot(x='Sex', hue='Survived', data=df_train, ax=ax[1])
ax[1].set_title('Sex: Survived vs Dead', y=1.02)
ax[1].set_ylabel('Count')

plt.show()

  
df_train[['Sex', 'Survived']].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)

	Sex	Survived
0	female	0.742038
1	male	0.188908

  
pd.crosstab(df_train['Sex'], df_train['Survived'], margins=True).style.background_gradient(cmap='summer_r')

Survived	0	1	All
Sex
female	81	233	314
male	468	109	577
All	549	342	891

  
sns.catplot(data=df_train, x='Pclass', y='Survived', hue='Sex', height=6, kind='point', saturation=.5, aspect=1.5)

<seaborn.axisgrid.FacetGrid at 0xbc20158>

  
sns.catplot(data=df_train, x='Sex', y='Survived', col='Pclass', kind='point', height=6, aspect=1)

<seaborn.axisgrid.FacetGrid at 0xba6afb0>

  
print('Oldest Passenger: {:.1f} Years'.format(df_train['Age'].max()))
print('Youngest Passenger: {:.1f} Years'.format(df_train['Age'].min()))
print('Mean Age: {:.1f}'.format(df_train['Age'].mean()))

Oldest Passenger: 80.0 Years
Youngest Passenger: 0.4 Years
Mean Age: 29.7

  
fig, ax = plt.subplots(1, 1, figsize=(9, 5))
sns.kdeplot(df_train[df_train['Survived'] == 1]['Age'], ax=ax)
sns.kdeplot(df_train[df_train['Survived'] == 0]['Age'], ax=ax)
plt.legend(['Survived', 'Dead'], fontsize=18)
plt.show()

  
# Age distribution withing classes
plt.figure(figsize=(8, 6))
df_train['Age'][df_train['Pclass'] == 1].plot(kind='kde')
df_train['Age'][df_train['Pclass'] == 2].plot(kind='kde')
df_train['Age'][df_train['Pclass'] == 3].plot(kind='kde')

plt.xlabel('Age')
plt.title('Age Distribution within Pclass', pad=20)
plt.legend(['1st Class', '2nd Class', '3rd Class'], fontsize=18)

<matplotlib.legend.Legend at 0x18380ba8>

  
cummulate_survival_ratio = []
for i in range(1, 80):
    cummulate_survival_ratio.append(df_train[df_train['Age'] < i]['Survived'].sum() / len(df_train[df_train['Age'] < i]['Survived']))

plt.figure(figsize=(8, 7))
plt.plot(cummulate_survival_ratio)
plt.title('Survival Rate Change by Age', y=1.02)
plt.xlabel('Range of Age (0<=Age<80)')
plt.ylabel('Survival Rate')
plt.show()

  
fig, ax = plt.subplots(1, 2, figsize=(22, 10))

sns.violinplot(data=df_train, x='Pclass', y='Age', hue='Survived', density_norm='count', split=True, gap=.1, inner='quart', ax=ax[0])
ax[0].set_title('Age and Survived Distribution by Pclass', fontsize=25, y=1.02)
ax[0].set_yticks(range(0, 110, 10))
sns.move_legend(ax[0], 'upper right', fontsize=20)

sns.violinplot(data=df_train, x='Sex', y='Age', hue='Survived', density_norm='count', split=True, gap=.1, inner='quart', ax=ax[1])
ax[1].set_title('Age and Survived Distribution by Sex', fontsize=25, y=1.02)
ax[1].set_yticks(range(0, 110, 10))
sns.move_legend(ax[1], 'upper right', fontsize=20)

plt.show()

  
fig, ax = plt.subplots(1, 2, figsize=(30, 10))
df_train[['Embarked', 'Survived']].groupby(['Embarked']).mean().sort_values(by='Survived', ascending=False).plot.bar(ax=ax[0])
ax[0].set_title('Survived Mean by Embarkation Point', y=1.02)
ax[0].set_ylabel('Mean')
ax[0].set_xticklabels(ax[0].get_xticklabels(), rotation=0)

sns.countplot(data=df_train, x='Embarked', hue='Survived', ax=ax[1])
plt.xlabel('Embarked')
plt.ylabel('Count')
plt.title('Survival Count by Embarkation Point', y=1.02)

plt.show()

  
fig, ax = plt.subplots(2, 2, figsize=(20, 15))

sns.countplot(data=df_train, x='Embarked', ax=ax[0, 0])
ax[0, 0].set_title('(1) Total No. of Passengers')

sns.countplot(data=df_train, x='Embarked', hue='Sex', ax=ax[0, 1])
ax[0, 1].set_title('(2) No. of Passengers by Sex and Embarked')

sns.countplot(data=df_train, x='Embarked', hue='Survived', ax=ax[1, 0])
ax[1, 0].set_title('(3) No. of Survived by Embarked')

sns.countplot(data=df_train, x='Embarked', hue='Pclass', ax=ax[1, 1])
ax[1, 1].set_title('(4) No. of Pclass by Embakred')

plt.subplots_adjust(wspace=0.2, hspace=0.5)
plt.show()

  
df_train['FamilySize'] = df_train['SibSp'] + df_train['Parch'] + 1
df_test['FamilySize'] = df_test['SibSp'] + df_train['Parch'] + 1

print('Maximum Size of Family: ', df_train['FamilySize'].max())
print('Minimum Size of Family: ', df_train['FamilySize'].min())

Maximum Size of Family:  11
Minimum Size of Family:  1

  
fig, ax = plt.subplots(1, 3, figsize=(40, 10))

sns.countplot(data=df_train, x='FamilySize', ax=ax[0])
ax[0].set_title('(1) No. of Passengers by Family Size', y=1.02)

sns.countplot(data=df_train, x='FamilySize', hue='Survived', ax=ax[1])
ax[1].set_title('(2) No. of Survived by Family Size', y=1.02)

df_train[['FamilySize', 'Survived']].groupby(['FamilySize']).mean().sort_values(by='Survived', ascending=False).plot.bar(ax=ax[2])
ax[2].set_title('(3) Mean of Survived by Family Size', y=1.02)

plt.subplots_adjust(wspace=0.2, hspace=0.5)
plt.show()

  
fig, ax = plt.subplots(1, 1, figsize=(6, 6))

fare_histplot = sns.histplot(df_train['Fare'], label='Skewness: {:.2f}'.format(df_train['Fare'].skew()), kde=True, ax=ax)
fare_histplot = fare_histplot.legend(loc='best', fontsize=22)

  
df_test.loc[df_test.Fare.isnull(), 'Fare'] = df_test['Fare'].mean()

df_train['Fare'] = df_train['Fare'].map(lambda x: np.log(x) if x > 0 else 0)
df_test['Fare'] = df_test['Fare'].map(lambda x: np.log(x) if x > 0 else 0)

  
fig, ax = plt.subplots(1, 1, figsize=(8, 8))

fare_loghist = sns.histplot(df_train['Fare'], label='Skeness: {:.2f}'.format(df_train['Fare'].skew()), kde=True, ax=ax)
fare_loghist = fare_loghist.legend(loc='best', fontsize=22)

  
df_train.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

  
df_train['Ticket'].value_counts()

347082      7
CA. 2343    7
1601        7
3101295     6
CA 2144     6
           ..
9234        1
19988       1
2693        1
PC 17612    1
370376      1
Name: Ticket, Length: 681, dtype: int64

  
# 2번째

df_train = pd.read_csv('../data/titanic/train.csv')
df_test = pd.read_csv('../data/titanic/test.csv')

df_train['FamilySize'] = df_train['SibSp'] + df_train['Parch'] + 1
df_test['FamilySize'] = df_test['SibSp'] + df_test['Parch'] + 1

df_test.loc[df_test.Fare.isnull(), 'Fare'] = df_test['Fare'].mean()

df_train['Fare'] = df_train['Fare'].map(lambda x: np.log(x) if x > 0 else 0)
df_test['Fare'] = df_test['Fare'].map(lambda x: np.log(x) if x > 0 else 0)

  
df_train['Initial'] = df_train.Name.str.extract('([A-Za-z]+)\.')
df_test['Initial'] = df_test.Name.str.extract('([A-Za-z]+)\.')

  
pd.crosstab(df_train['Initial'], df_train['Sex']).T.style.background_gradient(cmap='summer_r')

Initial	Capt	Col	Countess	Don	Dr	Jonkheer	Lady	Major	Master	Miss	Mlle	Mme	Mr	Mrs	Ms	Rev	Sir
Sex
female	0	0	1	0	1	0	1	0	0	182	2	1	0	125	1	0	0
male	1	2	0	1	6	1	0	2	40	0	0	0	517	0	0	6	1

  
df_train['Initial'].replace(['Mlle', 'Mme', 'Ms', 'Dr', 'Major', 'Lady', 'Countess', 'Jonkheer', 'Col', 'Rev', 'Capt', 'Sir', 'Don', 'Dona'],
                            ['Miss', 'Miss', 'Miss', 'Mr', 'Mr', 'Mrs', 'Mrs', 'Other', 'Other', 'Other', 'Mr', 'Mr', 'Mr', 'Mr'], inplace=True)
df_test['Initial'].replace(['Mlle', 'Mme', 'Ms', 'Dr', 'Major', 'Lady', 'Countess', 'Jonkheer', 'Col', 'Rev', 'Capt', 'Sir', 'Don', 'Dona'],
                            ['Miss', 'Miss', 'Miss', 'Mr', 'Mr', 'Mrs', 'Mrs', 'Other', 'Other', 'Other', 'Mr', 'Mr', 'Mr', 'Mr'], inplace=True)

  
df_train.groupby('Initial')['Survived'].mean().plot.bar()

<AxesSubplot:xlabel='Initial'>

  
df_train.groupby('Initial').mean()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare	FamilySize
Initial
Master	414.975000	0.575000	2.625000	4.574167	2.300000	1.375000	3.340710	4.675000
Miss	411.741935	0.704301	2.284946	21.860000	0.698925	0.537634	3.123713	2.236559
Mr	455.880907	0.162571	2.381853	32.739609	0.293006	0.151229	2.651507	1.444234
Mrs	456.393701	0.795276	1.984252	35.981818	0.692913	0.818898	3.443751	2.511811
Other	564.444444	0.111111	1.666667	45.888889	0.111111	0.111111	2.641605	1.222222

  
df_train.loc[(df_train.Age.isnull())&(df_train.Initial=='Master'), 'Age'] = 5
df_train.loc[(df_train.Age.isnull())&(df_train.Initial=='Miss'), 'Age'] = 22
df_train.loc[(df_train.Age.isnull())&(df_train.Initial=='Mr'), 'Age'] = 33
df_train.loc[(df_train.Age.isnull())&(df_train.Initial=='Mrs'), 'Age'] = 36
df_train.loc[(df_train.Age.isnull())&(df_train.Initial=='Other'), 'Age'] = 46

df_test.loc[(df_test.Age.isnull())&(df_test.Initial=='Master'), 'Age'] = 5
df_test.loc[(df_test.Age.isnull())&(df_test.Initial=='Miss'), 'Age'] = 22
df_test.loc[(df_test.Age.isnull())&(df_test.Initial=='Mr'), 'Age'] = 33
df_test.loc[(df_test.Age.isnull())&(df_test.Initial=='Mrs'), 'Age'] = 36
df_test.loc[(df_test.Age.isnull())&(df_test.Initial=='Other'), 'Age'] = 46

  
print('Embarked has ', sum(df_train['Embarked'].isnull()), ' Null values')

Embarked has  2  Null values

  
df_train['Embarked'].fillna('S', inplace=True)

  
df_train['Age_cat'] = 0
df_train.loc[df_train['Age'] < 10, 'Age_cat'] = 0
df_train.loc[(10 <= df_train['Age']) & (df_train['Age'] < 20), 'Age_cat'] = 1
df_train.loc[(20 <= df_train['Age']) & (df_train['Age'] < 30), 'Age_cat'] = 2
df_train.loc[(30 <= df_train['Age']) & (df_train['Age'] < 40), 'Age_cat'] = 3
df_train.loc[(40 <= df_train['Age']) & (df_train['Age'] < 50), 'Age_cat'] = 4
df_train.loc[(50 <= df_train['Age']) & (df_train['Age'] < 60), 'Age_cat'] = 5
df_train.loc[(60 <= df_train['Age']) & (df_train['Age'] < 70), 'Age_cat'] = 6
df_train.loc[70 <= df_train['Age'], 'Age_cat'] = 7

df_test['Age_cat'] = 0
df_test.loc[df_test['Age'] < 10, 'Age_cat'] = 0
df_test.loc[(10 <= df_test['Age']) & (df_test['Age'] < 20), 'Age_cat'] = 1
df_test.loc[(20 <= df_test['Age']) & (df_test['Age'] < 30), 'Age_cat'] = 2
df_test.loc[(30 <= df_test['Age']) & (df_test['Age'] < 40), 'Age_cat'] = 3
df_test.loc[(40 <= df_test['Age']) & (df_test['Age'] < 50), 'Age_cat'] = 4
df_test.loc[(50 <= df_test['Age']) & (df_test['Age'] < 60), 'Age_cat'] = 5
df_test.loc[(60 <= df_test['Age']) & (df_test['Age'] < 70), 'Age_cat'] = 6
df_test.loc[70 <= df_test['Age'], 'Age_cat'] = 7

  
def category_age(i):
    if i < 10:
        return 0
    elif i < 20:
        return 1
    elif i < 30:
        return 2
    elif i < 40:
        return 3
    elif i < 50:
        return 4
    elif i < 60:
        return 5
    elif i < 70:
        return 6
    else:
        return 7

df_train['Age_cat_2'] = df_train['Age'].apply(category_age)

  
print('Do Age_cat and Age_cat result same? ', (df_train['Age_cat'] == df_train['Age_cat_2']).all())

Do Age_cat and Age_cat result same?  True

  
df_train.drop(['Age', 'Age_cat_2'], axis=1, inplace=True)
df_test.drop(['Age'], axis=1, inplace=True)

  
df_train['Initial'] = df_train['Initial'].map({'Master': 0, 'Miss': 1, 'Mr': 2, 'Mrs': 3, 'Other': 4})
df_test['Initial'] = df_test['Initial'].map({'Master': 0, 'Miss': 1, 'Mr': 2, 'Mrs': 3, 'Other': 4})

  
df_train['Embarked'].unique()

array(['S', 'C', 'Q'], dtype=object)

  
df_train['Embarked'].value_counts()

S    646
C    168
Q     77
Name: Embarked, dtype: int64

  
df_train['Embarked'] = df_train['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})
df_test['Embarked'] = df_test['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})

  
df_train['Embarked'].isnull().any()

False

  
df_train['Sex'] = df_train['Sex'].map({'female': 0, 'male': 1})
df_test['Sex'] = df_test['Sex'].map({'female': 0, 'male': 1})

  
heatmap_data = df_train[['Survived', 'Pclass', 'Sex', 'Fare', 'Embarked', 'FamilySize', 'Initial', 'Age_cat']]

colormap = plt.cm.RdBu

plt.figure(figsize=(14, 12))
plt.title('Pearson Correlation Coefficient of Fetures', y=1.02, size=20)

sns.heatmap(heatmap_data.astype(float).corr(), vmax=1.0, cmap=colormap, annot=True, annot_kws={'size': 16}, linewidths=0.1, linecolor='white', square=True)

del heatmap_data

  
df_train = pd.get_dummies(df_train, prefix='Initial', columns=['Initial'])
df_test = pd.get_dummies(df_test, prefix='Initial', columns=['Initial'])

  
df_train.head()

	PassengerId	Survived	Pclass	Name	Sex	SibSp	Ticket	Fare	Cabin	Embarked	FamilySize	Age_cat	Initial_1	Initial_2	Initial_3
0	1	0	3	Braund, Mr. Owen Harris	1	1	A/5 21171	1.981001	NaN	2	2	2	0	1	0
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	0	1	PC 17599	4.266662	C85	0	2	3	0	0	1
2	3	1	3	Heikkinen, Miss. Laina	0	0	STON/O2. 3101282	2.070022	NaN	2	1	2	1	0	0
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	0	1	113803	3.972177	C123	2	2	3	0	0	1
4	5	0	3	Allen, Mr. William Henry	1	0	373450	2.085672	NaN	2	1	3	0	1	0

  
df_test.head()

	PassengerId	Pclass	Name	Sex	SibSp	Parch	Ticket	Fare	Cabin	Embarked	FamilySize	Age_cat	Initial_2	Initial_3
0	892	3	Kelly, Mr. James	1	0	0	330911	2.057860	NaN	1	1	3	1	0
1	893	3	Wilkes, Mrs. James (Ellen Needs)	0	1	0	363272	1.945910	NaN	2	2	4	0	1
2	894	2	Myles, Mr. Thomas Francis	1	0	0	240276	2.270836	NaN	1	1	6	1	0
3	895	3	Wirz, Mr. Albert	1	0	0	315154	2.159003	NaN	2	1	2	1	0
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	0	1	1	3101298	2.508582	NaN	2	3	2	0	1

  
df_train = pd.get_dummies(df_train, prefix='Embarked', columns=['Embarked'])
df_test = pd.get_dummies(df_test, prefix='Embarked', columns=['Embarked'])

  
df_train.head()

	PassengerId	Survived	Pclass	Name	Sex	SibSp	Ticket	Fare	Cabin	FamilySize	Age_cat	Initial_1	Initial_2	Initial_3	Embarked_0	Embarked_2
0	1	0	3	Braund, Mr. Owen Harris	1	1	A/5 21171	1.981001	NaN	2	2	0	1	0	0	1
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	0	1	PC 17599	4.266662	C85	2	3	0	0	1	1	0
2	3	1	3	Heikkinen, Miss. Laina	0	0	STON/O2. 3101282	2.070022	NaN	1	2	1	0	0	0	1
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	0	1	113803	3.972177	C123	2	3	0	0	1	0	1
4	5	0	3	Allen, Mr. William Henry	1	0	373450	2.085672	NaN	1	3	0	1	0	0	1

  
df_test.head()

	PassengerId	Pclass	Name	Sex	SibSp	Parch	Ticket	Fare	Cabin	FamilySize	Age_cat	Initial_2	Initial_3	Embarked_1	Embarked_2
0	892	3	Kelly, Mr. James	1	0	0	330911	2.057860	NaN	1	3	1	0	1	0
1	893	3	Wilkes, Mrs. James (Ellen Needs)	0	1	0	363272	1.945910	NaN	2	4	0	1	0	1
2	894	2	Myles, Mr. Thomas Francis	1	0	0	240276	2.270836	NaN	1	6	1	0	1	0
3	895	3	Wirz, Mr. Albert	1	0	0	315154	2.159003	NaN	1	2	1	0	0	1
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	0	1	1	3101298	2.508582	NaN	3	2	0	1	0	1

  
df_train.drop(['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 'Cabin'], axis=1, inplace=True)
df_test.drop(['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 'Cabin'], axis=1, inplace=True)

  
df_train.head()

	Survived	Pclass	Sex	Fare	FamilySize	Age_cat	Initial_1	Initial_2	Initial_3	Embarked_0	Embarked_2
0	0	3	1	1.981001	2	2	0	1	0	0	1
1	1	1	0	4.266662	2	3	0	0	1	1	0
2	1	3	0	2.070022	1	2	1	0	0	0	1
3	1	1	0	3.972177	2	3	0	0	1	0	1
4	0	3	1	2.085672	1	3	0	1	0	0	1

  
df_test.head()

	Pclass	Sex	Fare	FamilySize	Age_cat	Initial_2	Initial_3	Embarked_1	Embarked_2
0	3	1	2.057860	1	3	1	0	1	0
1	3	0	1.945910	2	4	0	1	0	1
2	2	1	2.270836	1	6	1	0	1	0
3	3	1	2.159003	1	2	1	0	0	1
4	3	0	2.508582	3	2	0	1	0	1

  
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.model_selection import train_test_split

  
X_train = df_train.drop('Survived', axis=1).values
target_label = df_train['Survived'].values
X_test = df_test.values

  
X_tr, X_vld, y_tr, y_vld = train_test_split(X_train, target_label, test_size=0.3, random_state=2018)

X_tr

array([[1.        , 1.        , 4.11373861, ..., 0.        , 0.        ,
        1.        ],
       [3.        , 1.        , 2.05412373, ..., 0.        , 0.        ,
        1.        ],
       [1.        , 1.        , 5.354225  , ..., 1.        , 0.        ,
        0.        ],
       ...,
       [2.        , 1.        , 2.35137526, ..., 0.        , 0.        ,
        1.        ],
       [3.        , 1.        , 2.08567209, ..., 0.        , 0.        ,
        1.        ],
       [3.        , 1.        , 1.98100147, ..., 0.        , 0.        ,
        1.        ]])

  
X_tr.shape[0]

623

X_vld

array([[1.        , 1.        , 5.57215403, ..., 0.        , 0.        ,
        1.        ],
       [3.        , 1.        , 2.08567209, ..., 0.        , 0.        ,
        1.        ],
       [1.        , 1.        , 3.41772668, ..., 0.        , 0.        ,
        1.        ],
       ...,
       [1.        , 1.        , 3.25809654, ..., 0.        , 0.        ,
        1.        ],
       [2.        , 1.        , 2.35137526, ..., 0.        , 0.        ,
        1.        ],
       [2.        , 0.        , 3.66356165, ..., 0.        , 0.        ,
        1.        ]])

  
X_vld.shape[0]

268

y_tr

array([0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1,
       0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1,
       0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1,
       1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,
       1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1,
       0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1,
       0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1,
       1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1,
       1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0,
       0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0,
       1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0,
       1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 0, 1, 1, 1, 0, 0], dtype=int64)

  
y_tr.shape[0]

623

y_vld

array([0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1,
       1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1,
       1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0,
       1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1,
       0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1,
       1, 0, 0, 1], dtype=int64)

  
y_vld.shape[0]

268

  
model = RandomForestClassifier()
model.fit(X_tr, y_tr)
prediction = model.predict(X_vld)

  
print(prediction)

[0 1 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0
0 0 0 0 0 1 0 1 1 1 0 0 1 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 0 0 0 0 1 1 1 0
0 0 0 0 0 0 0 1 1 0 1 1 1 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 0 1 1
0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 1 0 0
0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0 1 1 1 1 0 0 0 0 0 1
1 0 0 0 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 1 1 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 1 0 1 1 0 0 0 0 0 0 1 0 1 1
1 0 1 1 0 0 0 1]

  
print('Survived Predection with {:.2f}% accuracy out of total passengers {}'.format(100 * metrics.accuracy_score(y_vld, prediction), y_vld.shape[0]))

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

Cell In[97], line 1
----> 1 print('Survived Predection with {:.2f}% accuracy out of total passengers {}'.format(100 * metrics.accuracy_score(prediction, y_vld), y_vld.shape[0]))


File /lib/python3.11/site-packages/sklearn/utils/_param_validation.py:211, in validate_params.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    205 try:
    206     with config_context(
    207         skip_parameter_validation=(
    208             prefer_skip_nested_validation or global_skip_validation
    209         )
    210     ):
--> 211         return func(*args, **kwargs)
    212 except InvalidParameterError as e:
    213     # When the function is just a wrapper around an estimator, we allow
    214     # the function to delegate validation to the estimator, but we replace
    215     # the name of the estimator by the name of the function in the error
    216     # message to avoid confusion.
    217     msg = re.sub(
    218         r"parameter of \w+ must be",
    219         f"parameter of {func.__qualname__} must be",
    220         str(e),
    221     )


File /lib/python3.11/site-packages/sklearn/metrics/_classification.py:220, in accuracy_score(y_true, y_pred, normalize, sample_weight)
    154 """Accuracy classification score.
    155
    156 In multilabel classification, this function computes subset accuracy:
   (...)
    216 0.5
    217 """
    219 # Compute accuracy for each possible representation
--> 220 y_type, y_true, y_pred = _check_targets(y_true, y_pred)
    221 check_consistent_length(y_true, y_pred, sample_weight)
    222 if y_type.startswith("multilabel"):


File /lib/python3.11/site-packages/sklearn/metrics/_classification.py:84, in _check_targets(y_true, y_pred)
     57 def _check_targets(y_true, y_pred):
     58     """Check that y_true and y_pred belong to the same classification task.
     59
     60     This converts multiclass or binary types to a common shape, and raises a
   (...)
     82     y_pred : array or indicator matrix
     83     """
---> 84     check_consistent_length(y_true, y_pred)
     85     type_true = type_of_target(y_true, input_name="y_true")
     86     type_pred = type_of_target(y_pred, input_name="y_pred")


File /lib/python3.11/site-packages/sklearn/utils/validation.py:407, in check_consistent_length(*arrays)
    405 uniques = np.unique(lengths)
    406 if len(uniques) > 1:
--> 407     raise ValueError(
    408         "Found input variables with inconsistent numbers of samples: %r"
    409         % [int(l) for l in lengths]
    410     )


ValueError: Found input variables with inconsistent numbers of samples: [418, 268]

  
print(model.feature_importances_)

[0.09869027 0.11627062 0.33726262 0.09358139 0.11944675 0.01194506
 0.03424261 0.11596645 0.02474876 0.00479071 0.01296574 0.0122562
 0.01783281]

  
print(model.decision_path(X_vld))

(<268x26804 sparse matrix of type '<class 'numpy.intc'>'
	with 262443 stored elements in Compressed Sparse Row format>, array([    0,   255,   516,   773,  1038,  1287,  1530,  1797,  2052,
        2321,  2582,  2843,  3126,  3389,  3664,  3925,  4228,  4467,
        4722,  4997,  5270,  5523,  5778,  6031,  6312,  6579,  6852,
        7125,  7398,  7627,  7892,  8165,  8446,  8693,  8972,  9261,
        9528,  9825, 10094, 10359, 10600, 10875, 11162, 11433, 11688,
       11967, 12212, 12481, 12760, 13025, 13304, 13567, 13828, 14067,
       14318, 14595, 14880, 15163, 15466, 15757, 16016, 16315, 16582,
       16855, 17130, 17387, 17654, 17931, 18228, 18477, 18730, 18989,
       19274, 19539, 19794, 20053, 20326, 20603, 20886, 21149, 21400,
       21649, 21926, 22203, 22442, 22701, 22980, 23251, 23544, 23861,
       24120, 24385, 24654, 24913, 25178, 25451, 25700, 25957, 26242,
       26533, 26804]))

  
print(model.predict_proba(X_vld)[:10])

[[0.76       0.24      ]
 [0.38471429 0.61528571]
 [0.702      0.298     ]
 [0.99       0.01      ]
 [0.01       0.99      ]
 [1.         0.        ]
 [0.94416667 0.05583333]
 [0.98       0.02      ]
 [1.         0.        ]
 [0.74880952 0.25119048]]

  
print(model.get_params())

{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}

  
from pandas import Series

feature_importance = model.feature_importances_
Series_feat_imp = Series(feature_importance, index=df_test.columns)
print(Series_feat_imp)

Pclass        0.098690
Sex           0.116271
Fare          0.337263
FamilySize    0.093581
Age_cat       0.119447
Initial_0     0.011945
Initial_1     0.034243
Initial_2     0.115966
Initial_3     0.024749
Initial_4     0.004791
Embarked_0    0.012966
Embarked_1    0.012256
Embarked_2    0.017833
dtype: float64

  
num_features = len(Series_feat_imp)
random_colors = np.random.rand(num_features, 3)

plt.figure(figsize=(8, 8))
Series_feat_imp.sort_values(ascending=True).plot.barh(color=random_colors)
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.show()

  
cm = metrics.confusion_matrix(y_vld, prediction)
print('Confusion Matrix\n', cm)

Confusion Matrix
 [[152  20]
 [ 27  69]]

  
cr = metrics.classification_report(y_vld, prediction)
print('Classification Report\n', cr)

Classification Report
               precision    recall  f1-score   support

           0       0.85      0.88      0.87       172
           1       0.78      0.72      0.75        96

    accuracy                           0.82       268
   macro avg       0.81      0.80      0.81       268
weighted avg       0.82      0.82      0.82       268

  
submission = pd.read_csv('../data/titanic/gender_submission.csv')

  
submission.head()

	PassengerId	Survived
0	892	0
1	893	1
2	894	0
3	895	0
4	896	1

  
prediction = model.predict(X_test)
submission['Survived'] = prediction

  
print(submission)

     PassengerId  Survived
          892         0
          893         0
          894         0
          895         0
          896         0
..           ...       ...
       1305         0
       1306         1
       1307         0
       1308         0
       1309         1

[418 rows x 2 columns]

  
submission.to_csv('../data/titanic/my_first_submission.csv', index=False)

  
submission_true = pd.read_csv('../data/titanic/gender_submission.csv')
submission_true = np.array(submission_true['Survived'])
print(submission_true)

[0 1 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 1
0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 0 1 0
0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0
1 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0
0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 1 0 1
1 0 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0 1 0 1 0 0 0 0 1 1 0 1 0 1 0 1 0
0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1
0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0 1 1 0 1 1 0
1 0 0 1 1 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0
1 1 1 1 1 0 1 0 0 0]

  
submission_predict = np.array(submission['Survived'])
print(submission_predict)

[0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 0 1 1 1 1 0 1 0 0 0 0 0 1 1 1 0 0
0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 0 0 1 1 1 1 0 0 1 1 0 1 1
0 0 1 0 1 1 0 0 0 0 0 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 0 0
1 1 1 0 0 1 0 1 1 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0
0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1
1 0 0 0 0 0 1 0 1 1 1 0 0 0 1 1 1 1 0 0 1 0 1 0 0 0 0 1 1 0 1 0 1 0 1 0
0 1 0 0 1 0 0 0 1 0 0 1 0 0 0 1 1 1 1 0 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 1
0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 0 0
0 0 0 0 0 1 0 0 0 1 1 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0 1 1 0 0 1 1
1 0 0 1 1 0 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 1 0
1 1 1 1 1 0 1 0 0 1]

  
print('Survived Prediction with {:.2f}% accuracy out of total number {}'.format(100 * metrics.accuracy_score(submission_true, submission_predict), submission_true.shape[0]))

Survived Prediction with 83.01% accuracy out of total number 418

Data Science, Kaggle

Data Science Kaggle

This post is licensed under CC-BY-NC-ND-4.0 by the author.

머리말

이번 공부의 목적

향후 정리하고 확장할 사항

요약

본문

Trending Tags