深度学习之构建MPL神经网络——泰坦尼克号乘客的生存分析网站首页 学无止境

深度学习之构建MPL神经网络——泰坦尼克号乘客的生存分析

带我去滑雪 2024-06-17 10:48:58

简介深度学习之构建MPL神经网络——泰坦尼克号乘客的生存分析

大家好，我是带我去滑雪！

本期使用泰坦尼克号数据集，该数据集的响应变量为乘客是生存还是死亡（survived，其中1表示生存，0表示死亡），特征变量有乘客舱位等级（pclass）、乘客姓名（name）、乘客性别（sex，其中male为男性，female为女性）、乘客年龄（age）、兄弟姊妹或者配偶在船上的人数（sibsp）、父母或子女在船上的人数（parch）、船票号码（ticket）、船票费用（fare）、舱位号码（cabin）、登船的港口号码（embarked，其中有C、Q、S三个港口）。

前一期已经写过《使用Keras构建分类问题的MLP神经网络——用于糖尿病预测》关于搭建MPL神经网络了，这里继续练习的原因是那篇文章所用的数据集是已经处理好了的，可以直接使用。但现实经常是一份没有经过任何处理的数据集，那么我们该怎么对数据进行预处理，进而再搭建MPL神经网络，这是本期主要学习的重点。

1、数据预处理

（1）观察数据

import numpy as np
import pandas as pd
from tensorflow.keras.models import Sequential#导入Sequential模型
from tensorflow.keras.layers import Dense#导入Dense全连接层
df = pd.read_csv(r'E:工作硕士博客博客32-深度学习之构建MPL神经网络——泰坦尼克号乘客的生存分析 itanic_data.csv')
df

输出结果：

pclass survived name sex age sibsp parch ticket fare cabin embarked
0 1 1 Allen Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S
1 1 1 Allison Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S
2 1 0 Allison Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 S
3 1 0 Allison Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S
4 1 0 Allison Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 S
... ... ... ... ... ... ... ... ... ... ... ...
1304 3 0 Zabour Miss. Hileni female 14.5000 1 0 2665 14.4542 NaN C
1305 3 0 Zabour Miss. Thamine female NaN 1 0 2665 14.4542 NaN C
1306 3 0 Zakarian Mr. Mapriededer male 26.5000 0 0 2656 7.2250 NaN C
1307 3 0 Zakarian Mr. Ortin male 27.0000 0 0 2670 7.2250 NaN C
1308 3 0 Zimmerman Mr. Leo male 29.0000 0 0 315082 7.8750 NaN S

1309 rows × 11 columns

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked
0	1	1	Allen Miss. Elisabeth Walton	female	29.0000	0	0	24160	211.3375	B5	S
1	1	1	Allison Master. Hudson Trevor	male	0.9167	1	2	113781	151.5500	C22 C26	S
2	1	0	Allison Miss. Helen Loraine	female	2.0000	1	2	113781	151.5500	C22 C26	S
3	1	0	Allison Mr. Hudson Joshua Creighton	male	30.0000	1	2	113781	151.5500	C22 C26	S
4	1	0	Allison Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1	2	113781	151.5500	C22 C26	S
...	...	...	...	...	...	...	...	...	...	...	...
1304	3	0	Zabour Miss. Hileni	female	14.5000	1	0	2665	14.4542	NaN	C
1305	3	0	Zabour Miss. Thamine	female	NaN	1	0	2665	14.4542	NaN	C
1306	3	0	Zakarian Mr. Mapriededer	male	26.5000	0	0	2656	7.2250	NaN	C
1307	3	0	Zakarian Mr. Ortin	male	27.0000	0	0	2670	7.2250	NaN	C
1308	3	0	Zimmerman Mr. Leo	male	29.0000	0	0	315082	7.8750	NaN	S

使用describe（）函数得到数据描述统计：

print(df.describe())

输出结果：

       pclass     survived          age        sibsp        parch  
count  1309.000000  1309.000000  1046.000000  1309.000000  1309.000000   
mean      2.294882     0.381971    29.881135     0.498854     0.385027   
std       0.837836     0.486055    14.413500     1.041658     0.865560   
min       1.000000     0.000000     0.166700     0.000000     0.000000   
25%       2.000000     0.000000    21.000000     0.000000     0.000000   
50%       3.000000     0.000000    28.000000     0.000000     0.000000   
75%       3.000000     1.000000    39.000000     1.000000     0.000000   
max       3.000000     1.000000    80.000000     8.000000     9.000000   

              fare  
count  1308.000000  
mean     33.295479  
std      51.758668  
min       0.000000  
25%       7.895800  
50%      14.454200  
75%      31.275000  
max     512.329200

通过对比可以发现age和fare的数量不是1309，说明可能存在缺失值。

（2）寻找缺失值

#寻找缺失值

print(df.info())

输出结果：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   pclass    1309 non-null   int64  
 1   survived  1309 non-null   int64  
 2   name      1309 non-null   object 
 3   sex       1309 non-null   object 
 4   age       1046 non-null   float64
 5   sibsp     1309 non-null   int64  
 6   parch     1309 non-null   int64  
 7   ticket    1309 non-null   object 
 8   fare      1308 non-null   float64
 9   cabin     295 non-null    object 
 10  embarked  1307 non-null   object 
dtypes: float64(2), int64(4), object(5)
memory usage: 112.6+ KB
None

# 显示没有数据料的条数
print(df.isnull().sum())

输出结果：

pclass         0
survived       0
name           0
sex            0
age          263
sibsp          0
parch          0
ticket         0
fare           1
cabin       1014
embarked       2
dtype: int64

通过上述操作可以发现，age、fare、cabin、embarked都存在缺失值，其中age存在263条，fare存在1条，cabin存在1014条，embarked存在2条。

（3）删除多余字段

数据中name、ticket、cabin均是一些文本数据，对于后续构建神经网络没有帮助，将其均剔除。

df = df.drop(["name", "ticket", "cabin"], axis=1)

（4）填补缺失值

age和fare的缺失值均使用平均值进行填补，embarked的缺失值使用计算C、Q、S个数据并排序，缺失值填入个数最多的S进行填补。

# 填补缺失数据
df[["age"]] = df[["age"]].fillna(value=df[["age"]].mean())
df[["fare"]] = df[["fare"]].fillna(value=df[["fare"]].mean())
df[["embarked"]] = df[["embarked"]].fillna(value=df["embarked"].value_counts().idxmax())
print(df[["age"]])
print(df[["fare"]])
print(df["embarked"].value_counts())
print(df["embarked"].value_counts().idxmax())

输出结果：
         age
0     29.000000
1      0.916700
2      2.000000
3     30.000000
4     25.000000
...         ...
1304  14.500000
1305  29.881135
1306  26.500000
1307  27.000000
1308  29.000000

[1309 rows x 1 columns]
          fare
0     211.3375
1     151.5500
2     151.5500
3     151.5500
4     151.5500
...        ...
1304   14.4542
1305   14.4542
1306    7.2250
1307    7.2250
1308    7.8750

[1309 rows x 1 columns]
S    916
C    270
Q    123
Name: embarked, dtype: int64
S

（5）转化分类数据

将特征变量sex 由分类数据的female、male变为数值型数据1、0，可以使用map函数进行处理。

df["sex"] = df["sex"].map( {"female": 1, "male": 0} ).astype(int)
print(df["sex"])

输出结果：
0       1
1       0
2       1
3       0
4       1
       ..
1304    1
1305    1
1306    0
1307    0
1308    0
Name: sex, Length: 1309, dtype: int32

（6）将不是分类数据的embarked字段进行独立热编码

在embarked中有S、C、Q三种字段，可以使用map函数将字段转化为数值，或者将一个字段拆分成三个字段的独立热编码：

enbarked_one_hot = pd.get_dummies(df["embarked"], prefix="embarked")#使用pd.get_dummies将原始字段拆分成embarked_C等三个字段
df = df.drop("embarked", axis=1)#删除embarked字段
df = df.join(enbarked_one_hot)#合并三个独立热编码字段
df

输出结果：

pclass survived sex age sibsp parch fare embarked_C embarked_Q embarked_S
0 1 1 1 29.000000 0 0 211.3375 0 0 1
1 1 1 0 0.916700 1 2 151.5500 0 0 1
2 1 0 1 2.000000 1 2 151.5500 0 0 1
3 1 0 0 30.000000 1 2 151.5500 0 0 1
4 1 0 1 25.000000 1 2 151.5500 0 0 1
... ... ... ... ... ... ... ... ... ... ...
1304 3 0 1 14.500000 1 0 14.4542 1 0 0
1305 3 0 1 29.881135 1 0 14.4542 1 0 0
1306 3 0 0 26.500000 0 0 7.2250 1 0 0
1307 3 0 0 27.000000 0 0 7.2250 1 0 0
1308 3 0 0 29.000000 0 0 7.8750 0 0 1

1309 rows × 10 columns

	pclass	survived	sex	age	sibsp	parch	fare	embarked_C	embarked_Q	embarked_S
0	1	1	1	29.000000	0	0	211.3375	0	0	1
1	1	1	0	0.916700	1	2	151.5500	0	0	1
2	1	0	1	2.000000	1	2	151.5500	0	0	1
3	1	0	0	30.000000	1	2	151.5500	0	0	1
4	1	0	1	25.000000	1	2	151.5500	0	0	1
...	...	...	...	...	...	...	...	...	...	...
1304	3	0	1	14.500000	1	0	14.4542	1	0	0
1305	3	0	1	29.881135	1	0	14.4542	1	0	0
1306	3	0	0	26.500000	0	0	7.2250	1	0	0
1307	3	0	0	27.000000	0	0	7.2250	1	0	0
1308	3	0	0	29.000000	0	0	7.8750	0	0	1

（7）移动响应变量到数据框最后一列

为了后续更好地分割训练集和测试集，将响应变量survived 栏位移至最后。

# 将标签的 survived 栏位移至最后
df_survived = df.pop("survived")
df["survived"] = df_survived
df

输出结果：

pclass sex age sibsp parch fare embarked_C embarked_Q embarked_S survived
0 1 1 29.000000 0 0 211.3375 0 0 1 1
1 1 0 0.916700 1 2 151.5500 0 0 1 1
2 1 1 2.000000 1 2 151.5500 0 0 1 0
3 1 0 30.000000 1 2 151.5500 0 0 1 0
4 1 1 25.000000 1 2 151.5500 0 0 1 0
... ... ... ... ... ... ... ... ... ... ...
1304 3 1 14.500000 1 0 14.4542 1 0 0 0
1305 3 1 29.881135 1 0 14.4542 1 0 0 0
1306 3 0 26.500000 0 0 7.2250 1 0 0 0
1307 3 0 27.000000 0 0 7.2250 1 0 0 0
1308 3 0 29.000000 0 0 7.8750 0 0 1 0

1309 rows × 10 columns

	pclass	sex	age	sibsp	parch	fare	embarked_C	embarked_Q	embarked_S	survived
0	1	1	29.000000	0	0	211.3375	0	0	1	1
1	1	0	0.916700	1	2	151.5500	0	0	1	1
2	1	1	2.000000	1	2	151.5500	0	0	1	0
3	1	0	30.000000	1	2	151.5500	0	0	1	0
4	1	1	25.000000	1	2	151.5500	0	0	1	0
...	...	...	...	...	...	...	...	...	...	...
1304	3	1	14.500000	1	0	14.4542	1	0	0	0
1305	3	1	29.881135	1	0	14.4542	1	0	0	0
1306	3	0	26.500000	0	0	7.2250	1	0	0	0
1307	3	0	27.000000	0	0	7.2250	1	0	0	0
1308	3	0	29.000000	0	0	7.8750	0	0	1	0

至此，数据预处理完成！下面开始划分数据集构建神经网络进行学习。

2、划分训练集和测试集，并保存划分的数据集

随机将数据集划分成训练集（80%）、测试集（20%）。

#划分训练集和测试集

mask = np.random.rand(len(df)) < 0.8
df_train = df[mask]
df_test = df[~mask]
print("Train:", df_train.shape)
print("Test:", df_test.shape)

输出结果：
Train: (1051, 10)
Test: (258, 10)

#储存处理后的数据
df_train.to_csv("titanic_train.csv", index=False)
df_test.to_csv("titanic_test.csv", index=False)

输出结果：

3、搭建MPL神经网络模型

首先分割训练集和测试集的特征数据与标签数据，再分别进行数据标准化。定义深度神经网络模型：输入层为9个特征，两个隐藏层的神经元设置为11个，输出层是一个二分类问题，所以神经元设置为1个神经元。其中隐藏层和两个隐藏层的激活函数均设置为ReLU函数，输出层的激活函数为Sigmoid函数。

dataset_train = df_train.values#取出数据集的数组
dataset_test = df_test.values

# 分割特征数据和标签数据
X_train = dataset_train[:, 0:9]
Y_train = dataset_train[:, 9]
X_test = dataset_test[:, 0:9]
Y_test = dataset_test[:, 9]
# 特征标准化
X_train -= X_train.mean(axis=0)
X_train /= X_train.std(axis=0)
X_test -= X_test.mean(axis=0)
X_test /= X_test.std(axis=0)

# 定义模型
model = Sequential()
model.add(Dense(11, input_dim=X_train.shape[1], activation="relu"))
model.add(Dense(11, activation="relu"))
model.add(Dense(1, activation="sigmoid"))
model.summary() # 显示模型信息

输出结果：
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 11)                110       
                                                                 
 dense_1 (Dense)             (None, 11)                132       
                                                                 
 dense_2 (Dense)             (None, 1)                 12        
                                                                 
=================================================================
Total params: 254
Trainable params: 254
Non-trainable params: 0

各神经层的参数计算（包括权重和偏移量）：第一个隐藏层有9*11+11=110、第二个隐藏层有11*11+11=132、输出层有11*1+1=12，总共参数共有254个。

4、编译模型、训练模型、评估模型

在编译模型中，损失函数使用binary_crossentropy，优化器使用adam，评估标准为准确度accuracy。在训练模型中，验证集为训练集的20%，训练周期为100次，批次尺寸为10。

# 编译模型
model.compile(loss="binary_crossentropy", optimizer="adam",
metrics=["accuracy"])
#训练模型
print("Training ...")
history = model.fit(X_train, Y_train, validation_split=0.2,
epochs=100, batch_size=10)

输出结果：

84/84 [==============================] - 0s 2ms/step - loss: 0.3762 - accuracy: 0.8476 - val_loss: 0.4197 - val_accuracy: 0.8294
Epoch 95/100
84/84 [==============================] - 0s 2ms/step - loss: 0.3760 - accuracy: 0.8429 - val_loss: 0.4163 - val_accuracy: 0.8294
Epoch 96/100
84/84 [==============================] - 0s 2ms/step - loss: 0.3754 - accuracy: 0.8429 - val_loss: 0.4185 - val_accuracy: 0.8341
Epoch 97/100
84/84 [==============================] - 0s 2ms/step - loss: 0.3751 - accuracy: 0.8452 - val_loss: 0.4136 - val_accuracy: 0.8436
Epoch 98/100
84/84 [==============================] - 0s 2ms/step - loss: 0.3757 - accuracy: 0.8512 - val_loss: 0.4284 - val_accuracy: 0.8246
Epoch 99/100
84/84 [==============================] - 0s 2ms/step - loss: 0.3744 - accuracy: 0.8488 - val_loss: 0.4212 - val_accuracy: 0.8294
Epoch 100/100
84/84 [==============================] - 0s 2ms/step - loss: 0.3740 - accuracy: 0.8440 - val_loss: 0.4212 - val_accuracy: 0.8341

在输出结果有训练集的损失分数和准确率，验证集的损失分数和准确率。

# 评估模型
print(" 请稍等模型正在评估中 ...")
loss, accuracy = model.evaluate(X_train, Y_train)
print("训练数据集的准确度 = {:.2f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, Y_test)
print("測测试数据集的准确度 = {:.2f}".format(accuracy))

输出结果：
请稍等模型正在评估中 ...
33/33 [==============================] - 0s 1ms/step - loss: 0.3804 - accuracy: 0.8478
训练数据集的准确度 = 0.85
9/9 [==============================] - 0s 2ms/step - loss: 0.5483 - accuracy: 0.7519
測测试数据集的准确度 = 0.75

5、绘制训练集和验证集的损失分数趋势图、准确率趋势图

import matplotlib.pyplot as plt
# 绘制训练和验证损失趋势图
loss = history.history["loss"]
epochs = range(1, len(loss)+1)
val_loss = history.history["val_loss"]
plt.plot(epochs, loss, "b-", label="Training Loss")
plt.plot(epochs, val_loss, "r--", label="Validation Loss")
plt.title("Training and Validation Loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()
plt.savefig("squares1.png",
bbox_inches ="tight",
pad_inches = 1,
transparent = True,
facecolor ="w",
edgecolor ='w',
dpi=300,
orientation ='landscape')

输出结果：

#绘制训练和验证准确率趋势图
acc = history.history["accuracy"]
epochs = range(1, len(acc)+1)
val_acc = history.history["val_accuracy"]
plt.plot(epochs, acc, "b-", label="Training Acc")
plt.plot(epochs, val_acc, "r--", label="Validation Acc")
plt.title("Training and Validation Accuracy")
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.legend()
plt.savefig("squares2.png",
bbox_inches ="tight",
pad_inches = 1,
transparent = True,
facecolor ="w",
edgecolor ='w',
dpi=300,
orientation ='landscape')

输出结果：

从趋势图大致可以看出，训练周期大概在18次训练模型最好，再多就会出现过拟合。

6、使用最佳训练周期训练模型，并保存模型结果和权重

model = Sequential()
model.add(Dense(11, input_dim=X_train.shape[1], activation="relu"))
model.add(Dense(11, activation="relu"))
model.add(Dense(1, activation="sigmoid"))
# 编译模型
model.compile(loss="binary_crossentropy", optimizer="adam",metrics=["accuracy"])
# 训练模型
print("请稍等模型正在训练中 ...")
model.fit(X_train, Y_train, epochs=18, batch_size=10, verbose=0)
# 评估模型
print(" 请稍等模型正在评估中 ......")
loss, accuracy = model.evaluate(X_train, Y_train)
print("训练数据集的准确度 = {:.2f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, Y_test)
print("测试数据集的准确度 = {:.2f}".format(accuracy))

输出结果：
请稍等模型正在训练中 ...

请稍等模型正在评估中 ......
33/33 [==============================] - 0s 1ms/step - loss: 0.3960 - accuracy: 0.8402
训练数据集的准确度 = 0.84
9/9 [==============================] - 0s 1ms/step - loss: 0.5574 - accuracy: 0.7364
测试数据集的准确度 = 0.74

# 存储模型结构与权重
print("保存模型中，请稍等...（已完成）")
model.save("titanic.h5")

输出结果：

7、使用训练好的模型进行预测

from tensorflow.keras.models import load_model
# 建立Keras的Sequential模型
model = Sequential()
model = load_model("titanic.h5")#调用之前保存的神经网络结构与权重
# 编译模型
model.compile(loss="binary_crossentropy", optimizer="adam",metrics=["accuracy"])
# 评估模型
loss, accuracy = model.evaluate(X_test, Y_test)
print("测试数据集的准确度= {:.2f}".format(accuracy))

输出结果：
9/9 [==============================] - 0s 1ms/step - loss: 0.5574 - accuracy: 0.7364
测试数据集的准确度= 0.74

predict=model.predict(X_test)
# Y_pred=np.argmax(predict,axis=1)
Y_pred = np.int64(predict>0.5)
y_pred = np.squeeze(Y_pred)
print(y_pred[:5])

#print(Y_test.astype(int))
#显示混淆矩阵
tb = pd.crosstab(Y_test.astype(int), y_pred,
rownames=["label"], colnames=["predict"])
print(tb)

输出结果：
9/9 [==============================] - 0s 1ms/step
[1 1 1 0 0]
predict    0   1
label           
0        126  21
1         47  64

更多优质内容持续发布中，请移步主页查看。

点赞+关注,下次不迷路！

风语者！平时喜欢研究各种技术，目前在从事后端开发工作，热爱生活、热爱工作。