Kaggle Titanic 优化手记

最近尝试了一下大名鼎鼎的 Kaggle，的确是刷起题来很开心，这里记录一下 Kaggle Titanic 的优化记录，备忘。

直接用随机森林模型，直接用 “Pclass”, “Sex”, “SibSp”, “Parch” 四个维度的数据，能达到 0.77511 的识别率；随机森林用的 sk-learn实现，参数是：

1

(n_estimators=200, max_depth=6, random_state=1)

1

(n_estimators=300, max_depth=6, random_state=1)

加入一个 age 的维度，把 NaN 数据直接替换为30，输出准确率为 0.77033。
尝试通过 Sex 和 Pclass 标签把人分类，按类别来算平均值，把这个平均值代入到 age 为 NaN 的地方，模型依然为随机森林，准确率为 0.77751，比第1步略有提高。

上述两个思路参考自：How to Solve the missing values in Age Column ?
添加了一个 Fare 的字段来训练和预测，同时对于缺失的 Fare 值使用 31 来代替（随便找了一个数字），得到 0.78708 的准确率。
在5的基础上，根据 Pclass 的 Fare 均值来补偿缺失的 Fare 值，结果依然是0.78708，可能是因为缺失的 Fare 值很少，这个补偿没有什么用处。
换成 SVM 的 SVC，使用默认参数

1

model = SVC(gamma='auto')

得到的准确率为 0.60765，效果远没有随机森林算法好。

1

RandomForestClassifier(n_estimators=300, max_depth=10, random_state=10)

得到准确率为0.76794，试用新参数：

1

RandomForestClassifier(n_estimators=300, max_depth=20, random_state=1)

得到准确率为0.76076。效果都没有最早的参数好。

文章目录