首先想到的是从input入手:
看看有哪些重要的feature,kaggle模板中提供的features是features = ["Pclass", "Sex", "SibSp", "Parch"]
,找的方法是permutation importance;
步骤1: 在permutation之前,先要把data进行normalization:主要有两个:处理null值和处理categories值;
初始化:
import eli5import numpy as np # linear algebraimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)import matplotlib.pyplot as plt# Input data files are available in the read-only "../input/" directory# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directoryimport os# for dirname, _, filenames in os.walk('/kaggle/input'):# for filename in filenames:# print(os.path.join(dirname, filename))from sklearn.ensemble import RandomForestClassifiertrain_data = pd.read_csv("./input/train.csv")###permutation importance:from sklearn.model_selection import train_test_splity_permut = train_data["Survived"] # Convert from string "Yes"/"No" to binarybasefeatures = ['PassengerId','Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']X_permut = train_data[basefeatures].copy()# print(X_permut.head(5))####preprocess
处理categories值:
##for categories# from sklearn.preprocessing import OrdinalEncoderfrom sklearn.preprocessing import LabelEncoderclass_le = LabelEncoder()X_permut['Name'] = class_le.fit_transform(X_permut['Name'].values)X_permut['Sex'] = class_le.fit_transform(X_permut['Sex'].values)X_permut['Cabin'] = class_le.fit_transform(X_permut['Cabin'].astype(str).values)X_permut['Embarked'] = class_le.fit_transform(X_permut['Embarked'].values)X_permut['Ticket'] = class_le.fit_transform(X_permut['Ticket'].values)# print(X_permut.head(10))####
处理null值:
##for nullfrom sklearn.impute import SimpleImputercols_with_missing_X_permut = [col for col in X_permut.columns if X_permut[col].isnull().any()]imputer = SimpleImputer(missing_values=np.nan, strategy='mean')for col in cols_with_missing_X_permut: imputer = imputer.fit(X_permut[[col]]) X_permut[[col]] = imputer.transform(X_permut[[col]])# print(X_permut.head(10))###
permutation找出重要的features:
train_X_permut, val_X_permut, train_y_permut, val_y_permut = train_test_split(X_permut,y_permut, random_state=1)permut_model = RandomForestClassifier(n_estimators=50, random_state=1).fit(train_X_permut, train_y_permut)from sklearn.inspection import permutation_importanceperm = permutation_importance(permut_model, val_X_permut, val_y_permut,n_repeats=30,random_state=0)important_features=[]for i in perm.importances_mean.argsort()[::-1]: if perm.importances_mean[i] - 2 * perm.importances_std[i] > 0: important_features.append(train_data.columns[i]) print(f"{train_data.columns[i]:<8}" f"{perm.importances_mean[i]:.3f}" f" +/- {perm.importances_std[i]:.3f}")
步骤2:用important features进行训练:以下代码会报错:
Number of features of the model must match the input. Model n_features is 1578 and input n_features is 787
原因是get_dummies接口会因爲值的不同生成的features不同,处理方法:
https://stackoverflow.com/questions/44026832/valueerror-number-of-features-of-the-model-must-match-the-input
但是我还没用过,暂时不知道是否可行;
print(important_features)important_features.remove('Survived')test_data = pd.read_csv("./input/test.csv")y = train_data['Survived']# features = ["Pclass", "Sex", "SibSp", "Parch"]X = pd.get_dummies(train_data[important_features])X_test = pd.get_dummies(test_data[important_features])# print(train_data.columns)print(X.count)print(X_test.count)model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)model.fit(X, y)##for nullcols_with_missing_X_test = [col for col in X_test.columns if X_test[col].isnull().any()]imputer = SimpleImputer(missing_values=np.nan, strategy='mean')for col in cols_with_missing_X_test: imputer = imputer.fit(X_test[[col]]) X_test[[col]] = imputer.transform(X_test[[col]])# print(X_permut.head(10))###print(X_test.count)predictions = model.predict(X_test)output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})output.to_csv('my_submission.csv', index=False)