博客
关于我
数据预处理与特征工程—6.Kaggle房价预测中数据预处理与特征工程
阅读量:274 次
发布时间:2019-03-01

本文共 5094 字,大约阅读时间需要 16 分钟。

???????????????????????????????????????????????????????????????

1. ????????

??????????????????

# ?????train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv', header=0, index_col=0)test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv', header=0, index_col=0)

??????????????????????????

# ?????????train.describe().T

???????

# ????????train.dtypes.value_counts()

2. ?????????

??missingno??????????

# ???????mg.matrix(train)

????????????

def missing_percentage(df):    total = df.isnull().sum().sort_values(ascending=False)[df.isnull().sum().sort_values(ascending=False) != 0]    percentage = round(df.isnull().sum().sort_values(ascending=False)*100 / len(df), 2)[df.isnull().sum().sort_values(ascending=False)*100 / len(df) != 0]    return pd.concat([total, percentage], axis=1, keys=['Total', 'Percentage'])missing_percentage(train)

3. ????

3.1 ??????

????SalePrice??????

def plot_1(df, feature):    style.use('fivethirtyeight')    fig, axes = plt.subplots(3, 1, constrained_layout=True, figsize=(10, 24))        # ???????????????    sns.distplot(df.loc[:, feature], norm_hist=True, ax=axes[0])        # ???????    stats.probplot(df.loc[:, feature], plot=axes[1])        # ?????    sns.boxplot(df.loc[:, feature], orient='h', ax=axes[2])        plot_1(train, 'SalePrice')

???????????

# ????????????train.SalePrice.skew(), train.SalePrice.kurtosis()

3.2 ????????????

??SalePrice?OverallQual????

def customized_cat_boxplot(y, x):    style.use('fivethirtyeight')    plt.subplots(figsize=(12, 8))    sns.boxplot(y=y, x=x)    customized_cat_boxplot(train.SalePrice, train.OverallQual)

3.3 ????????????

??SalePrice?GrLivArea????

def customized_num_scatterplot(y, x):    style.use('fivethirtyeight')    plt.subplots(figsize=(12, 8))    sns.scatterplot(y=y, x=x)    customized_num_scatterplot(train.SalePrice, train.GrLivArea)

?????????

# ?????train = train[train.GrLivArea < 4500]train.reset_index(drop=True, inplace=True)

??????

plt.subplots(figsize=(12, 8))sns.residplot(train.GrLivArea, train.SalePrice)

3.4 ????

?????

# ????train['SalePrice'] = np.log1p(train['SalePrice'])

??????

plot_1(train, 'SalePrice')

4. ?????

??????????????

missing_val_col = ["Alley", "PoolQC", "MiscFeature", "Fence", "FireplaceQu", "GarageType", "GarageFinish", "GarageQual", "GarageCond", 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'MasVnrType']for i in missing_val_col:    all_data[i] = all_data[i].fillna('None')missing_val_col2 = ['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath', 'GarageYrBlt', 'GarageArea', 'GarageCars', 'MasVnrArea']for i in missing_val_col2:    all_data[i] = all_data[i].fillna(0)# ??????LotFrontageall_data['LotFrontage'] = all_data.groupby('Neighborhood')['LotFrontage'].transform(lambda x: x.fillna(x.mean()))

????

1. ????

??????

all_data['TotalFS'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']all_data['YrBltAndRemod'] = all_data['YearBuilt'] + all_data['YearRemodAdd']all_data['Total_sqr_footage'] = all_data['BsmtFinSF1'] + all_data['BsmtFinSF2'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']all_data['Total_Bathrooms'] = all_data['FullBath'] + (0.5 * all_data['HalfBath']) + all_data['BsmtFullBath'] + (0.5 * all_data['BsmtHalfBath'])all_data['Total_porch_sf'] = all_data['OpenPorchSF'] + all_data['3SsnPorch'] + all_data['EnclosedPorch'] + all_data['ScreenPorch'] + all_data['WoodDeckSF']

2. ????

???????

all_data['haspool'] = all_data['PoolArea'].apply(lambda x: 1 if x > 0 else 0)all_data['has2ndfloor'] = all_data['2ndFlrSF'].apply(lambda x: 1 if x > 0 else 0)all_data['hasgarage'] = all_data['GarageArea'].apply(lambda x: 1 if x > 0 else 0)all_data['hasbsmt'] = all_data['TotalBsmtSF'].apply(lambda x: 1 if x > 0 else 0)all_data['hasfireplace'] = all_data['Fireplaces'].apply(lambda x: 1 if x > 0 else 0)

3. ????

????????

def overfit_reducer(df):    overfit = []    for i in df.columns:        counts = df[i].value_counts()        zeros = counts.iloc[0]        if zeros / len(df) * 100 > 99.94:            overfit.append(i)    return overfitoverfitted_features = overfit_reducer(X_train)X_train = X_train.drop(overfitted_features, axis=1)X_test = X_test.drop(overfitted_features, axis=1)

4. ??

??one-hot???

all_data['MSSubClass'] = all_data['MSSubClass'].astype(str)all_data['MSZoning'] = all_data.groupby('MSSubClass')['MSZoning'].transform(lambda x: x.fillna(x.mode()[0]))all_data['YrSold'] = all_data['YrSold'].astype(str)all_data['MoSold'] = all_data['MoSold'].astype(str)all_data['Functional'] = all_data['Functional'].fillna('Typ')all_data['Utilities'] = all_data['Utilities'].fillna('AllPub')all_data['Exterior1st'] = all_data['Exterior1st'].fillna(all_data['Exterior1st'].mode()[0])all_data['Exterior2nd'] = all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0])all_data['KitchenQual'] = all_data['KitchenQual'].fillna("TA")all_data['SaleType'] = all_data['SaleType'].fillna(all_data['SaleType'].mode()[0])all_data['Electrical'] = all_data['Electrical'].fillna("SBrkr")

?????????????????????????????????????????

转载地址:http://xusv.baihongyu.com/

你可能感兴趣的文章
Objective-C实现iterating through submasks遍历子掩码算法(附完整源码)
查看>>
Objective-C实现iterative merge sort迭代归并排序算法(附完整源码)
查看>>
Objective-C实现jaccard similarity相似度无平方因子数算法(附完整源码)
查看>>
Objective-C实现Julia集算法(附完整源码)
查看>>
Objective-C实现jump search跳转搜索算法(附完整源码)
查看>>
Objective-C实现jumpSearch跳转搜索算法(附完整源码)
查看>>
Objective-C实现k nearest neighbours k最近邻分类算法(附完整源码)
查看>>
Objective-C实现k-means clustering均值聚类算法(附完整源码)
查看>>
Objective-C实现k-Means算法(附完整源码)
查看>>
Objective-C实现k-nearest算法(附完整源码)
查看>>
Objective-C实现KadaneAlgo计算给定数组的最大连续子数组和算法(附完整源码)
查看>>
Objective-C实现kadanes卡达内斯算法(附完整源码)
查看>>
Objective-C实现kahns algorithm卡恩算法(附完整源码)
查看>>
Objective-C实现karatsuba大数相乘算法(附完整源码)
查看>>
Objective-C实现karger算法(附完整源码)
查看>>
Objective-C实现KMP搜索算法(附完整源码)
查看>>
Objective-C实现Knapsack problem背包问题算法(附完整源码)
查看>>
Objective-C实现knapsack背包问题算法(附完整源码)
查看>>
Objective-C实现knapsack背包问题算法(附完整源码)
查看>>
Objective-C实现knight tour骑士之旅算法(附完整源码)
查看>>