机器学习—回归与分类4-3(AdaBoost算法)
使用AdaBoost预测黑色星期五花销
主要步骤流程:
- 1. 导入包
- 2. 导入数据集
- 3. 数据预处理
- 3.1 检测并处理缺失值
- 3.2 删除无用的列
- 3.3 检查类别型变量
- 3.4 标签编码&独热编码
- 3.5 得到自变量和因变量
- 3.6 拆分训练集和测试集
- 3.7 特征缩放
- 4. 使用不同的参数构建AdaBoost回归模型
- 4.1 模型1:构建AdaBoost回归模型
- 4.1.1 构建模型
- 4.1.2 测试集做预测
- 4.1.3 评估模型性能
- 4.2 模型2:构建AdaBoost回归模型
- 4.1 模型1:构建AdaBoost回归模型
数据集链接:
1. 导入包
In [2]:# 导入包
import numpy as np
import pandas as pd
2. 导入数据集
In [3]:# 导入数据集
data = pd.read_csv('BlackFriday.csv')
data
Out[3]:
User_ID | Product_ID | Gender | Age | Occupation | City_Category | Stay_In_Current_City_Years | Marital_Status | Product_Category_1 | Product_Category_2 | Product_Category_3 | Purchase | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1000001 | P00069042 | F | 0-17 | 10 | A | 2 | 0 | 3 | NaN | NaN | 8370 |
1 | 1000001 | P00248942 | F | 0-17 | 10 | A | 2 | 0 | 1 | 6.0 | 14.0 | 15200 |
2 | 1000001 | P00087842 | F | 0-17 | 10 | A | 2 | 0 | 12 | NaN | NaN | 1422 |
3 | 1000001 | P00085442 | F | 0-17 | 10 | A | 2 | 0 | 12 | 14.0 | NaN | 1057 |
4 | 1000002 | P00285442 | M | 55+ | 16 | C | 4+ | 0 | 8 | NaN | NaN | 7969 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 1001649 | P00102642 | M | 18-25 | 19 | C | 2 | 1 | 4 | 8.0 | 9.0 | 1374 |
49996 | 1001649 | P00035842 | M | 18-25 | 19 | C | 2 | 1 | 5 | 6.0 | 9.0 | 5372 |
49997 | 1001649 | P00052842 | M | 18-25 | 19 | C | 2 | 1 | 10 | 15.0 | NaN | 18879 |
49998 | 1001649 | P00183142 | M | 18-25 | 19 | C | 2 | 1 | 15 | NaN | NaN | 17029 |
49999 | 1001650 | P00155642 | M | 26-35 | 19 | C | 1 | 0 | 8 | NaN | NaN | 6093 |
50000 rows × 12 columns
3. 数据预处理
3.1 检测并处理缺失值
In [4]:# 检测缺失值
null_df = data.isnull().sum()
null_df
Out[4]:
User_ID 0
Product_ID 0
Gender 0
Age 0
Occupation 0
City_Category 0
Stay_In_Current_City_Years 0
Marital_Status 0
Product_Category_1 0
Product_Category_2 15721
Product_Category_3 34817
Purchase 0
dtype: int64
In [5]:
# 删除缺失列
data = data.drop(['Product_Category_2', 'Product_Category_3'], axis = 1)
In [6]:
# 再次检测缺失值
null_df = data.isnull().sum()
null_df
Out[6]:
User_ID 0
Product_ID 0
Gender 0
Age 0
Occupation 0
City_Category 0
Stay_In_Current_City_Years 0
Marital_Status 0
Product_Category_1 0
Purchase 0
dtype: int64
3.2 删除无用的列
In [7]:# 删除无用的列
data = data.drop(['User_ID', 'Product_ID'], axis = 1)
3.3 检查类别型变量
In [8]:# 检查类别型变量
print(data.dtypes)
Gender object
Age object
Occupation int64
City_Category object
Stay_In_Current_City_Years object
Marital_Status int64
Product_Category_1 int64
Purchase int64
dtype: object
In [9]:
# 转换变量类型
data['Stay_In_Current_City_Years'].replace('4+', 4, inplace = True)
data['Stay_In_Current_City_Years'] = data['Stay_In_Current_City_Years'].astype('int64')
data['Product_Category_1'] = data['Product_Category_1'].astype('object')
data['Occupation'] = data['Occupation'].astype('object')
data['Marital_Status'] = data['Marital_Status'].astype('object')
In [10]:
# 检查类别型变量
print(data.dtypes)
Gender object
Age object
Occupation object
City_Category object
Stay_In_Current_City_Years int64
Marital_Status object
Product_Category_1 object
Purchase int64
dtype: object
3.4 标签编码&独热编码
In [11]:# 标签编码&独热编码
data = pd.get_dummies(data, drop_first = True)
data
Out[11]:
Stay_In_Current_City_Years | Purchase | Gender_M | Age_18-25 | Age_26-35 | Age_36-45 | Age_46-50 | Age_51-55 | Age_55+ | Occupation_1 | ... | Product_Category_1_9 | Product_Category_1_10 | Product_Category_1_11 | Product_Category_1_12 | Product_Category_1_13 | Product_Category_1_14 | Product_Category_1_15 | Product_Category_1_16 | Product_Category_1_17 | Product_Category_1_18 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 8370 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 2 | 15200 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 2 | 1422 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 2 | 1057 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 4 | 7969 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 2 | 1374 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
49996 | 2 | 5372 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
49997 | 2 | 18879 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
49998 | 2 | 17029 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
49999 | 1 | 6093 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
50000 rows × 49 columns
3.5 得到自变量和因变量
In [12]:# 得到自变量和因变量
y = data['Purchase'].values
data = data.drop(['Purchase'], axis = 1)
x = data.values
3.6 拆分训练集和测试集
In [13]:# 拆分训练集和测试集
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 1)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)
(35000, 48)
(15000, 48)
(35000,)
(15000,)
3.7 特征缩放
In [14]:# 特征缩放
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
sc_y = StandardScaler()
y_train = np.ravel(sc_y.fit_transform(y_train.reshape(-1, 1)))
4. 使用不同的参数构建AdaBoost回归模型
4.1 模型1:构建AdaBoost回归模型
4.1.1 构建模型
In [15]:# 使用不同的参数构建AdaBoost回归模型
# 模型1:构建AdaBoost回归模型(base_estimator=None, n_estimators=50, learning_rate=1)
from sklearn.ensemble import AdaBoostRegressor
regressor = AdaBoostRegressor(n_estimators=50, learning_rate=1, loss='linear', random_state=0)
regressor.fit(x_train, y_train)
Out[15]:
AdaBoostRegressor(learning_rate=1, random_state=0)
4.1.2 测试集做预测
In [16]:# 在测试集做预测
y_pred = regressor.predict(x_test)
y_pred[:5]
Out[16]:
array([ 0.65004612, 0.18805047, 0.55246384, 0.55246384, -0.01299079])
In [17]:
# y_pred变回特征缩放之前的
y_pred = sc_y.inverse_transform(y_pred)
y_pred[:5]
Out[17]:
array([12501.03102111, 10206.94045722, 12016.4753753 , 12016.4753753 ,
9208.64781524])
4.1.3 评估模型性能
In [18]:# 评估模型性能
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
print("R2 Score:", r2)
R2 Score: 0.3075923325444474
4.2 模型2:构建AdaBoost回归模型
In [29]:# 模型2:构建AdaBoost回归模型(base_estimator=DecisionTreeRegressor, n_estimators=2000, learning_rate=0.1)
from sklearn.tree import DecisionTreeRegressor
regressor = AdaBoostRegressor(base_estimator = DecisionTreeRegressor(min_samples_split=100, max_depth=10, min_samples_leaf=10),
n_estimators=1000, learning_rate=0.2, loss='linear', random_state=0)
regressor.fit(x_train, y_train)
Out[29]:
AdaBoostRegressor(base_estimator=DecisionTreeRegressor(max_depth=10,
min_samples_leaf=10,
min_samples_split=100),
learning_rate=0.2, n_estimators=1000, random_state=0)
In [30]:
# 在测试集做预测
y_pred = regressor.predict(x_test)
y_pred = sc_y.inverse_transform(y_pred) # y_pred变回特征缩放之前的
In [31]:
# 评估模型性能
r2 = r2_score(y_test, y_pred)
print("R2 Score:", r2)
R2 Score: 0.6070474824774648
结论: 由上面2个模型可见,不同超参数对模型性能的影响不同