机器学习—分类3-3（逻辑回归算法）

基于逻辑回归预测客户是否购买汽车新车型

主要步骤流程：

1. 导入包
2. 导入数据集
3. 数据预处理
- 3.1 检测缺失值
- 3.2 生成自变量和因变量
- 3.3 查看样本是否均衡
- 3.4 将数据拆分成训练集和测试集
- 3.5 特征缩放
4. 使用不同的参数构建逻辑回归模型
- 4.1 模型1：构建逻辑回归模型并训练模型
  - 4.1.1 构建逻辑回归模型并训练
  - 4.1.2 预测测试集
  - 4.1.3 得到线性回归的系数和截距
  - 4.1.4 生成混淆矩阵
  - 4.1.5 可视化测试集的预测结果
  - 4.1.6 评估模型性能
- 4.2 模型2：构建逻辑回归模型并训练模型

数据集链接：

1. 导入包

In [1]:

# 导入包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

2. 导入数据集

In [2]:

# 导入数据集
dataset = pd.read_csv('Social_Network_Ads.csv')
dataset

Out[2]:

	User ID	Gender	Age	EstimatedSalary	Purchased
0	15624510	Male	19	19000	0
1	15810944	Male	35	20000	0
2	15668575	Female	26	43000	0
3	15603246	Female	27	57000	0
4	15804002	Male	19	76000	0
...	...	...	...	...	...
395	15691863	Female	46	41000	1
396	15706071	Male	51	23000	1
397	15654296	Female	50	20000	1
398	15755018	Male	36	33000	0
399	15594041	Female	49	36000	1

400 rows × 5 columns

3. 数据预处理

3.1 检测缺失值

In [3]:

# 检测缺失值
null_df = dataset.isnull().sum()
null_df

Out[3]:

User ID            0
Gender             0
Age                0
EstimatedSalary    0
Purchased          0
dtype: int64

3.2 生成自变量和因变量

为了可视化分类效果，仅选取 Age 和 EstimatedSalary 这2个字段作为自变量

In [4]:

# 生成自变量和因变量
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values

3.3 查看样本是否均衡

In [5]:

# 查看样本是否均衡
sample_0 = sum(dataset['Purchased']==0)
sample_1 = sum(dataset['Purchased']==1)
print('不买车的样本占总样本的%.2f' %(sample_0/(sample_0 + sample_1)))

不买车的样本占总样本的0.64

3.4 将数据拆分成训练集和测试集

In [6]:

# 将数据拆分成训练集和测试集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(300, 2)
(100, 2)
(300,)
(100,)

3.5 特征缩放

In [7]:

# 特征缩放
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

4. 使用不同的参数构建逻辑回归模型

4.1 模型1：构建逻辑回归模型并训练模型

4.1.1 构建逻辑回归模型并训练

In [8]:

# 使用不同的参数构建逻辑回归模型
# 模型1：构建逻辑回归模型并训练模型（penalty='l2', C=1, class_weight='balanced'）
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(penalty='l2', C=1, class_weight='balanced', random_state = 0)
classifier.fit(X_train, y_train)

Out[8]:

LogisticRegression(C=1, class_weight='balanced', random_state=0)

4.1.2 预测测试集

In [9]:

# 预测测试集
y_pred = classifier.predict(X_test)

In [10]:

y_pred[:5]

Out[10]:

array([0, 0, 0, 0, 0], dtype=int64)

4.1.3 得到线性回归的系数和截距

In [11]:

# 得到线性回归的系数和截距
print('线性回归的系数是：' + str(classifier.coef_))
print('线性回归的截距是：' + str(classifier.intercept_))

线性回归的系数是：[[2.22813781 1.21242255]]
线性回归的截距是：[-0.47862396]

In [12]:

print('逻辑回归的决策边界是：')
print('Age * %.2f + EstimatedSalary * %.2f + (%.2f) = 0' %(classifier.coef_[0][0], classifier.coef_[0][1], classifier.intercept_) )

逻辑回归的决策边界是：
Age * 2.23 + EstimatedSalary * 1.21 + (-0.48) = 0

4.1.4 生成混淆矩阵

In [13]:

# 生成混淆矩阵
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[61  7]
 [ 4 28]]

4.1.5 可视化测试集的预测结果

In [14]:

# 可视化测试集的预测结果
from matplotlib.colors import ListedColormap
plt.figure()
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('pink', 'limegreen')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                color = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Logistic Regression (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

4.1.6 评估模型性能

In [15]:

# 评估模型性能
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

0.89

In [16]:

(cm[0][0] + cm[1][1])/(cm[0][0] + cm[0][1] + cm[1][0] + cm[1][1])

Out[16]:

0.89

4.2 模型2：构建逻辑回归模型并训练模型

In [31]:

# 模型2：构建逻辑回归模型并训练模型（solver='liblinear', penalty='l1', C=2, class_weight=None）
classifier = LogisticRegression(solver='liblinear', penalty='l1', C=0.25, class_weight=None, random_state = 0)
classifier.fit(X_train, y_train)

Out[31]:

LogisticRegression(C=0.25, penalty='l1', random_state=0, solver='liblinear')

In [32]:

# 预测测试集
y_pred = classifier.predict(X_test)

In [33]:

y_pred[:5]

Out[33]:

array([0, 0, 0, 0, 0], dtype=int64)

In [34]:

print('逻辑回归的决策边界是：')
print('Age * %.2f + EstimatedSalary * %.2f + (%.2f) = 0' %(classifier.coef_[0][0], classifier.coef_[0][1], classifier.intercept_) )

逻辑回归的决策边界是：
Age * 1.84 + EstimatedSalary * 0.94 + (-0.79) = 0

In [35]:

# 生成混淆矩阵
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[65  3]
 [ 6 26]]

In [36]:

# 可视化测试集的预测结果
plt.figure()
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('pink', 'limegreen')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                color = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Logistic Regression (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

In [37]:

# 评估模型性能
print(accuracy_score(y_test, y_pred))

0.91

结论：由上面2个模型可见，不同超参数对模型性能的影响不同（混淆矩阵的结果不同）

机器学习

机器学习—分类3-3（逻辑回归算法）

主要步骤流程：

1. 导入包

2. 导入数据集

3. 数据预处理

3.1 检测缺失值

3.2 生成自变量和因变量

3.3 查看样本是否均衡

3.4 将数据拆分成训练集和测试集

3.5 特征缩放

4. 使用不同的参数构建逻辑回归模型

4.1 模型1：构建逻辑回归模型并训练模型

4.1.1 构建逻辑回归模型并训练

4.1.2 预测测试集

4.1.3 得到线性回归的系数和截距

4.1.4 生成混淆矩阵

4.1.5 可视化测试集的预测结果

4.1.6 评估模型性能

4.2 模型2：构建逻辑回归模型并训练模型

相关

[ML]机器学习中我未见过的概念

[机器学习笔记(一)] TensorFLow安装

机器学习 - 线性回归模型实战 02

机器学习-支持向量机SVM

TensorFlow——机器学习编程框架

机器学习（三、神经网络）

吴恩达机器学习作业2- 逻辑回归与正则化作业（python实现）

[ 机器学习 - 吴恩达 ] | 1-2 What is machine learning

《神经网络与机器学习》第8章泛化与正则化

【机器学习】逻辑回归（Logistic Regression）

机器学习--决策树算法(CART)

机器学习--决策树算法(CART)

标签