机器学习-过采样（全网最详解）

大家好，欢迎来到IT知识分享网。

文章目录

相关介绍
过采样实际运用
总结

过采样实际运用

这里我们通过讲解信用卡贷款的问题来为大家展示过采样的相关用法，包括过采样、模型搭建、混淆矩阵、数据标准化和交叉验证等多种代码的实现与应用。

1.导入相关包

import matplotlib.pylab as plt import numpy as np import pandas as pd from pylab import mpl def cm_plot(y, yp): from sklearn.metrics import confusion_matrix import matplotlib.pyplot as plt cm = confusion_matrix(y, yp) plt.matshow(cm, cmap=plt.cm.Blues) plt.colorbar() for x in range(len(cm)): for y in range(len(cm)): plt.annotate(cm[x, y], xy=(y, x), horizontalalignment='center', verticalalignment='center') plt.ylabel('True label') plt.xlabel('Predicted label') return plt

这里我们导入相关包，并绘制混淆矩阵，对应包的作用如下：

pandas：用于数据处理和分析的包。
numpy：提供高性能多维数组的对象和相关操作的运用。
matplotlib.pylab：绘制图像，这里通过绘制图像来展示数据之间的关系。

2.数据预处理

data = pd.read_csv(r"./creditcard.csv") data.head() # 默认打印前5行 """数据标准化：Z标准化""" from sklearn.preprocessing import StandardScaler scaler = StandardScaler() a = data[['Amount']] # 返回dataframe数据，而不是series data['Amount'] = scaler.fit_transform(data[['Amount']]) data = data.drop(['Time'], axis=1) # 删除无用列 """切分测试集，测试集使用原始数据进行预测""" from sklearn.model_selection import train_test_split X_whole = data.drop('Class', axis=1) y_whole = data.Class x_train_w, x_test_w, y_train_w, y_test_w = \ train_test_split(X_whole, y_whole, test_size=0.3, random_state=0)

读取数据：通过pandas库的read_csv读取数据集。
数据标准化：引入sklearn库，对Amount列数据进行标准化，这里我们对其进行Z标准化操作，将其列内的数据限制在（-1，1）范围内，以便减小Amount列对数据集的影响，同时删除无用列Time并将数据再次赋值给data。
数据集分割：引入sklearn库，同时将data中除去Class列的所有数据全部给X_whole,将Class列给y_whole。通过sklearn库中的train_test_split方法将X_whole与y_whole按随机种子为0，测试数据为原数据的30%来切分成测试集与训练集。

3.过采样操作

from imblearn.over_sampling import SMOTE oversampler = SMOTE(random_state=0) os_x_train,os_y_train = oversampler.fit_resample(x_train_w,y_train_w) mpl.rcParams['font.sans-serif'] = ['Microsoft YaHei'] mpl.rcParams['axes.unicode_minus'] = False lables_count = pd.value_counts(os_y_train) # 0有多少个数据，1有多少个数据 plt.title("正负样本数") plt.xlabel("类别") plt.ylabel("频数") lables_count.plot(kind='bar') plt.show() os_x_train_w, os_x_test_w, os_y_train_w, os_y_test_w = \ train_test_split(os_x_train, os_y_train, test_size=0.3, random_state=0)

过采样：导入SMOTE类，并通过SMOTE建立一个实列，使用fit_resample方法将SMOTE用于训练训练集数据特征x_train_w与变量y_train_w，并将过采样后的数据返回到特征os_x_train与变量os_y_train中去。
绘制图像：将过采样后返回的数据进行图像绘制，以便查看0、1数据的数量。图形展示如下：
切分测试集：对过采样数据再次进行切分，按照测试集为30%的方法切分新的测试集与训练集，让过采样后的数据先进行内部测试

4.交叉验证

from sklearn.linear_model import LogisticRegression from sklearn.model_selection import cross_val_score # 交叉验证的函数 # 交叉验证选择较优惩罚因子 scores = [] c_param_range = [0.01, 0.1, 1, 10, 100] # 参数 for i in c_param_range: # 第1词循环的时候C=0.01， lr = LogisticRegression(C=i, penalty='l2', solver='lbfgs', max_iter=1000) score = cross_val_score(lr, os_x_train_w, os_y_train_w, cv=8, scoring='recall') score_mean = sum(score) / len(score) # 交叉验证后的召回率 scores.append(score_mean) # 所有交叉验证的召回率 print(score_mean) best_c = c_param_range[np.argmax(scores)] # 寻找scores中最大值对应的C print("最大值对应的C为：{}".format(best_c)) # 建立最优模型 lr = LogisticRegression(C=best_c, penalty='l2', solver='lbfgs', max_iter=1000) lr.fit(os_x_train_w, os_y_train_w)

设置参数：依次设置内部参数，C为正则化强度，正则化系数λ的倒数，float类型，默认为1.0。必须是正浮点型数。像SVM一样，越小的数值表示越强的正则化。penalty为正则化方式，有l1和l2两种，这里我们选择l2方式。Solver为优化拟合参数算法选择，默认为liblinear，这里我们选择lbfgs。max_iter为最大迭代次数，这里我们设置为1000。
交叉验证：通过K折交叉验证来选择最优的惩罚因子，防止过拟合。这里K设置为8。然后计算8次验证后的召回率将其返回到scores中。
寻找最优正则化强度：通过np.argmax方法寻找最大召回率对应的C值
建立模型：取出最优值，然后进行最优的模型建立。

5.绘制混淆矩阵

from sklearn import metrics os_train_predicted = lr.predict(os_x_train_w) print(metrics.classification_report(os_y_train_w, os_train_predicted)) cm_plot(os_y_train_w, os_train_predicted).show() os_test_predicted = lr.predict(os_x_test_w) # 小数据测试 print(metrics.classification_report(os_y_test_w, os_test_predicted)) cm_plot(os_y_test_w, os_test_predicted).show() train_predicted = lr.predict(x_train_w) print(metrics.classification_report(y_train_w, train_predicted)) cm_plot(y_train_w, train_predicted).show() test_predicted = lr.predict(x_test_w) print(metrics.classification_report(y_test_w, test_predicted)) cm_plot(y_test_w, test_predicted).show()

绘制混淆矩阵：绘制全部测试集与训练集的混淆矩阵和数据图像，以便观察相应的值。

6.模型评估与测试

thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] recalls = [] for i in thresholds: y_predict_proba = lr.predict_proba(x_test_w) y_predict_proba = pd.DataFrame(y_predict_proba) y_predict_proba = y_predict_proba.drop([0], axis=1) # 当预测概率大于i，0.1，0.2，预测的标签设置1 y_predict_proba[y_predict_proba[[1]] > i] = 1 # 当预测概率小于等于i 预测的标签设置为0 y_predict_proba[y_predict_proba[[1]] <= i] = 0 recall = metrics.recall_score(y_test_w, y_predict_proba[1]) recalls.append(recall) print(metrics.classification_report(y_test_w, y_predict_proba[1])) print("{} Recall metric in the testing dataset: {:.3f}".format(i, recall))

模型预测：更改标签设置的范围，计算每次更改阈值时的召回率，并绘制相应的混淆矩阵，输出召回率。

总结

过采样是逻辑回归中处理不平衡数据集的一种有效方法。通过增加少数类样本的数量，可以平衡数据集，提高模型对少数类的识别能力。然而，在选择过采样方法时，需要考虑其潜在的缺点，并结合实际情况选择最适合的方法。

免责声明：本站所有文章内容,图片，视频等均是来源于用户投稿和互联网及文摘转载整编而成，不代表本站观点，不承担相关法律责任。其著作权各归其原作者或其出版社所有。如发现本站有涉嫌抄袭侵权/违法违规的内容,侵犯到您的权益，请在线联系站长,一经查实,本站将立刻删除。本文来自网络,若有侵权，请联系删除，如若转载，请注明出处：https://haidsoft.com/118670.html

机器学习-过采样（全网最详解）

文章目录

相关介绍

1.过采样的基本概念

2.常见的过采样方法

3.过采样在逻辑回归中的应用

过采样实际运用

1.导入相关包

2.数据预处理

3.过采样操作

4.交叉验证

5.绘制混淆矩阵

6.模型评估与测试

总结

发表回复

机器学习-过采样（全网最详解）

文章目录

相关介绍

1.过采样的基本概念

2.常见的过采样方法

3.过采样在逻辑回归中的应用

过采样实际运用

1.导入相关包

2.数据预处理

3.过采样操作

4.交叉验证

5.绘制混淆矩阵

6.模型评估与测试

总结

相关推荐

发表回复