不平衡数据的解决之道

概述

作者以募捐数据（正例:负例 < 1:20）为例，通过一系列实验比较了不平衡数据的多种处理方法。

如果不做任何处理，用随机森林可以达到97%的准确率，但实际上存在很多的false positives和false negatives，用平衡数据统计大概只有77%的精度。

confusionMatrix( # the original model predicted for leadership levels, too, which I don't care in terms of accuracy fct_collapse(predictedoutcomes$rf, donor = c('donor', 'leadership')) , fct_collapse(predictedoutcomes$actual, donor = c('donor','leadership')) ) ## Confusion Matrix and Statistics ## not the exact real-life numbers ## ## Reference ## Prediction donor no gift ## donor 300 250 ## no gift 250 19500 ## ## Accuracy : 0.9744 ## 95% CI : (0.9721, 0.9765) ## No Information Rate : 0.9691 ## P-Value [Acc > NIR] : 4.31e-06 ## ## Kappa : 0.5615 ## Mcnemar's Test P-Value : 0.1732 ## ## Sensitivity : 0.56000 ## Specificity : 0.98761 ## Pos Pred Value : 0.59022 ## Neg Pred Value : 0.98600 ## Prevalence : 0.03089 ## Detection Rate : 0.01730 ## Detection Prevalence : 0.02931 ## Balanced Accuracy : 0.77380 ## ## 'Positive' Class : donor ##

两种解决方法：

1. 带权法。本文主要是惩罚多样本类别，其实还可以加权少样本类别；

2. 采样法。本文依然只对多样本类别进行下采样，对应的其实还可以对少样本类别上采样。

下采样

作者做了18组不同的采样实验，$n_x$和$n_y$分别表示正例和负例的采样数量。

possiblesizes ## # A tibble: 18 × 2 ## n_x n_y ## ## 1 50 50 ## 2 50 500 ## 3 50 1000 ## 4 50 5000 ## 5 50 20000 ## 6 50 60000 ## 7 500 50 ## 8 500 500 ## 9 500 1000 ## 10 500 5000 ## 11 500 20000 ## 12 500 60000 ## 13 1000 50 ## 14 1000 500 ## 15 1000 1000 ## 16 1000 5000 ## 17 1000 20000 ## 18 1000 60000 # plot the possible sizes for clarity possiblesizes %>% ggplot(aes(x = n_x, y = n_y)) + geom_jitter(size = 3, width = 50) + ggtitle("Possible Sample Sizes")

带权模型

类似的，作者做了25组不同权重的实验。

possibleweights ## # A tibble: 25 × 2 ## p_x p_y ## ## 1 0.1 0.1 ## 2 0.1 0.3 ## 3 0.1 0.5 ## 4 0.1 0.7 ## 5 0.1 0.9 ## 6 0.3 0.1 ## 7 0.3 0.3 ## 8 0.3 0.5 ## 9 0.3 0.7 ## 10 0.3 0.9 ## # ... with 15 more rows # plot the possible weights for clarity possibleweights %>% ggplot(aes(x = p_x, y = p_y)) + geom_point(size = 3) + ggtitle("Possible Class Weights")

比较模型 # plot all the ROCs plot(FY16sampledrocs[[1]], main = "ROC") foo

上述所有实验的AUC比较结果如下图所示：

AUC点阵图

进一步画出AUC点阵图，代码为：

FY16allaucs %>% ggplot( aes(x = rownum, y = auc, color = modeltype)) + geom_point() + ylim(.75,1) + # 0.8507755 for bog standard RF model geom_hline(aes(yintercept = rfreferenceauc), color = 'gray') + # 0.8410452 for caret model's AUC which is what we actually used geom_hline(aes(yintercept = caretreference), color = 'orange')

最终结果如下图所示，其中灰色的线为bog标准随机森林模型，AUC为0.85，橙色为Caret模型，AUC为0.84。

从图中可以看出，采样模型比带权模型好很多。

最好的模型

有意思的是，最好的模型少类样本数量都是50个，top 3模型中多类样本分别为500，50，1000。

roundedauc sampleratio n_x n_y ----------- ------------- ---- ----- 0.910 10 50 500 0.907 1 50 50 0.900 20 50 1000

奇怪的是，采样最差的模型少类样本数量也同样为50，但多类样本的数量多很多：

roundedauc sampleratio n_x n_y ----------- ------------- ---- ------ 0.836 120 500 60000 0.831 400 50 20000 0.795 1200 50 60000

采样率-AUC图表明采样率和AUC间接成正比，在采样率大于25的区间，AUC呈对数下降：

在采样率小于25的区间，数据波动较大，如下所示：

总结与展望

长话短说，这篇文章中采样相对带权看起来是赢者，但作者并未尝试两种方法的结合，比如利用下采样得到一个不错的采样率，然后用带权法惩罚多类样本。

一些有用的链接：

Great paper on strategies for imbalanced data Super detailed answer on how to model with downsampling More info from Stack Exchange about weighted random forests 我爱机器学习(52ml.net)编者按

本文是作者关于不平衡数据的简单实验，但不算完善，比如作者自己提到的结合的方法，此外，实验数据比较个例，慎重参考。

推荐额外三篇相关文章：

不均衡数据问题解决真实世界问题：如何在不平衡类上使用机器学习？ [导读]Learning from Imbalanced Classes

作者：我爱机器学习(52ml.net)

原文作者：jaket

原文： Solutions for Modeling Imbalanced Data

原文章节：

What to do when modeling really imbalanced data? AF16 Model Dealing with Rare Cases Comparing Models The Best Models Summary and General Ending Thoughts

欢迎加入我爱机器学习QQ4群（466461154），我们只讨论干货

微博：我爱机器学习

不平衡数据的解决之道

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本