NYC buses: C5.0 classification with R; more than 20 minute delay?

(This article was first published on R Programming DataScience+ , and kindly contributed toR-bloggers)

Tags

Data Management Data Visualisation R Programming

We are continuing on with our NYC bus breakdown problem . When we left off, we had constructed a rule-based Cubist regression model with our expanded pool of predictors ; but we were still only managing to explain 37% of the data's variance with our model. Given how 'dirty' the target variable 'time_delayed' is (because it is human reported and of dubious precision), we decided that perhaps we should rephrase the question in order to get a more sensible answer. When a bus breakdown is called into operations, perhaps the question to ask is: “Will this delay exceed twenty minutes?”

Classification more than 20 minutes?

We could choose some other time, but for simplicity, twenty minutes is probably the breaking point of human patience be they, passengers or providers. This division also breaks the dataset approximately in half, so we don't have to deal with imbalance. As a side note, I have also run the problem for a thirty minute or more delay, breaking the data into and ; and the solution only improves.

We begin by setting up the data that we need, and the caret control objects:

in_csv <- "../output/intermediate/ii_times.csv" ii_times <- read_csv(in_csv) %>% filter(reported_before_resolved == 1) %>% select(-reported_before_resolved) %>% mutate( time_delayed = cut(time_delayed, breaks = c(0, 22, 10000), labels = FALSE) ) %>% mutate(time_delayed = factor(time_delayed, labels = c("t0.22","t22."))) halfFold <- createFolds(ii_times$Busbreakdown_ID, k = 5, list=TRUE) ii_small <- ii_times[halfFold[[2]],] ii_test <-ii_times[halfFold[[3]],] ii_train <- ii_times %>% anti_join(ii_test, by = "Busbreakdown_ID") rctrl_manual <- trainControl(method = "none", returnResamp = "all", classProbs = TRUE, summaryFunction = twoClassSummary) rctrl_repcv <- trainControl(method = "repeatedcv", number = 2, repeats = 5, returnResamp = "all", classProbs = TRUE, summaryFunction = twoClassSummary)

Then we can use cross-fold validation on a reduced data set to quickly get a sense of which parameters we should use. Note that we deliberately choose 'rules', not 'tree' here, because we want a human-readable set of if/then conditions. Unfortunately, we will end up using 15 trials, which means our rule set is large. I have commented its output here, but it can be read this way, or with 'summary(model)'.

<code>#---cubist c50_grd<-expand.grid( .winnow = FALSE, .trials=c(15,30,50), .model="rules" ) set.seed(849) c50_cv <- train(time_delayed ~ . -Busbreakdown_ID, data = ii_small, method = "C5.0", metric="ROC", na.action = na.pass, trControl = rctrl_repcv, tuneGrid = c50_grd ) c50_cv</code> ## C5.0 ## ## 40665 samples ## 46 predictor ## 2 classes: 't0.22', 't22.' ## ## No pre-processing ## Resampling: Cross-Validated (2 fold, repeated 5 times) ## Summary of sample sizes: 20332, 20333, 20332, 20333, 20332, 20333, ... ## Resampling results across tuning parameters: ## ## trials ROC Sens Spec ## 15 0.8261069 0.8028816 0.6830990 ## 30 0.8258429 0.8030490 0.6855583 ## 50 0.8258429 0.8030490 0.6855583 ## ## Tuning parameter 'model' was held constant at a value of rules ## ## Tuning parameter 'winnow' was held constant at a value of FALSE ## ROC was used to select the optimal model using the largest value. ## The final values used for the model were trials = 15, model = rules ## and winnow = FALSE. #writeLines(c50_cv$finalModel$rules) #We can output the rule set as a block of text like this, but we have suppressed this output because with 50 trials, it is a very, very long list.

The system is relatively insensitive to the number of trials, so we can use 15. We run on our training and test sets to derive our true test ROC, and have a look at how our predicted probabilities correlate with the predictors, compared to the actual values.

<code>c50_grd<-expand.grid( .winnow = FALSE, .trials=15, .model="rules" ) c50_man <- train(time_delayed ~ . -Busbreakdown_ID, data = ii_train, method = "C5.0", metric="ROC", na.action = na.pass, trControl = rctrl_manual, tuneGrid = c50_grd ) predictions <- predict(c50_man, newdata = ii_test, type = "prob") ii_withpred <- ii_test %>% cbind(predictions %>% tbl_df()) colAUC(ii_withpred[["t0.22"]], (ii_withpred[["time_delayed"]] =="t0.22") * 1L, plotROC=TRUE) </code> ## [,1] ## 0 vs. 1 0.8486122 
NYC buses: C5.0 classification with R; more than 20 minute delay?

NYC buses: C5.0 classification with R; more than 20 minute delay?

<code>#cubist: 0.847 corrr_analysis <- ii_withpred %>% select(-Busbreakdown_ID, -t0.22, -t22.) %>% mutate( time_delayed = (time_delayed == "t22.")*1L ) %>% correlate() %>% focus(time_delayed) %>% rename(feature = rowname) %>% arrange(desc(abs(time_delayed))) %>% mutate(feature = as_factor(feature)) corrr_analysis %>% print(n=61)</code> ## # A tibble: 45 x 2 ## feature time_delayed ## <fct> <dbl> ## 1 vehicle_total_with_attendants 0.176 ## 2 Has_Contractor_Notified_Parents 0.171 ## 3 Boro_Bronx -0.165 ## 4 drivers_total_attendant 0.163 ## 5 vehicle_total_max_riders -0.162 ## 6 vehicle_total_reg_seats -0.160 ## 7 Boro_Manhattan 0.151 ## 8 service_type_d2d 0.149 ## 9 Number_Of_Students_On_The_Bus -0.149 ## 10 Boro_StatenIsland -0.140 ## 11 vehicle_total_ambulatory_seats -0.0972 ## 12 rush_min_from_peak 0.0954 ## 13 drivers_staff_servSchool 0.0792 ## 14 Reason_MechanicalProblem 0.0729 ## 15 Reason_HeavyTraffic -0.0698 ## 16 Boro_Queens 0.0667 ## 17 vehicle_total_with_lifts 0.0654 ## 18 rush_within -0.0612 ## 19 time_am -0.0545 ## 20 rush_between 0.0538 ## 21 Reason_DelayedbySchool -0.0471 ## 22 Reason_FlatTire 0.0461 ## 23 Have_You_Alerted_OPT -0.0457 ## 24 Reason_Accident 0.0427 ## 25 Boro_Brooklyn 0.0421 ## 26 School_Age 0.0416 ## 27 drivers_numServ_prek -0.0394 ## 28 Boro_NassauCounty 0.0389 ## 29 drivers_numServ_school 0.0374 ## 30 Reason_WontStart 0.0371 ## 31 Has_Contractor_Notified_Schools 0.0346 ## 32 drivers_num_servPreK -0.0315 ## 33 drivers_staff_servPreK -0.0305 ## 34 Reason_We

NYC buses: C5.0 classification with R; more than 20 minute delay?

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本