Easy time-series prediction with R: a tutorial with air traffic data from Lux Ai ...

(This article was first published on Econometrics and Free Software , and kindly contributed toR-bloggers)

In this blog post, I will show you how you can quickly and easily forecast a univariate time series.

I am going to use data from the EU Open Data Portal on air passenger transport. You can find the

data here . I downloaded

the data in the TSV format for Luxembourg Airport, but you could repeat the analysis for any airport.

Once you have the data, load some of the package we are going to need:

library(tidyverse) library(lubridate) library(forecast) library(tsibble) library(brotools)

and define the following function:

ihs <- function(x){ log(x + sqrt(x**2 + 1)) }

This function, the inverse hyperbolic sine, is useful to transform data in a manner that is very

close to logging it, but that allows for 0’s. The data from Eurostat is not complete for some reason,

so there are some 0 sometimes. To avoid having to log 0, which in R yields -Inf , I use this

transformation.

Now, let’s load the data:

avia <- read_tsv("avia_par_lu.tsv") ## Parsed with column specification: ## cols( ## .default = col_character() ## ) ## See spec(...) for full column specifications.

Let’s take a look at the data:

head(avia) ## # A tibble: 6 x 238 ## `unit,tra_meas,… `2018Q1` `2018M03` `2018M02` `2018M01` `2017Q4` `2017Q3` ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 FLIGHT,CAF_PAS,… 511 172 161 178 502 475 ## 2 FLIGHT,CAF_PAS,… : : : : : : ## 3 FLIGHT,CAF_PAS,… : : : : 399 306 ## 4 FLIGHT,CAF_PAS,… 485 167 151 167 493 497 ## 5 FLIGHT,CAF_PAS,… 834 293 267 274 790 728 ## 6 FLIGHT,CAF_PAS,… : : : : : : ## # ... with 231 more variables: `2017Q2` <chr> , `2017Q1` <chr> , ## # `2017M12` <chr> , `2017M11` <chr> , `2017M10` <chr> , `2017M09` <chr> , ## # `2017M08` <chr> , `2017M07` <chr> , `2017M06` <chr> , `2017M05` <chr> , ## # `2017M04` <chr> , `2017M03` <chr> , `2017M02` <chr> , `2017M01` <chr> , ## # `2017` <chr> , `2016Q4` <chr> , `2016Q3` <chr> , `2016Q2` <chr> , ## # `2016Q1` <chr> , `2016M12` <chr> , `2016M11` <chr> , `2016M10` <chr> , ## # `2016M09` <chr> , `2016M08` <chr> , `2016M07` <chr> , `2016M06` <chr> , ## # `2016M05` <chr> , `2016M04` <chr> , `2016M03` <chr> , `2016M02` <chr> , ## # `2016M01` <chr> , `2016` <chr> , `2015Q4` <chr> , `2015Q3` <chr> , ## # `2015Q2` <chr> , `2015Q1` <chr> , `2015M12` <chr> , `2015M11` <chr> , ## # `2015M10` <chr> , `2015M09` <chr> , `2015M08` <chr> , `2015M07` <chr> , ## # `2015M06` <chr> , `2015M05` <chr> , `2015M04` <chr> , `2015M03` <chr> , ## # `2015M02` <chr> , `2015M01` <chr> , `2015` <chr> , `2014Q4` <chr> , ## # `2014Q3` <chr> , `2014Q2` <chr> , `2014Q1` <chr> , `2014M12` <chr> , ## # `2014M11` <chr> , `2014M10` <chr> , `2014M09` <chr> , `2014M08` <chr> , ## # `2014M07` <chr> , `2014M06` <chr> , `2014M05` <chr> , `2014M04` <chr> , ## # `2014M03` <chr> , `2014M02` <chr> , `2014M01` <chr> , `2014` <chr> , ## # `2013Q4` <chr> , `2013Q3` <chr> , `2013Q2` <chr> , `2013Q1` <chr> , ## # `2013M12` <chr> , `2013M11` <chr> , `2013M10` <chr> , `2013M09` <chr> , ## # `2013M08` <chr> , `2013M07` <chr> , `2013M06` <chr> , `2013M05` <chr> , ## # `2013M04` <chr> , `2013M03` <chr> , `2013M02` <chr> , `2013M01` <chr> , ## # `2013` <chr> , `2012Q4` <chr> , `2012Q3` <chr> , `2012Q2` <chr> , ## # `2012Q1` <chr> , `2012M12` <chr> , `2012M11` <chr> , `2012M10` <chr> , ## # `2012M09` <chr> , `2012M08` <chr> , `2012M07` <chr> , `2012M06` <chr> , ## # `2012M05` <chr> , `2012M04` <chr> , `2012M03` <chr> , `2012M02` <chr> , ## # `2012M01` <chr> , `2012` <chr> , … </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr> </chr>

So yeah, useless in that state. The first column actually is composed of 3 columns, merged together,

and instead of having one column with the date, and another with the value, we have one column

per date. Some cleaning is necessary before using this data.

Let’s start with going from a wide to a long data set:

avia %>% select("unit,tra_meas,airp_pr\\time", contains("20")) %>% gather(date, passengers, -`unit,tra_meas,airp_pr\\time`)

The first line makes it possible to only select the columns that contain the string “20”, so

selecting columns from 2000 onward. Then, using gather, I go from long to wide. The data looks

like this now:

## # A tibble: 117,070 x 3 ## `unit,tra_meas,airp_pr\\time` date passengers ## <chr> <chr> <chr> ## 1 FLIGHT,CAF_PAS,LU_ELLX_AT_LOWW 2018Q1 511 ## 2 FLIGHT,CAF_PAS,LU_ELLX_BE_EBBR 2018Q1 : ## 3 FLIGHT,CAF_PAS,LU_ELLX_CH_LSGG 2018Q1 : ## 4 FLIGHT,CAF_PAS,LU_ELLX_CH_LSZH 2018Q1 485 ## 5 FLIGHT,CAF_PAS,LU_ELLX_DE_EDDF 2018Q1 834 ## 6 FLIGHT,CAF_PAS,LU_ELLX_DE_EDDI 2018Q1 : ## 7 FLIGHT,CAF_PAS,LU_ELLX_DE_EDDM 2018Q1 1095 ## 8 FLIGHT,CAF_PAS,LU_ELLX_DE_EDDR 2018Q1 : ## 9 FLIGHT,CAF_PAS,LU_ELLX_DE_EDDT 2018Q1 : ## 10 FLIGHT,CAF_PAS,LU_ELLX_DK_EKCH 2018Q1 : ## # ... with 117,060 more rows </chr> </chr> </chr>

Now, let’s separate the first column into 3 columns:

avia %>% select("unit,tra_meas,airp_pr\\time", contains("20")) %>% gather(date, passengers, -`unit,tra_meas,airp_pr\\time`) %>% separate(col = `unit,tra_meas,airp_pr\\time`, into = c("unit", "tra_meas", "air_pr\\time"), sep = ",")

This separates the first column into 3 new columns, “unit”, “tra_meas” and “air_pr\time”. This step

is not necessary for the rest of the analysis, but might as well do it. The data looks like this now:

## # A tibble: 117,070 x 5 ## unit tra_meas `air_pr\\time` date passengers ## <chr> <chr> <chr> <chr> <chr> ## 1 FLIGHT CAF_PAS LU_ELLX_AT_LOWW 2018Q1 511 ## 2 FLIGHT CAF_PAS LU_ELLX_BE_EBBR 2018Q1 : ## 3 FLIGHT CAF_PAS LU_ELLX_CH_LSGG 2018Q1 : ## 4 FLIGHT CAF_PAS LU_ELLX_CH_LSZH 2018Q1 485 ## 5 FLIGHT CAF_PAS LU_ELLX_DE_EDDF 2018Q1 834 ## 6 FLIGHT CAF_PAS LU_ELLX_DE_EDDI 2018Q1 : ## 7 FLIGHT CAF_PAS LU_ELLX_DE_EDDM 2018Q1 1095 ## 8 FLIGHT CAF_PAS LU_ELLX_DE_EDDR 2018Q1 : ## 9 FLIGHT CAF_PAS LU_ELLX_DE_EDDT 2018Q1 : ## 10 FLIGHT CAF_PAS LU_ELLX_DK_EKCH 2018Q1 : ## # ... with 117,060 more rows </chr> </chr> </chr> </chr> </chr>

The next steps are simple renamings. I have copy-pasted the information from the Eurostat page

where you can view the data online .

If you click here:

you will be able to select the variables you want displayed in the table, as well as the dictionary

of the variables. I simply copy pasted it and recoded the variables. You can take a look at the

whole cleaning workflow by clicking “Click to expand” below:

Click here to take a look at the whole cleaning workflow

avia_clean <- avia %>% select("unit,tra_meas,airp_pr\\time", contains("20")) %>% gather(date, passengers, -`unit,tra_meas,airp_pr\\time`) %>% separate(col = `unit,tra_meas,airp_pr\\time`, into = c("unit", "tra_meas", "air_pr\\time"), sep = ",") %>% mutate(tra_meas = fct_recode(tra_meas, `Passengers on board` = "PAS_BRD", `Passengers on board (arrivals)` = "PAS_BRD_ARR", `Passengers on board (departures)` = "PAS_BRD_DEP", `Passengers carried` = "PAS_CRD", `Passengers carried (arrival)` = "PAS_CRD_ARR", `Passengers carried (departures)` = "PAS_CRD_DEP", `Passengers seats available` = "ST_PAS", `Passengers seats available (arrivals)` = "ST_PAS_ARR", `Passengers seats available (departures)` = "ST_PAS_DEP", `Commercial passenger air flights` = "CAF_PAS", `Commercial passenger air flights (arrivals)` = "CAF_PAS_ARR", `Commercial passenger air flights (departures)` = "CAF_PAS_DEP")) %>% mutate(unit = fct_recode(unit, Passenger = "PAS", Flight = "FLIGHT", `Seats and berths` = "SEAT")) %>% mutate(destination = fct_recode(`air_pr\\time`, `WIEN-SCHWECHAT` = "LU_ELLX_AT_LOWW", `BRUSSELS` = "LU_ELLX_BE_EBBR", `GENEVA` = "LU_ELLX_CH_LSGG", `ZURICH` = "LU_ELLX_CH_LSZH", `FRANKFURT/MAIN` = "LU_ELLX_DE_EDDF", `HAMBURG` = "LU_ELLX_DE_EDDH", `BERLIN-TEMPELHOF` = "LU_ELLX_DE_EDDI", `MUENCHEN` = "LU_ELLX_DE_EDDM", `SAARBRUECKEN` = "LU_ELLX_DE_EDDR", `BERLIN-TEGEL` = "LU_ELLX_DE_EDDT", `KOBENHAVN/KASTRUP` = "LU_ELLX_DK_EKCH", `HURGHADA / INTL` = "LU_ELLX_EG_HEGN", `IRAKLION/NIKOS KAZANTZAKIS` = "LU_ELLX_EL_LGIR", `FUERTEVENTURA` = "LU_ELLX_ES_GCFV", `GRAN CANARIA` = "LU_ELLX_ES_GCLP", `LANZAROTE` = "LU_ELLX_ES_GCRR", `TENERIFE SUR/REINA SOFIA` = "LU_ELLX_ES_GCTS", `BARCELONA/EL PRAT` = "LU_ELLX_ES_LEBL", `ADOLFO SUAREZ MADRID-BARAJAS` = "LU_ELLX_ES_LEMD", `MALAGA/COSTA DEL SOL` = "LU_ELLX_ES_LEMG", `PALMA DE MALLORCA` = "LU_ELLX_ES_LEPA", `SYSTEM - PARIS` = "LU_ELLX_FR_LF90", `NICE-COTE D'AZUR` = "LU_ELLX_FR_LFMN", `PARIS-CHARLES DE GAULLE` = "LU_ELLX_FR_LFPG", `STRASBOURG-ENTZHEIM` = "LU_ELLX_FR_LFST", `KEFLAVIK` = "LU_ELLX_IS_BIKF", `MILANO/MALPENSA` = "LU_ELLX_IT_LIMC", `BERGAMO/ORIO AL SERIO` = "LU_ELLX_IT_LIME", `ROMA/FIUMICINO` = "LU_ELLX_IT_LIRF", `AGADIR/AL MASSIRA` = "LU_ELLX_MA_GMAD", `AMSTERDAM/SCHIPHOL` = "LU_ELLX_NL_EHAM", `WARSZAWA/CHOPINA` = "LU_ELLX_PL_EPWA", `PORTO` = "LU_ELLX_PT_LPPR", `LISBOA` = "LU_ELLX_PT_LPPT", `STOCKHOLM/ARLANDA` = "LU_ELLX_SE_ESSA", `MONASTIR/HABIB BOURGUIBA` = "LU_ELLX_TN_DTMB", `ENFIDHA-HAMMAMET INTERNATIONAL` = "LU_ELLX_TN_DTNH", `ENFIDHA ZINE EL ABIDINE BEN ALI` = "LU_ELLX_TN_DTNZ", `DJERBA/ZARZIS` = "LU_ELLX_TN_DTTJ", `ANTALYA (MIL-CIV)` = "LU_ELLX_TR_LTAI", `ISTANBUL/ATATURK` = "LU_ELLX_TR_LTBA", `SYSTEM - LONDON` = "LU_ELLX_UK_EG90", `MANCHESTER` = "LU_ELLX_UK_EGCC", `LONDON GATWICK` = "LU_ELLX_UK_EGKK", `LONDON/CITY` = "LU_ELLX_UK_EGLC", `LONDON HEATHROW` = "LU_ELLX_UK_EGLL", `LONDON STANSTED` = "LU_ELLX_UK_EGSS", `NEWARK LIBERTY INTERNATIONAL, NJ.` = "LU_ELLX_US_KEWR", `O.R TAMBO INTERNATIONAL` = "LU_ELLX_ZA_FAJS")) %>% mutate(passengers = as.numeric(passengers)) %>% select(unit, tra_meas, destination, date, passengers)

## Warning in evalq(as.numeric(passengers), <environment> ): NAs introduced by ## coercion </environment>

There is quarterly data and monthly data. Let’s separate the two:

avia_clean_quarterly <- avia_clean %>% filter(tra_meas == "Passengers on board (arrivals)", !is.na(passengers)) %>% filter(str_detect(date, "Q")) %>% mutate(date = yq(date))

In the “date” column, I detect the observations with “Q” in their name, indicating that it is quarterly data.

I do the same for monthly data, but I have to add the string “01” to the dates. This transforms

a date that looks like this “2018M1” to this “2018M101”. “2018M101” can then be converted into a

date by using the ymd() function from lubridate. yq() was used for the quarterly data.

avia_clean_monthly <- avia_clean %>% filter(tra_meas == "Passengers on board (arrivals)", !is.na(passengers)) %>% filter(str_detect(date, "M")) %>% mutate(date = paste0(date, "01")) %>% mutate(date = ymd(date)) %>% select(destination, date, passengers)

Time for some plots. Let’s start with the raw data:

avia_clean_monthly %>% group_by(date) %>% summarise(total = sum(passengers)) %>% ggplot() + ggtitle("Raw data") + geom_line(aes(y = total, x = date), colour = "#82518c") + scale_x_date(date_breaks = "1 year", date_labels = "%m-%Y") + theme_blog()
Easy time-series prediction with R: a tutorial with air traffic data from Lux Ai ...

And now with the logged data (or rather, the data transformed using the inverted hyperbolic sine

transformation):

We clearly see a seasonal pattern in the data. There is also an upward trend. We will have to deal

with these two problems if we want to do some forecasting. For this, let’s limit ourselves to data

from before 2015, and convert the “passengers” column from the data to a time series object, using

the ts() function:

avia_clean_train <- avia_clean_monthly %>% select(date, passengers) %>% filter(year(date) < 2015) %>% group_by(date) %>% summarise(total_passengers = sum(passengers)) %>% pull(total_passengers) %>% ts(., frequency = 12, start = c(2005, 1))

We will try to pseudo -forecast the data from 2015 to the last point available, March 2018.

First, let’s tranform the data:

logged_data <- ihs(avia_clean_train)

Taking the log, or ihs of the data deals with stabilizing the variance of the time series.

There might also be a need to difference the data. Computing the differences between consecutive

observations makes the time-series stationary. This will be taken care of by the auto.arima()

function, if needed. The auto.arima() function returns the best ARIMA model according to different

statistical criterions, sach us the AIC, AICc or BIC.

(model_fit <- auto.arima(logged_data)) ## Series: logged_data ## ARIMA(2,1,1)(2,1,0)[12] ## ## Coefficients: ## ar1 ar2 ma1 sar1 sar2 ## -0.4061 -0.2431 -0.3562 -0.5590 -0.3282 ## s.e. 0.2003 0.1432 0.1994 0.0911 0.0871 ## ## sigma^2 estimated as 0.004503: log likelihood=137.11 ## AIC=-262.21 AICc=-261.37 BIC=-246.17

auto.arima() found that the best model would be an $ARIMA(2, 1, 1)(2, 1, 0)_{12}$ . This is an

seasonal autoregressive model, with p = 2, d = 1, q = 1, P = 2 and D = 1.

model_forecast <- forecast(model_fit, h = 39)

I can now forecast the model for the next 39 months (which correspond to the data available).

To plot the forecast, one could do a simple call to the plot function. But the resulting plot

is not very aesthetic. To plot my own, I have to grab the data that was forecast, and do some

munging again:

point_estimate <- model_forecast$mean %>% as_tsibble() %>% rename(point_estimate = value, date = index) upper <- model_forecast$upper %>% as_tsibble() %>% spread(key, value) %>% rename(date = index, upper80 = `80%`, upper95 = `95%`) lower <- model_forecast$lower %>% as_tsibble() %>% spread(key, value) %>% rename(date = index, lower80 = `80%`, lower95 = `95%`) estimated_data <- reduce(list(point_estimate, upper, lower), full_join, by = "date")

as_tibble() is a function from the {tsibble} package that converts objects that are time-series aware

to time-aware tibbles. If you are not familiar with ts_tibble() , I urge you to run the above lines

one by one, and especially to compare as_tsibble() with the standard as_tibble() from the {tibble}

package.

This is how estimated_data looks:

head(estimated_data) ## # A tsibble: 6 x 6 [1M] ## date point_estimate upper80 upper95 lower80 lower95 ## <mth> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2015 Jan 11.9 12.0 12.1 11.8 11.8 ## 2 2015 Feb 11.9 12.0 12.0 11.8 11.7 ## 3 2015 Mar 12.1 12.2 12.3 12.0 12.0 ## 4 2015 Apr 12.2 12.3 12.4 12.1 12.1 ## 5 2015 May 12.3 12.4 12.4 12.2 12.1 ## 6 2015 Jun 12.3 12.4 12.5 12.2 12.1 </dbl> </dbl> </dbl> </dbl> </dbl> </mth>

We can now plot the data, with the forecast, and with the 95% confidence interval:

avia_clean_monthly %>% group_by(date) %>% summarise(total = sum(passengers)) %>% mutate(total_ihs = ihs(total)) %>% ggplot() + ggtitle("Logged data") + geom_line(aes(y = total_ihs, x = date), colour = "#82518c") + scale_x_date(date_breaks = "1 year", date_labels = "%m-%Y") + geom_ribbon(data = estimated_data, aes(x = date, ymin = lower95, ymax = upper95), fill = "#666018", alpha = 0.2) + geom_line(data = estimated_data, aes(x = date, y = point_estimate), linetype = 2, colour = "#8e9d98") + theme_blog()
Easy time-series prediction with R: a tutorial with air traffic data from Lux Ai ...

The pseudo-forecast (the dashed line) is not very far from the truth, only overestimating the

seasonal peaks, but the true line is within the 95% confidence interval, which is good!

If you found this blog post useful, you might want to follow me on twitter for blog post updates.

Easy time-series prediction with R: a tutorial with air traffic data from Lux Ai ...

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本