FiveThirtyEight's polling data for the US Presidential election

3,000+ voting intention surveys

Like many others around the world, I have been watching with interest the democratic process in the United States of America. One of the most influential and watched websites is FiveThirtyEight.com , headed by Nate Silver, author of the excellent popularisation of the craft of statistical time series forecasting The Signal and the Noise . I have no intention of adding substantive commentary to the crowded space of analysis of US voting behaviour (some of which, and not just from FiveThirtyEight, is extremely high quality BTW), but I wanted at least poke around the data. In particular I’m interested in who is doing all these polls, and what does FiveThirtyEight do to the numbers before putting them into their forecasting models.

In pale grey text at the bottom of the election prediction page, FiveThirytyEight have a link to “Download CSV of polls”.

There are more than 9,000 (at the time of writing) rows in this spreadsheet, but a quick look shows that there are only 3,067 unique values of the poll_id field. Each poll is listed three times in this dataset: once each for “polls-only”, “polls-plus” and “now-cast”. Here’s R code for downloading the data and a first quick look at it:

library(dplyr) library(tidyr) library(ggplot2) library(viridis) library(ggthemes) # this next code will need to be adapted if you don't have a ../data/ folder... # alternatively, if that link stops working, there's a static copy at http://ellisp.github.io/data/polls.csv www <- "http://projects.fivethirtyeight.com/general-model/president_general_polls_2016.csv" download.file(www, destfile = "../data/polls.csv") # download data polls_orig <- read.csv("../data/polls.csv", stringsAsFactors = FALSE) table(table(polls_orig$poll_id)) ## 3 ## 3067 table(polls_orig$type) ## now-cast polls-only polls-plus ## 3067 3067 3067

Poll weight?

The polling data has a a poll_wt field. At first I thought this would estimate the value of the poll based on the grade of its provider’s quality (they are all rated from D to A+) and sample size, but it turns out to be more complicated than that, and to be based largely on how long before the election the data was collected. The values for polls-only and polls-plus (both of which forecast the election on its actual date) are identical; the now-cast weights are slightly higher. Sample size and pollster grade also contribute to the determination of poll_wt . Many of the older surveys have a poll_wt of pretty much zero, meaning they are not being used at all in estimating current predictions. All this can be seen in the graphic below.

That graphic was created with this code:

grades = c("A+", "A", "A-", "B+", "B", "B-", "C+", "C", "C-", "D", "") # parse the dates as proper dates, and add structure to the grade polls <- polls_orig %>% mutate( startdate = as.Date(startdate, "%m/%d/%Y"), enddate = as.Date(enddate, "%m/%d/%Y"), grade = ordered(grade, levels = grades) ) polls %>% ggplot(aes(x = enddate, y = poll_wt, colour = type, size = samplesize)) + geom_point() + scale_y_sqrt() + facet_wrap(~grade)

For the rest of this post I’ll be focusing on the “polls-only” data.

Raw versus adjusted?

Each poll has two sets of percentages for each of four presidential candidates (Clinton, Trump, McMullin and Johnson - but McMullin is often missing). The two sets are labelled “raw” and “adj”. My understanding of the FiveThirtyEight approach is that the “adj” value reflects calibration for historical statistical bias of the individual polls (ie if they consistently have overestimated party X in the past, their reported proportion for party X is adjusted downwards). This is perfectly sound and standard, and it’s good that both the adjusted and unadjusted data are made available. Neither the raw nor adjusted series add up to 100, presumably because of undecided or otherwise non-informative respondents.

Here is a comparison of the raw and adjusted intended vote for Clinton:

FiveThirtyEight's polling data for the US Presidential election

The diagonal line shows parity - points on the line represent surveys where the adjusted value is exactly the same as raw. We see that much of the adjustment for Clinton is upwards. Before any conclusions are drawn from that, let’s check in on the same graphic for Trump:

We see that Trump’s vote also gets a share of upwards adjustment. Before leaving this, let’s look at the distribution of adjustment factors for them both:

For both candidates, the most common adjustment is actually very slightly downwards (ie just below 1). Clinton’s estimated intended votes do get adjusted up a little more than do Trump’s. With adjustments based on historical relationships between polls and results, this is the sort of thing to be uncertain about in this far-from-usual election year; but FiveThirtyEight only have history to go on. Certainly in the methodological pieces on their website like this one , they pay all due respect to this aspect of uncertainty.

pollsonly <- filter(polls, type == "polls-only") # compare the raw values with the adjusted values for Clinton pollsonly %>% ggplot(aes(x = rawpoll_clinton, y = adjpoll_clinton)) + geom_abline(intercept = 0, slope = 1, colour = "grey75") + geom_point(alpha = 0.4, aes(size = samplesize, colour = grade)) + scale_color_viridis(discrete = TRUE, option = "magma") + guides(colour = guide_legend(override.aes = list(alpha = 1, size = 2))) + coord_equal() # For Trump: pollsonly %>% ggplot(aes(x = rawpoll_trump, y = adjpoll_trump)) + geom_abline(intercept = 0, slope = 1, colour = "grey75") + geom_point(alpha = 0.4, aes(size = samplesize, colour = grade)) + scale_color_viridis(discrete = TRUE, option = "magma") + guides(colour = guide_legend(override.aes = list(alpha = 1, size = 2))) + coord_equal() # Distribution of adjustment factors: pollsonly %>% mutate(Trump = adjpoll_trump / rawpoll_trump, Clinton = adjpoll_clinton / rawpoll_clinton) %>% select(Trump, Clinton) %>% gather(candidate, adj_ratio) %>% ggplot(aes(x = adj_ratio, colour = candidate, fill = candidate)) + geom_density(alpha = 0.3) + ggtitle("Density of adjustment ratios for two main candidates")

Trends over time

Note that in the graphics above I’m ignoring the regional element of the data - some of the surveys’ target population is the whole USA, some are for specific locations, from 14 for the District of Colombia to 109 for the key state of Florida. This is the main reason for the large variance in estimated intended vote for Senator Clinton when looking at all the polls together.

More interesting is seeing trends in voting intention over time. In the media, these are usually shown as lovely clean time series charts with a single, simple line for each candidate. The reality of the data is much messier of course. Here’s the estimated intended vote for both Clinton and Trump, from just the surveys aiming to estimate voting behaviour of the whole U.S.A., illustrating the raw versus adjusted data:

I like this graphic because it illustrates several things:

the big range of individual survey results. All these surveys are trying to measure the same thing - intended vote in the same election, from the same population. Nothing better illustrates the folly of paying attention to only individual polls than looking at this spread. Pooling them - which is implicitly done by the smoothing blue line - is the way to go. one result of the adjustment process is to reduce the variance in survey results. For both Clinton and Trump, the adjusted poll results are tighter than the raw data. the basic story - Clinton is doing better than Trump - isn’t changed by the adjustment process a cluster of recent large sample size, “B” grade surveys have particularly low raw values for both Clinton and Trump and these are both adjusted upwards. Checking the data, these turn out to be “Google Consumer Surveys”, which are apparently judged by FiveThirtyEight to materially understate the probable vote for major party candidates and overstate the chances of Johnson (Johnson’s percentages in these surveys are typically adjusted down about three percentage points by FiveThirtyEight) It’s useful to think of each pollster’s data as a group of longitudinal results. This suggests a variant of our last graphic, the standard longitudinal spaghetti plot:
FiveThirtyEight's polling data for the US Presidential election

# show probability of voting Clinton and Trump over time, just the national polls: pollsonly %>% filter(state == "U.S.") %>% select(enddate, rawpoll_clinton, adjpoll_clinton, rawpoll_trump, adjpoll_trump, samplesize, grade) %>% gather(variable, value, -enddate, -samplesize, -grade) %>% ggplot(aes(x = enddate, y = value / 100)) + facet_wrap(~variable) + geom_point(alpha = 0.4, aes(colour = grade, size = samplesize)) + geom_smooth(aes(weight = samplesize), span = 0.5) + scale_color_viridis("Pollster grade", discrete = TRUE, option = "magma") + guides(colour = guide_legend(override.aes = list(alpha = 1, size = 2))) + scale_y_continuous("Percentage of intended vote\n", label = percent) + scale_size_area("Sample size", label = comma) + ggtitle("Intended vote in the US Presidential election", subtitle = "National surveys only") + labs(x = "End date of survey", caption = "Data compiled by FiveThirtyEight, analysis by http://ellisp.github.io") # longitudinal version, with lines connecting pollsters: pollsonly %>% filter(state == "U.S.") %>% select(enddate, rawpoll_clinton, adjpoll_clinton, rawpoll_trump, adjpoll_trump, samplesize, grade, pollster) %>% gather(variable, value, -enddate, -samplesize, -grade, -pollster) %>% ggplot(aes(x = enddate, y = value / 100, colour = grade)) + facet_wrap(~variable) + geom_line(aes(group = pollster)) + scale_color_viridis("Pollster grade", discrete = TRUE, option = "magma") + guides(colour = guide_legend(override.aes = list(alpha = 1, size = 2))) + scale_y_continuous("Percentage of intended vote\n", label = percent) + ggtitle("Intended vote in the US Presidential election", subtitle = "National surveys only, grouped by pollster") + labs(x = "End date of survey", caption = "Data compiled by FiveThirtyEight, analysis by http://ellisp.github.io")

Who are all these pollsters?

Finally, I wanted a graphic that showed the relative influence of the various pollsters on FiveThirtyEight’s polling forecasts. This takes me back to the poll_wt variable.

Unfortunately visualising all 180 pollsters is a non-trivial problem. After trialling various ways of showing pollster grade, sample size, and total weight simultaneously I fell back on a straightforward wordcloud:

It could definitely be improved on with some imagination.

Survey Monkey polls are the most influential in the data set despite their relatively low grade (C-) because of the sheer number of them - at least three surveys in every state so far plus 30 national surveys. Ipsos, Google and Survey Monkey between them have surveyed 1.75 million people so far in this campaign.

Final code snippet - creating the wordcloud:

# create summary pollster info. Most pollsters always have the same grade, a few # sometimes have different grades and sometimes have their grade missing palette <- magma(length(grades))[length(grades):1] pollsters <- pollsonly %>% group_by(pollster) %>% mutate(graden = ifelse(grade == "", NA, as.numeric(grade))) %>% summarise(grade = round(mean(graden, na.rm = TRUE)), totalweight = sum(poll_wt), totalsample = sum(samplesize)) %>% mutate(grade = ifelse(is.na(grade), 11, grade), grade = factor(grade, levels = 1:11, labels = grades), grade_colours = palette[as.numeric(grade)]) %>% ungroup() par(bg = "grey40") set.seed(223) wordcloud(words = pollsters$pollster, freq = pollsters$totalweight * 10000, colors = pollsters$grade_colours, random.order = FALSE, ordered.colors = TRUE, family = "myfont") grid.text(0.55, 0.03, hjust = 0, label = "Size is mapped to total combined weight as at 29/10/2016;\nColour is mapped to pollster grade (lighter is better).", gp = gpar(col = "grey90", fontfamily = "myfont", cex = 0.8)) grid.text(0.03, 0.97, hjust = 0, label = "The pollster data used by FiveThirtyEight", gp = gpar(col = "grey90", fontfamily = "myfont", cex = 1.5)) legend("bottomleft", title = "", bg = "grey45", box.lty = 0, legend = c(levels(pollsters$grade)[1:10], "None"), text.col = palette)

FiveThirtyEight's polling data for the US Presidential election

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本