Like many others around the world, I have been watching with interest the democratic process in the United States of America. One of the most influential and watched websites is FiveThirtyEight.com , headed by Nate Silver, author of the excellent popularisation of the craft of statistical time series forecasting The Signal and the Noise . I have no intention of adding substantive commentary to the crowded space of analysis of US voting behaviour (some of which, and not just from FiveThirtyEight, is extremely high quality BTW), but I wanted at least poke around the data. In particular I’m interested in who is doing all these polls, and what does FiveThirtyEight do to the numbers before putting them into their forecasting models.
In pale grey text at the bottom of the election prediction page, FiveThirytyEight have a link to “Download CSV of polls”.
There are more than 9,000 (at the time of writing) rows in this spreadsheet, but a quick look shows that there are only 3,067 unique values of the poll_id field. Each poll is listed three times in this dataset: once each for “polls-only”, “polls-plus” and “now-cast”. Here’s R code for downloading the data and a first quick look at it:
library(dplyr) library(tidyr) library(ggplot2) library(viridis) library(ggthemes) # this next code will need to be adapted if you don't have a ../data/ folder... # alternatively, if that link stops working, there's a static copy at http://ellisp.github.io/data/polls.csv www <- "http://projects.fivethirtyeight.com/general-model/president_general_polls_2016.csv" download.file(www, destfile = "../data/polls.csv") # download data polls_orig <- read.csv("../data/polls.csv", stringsAsFactors = FALSE) table(table(polls_orig$poll_id)) ## 3 ## 3067 table(polls_orig$type) ## now-cast polls-only polls-plus ## 3067 3067 3067
Poll weight?The polling data has a a poll_wt field. At first I thought this would estimate the value of the poll based on the grade of its provider’s quality (they are all rated from D to A+) and sample size, but it turns out to be more complicated than that, and to be based largely on how long before the election the data was collected. The values for polls-only and polls-plus (both of which forecast the election on its actual date) are identical; the now-cast weights are slightly higher. Sample size and pollster grade also contribute to the determination of poll_wt . Many of the older surveys have a poll_wt of pretty much zero, meaning they are not being used at all in estimating current predictions. All this can be seen in the graphic below.
That graphic was created with this code:
grades = c("A+", "A", "A-", "B+", "B", "B-", "C+", "C", "C-", "D", "") # parse the dates as proper dates, and add structure to the grade polls <- polls_orig %>% mutate( startdate = as.Date(startdate, "%m/%d/%Y"), enddate = as.Date(enddate, "%m/%d/%Y"), grade = ordered(grade, levels = grades) ) polls %>% ggplot(aes(x = enddate, y = poll_wt, colour = type, size = samplesize)) + geom_point() + scale_y_sqrt() + facet_wrap(~grade)
For the rest of this post I’ll be focusing on the “polls-only” data.
Raw versus adjusted?Each poll has two sets of percentages for each of four presidential candidates (Clinton, Trump, McMullin and Johnson - but McMullin is often missing). The two sets are labelled “raw” and “adj”. My understanding of the FiveThirtyEight approach is that the “adj” value reflects calibration for historical statistical bias of the individual polls (ie if they consistently have overestimated party X in the past, their reported proportion for party X is adjusted downwards). This is perfectly sound and standard, and it’s good that both the adjusted and unadjusted data are made available. Neither the raw nor adjusted series add up to 100, presumably because of undecided or otherwise non-informative respondents.
Here is a comparison of the raw and adjusted intended vote for Clinton:
The diagonal line shows parity - points on the line represent surveys where the adjusted value is exactly the same as raw. We see that much of the adjustment for Clinton is upwards. Before any conclusions are drawn from that, let’s check in on the same graphic for Trump:
We see that Trump’s vote also gets a share of upwards adjustment. Before leaving this, let’s look at the distribution of adjustment factors for them both:
For both candidates, the most common adjustment is actually very slightly downwards (ie just below 1). Clinton’s estimated intended votes do get adjusted up a little more than do Trump’s. With adjustments based on historical relationships between polls and results, this is the sort of thing to be uncertain about in this far-from-usual election year; but FiveThirtyEight only have history to go on. Certainly in the methodological pieces on their website like this one , they pay all due respect to this aspect of uncertainty.
pollsonly <- filter(polls, type == "polls-only") # compare the raw values with the adjusted values for Clinton pollsonly %>% ggplot(aes(x = rawpoll_clinton, y = adjpoll_clinton)) + geom_abline(intercept = 0, slope = 1, colour = "grey75") + geom_point(alpha = 0.4, aes(size = samplesize, colour = grade)) + scale_color_viridis(discrete = TRUE, option = "magma") + guides(colour = guide_legend(override.aes = list(alpha = 1, size = 2))) + coord_equal() # For Trump: pollsonly %>% ggplot(aes(x = rawpoll_trump, y = adjpoll_trump)) + geom_abline(intercept = 0, slope = 1, colour = "grey75") + geom_point(alpha = 0.4, aes(size = samplesize, colour = grade)) + scale_color_viridis(discrete = TRUE, option = "magma") + guides(colour = guide_legend(override.aes = list(alpha = 1, size = 2))) + coord_equal() # Distribution of adjustment factors: pollsonly %>% mutate(Trump = adjpoll_trump / rawpoll_trump, Clinton = adjpoll_clinton / rawpoll_clinton) %>% select(Trump, Clinton) %>% gather(candidate, adj_ratio) %>% ggplot(aes(x = adj_ratio, colour = candidate, fill = candidate)) + geom_density(alpha = 0.3) + ggtitle("Density of adjustment ratios for two main candidates")
Trends over timeNote that in the graphics above I’m ignoring the regional element of the data - some of the surveys’ target population is the whole USA, some are for specific locations, from 14 for the District of Colombia to 109 for the key state of Florida. This is the main reason for the large variance in estimated intended vote for Senator Clinton when looking at all the polls together.
More interesting is seeing trends in voting intention over time. In the media, these are usually shown as lovely clean time series charts with a single, simple line for each candidate. The reality of the data is much messier of course. Here’s the estimated intended vote for both Clinton and Trump, from just the surveys aiming to estimate voting behaviour of the whole U.S.A., illustrating the raw versus adjusted data:
I like this graphic because it illustrates several things:
the big range of individual survey results. All these surveys are trying to measure the same thing - intended vote in the same election, from the same population. Nothing better illustrates the folly of paying attention to only individual polls than looking at this spread. Pooling them - which is implicitly done by the smoothing blue line - is the way to go. one result of the adjustment process is to reduce the variance in survey results. For both Clinton and Trump, the adjusted poll results are tighter than the raw data. the basic story - Clinton is doing better than Trump - isn’t changed by the adjustment process a cluster of recent large sample size, “B” grade surveys have particularly low raw values for both Clinton and Trump and these are both adjusted upwards. Checking the data, these turn out to be “Google Consumer Surveys”, which are apparently judged by FiveThirtyEight to materially understate the probable vote for major party candidates and overstate the chances of Johnson (Johnson’s percentages in these surveys are typically adjusted down about three percentage points by FiveThirtyEight) It’s useful to think of each pollster’s data as a group of longitudinal results. This suggests a variant of our last graphic, the standard longitudinal spaghetti plot:# show probability of voting Clinton and Trump over time, just the national polls: pollsonly %>% filter(state == "U.S.") %>% select(enddate, rawpoll_clinton, adjpoll_clinton, rawpoll_trump, adjpoll_trump, samplesize, grade) %>% gather(variable, value, -enddate, -samplesize, -grade) %>% ggplot(aes(x = enddate, y = value / 100)) + facet_wrap(~variable) + geom_point(alpha = 0.4, aes(colour = grade, size = samplesize)) + geom_smooth(aes(weight = samplesize), span = 0.5) + scale_color_viridis("Pollster grade", discrete = TRUE, option = "magma") + guides(colour = guide_legend(override.aes = list(alpha = 1, size = 2))) + scale_y_continuous("Percentage of intended vote\n", label = percent) + scale_size_area("Sample size", label = comma) + ggtitle("Intended vote in the US Presidential election", subtitle = "National surveys only") + labs(x = "End date of survey", caption = "Data compiled by FiveThirtyEight, analysis by http://ellisp.github.io") # longitudinal version, with lines connecting pollsters: pollsonly %>% filter(state == "U.S.") %>% select(enddate, rawpoll_clinton, adjpoll_clinton, rawpoll_trump, adjpoll_trump, samplesize, grade, pollster) %>% gather(variable, value, -enddate, -samplesize, -grade, -pollster) %>% ggplot(aes(x = enddate, y = value / 100, colour = grade)) + facet_wrap(~variable) + geom_line(aes(group = pollster)) + scale_color_viridis("Pollster grade", discrete = TRUE, option = "magma") + guides(colour = guide_legend(override.aes = list(alpha = 1, size = 2))) + scale_y_continuous("Percentage of intended vote\n", label = percent) + ggtitle("Intended vote in the US Presidential election", subtitle = "National surveys only, grouped by pollster") + labs(x = "End date of survey", caption = "Data compiled by FiveThirtyEight, analysis by http://ellisp.github.io")
Who are all these pollsters?Finally, I wanted a graphic that showed the relative influence of the various pollsters on FiveThirtyEight’s polling forecasts. This takes me back to the poll_wt variable.
Unfortunately visualising all 180 pollsters is a non-trivial problem. After trialling various ways of showing pollster grade, sample size, and total weight simultaneously I fell back on a straightforward wordcloud:
It could definitely be improved on with some imagination.
Survey Monkey polls are the most influential in the data set despite their relatively low grade (C-) because of the sheer number of them - at least three surveys in every state so far plus 30 national surveys. Ipsos, Google and Survey Monkey between them have surveyed 1.75 million people so far in this campaign.
Final code snippet - creating the wordcloud:
# create summary pollster info. Most pollsters always have the same grade, a few # sometimes have different grades and sometimes have their grade missing palette <- magma(length(grades))[length(grades):1] pollsters <- pollsonly %>% group_by(pollster) %>% mutate(graden = ifelse(grade == "", NA, as.numeric(grade))) %>% summarise(grade = round(mean(graden, na.rm = TRUE)), totalweight = sum(poll_wt), totalsample = sum(samplesize)) %>% mutate(grade = ifelse(is.na(grade), 11, grade), grade = factor(grade, levels = 1:11, labels = grades), grade_colours = palette[as.numeric(grade)]) %>% ungroup() par(bg = "grey40") set.seed(223) wordcloud(words = pollsters$pollster, freq = pollsters$totalweight * 10000, colors = pollsters$grade_colours, random.order = FALSE, ordered.colors = TRUE, family = "myfont") grid.text(0.55, 0.03, hjust = 0, label = "Size is mapped to total combined weight as at 29/10/2016;\nColour is mapped to pollster grade (lighter is better).", gp = gpar(col = "grey90", fontfamily = "myfont", cex = 0.8)) grid.text(0.03, 0.97, hjust = 0, label = "The pollster data used by FiveThirtyEight", gp = gpar(col = "grey90", fontfamily = "myfont", cex = 1.5)) legend("bottomleft", title = "", bg = "grey45", box.lty = 0, legend = c(levels(pollsters$grade)[1:10], "None"), text.col = palette)