Channel: CodeSection,代码区,网络安全 - CodeSec
Viewing all articles
Browse latest Browse all 12749

Animating regression models in R using broom and ggplot2


Animating regression models in R using broom and ggplot2
My first article on Towards Data Science was the result of a little exercise I set myself to keep the little grey cells ticking over. This is something of a similar exercise, albeit a bit more relevant to a project I’ve been working on. As I spend my time working in a marketing department, I have to get used to wearing [at least] two hats.

Often, these hats are mutually exclusive, and sometimes they disagree with each other. In this case, the disagreement is in the form of another piece of animated data visualisation. As with the animated Scottish rugby champions graph, this example doesn’t really benefit from adding the animation as another dimension to the plot.

The graph is simply to show the trends for some metrics to do with UK university fundraising over time. I only really need x and y to represent the value and the year, but where’s the fun in that? That’s the sort of thing we can plot ridiculously easily using ggplot2 :

ggplot(fund_tidy, aes(x = year, y = value, colour = kpi)) + geom_line()

Why not use this as a bit more of a learning exercise? I’ve played about with the gganimate package before, but never really spent any quality time with it. This seemed like a good opportunity.

The datavizconflict

And that brings us on to the butting of hats. I don’t think that an animated plot is the best way to represent these data. I don’t know if it technically counts as non-data ink, but you get the idea: it’s just not necessary. If x and y were already taken and I wanted to show how those two values changed over time, animation presents those changes in a way that’s easy to understand. In this case, it’s redundant.

For further adventures where marketing meets data science, follow Chris onTwitter .

However, a lot of graphs are made not to represent the data as simply and accurately as possible, but to get attention. In many cases, particularly in the world of the marketing agency, there is a tendency to turn what could be presented as a clear, straightforward bar chart, into a full-on novelty infographic. Tourist footfall over time represented as a cartoon foot with the size of the toe representing the value for each year anyone? But that’s a story for another day.

The truth is, animation catches the eye, and it can increase the dwell time, allowing the reader time to take in the title, axes labelling, legends and the message. Possibly. As well as increasing exposure to any branding. I do have some principles though, so I wouldn’t ever intentionally set out to make a graph that was misleading. Playing with the colour schemes and layout to make it look a bit sleeker? Absolutely, but the data has to come first.

A CASE oftrends

I had been doing some university fundraising work looking at historic Ross-CASE reports , and thought it would be interesting to look at how some of the key performance indicators had changed over time. I’d looked at some of the main ones before, but hadn’t looked at a few others, and thought it might be interesting to look at them together. And it would be some good ggplot2 and gganimate practice. So let us begin.

n.b. As the aim of this exercise was to compare underlying trends and spend more time with gganimate, not to produce a publication-quality figure, hence a somewhat ‘cavalier’ attitude to y axis labelling!

No onion skinninghere

As ever, importing my pre-made dataset and having a quick look was first on the agenda:

# import yearly data (total, summed values, not means or medians) # dataset compiled from historical Ross-CASE reports library(readr) fund_df <- read_csv("year_sum.csv") # quick look at data library(dplyr) glimpse(fund_df) Observations: 12 Variables: 6 $ year <int> 2005, 2006, 2007, 2008, 2009, 2... $ new_funds_raised <int> 452, 548, 682, 532, 600, 693, 7... $ cash_received <int> 324, 413, 438, 511, 506, 560, 5... $ fundraising_staff <int> 660, 734, 851, 913, 1043, 1079,... $ contactable_alumni <dbl> 5.7, 6.2, 6.9, 7.7, 8.3, 8.0, 8... $ contact_alum_x100 <dbl> 570, 620, 690, 770, 830, 800, 8... library(ggplot2) ggplot(fund_df, aes(x = year, y = new_funds_raised)) + geom_line()
Animating regression models in R using broom and ggplot2

Okay, we have a dataset that seems to look how I would expect it to from previous work, so hopefully I’ve not screwed things up at the first hurdle. Onward!

As the values for contactable_alumni were a couple of orders of magnitude away from the rest of the values, I created a new column where those were multiplied by 100 to put them on the same scale. I then gathered the data into a tidy, ‘long’, format:

# create contactable alumni x100 variable to place values on equivalent scale fund_df <- fund_df %>% mutate(contact_alum_x100 = contactable_alumni * 100) # create tidy dataframe library(tidyr) fund_tidy <- fund_df %>% gather(kpi, value, - year) %>% mutate(kpi = as.factor(kpi)) glimpse(fund_tidy) Observations: 60 Variables: 3 $ year <int> 2005, 2006, 2007, 2008, 2009, 2010, 2011, 20... $ kpi <fct> new_funds_raised, new_funds_raised, new_fund... $ value <dbl> 452, 548, 682, 532, 600, 693, 774, 681, 807,...

With the data transformed, we were ready to create our first animated plot, remembering to start by filtering out out original contactable_alumni variable:

# create animated plot library(gganimate) library(transformr) first_animate <- fund_tidy %>% filter(kpi != "contactable_alumni") %>% ggplot(aes(x = year, y = value, colour = kpi)) + geom_line() + # this next line is where the magic happens: transition_reveal(kpi, year) + labs(title = "Trends in University Fundraising KPIs Over Time", subtitle = "Data from Ross-CASE reports", x = "Year", y = 'Value', caption = "y axis labelling omitted due to differences in scale between KPIs", colour = "KPI") + scale_colour_discrete(labels = c("Cash received", "Contactable alumni", "Fundraising staff", "New funds raised")) + scale_y_discrete(labels = NULL) + theme_chris()
Animating regression models in R using broom and ggplot2

And we’re off. But is that as good as it could be? I don’t think so. The main thing for me is that, as we’re interested in trends, we should have trendlines on there as well. How to go about that…?

To do that in a non-animated way, we’d simply add a geom_smooth() to our plotting code:

# create non-animated plot with trendlines fund_tidy %>% f

Viewing all articles
Browse latest Browse all 12749