In previous posts, I’ve examined the benefits of the tidy data framework in cleaning, visualizing, and modeling in exploratory data analysis on a molecular biology experiment. We’re using Brauer et al 2008 as our case study, which explores the effects of nutrient starvation on gene expression in yeast.
From the posts so far, one might get the impression that I think data must be tidy at every stage of an analysis. Not true! That would be an absurd and unnecessary constraint. Lots of mathematical operations are faster on matrices, such as singular value decomposition or linear regression. Jeff Leek rightfully points this out as an issue with my previous modeling gene expression post , where he remarks that the limma package is both faster and takes more statistical considerations (pooling variances among genes) into account.
Isn’t it contradictory to do these kinds of operations in a tidy analysis? Not at all. My general recommendation is laid out as follows:
As long as you’re in that Models “cloud”, you can store your data in whatever way is computationally and practically easiest. However:Before you model, you should use tidy tools to clean, process and wrangle your data (as shown in previous posts) After you’ve performed your modeling, you should turn the model into a tidy output for interpretation, visualization, and communication
This requires a new and important tool in our series on tidy bioinformatics analysis: the biobroom package , written and maintained by my former colleagues, particularly Andy Bass and John Storey. In this post I’ll show how to use the limma and biobroom packages in combination to continue a tidy analysis, and consider when and how to use non-tidy data in an analysis.Setup
Here’s the code to catch up with our previous posts:
library(readr) library(dplyr) library(tidyr) library(ggplot2) url <- "http://varianceexplained.org/files/Brauer2008_DataSet1.tds" nutrient_names <- c(G = "Glucose", L = "Leucine", P = "Phosphate", S = "Sulfate", N = "Ammonia", U = "Uracil") cleaned_data <- read_delim(url, delim = "\t") %>% separate(NAME, c("name", "BP", "MF", "systematic_name", "number"), sep = "\\|\\|") %>% mutate_each(funs(trimws), name:systematic_name) %>% select(-number, -GID, -YORF, -GWEIGHT) %>% gather(sample, expression, G0.05:U0.3) %>% separate(sample, c("nutrient", "rate"), sep = 1, convert = TRUE) %>% mutate(nutrient = plyr::revalue(nutrient, nutrient_names)) %>% filter(!is.na(expression), systematic_name != "") %>% group_by(systematic_name, nutrient) %>% filter(n() == 6) %>% ungroup()
In our last analysis, we performed linear models using broom and the built-in lm function:
library(broom) linear_models <- cleaned_data %>% group_by(name, systematic_name, nutrient) %>% do(tidy(lm(expression ~ rate, .)))
The above approach is useful and flexible. But as Jeff notes , it’s not how a computational biologist would typically run a gene expression analysis, for two reasons.Performing thousands of linear regressions with separate lm calls is slow . It takes about a minute on my computer. There are computational shortcuts we can take when all of our data is in the form of a gene-by-sample matrix. We’re not taking statistical advantage of the shared information. Modern bioinformatics approaches often “share power” across genes, by pooling variance estimates. The approach in the limma package is one notable example for microarray data, and RNA-Seq tools like edgeR and DESeq2 take a similar approach in their negative binomial models.
We’d like to take advantage of the sophisticated biological modeling tools in Bioconductor. We’re thus going to convert our data into a non-tidy format (a gene by sample matrix), and run it through limma to create a linear model for each gene. Then when we want to visualize, compare, or otherwise manipulate our models, we’ll tidy the model output using biobroom.Limma
Most gene expression packages in Bioconductor expect data to be in a matrix with one row per gene and one column per sample. Inthe last post we fit one model for each gene and nutrient combination. So let’s set it up that way using reshape2’s acast() .
library(reshape2) exprs <- acast(cleaned_data, systematic_name + nutrient ~ rate, value.var = "expression") head(exprs)
## 0.05 0.1 0.15 0.2 0.25 0.3 ## Q0017_Ammonia 0.18 0.73 0.05 -0.14 -0.06 0.24 ## Q0017_Glucose -0.21 -0.26 -0.17 0.00 0.10 0.10 ## Q0017_Leucine 0.46 0.05 0.42 0.35 0.36 0.09 ## Q0017_Phosphate -0.24 -0.35 -0.45 -0.24 -0.35 0.11 ## Q0017_Sulfate 2.44 0.64 0.21 0.40 -0.07 0.31 ## Q0017_Uracil 1.88 0.03 0.15 0.04 0.14 0.32
We then need to extract the experiment design, which in this case is just the growth rate:
rate <- as.numeric(colnames(exprs)) rate##  0.05 0.10 0.15 0.20 0.25 0.30
limma (“linear modeling of microarrays”) is one of the most popular Bioconductor packages for performing linear-model based differential expression analyses on microarray data. With the data in this matrix form, we’re ready to use it:
library(limma) fit <- lmFit(exprs, model.matrix(~rate)) eb <- eBayes(fit)
This performs a linear regression for each gene. This operation is both faster and more statistically sophisticated than our original use of lm for each gene.
So now we’ve performed our regression. What output do we get?
class(eb)##  "MArrayLM" ## attr(,"package") ##  "limma"
## Length Class Mode ## coefficients 65074 -none- numeric ## rank 1 -none- numeric ## assign 2 -none- numeric ## qr 5 qr list ## df.residual 32537 -none- numeric ## sigma 32537 -none- numeric ## cov.coefficients 4 -none- numeric ## stdev.unscaled 65074 -none- numeric ## pivot 2 -none- numeric ## Amean 32537 -none- numeric ## method 1 -none- character ## design 12 -none- numeric ## df.prior 1 -none- numeric ## s2.prior 1 -none- numeric ## var.prior 2 -none- numeric ## proportion 1 -none- numeric ## s2.post 32537 -none- numeric ## t 65074 -none- numeric ## df.total 32537 -none- numeric ## p.value 65074 -none- numeric ## lods 65074 -none- numeric ## F 32537 -none- numeric ## F.p.value 32537 -none- numeric
That’s a lot of outputs, and many of them are matrices of varying shapes. If you want to work with this using tidy tools (and if you’ve been listening, you hopefully do), we need to tidy it:
library(biobroom) tidy(eb, intercept = TRUE)
## # A tibble: 65,074 x 6 ## gene term estimate statistic p.value lod ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Q0017_Ammonia (Intercept) 0.3926667 1.591294 0.155459427 -6.3498508 ## 2 Q0017_Glucose (Intercept) -0.3533333 -3.172498 0.015599597 -4.0295194 ## 3 Q0017_Leucine (Intercept) 0.3873333 2.339366 0.051800996 -5.2780422 ## 4 Q0017_Phosphate (Intercept) -0.4493333 -2.732529 0.029158254 -4.6871409 ## 5 Q0017_Sulfate (Intercept) 1.9140000 3.937061 0.005595522 -2.9298317 ## 6 Q0017_Uracil (Intercept) 1.1846667 2.476906 0.042315927 -5.0718352 ## 7 Q0045_Ammonia (Intercept) -1.5060000 -5.842285 0.000629533 -0.5396668 ## 8 Q0045_Glucose (Intercept) -0.8513333 -5.235045 0.001195959 -1.2455443 ## 9 Q0045_Leucine (Intercept) -0.4440000 -1.807071 0.113591371 -6.0543398 ## 10 Q0045_Phosphate (Intercept) -0.7546667 -4.488406 0.002819261 -2.1846537 ## # ... with 65,064 more rows
Notice that this is now in one-row-per-coefficient-per-gene form, much like the output of broom’s tidy() on linear models.
Like broom, biobroom always returns a tablewithout rownames that we can feed into standard tidy tools like dplyr and ggplot2. (Note that unlike broom, biobroom requires an intercept = TRUE argument to leave the intercept term, simply because in many genomic datasets- though not ours- the intercept term is almost meaningless). biobroom can also tidy model objects from other tools like edgeR or DESeq2, always giving a consistent format similar to this one.
Now all we’ve got to do split the systematic name and nutrient back up. tidyr’s separate() can do this:
td <- tidy(eb, intercept = TRUE) %>% separate(gene, c("systematic_name", "nutrient"), sep = "_") td
## # A tibble: 65,074 x 7 ## systematic_name nutrient term estimate statistic p.value ## * <chr> <chr> <chr> <dbl> <dbl> <dbl> ## 1 Q0017 Ammonia (Intercept) 0.3926667 1.591294 0.155459427 ## 2 Q0017 Glucose (Intercept) -0.3533333 -3.172498 0.015599597 ## 3 Q0017 Leucine (Intercept) 0.3873333 2.339366 0.051800996 ## 4 Q0017 Phosphate (Intercept) -0.4493333 -2.732529 0.029158254 ## 5 Q0017 Sulfate (Intercept) 1.9140000 3.937061 0.005595522 ## 6 Q0017 Uracil (Intercept) 1.1846667 2.476906 0.042315927 ## 7 Q0045 Ammonia (Intercept) -1.5060000 -5.842285 0.000629533 ## 8 Q0045 Glucose (Intercept) -0.8513333 -5.235045 0.001195959 ## 9 Q0045 Leucine (Intercept) -0.4440000 -1.807071 0.113591371 ## 10 Q0045 Phosphate (Intercept) -0.7546667 -4.488406 0.002819261 ## # ... with 65,064 more rows, and 1 more variables: lod <dbl>Analyzing a tidied model
We can now use the tidy approaches to visualization and interpretation that were explored in previous posts. We could create ap-value histogram
ggplot(td, aes(p.value)) + geom_histogram() + facet_grid(term ~ nutrient, scales = "free_y")
Or make a volcano plot, comparing statistical significance to effect size (here let’s say just on the slope terms):
td %>% filter(term == "rate") %>% ggplot(aes(estimate, p.value)) + geom_point() + facet_wrap(~ nutrient, scales = "free") + scale_y_log10()
We could easily perform for multiple hypothesis testing within each group, and filter for significant (say, FDR < 1%) changes:
td_filtered <- td %>% group_by(term, nutrient) %>% mutate(fdr = p.adjust(p.value, method = "fdr")) %>% ungroup() %>% filter(fdr < .01)
Or finding the top few significant changes in each group using dplyr’s top_n :
top_3 <- td_filtered %>% filter(term == "rate") %>% group_by(nutrient) %>% top_n(3, abs(estimate)) top_3## Source: local data frame [18 x 8] ## Groups: nutrient  ## ## systematic_name nutrient term estimate statistic p.value ## <chr> <chr> <chr> <dbl> <dbl> <dbl> ## 1 YAL061W Sulfate rate -13.40000 -15.374369 1.159000e-06 ## 2 YBR054W Phosphate rate -23.80571 -12.675600 4.306351e-06 ## 3 YBR072W Uracil rate -27.01143 -11.870314 6.702423e-06 ## 4 YBR116C Uracil rate -25.12000 -9.931703 2.199911e-05 ## 5 YBR294W Phosphate rate 24.56571 9.953214 2.168703e-05 ## 6 YFL014W Leucine rate -18.92000 -6.884237 2.318928e-04 ## 7 YHR096C Ammonia rate -23.71429 -20.700901 1.496359e-07 ## 8 YHR096C Glucose rate -20.31429 -7.890506 9.816693e-05 ## 9 YHR096C Leucine rate -23.18286 -6.324890 3.904599e-04 ## 10 YHR096C Phosphate rate -26.12000 -25.405139 3.616210e-08 ## 11 YHR137W Glucose rate -21.45143 -9.798727 2.404524e-05 ## 12 YIL160C Sulfate rate -13.50286 -16.838241 6.214232e-07 ## 13 YIL169C Ammonia rate -24.99429 -8.439007 6.375875e-05 ## 14 YLR327C Leucine rate -18.43429 -17.547502 4.679809e-07 ## 15 YLR327C Uracil rate -28.13714 -7.434840 1.432175e-04 ## 16 YMR303C Glucose rate -22.60000 -6.101140 4.856028e-04 ## 17 YOL155C Ammonia rate -28.13714 -11.530132 8.147212e-06 ## 18 YPL223C Sulfate rate -18.42857 -12.883494 3.857832e-06 ## # ... with 2 more variables: lod <dbl>, fdr <dbl>
We could join this with our original data, which would let us visualize the trends for only the most significant genes:
top_3 %>% rename(significant_nutrient = nutrient) %>% inner_join(cleaned_data, by = "systematic_name") %>% mutate(highlight = nutrient == significant_nutrient) %>% ggplot(aes(rate, expression, color = nutrient)) + geom_point() + geom_smooth(aes(lty = !highlight), method = "lm", se = FALSE, show.legend = FALSE) + facet_wrap(significant_nutrient ~ systematic_name, ncol = 3, scales = "free_y")
In short, you can once again use the suite of “tidy tools” that we’ve found powerful in genomic analyses.Conclusion: Data is wrangled for you, not you for the data
There’s a classic proverb of computer science from Abelman & Sussman : “Programs must be written for people to read, and only incidentally for machines to execute.” I’d say this is even more true for data it is for code. Data scientists need to be very comfortable engaging with their data, not fighting with the representation.
I agree with a lot in Jeff’s “Non-tidy data” post , but there’s one specific statement I disagree with:
…you might not use tidy data because many functions require a different, also very clean and useful data format, and you don’t want to have to constantly be switching back and forth.
I’d counter that switching is a small cost, because switching can be automated. Note that in the above analysis, reshaping the data required only two lines of code and two functions ( acast() and tidy() ). In contrast, there’s no way to automate critical thinking. Any challenge in plotting, filtering, or merging your data will get directly in the way of answering scientific questions.
@jtleek @mark_scheuerell @hadleywickham solution is put energy into tidying first. Otherwise, you will pay off energy 10x in plot/manip― David Robinson (@drob) February 12, 2016
It’s thus the job of tool developers to make these “switches” as easy as possible. broom and biobroom play a role in this, as do reshape2 and tidyr. Jeff lists natural language as a type of data that’s best left un-tidy, but since that post Julia Silge and have I developed the tidytext package, and we’ve found it useful for performing text analyses using ggplot2 and dplyr ( see here for examples ).
Other examples of operations that are better performed on matrices include correlations and distance calculations, and for those purposes I’m currently working on the widyr package, which wraps these operations to allow a tidy input and tidy output (for example, see this application of the pairwise_cor function ).
Next time: gene set enrichment analysis
( Programming note : this wasoriginally my plan for this post, but I decided to preempt it for biobroom!)
These per-gene models can still be difficult to interpret biologically if you’re not familiar with the functions of specific genes. What we really want is a way to summarize the results into “genes involved in this biological process changed their expression.” This is where annotations of gene sets become useful.
gene_sets <- distinct(cleaned_data, systematic_name, BP, MF) td %>% inner_join(gene_sets) %>% filter(BP == "leucine biosynthesis", term == "(Intercept)") %>% mutate(nutrient = reorder(nutrient, estimate, median)) %>% ggplot(aes(nutrient, estimate)) + geom_boxplot() + geom_point() + geom_text(aes(label = systematic_name), vjust = 1, hjust = 1) + xlab("Limiting nutrient") + ylab("Intercept (expression at low concentration)") + ggtitle("Genes involved in leucine biosynthesis")
Notice how clear it is that these genes respond to leucine starvation in particular. This can be applied to gene sets containing dozens or even hundreds of genes while still making the general trend apparent. Furthermore, we could use these summaries to look at many gene sets at once, and even use statistical tests to discover new gene sets that respond to starvation.
Thus, in my next post in this series, we’ll apply our “tidy modeling” approach to a new problem. Instead of testing whether each gene responds to starvation in an interesting way, we’ll test functional groups of genes (a general method called gene set analysis ) in order to develop higher-level biological insights. And we’ll do it with these same set of tidy tools we’ve been using so far.