Data Science for Life Scientists

Biostatistics & R Programming

A curated, opinionated collection of books, interactive courses, and workflows for mastering statistical analysis and data visualization — selected specifically for researchers in biology, medicine, and neuroscience.

Curator's note: As a neuroscientist who relies daily on R for transcriptomic analysis, survival modeling, and figure preparation, I've assembled the resources I genuinely use and recommend. This list prioritizes open-access, high-quality materials with biological and medical context — not generic programming tutorials. Resources are organized by learning level and topic.

Suggested Learning Path

Where to Start

1

Beginner

R basics, data manipulation with tidyverse, and fundamental statistical concepts for biologists. No prior coding experience needed.

2

Intermediate

Statistical testing, regression models, publication-quality visualization with ggplot2, and reproducible research with R Markdown.

3

Advanced

RNA-seq analysis, survival modeling, mixed-effects models, machine learning in R, and high-throughput bioinformatics pipelines.

Foundational Books & Guides

The essential reading list — start here if you are new to R or biostatistics

R for Data Science, 2nd Edition — Hadley Wickham & Garrett Grolemund
The gold-standard introduction to data science in R. Covers the full tidyverse workflow: import, tidy, transform, visualize, and model. Exceptionally well-written with real datasets. Free online.
Beginnertidyverseggplot2dplyr
Read Online
Basic Statistics for Biologists — Erik Kusch (GitHub)
A series of lectures and seminars specifically designed for B.Sc. biology students — covering fundamental biostatistics using R. Includes hypothesis testing, ANOVA, correlation, and regression with biological examples. Highly practical.
BeginnerBiostatisticsANOVAHypothesis Testing
GitHub Repo
Modern Statistics for Modern Biology — Susan Holmes & Wolfgang Huber
A rigorous, beautifully illustrated guide to statistical methods tailored for modern biological data — including count data, high-throughput assays, clustering, and network analysis. Essential for anyone working with -omics data. Free online, published by Cambridge University Press.
Intermediate-omicsCount DataClustering
Read Online
Biostatistics for the Health Sciences — Textbook (Petrie & Sabin)
Comprehensive biostatistics reference for clinical and health sciences researchers. Covers study design, sample size calculation, survival analysis, logistic regression, and meta-analysis in accessible clinical language. Widely used in medical research training.
IntermediateClinical ResearchSample SizeSurvival Analysis
Reference

Data Visualization & Publication Figures

Make your data speak — create publication-quality figures with R

ggplot2: Elegant Graphics for Data Analysis — Hadley Wickham
The definitive reference for ggplot2, the most widely used R visualization library. Covers the grammar of graphics, themes, scales, coordinate systems, and faceting. Indispensable for generating Nature/Science-quality figures. Free online.
Beginner+ggplot2VisualizationPublication Figures
Read Online
ggplot2 Cheat Sheet — Posit/RStudio
The essential 2-page quick-reference for all major ggplot2 functions, aesthetics, geoms, themes, and coordinate systems. Print it out and keep it on your desk. Updated for ggplot2 3.x.
All LevelsCheat Sheetggplot2
Download PDF
ggpubr — Publication-ready Plots for Biomedical Research
An R package built on top of ggplot2 that simplifies the creation of publication-ready boxplots, violin plots, bar charts, and scatter plots with statistical annotations. Ideal for comparing experimental groups with automatic significance brackets and p-value labels.
BeginnerggpubrStatistical TestsBoxplots
Documentation

Bioinformatics & RNA-seq Analysis

From raw sequencing reads to biological insight — workflows for transcriptomic data

Bioconductor Workflow: RNA-seq Data Analysis with DESeq2
The official, community-standard workflow for differential expression analysis using DESeq2. Covers pre-processing (featureCounts/STAR), normalization, size factor estimation, dispersion modeling, hypothesis testing, and results visualization. Highly cited in peer-reviewed methods.
IntermediateDESeq2RNA-seqBioconductor
Vignette
clusterProfiler — Gene Ontology & KEGG Pathway Enrichment Analysis
A powerful R/Bioconductor package for gene set enrichment analysis (GSEA), GO enrichment, and KEGG pathway analysis. Essential for interpreting differential expression results in biological context. Supports human, mouse, and custom organism databases.
IntermediateGSEAGO EnrichmentKEGG
Read Book
StatQuest with Josh Starmer — YouTube Channel
The best free educational resource for understanding statistical concepts and bioinformatics algorithms. Crystal-clear explanations of PCA, DESeq2, logistic regression, ROC curves, p-values, and much more — with no mathematical shortcuts. Highly recommended for all levels.
All LevelsVideoMachine LearningStatistics
YouTube

Advanced Statistical Modeling

Mixed models, survival analysis, and robust methods for complex biomedical data

Linear Mixed Models in R (lme4) — Douglas Bates
Essential for analyzing longitudinal or repeated-measures data (e.g., motor scores across PD disease stages, or longitudinal biomarker changes). Covers fixed vs. random effects, model specification, convergence, and diagnostics with the lme4 package.
Advancedlme4Longitudinal DataMixed Models
PDF Vignette
Survival Analysis in R — survival & survminer Packages
Complete workflow for Kaplan-Meier estimation, log-rank tests, Cox proportional hazards models, and time-varying covariates in R. Covers the survival and survminer packages with beautiful ggplot2-based KM curve visualization.
AdvancedKaplan-MeierCox ModelClinical Data
Documentation
Multiple Testing Correction — A Practical Guide for Biologists
Critical reading for any high-throughput analysis. Covers Bonferroni, FDR (Benjamini-Hochberg), q-values, and when to apply each method. Also discusses the practical distinction between family-wise error rate (FWER) and false discovery rate (FDR) in -omics studies.
IntermediateFDRBonferroniMultiple Testing
Nature Methods

Reproducible Research & Reporting

Make your analyses transparent, shareable, and fully reproducible

R Markdown: The Definitive Guide — Yihui Xie
Master R Markdown to create fully reproducible research documents, reports, presentations, and even websites that integrate your R code, outputs, and narrative. The standard for open, reproducible science. Free online, published by CRC Press.
IntermediateR MarkdownReproducibilityReporting
Read Online
Happy Git with R — Jenny Bryan
A practical, friendly guide to using Git and GitHub from within RStudio. Covers setting up version control, making commits, branching, collaboration workflows, and resolving merge conflicts — all explained for data scientists, not software engineers.
Beginner+GitGitHubVersion Control
Read Online