CFAES Bioinformatics Core, OSU
2026-02-05
Unlike an organism’s genome content, the expression of genes is highly dynamic – it varies:
Across time and space
Qualitatively (which genes are expressed) and especially quantitatively (how much of each gene is expressed)
Considering:
That protein production tells us about the activity of biological functions,
and the molecular mechanisms underlying those functions
That it is easier to measure transcript (mRNA) than protein abundance
The central dogma

… gene expression can be used as a proxy for protein expression to make functional inferences
Caveat: the correlation between mRNA and protein levels is rather imperfect –
see e.g. Ponomarenko et al. (2023).
For example, gene expression can be quantified to associate genes and genetic pathways with:
So, studying gene expression can identify candidate genes underlying phenotypes –
these can next by functionally validated using experiments like knock-outs.
And further down the line, understanding molecular basis of phenotypes has all sorts of uses!
E.g, in an agricultural context, to manipulate phenotypes such as resistance, pathogenicity, and yield.
To estimate gene expression levels genome-wide, RNA-Seq takes a brute-force approach by randomly sequencing of millions of RNA fragments per sample
The resulting reads can be assigned a gene of origin,
and the core idea is that a gene’s read count reflects that gene’s expression level

To estimate gene expression levels genome-wide, RNA-Seq takes a brute-force approach by randomly sequencing of millions of RNA fragments per sample
The resulting reads can be assigned a gene of origin,
and the core idea is that a gene’s read count reflects that gene’s expression level

RNA-Seq is a very widely used technique —
it constitutes the most common usage of high-throughput sequencing
We’ll focus on the most common type of RNA-Seq, which:
When I refer to RNA-Seq from now on, this specific type of RNA-Seq is implied.

From BioRender
RNA-Seq is the most common data type I help analyze in my role.
For a taste of what it’s used for in CFAES and beyond —
the following projects aimed to identify genes & pathways differing between:
Soybean cultivars in response to Phytophtora sojae inoculation (Dorrance lab, PlantPath)
Response of maize to infection with Pantoea with and without an effector gene (Mackey lab, HCS)
Mated and unmated mosquitos (Sirot lab, College of Wooster)
Tissues of the ambrosia beetle and its symbiotic fungus (Ranger lab, USDA)
Diapause-inducing conditions for two pest stink bug species (Michel lab, Entomology)
Pig coronaviruses with vs. without an experimental insertion (Wang lab, CFAH)
Human carcinoma cell lines with vs. without a manipulated gene (Cruz lab, CCC)
And to improve the annotation of a nematode genome assembly (Taylor lab, PlantPath)
RNA-Seq typically compares groups of samples defined by differences in:
Treatments — e.g. different host plants, temperature, diet, mated/unmated
Organismal variants — e.g. ages/developmental stages, sexes, subspecies
Tissues
Garrigós et al. (2025)
Culex pipiens mosquitos infected with malaria-causing Plasmodium protozoans:
To make statistically supported conclusions about expression differences,
we need biological replication (at least 3-5 samples per group):

AAAAAA...)

We want mRNAs but these often make up only a few percent of RNAs!
The two main ways to select for mRNAs are poly-A selection and ribo-depletion.
As mentioned in the previous lecture:
Library preparation is typically done by sequencing facilities / companies
Many samples can be “multiplexed” into a single (RNA-Seq) library

Modified after https://sydney-informatics-hub.github.io
The analysis of RNA-Seq data can be divided into two main parts:
The count table has one row for each gene and one column for each sample,
with the entries being the number of reads mapping to each gene in each sample:
| Sample 1 | Sample 2 | Sample 3 | |
|---|---|---|---|
| Gene A | 1500 | 2300 | 1800 |
| Gene B | 0 | 5 | 2 |
| Gene C | 300 | 250 | 400 |
Actual count tables have thousands of genes (rows) and usually dozens of samples (columns)
How do you get to such a count table?
I though I’d get some generative AI help with the diagram on the previous slide –
this is what Adobe Firefly came up with: 😵💫
It specifically involves, at a minimum:
Read preprocessing
Aligning reads to a reference genome
Quantifying expression levels
This part is “bioinformatics”-heavy, with large files, high computing needs, and using command-line tools ran in the Unix shell.
If this is a problem: the process is fairly standardized and suitable to be outsourced.
Read pre-processing includes:
FastQC)
From BioRender

From BioRender
So, the alignment of reads to a reference genome needs to be “splice-aware”:

Van den Berge et al. (2019)
Alternatively, you can align to the transcriptome (i.e., all mature transcripts):

Van den Berge et al. (2019)
These represent different transcripts originating from the same gene due to alternative splicing. These will produce different proteins, which are called isoforms.
Most short-read RNA-Seq studies do not attempt to distinguish between isoforms,
but rather quantify expression at the gene level.

Van den Berge et al. (2019)
In essence, a simple counting exercise once you have the alignments in hand:
for each sample, how many reads map to each gene?
Though in practice, a bit more complicated than this, due to e.g.:
The “nf-core” initiative (https://nf-co.re, Ewels et al. (2020)) aims to produce best-practice and automated bioinformatics pipelines, like for RNA-Seq (https://nf-co.re/rnaseq):

The second part of RNA-Seq data analysis involves analyzing the count table.
In contrast to the first part, this can be done on a laptop and instead is heavier on
statistics, data visualization and biological interpretation.
It is typically done with the R language, and common aspects include:
Principal Component Analysis (PCA)
Assessing overall sample “clustering” (similarity) patterns
Differential Expression (DE) analysis
Finding genes that differ in expression level between sample groups (DEGs)
Functional enrichment analysis
See whether certain gene function categories are overrepresented among DEGs
PCA examines overall patterns of dissimilarity among samples,
such as whether groups of interest form distinct clusters:

Fig. 1 from Garrigós et al. (2025)
We’ll talk more about the interpretation of this PCA plot in tomorrow’s lab
PCA examines overall patterns of dissimilarity among samples,
such as whether groups of interest form distinct clusters:

Fig. 1 from Garrigós et al. (2025)
PCA is a very useful technique, not just for RNA-Seq data.
To learn more about how it works, see this video (short overview) and this video (more detailed).
Differential Expression (DE) analysis allows you to test, separately for every expressed gene in your dataset, whether it significantly differs in expression level between groups.
Typically, this is done with pairwise comparisons between groups:

Differential Expression (DE) analysis allows you to test, separately for every expressed gene in your dataset, whether it significantly differs in expression level between groups.
Typically, this is done with pairwise comparisons between groups:


We will talk about these during tomorrow’s lab.
While not necessarily easy conceptually, this is all fairly straightforward in practice: specialized R packages like DESeq2 take care of the details.
Lists of DEGs can be long and not always easy to make biological sense of.
Functional enrichment analysis helps with this, asking whether certain functional categories of genes are statistically overrepresented among DEGs.
Several databases group genes into such functional categories —
the two main ones used for enrichment analysis are:
A GO term called “photosynthesis” is associated with a set of genes involved in photosynthesis.
Say that 300 genes out of 30,000 genes in the genome have been annotated to this term = 1%
Say that among your 50 out of your 500 DEGs are annotated to this term = 10%
The larger the difference between these percentages, the stronger the indication that the function in question –here, photosynthesis– is overrepresented among your DEGs.

Fig. 4 from Garrigós et al. (2025)

Rodriguez et al. (2020)
KEGG representation of up-regulated genes related to jasmonic acid (JA) signal transduction pathways (ko04075) in banana cv. Calcutta 4 after inoculation with Pseudocercospora fijiensis. Genes or chemicals up-regulated at any time point were highlighted in green.