CFAES Bioinformatics Core, Ohio State University
2026-01-29
Here, we’ll only focus on DNA sequencing, keeping in mind that:
Protein sequencing involves completely different technologies
RNA can be, and usually is, sequenced as DNA as well
How is that done and why?
RNA is usually reverse transcribed to DNA (cDNA) prior to sequencing.
While it is becoming more feasible to directly sequence RNA molecules,
RNA is an unstable molecule that is easily degraded and harder to sequence.
High-throughput sequencing (HTS)
Sequences 105-109, often randomly selected, DNA fragments at a time — two types:
Short-read HTS: More accurate, shorter reads (since 2005)
Long-read HTS: Less accurate, longer reads (since 2011)
These sequenced fragments of DNA are usually called reads

Therefore, to design primers, you must know something about the target sequence in advance — this can be highly limiting
The sequenced fragment can be up to about 800-1,000 bp
Sequencing itself is performed by synthesizing a new DNA strand with fluorescently-labeled nucleotides, using a different color for each base (A, C, G, T)
The final result is a chromatogram that can be “base-called”:

The entire human genome was sequenced with Sanger technology!
How many basepairs is that? Want to guess how much it cost to do this?
Because HTS has much higher throughput and is much cheaper per base,
Sanger sequencing is now less widely used
But it is not obsolete, in part because high throughput isn’t always needed
Image generated by Adobe Firefly
Some present-day uses of Sanger sequencing include…
Taxonomic identification of samples
Examining genetic variation among individuals/populations
…where using one or few marker loci or candidate genes can be sufficient
Let’s start with the big picture – HTS data underlies several of these main “omics” approaches:
Copyright ThermoFisher
| Omics type | Molecule type | |
|---|---|---|
| Genomics | DNA | |
| Epigenomics | DNA modifications | High-throughput sequencing (HTS) |
| Transcriptomics | RNA | |
| Proteomics | Proteins | |
| Metabolomics | Metabolites |
| Omics type | Molecule type | Data mainly produced by |
|---|---|---|
| Genomics | DNA | High-throughput sequencing (HTS) |
| Epigenomics | DNA modifications | High-throughput sequencing (HTS) |
| Transcriptomics | RNA | High-throughput sequencing (HTS) |
| Proteomics | Proteins | Mass Spectrometry |
| Metabolomics | Metabolites | Mass Spectrometry |
| Short-read HTS | Long-read HTS | |
|---|---|---|
| Main companies | Illumina | Oxford Nanopore Technologies (ONT) & Pacific Biosciences (PacBio) |
| Short-read HTS | Long-read HTS | |
|---|---|---|
| Usage | More | Less (but increasing) |
| Main companies | Illumina | Oxford Nanopore Technologies (ONT) & Pacific Biosciences (PacBio) |
| Timeline | Since 2005 — technology fairly stable | Since 2011 — still rapid development |
| Short-read HTS | Long-read HTS | |
|---|---|---|
| Usage | More | Less (but increasing) |
| Main companies | Illumina | Oxford Nanopore Technologies (ONT) & Pacific Biosciences (PacBio) |
| Timeline | Since 2005 — technology fairly stable | Since 2011 — still rapid development |
| Read lengths | 50-300 bp | 10-100+ kbp |
| Error rates | Mostly <0.1% | 1-10% (ONT) / <0.1-10% (PacBio) |
| Short-read HTS | Long-read HTS | |
|---|---|---|
| Usage | More | Less (but increasing) |
| Main companies | Illumina | Oxford Nanopore Technologies (ONT) & Pacific Biosciences (PacBio) |
| Timeline | Since 2005 — technology fairly stable | Since 2011 — still rapid development |
| Read lengths | 50-300 bp | 10-100+ kbp |
| Error rates | Mostly <0.1% | 1-10% (ONT) / <0.1-10% (PacBio) |
| Throughput | Higher | Lower |
| Cost per base | Lower | Higher |
For example:
For example:
For these, genomic locations (variant analysis) or gene identities (RNA-Seq) can be reliably inferred from as little as 25 bp
A read’s sequence may differ from the actual DNA sequence it originated from:
When you receive HTS reads, base calls have typically been made already.
Every base call is accompanied by a quality score, representing the estimated error probability.
To overcome sequencing errors, every base can be sequenced multiple times –
i.e., obtaining a “depth of coverage” greater than 1:
Typical depths of coverage: ~50-100x for genome assembly & 10-30x for variant typing (!)

Multiplexing!
Adapters can include “indices” or “barcodes” to identify individual samples, so many samples can be combined (multiplexed) into a single library

“Adapter read-through”: the final bases in the resulting reads will consist of adapter sequence (these should be removed before downstream analysis)
Overlapping reads (this can be useful!):
We start with fragments that have been attached to a flow cell —
this image shows just two, but millions are present simultaneously

Sequencing is performed by synthesizing a new strand using fluorescently-labeled bases,
and taking a picture each time a new nucleotide is incorporated:

Sequencing is performed by synthesizing a new strand using fluorescently-labeled bases,
and taking a picture each time a new nucleotide is incorporated:

Sequencing is performed by synthesizing a new strand using fluorescently-labeled bases,
and taking a picture each time a new nucleotide is incorporated:





Many HTS applications either require a “reference genome” or involve its production:
Many HTS applications either require a “reference genome” or involve its production:
What exactly does reference genome refer to? It usually includes:
An assembly
A representation of most or all of the genome DNA sequence: the genome assembly
An annotation
Provides e.g. locations of genes and other genomic “features” in the corresponding genome assembly, and functional information for these features
Taxonomic identity
Reference genomes are typically applicable at the species level. For example, if you work with maize, you want a Zea mays reference genome. But:

Konkel and Slot (2023)
How is this data stored?
Both genome assemblies and annotations are typically saved in a single text file each — we’ll explore these in tomorrow’s lab
You’ve learned…
That high-throughput sequencing (HTS) enables DNA sequencing at much larger scales than was previously possible
That HTS underlies several major “omics” fields
How short-read and long-read HTS have different strengths and weaknesses
About libraries and the technology underlying short-read sequencing
The labs this and next week are organized around the data set from Garrigós et al. (2025):
This paper uses paired-end Illumina RNA-Seq data to study gene expression in Culex pipiens mosquitos infected with two different malaria-causing Plasmodium protozoans.
We talked about these HTS applications:
In each of of these applications:
| Appl. | What | Who | How |
|---|---|---|---|
| Whole-genome assembly | The whole genome! 😃 | A single individual | “Overlap” reads into larger fragments |
| Variant typing | |||
| 16S Metabarcoding | |||
| RNA-Seq |
| Appl. | What | Who | How |
|---|---|---|---|
| Whole-genome assembly | Whole genome 😃 | A single individual | “Overlap” reads into larger fragments |
| Variant typing | Whole genome or specific (e.g. exome) or random (GBS/RAD) regions |
Multiple individuals | For each site & individual, determine the variant state (allele) |
| 16S Metabarcoding | |||
| RNA-Seq |
| Appl. | What | Who | How |
|---|---|---|---|
| Whole-genome assembly | Whole genome 😃 | A single individual | “Overlap” reads into larger fragments |
| Variant typing | Whole genome or specific (e.g. exome) or random (GBS/RAD) regions |
Multiple individuals | For each site & individual, determine the variant state (allele) |
| 16S Metabarcoding | A specific locus (e.g. 16S) | Multi-species samples — soil, gut contents, etc. |
Assign reads a taxonomic identity and count |
| RNA-Seq |
| Appl. | What | Who | How |
|---|---|---|---|
| Whole-genome assembly | Whole genome 😃 | A single individual | “Overlap” reads into larger fragments |
| Variant typing | Whole genome or specific (e.g. exome) or random (GBS/RAD) regions |
Multiple individuals | For each site & individual, determine the variant state (allele) |
| 16S Metabarcoding | A specific locus (e.g. 16S) | Multi-species samples — soil, gut contents, etc. |
Assign reads a taxonomic identity and count |
| RNA-Seq | Whole transcriptome | Multiple indivuals (& multiple conditions) |
Assign reads a gene identity and count |