2 · Preprocessing
This is how the raw FASTQ files became the quant.sf you’ll analyze. We walk through it but don’t run it live — building a human transcriptome index and quantifying full FASTQ takes too long for a one-hour session. The scripts are in scripts/, ready to run on your own time.
The committed quant.sf in this repo are real salmon output. To keep the repo small and the download fast, they were generated from a subset of reads per sample (MODE=SUBSET). The pipeline below is identical for the full data.
The pipeline at a glance
FASTQ (GEO/SRA) ──► salmon index (GENCODE v45) ──► salmon quant ──► quant.sf
script 01 script 02 script 03
Hardware requirements & timing
You don’t need a cluster. Everything here was built on a laptop.
Minimum / recommended hardware
| Resource | Minimum | Recommended | Notes |
|---|---|---|---|
| CPU | 2 cores | 4–8 cores | salmon scales with threads (-p) |
| RAM | 8 GB | 16 GB | the human transcriptome index peaks at a few GB |
| Free disk | ~5 GB (subset) | ~30 GB (full data) | index ≈ 0.4 GB; FASTQ dominates |
| OS | macOS / Linux | — | salmon + sra-tools via conda |
Time actually measured building this repo’s data on an Apple Silicon Mac (M-series), 18 GB RAM, salmon run with 6 threads, using the SUBSET mode (5 million reads per sample):
| Step | What | Time | Output size |
|---|---|---|---|
| Download references | GENCODE v45 transcripts (79 MB) + GTF (47 MB) | ~1–2 min | 126 MB |
02 build salmon index |
salmon index (GENCODE v45) |
~1.5 min | 364 MB |
01 download FASTQ (subset) |
4 × 5M single-end reads, streamed from ENA | ~10–20 min† | ~0.9 GB |
03 salmon quant |
4 samples, ~1 min each | ~4 min | ~44 MB (4 × quant.sf) |
make_tx2gene.R |
parse GTF → tx2gene + gene map | ~2–3 min | ~10 MB |
† network-bound — the dominant and least predictable step. The compute itself (index + quant) is under ~6 minutes total.
The real runs have 22–30 M reads each (not 5 M). Expect FASTQ downloads of several GB per sample and salmon quant of ~3–5 min per sample. The index build and RAM needs are unchanged. This is exactly why we don’t run preprocessing live — and why the repo ships finished quant.sf.
Step 1 — Download FASTQ
The blog this workshop is based on uses fastq-dl to pull FASTQ from ENA by SRA experiment accession:
fastq-dl --accession SRX14311105 --group-by-experiment--group-by-experiment merges multiple runs of the same sample into one file. Our scripts/01_download_fastq.sh loops over all four accessions and also offers a fast SUBSET mode (via sra-tools fastq-dump -X) used to build this repo’s data.
Try this prompt
Walk me through scripts/01_download_fastq.sh. What’s the difference between FULL and SUBSET mode, and why would I use each?
Step 2 — Build the salmon index
Salmon maps reads against a transcriptome, so we index the GENCODE v45 transcript FASTA:
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/gencode.v45.transcripts.fa.gz
salmon index -t gencode.v45.transcripts.fa.gz -i gencode.v45_human_index -k 31 --gencode-k 31— k-mer size; 31 is standard for reads ≥ 75 bp.--gencode— strips the|-delimited junk from GENCODE FASTA headers so transcript IDs stay clean.
This is the RAM-heaviest step (the index peaks at a few GB), but it’s quick: ~1.5 min on 6 threads here, a few minutes on a 2-core laptop. The finished index is ~364 MB. See scripts/02_build_salmon_index.sh and the timing table.
Step 3 — Quantify
For single-end reads, feed the FASTQ with -r (paired-end would use -1/-2):
salmon quant -i gencode.v45_human_index -l A \
-r fastq/Hypoxia_sgCTRL_1.fastq.gz \
-o data/salmon/Hypoxia_sgCTRL_1 \
--validateMappings --gcBias --seqBias-l A— auto-detect the library type.--validateMappings— selective alignment, more accurate.--gcBias/--seqBias— correct common technical biases.
Each run writes a quant.sf (plus logs) in ~1 minute per sample (5M reads, 6 threads). Our subset runs hit 90–92% mapping rates — right in line with the 91–93% the original study reported. See scripts/03_salmon_quant.sh.
Step 4 — Build the transcript→gene map
scripts/make_tx2gene.R parses the GENCODE GTF with rtracklayer and writes data/tx2gene.csv and data/gene_name_map.csv:
gtf <- rtracklayer::import("reference/gencode.v45.annotation.gtf.gz")
tx <- subset(as.data.frame(gtf), type == "transcript")
# transcript_id -> gene_id and gene_id -> gene_nameconda env create -f environment.yml, then run scripts/01 → 02 → 03 → make_tx2gene.R. Everything you need is in the repo.
→ Now the fun part: load the data and find differentially expressed genes — 3 · Import & DESeq2.