2 · Preprocessing

This is how the raw FASTQ files became the quant.sf you’ll analyze. We walk through it but don’t run it live — building a human transcriptome index and quantifying full FASTQ takes too long for a one-hour session. The scripts are in scripts/, ready to run on your own time.

Note

The committed quant.sf in this repo are real salmon output. To keep the repo small and the download fast, they were generated from a subset of reads per sample (MODE=SUBSET). The pipeline below is identical for the full data.

The pipeline at a glance

FASTQ (GEO/SRA)  ──►  salmon index (GENCODE v45)  ──►  salmon quant  ──►  quant.sf
   script 01              script 02                      script 03

Hardware requirements & timing

You don’t need a cluster. Everything here was built on a laptop.

Minimum / recommended hardware

Resource	Minimum	Recommended	Notes
CPU	2 cores	4–8 cores	salmon scales with threads (`-p`)
RAM	8 GB	16 GB	the human transcriptome index peaks at a few GB
Free disk	~5 GB (subset)	~30 GB (full data)	index ≈ 0.4 GB; FASTQ dominates
OS	macOS / Linux	—	salmon + sra-tools via conda

Time actually measured building this repo’s data on an Apple Silicon Mac (M-series), 18 GB RAM, salmon run with 6 threads, using the SUBSET mode (5 million reads per sample):

Step	What	Time	Output size
Download references	GENCODE v45 transcripts (79 MB) + GTF (47 MB)	~1–2 min	126 MB
`02` build salmon index	`salmon index` (GENCODE v45)	~1.5 min	364 MB
`01` download FASTQ (subset)	4 × 5M single-end reads, streamed from ENA	~10–20 min†	~0.9 GB
`03` salmon quant	4 samples, ~1 min each	~4 min	~44 MB (4 × `quant.sf`)
`make_tx2gene.R`	parse GTF → tx2gene + gene map	~2–3 min	~10 MB

† network-bound — the dominant and least predictable step. The compute itself (index + quant) is under ~6 minutes total.

Full-depth data costs more

The real runs have 22–30 M reads each (not 5 M). Expect FASTQ downloads of several GB per sample and salmon quant of ~3–5 min per sample. The index build and RAM needs are unchanged. This is exactly why we don’t run preprocessing live — and why the repo ships finished quant.sf.

Step 1 — Download FASTQ

The blog this workshop is based on uses fastq-dl to pull FASTQ from ENA by SRA experiment accession:

fastq-dl --accession SRX14311105 --group-by-experiment

--group-by-experiment merges multiple runs of the same sample into one file. Our scripts/01_download_fastq.sh loops over all four accessions and also offers a fast SUBSET mode (via sra-tools fastq-dump -X) used to build this repo’s data.

Try this prompt

Walk me through scripts/01_download_fastq.sh. What’s the difference between FULL and SUBSET mode, and why would I use each?

Step 2 — Build the salmon index

Salmon maps reads against a transcriptome, so we index the GENCODE v45 transcript FASTA:

wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/gencode.v45.transcripts.fa.gz
salmon index -t gencode.v45.transcripts.fa.gz -i gencode.v45_human_index -k 31 --gencode

-k 31 — k-mer size; 31 is standard for reads ≥ 75 bp.
--gencode — strips the |-delimited junk from GENCODE FASTA headers so transcript IDs stay clean.

This is the RAM-heaviest step (the index peaks at a few GB), but it’s quick: ~1.5 min on 6 threads here, a few minutes on a 2-core laptop. The finished index is ~364 MB. See scripts/02_build_salmon_index.sh and the timing table.

Step 3 — Quantify

For single-end reads, feed the FASTQ with -r (paired-end would use -1/-2):

salmon quant -i gencode.v45_human_index -l A \
  -r fastq/Hypoxia_sgCTRL_1.fastq.gz \
  -o data/salmon/Hypoxia_sgCTRL_1 \
  --validateMappings --gcBias --seqBias

-l A — auto-detect the library type.
--validateMappings — selective alignment, more accurate.
--gcBias / --seqBias — correct common technical biases.

Each run writes a quant.sf (plus logs) in ~1 minute per sample (5M reads, 6 threads). Our subset runs hit 90–92% mapping rates — right in line with the 91–93% the original study reported. See scripts/03_salmon_quant.sh.

Step 4 — Build the transcript→gene map

scripts/make_tx2gene.R parses the GENCODE GTF with rtracklayer and writes data/tx2gene.csv and data/gene_name_map.csv:

gtf <- rtracklayer::import("reference/gencode.v45.annotation.gtf.gz")
tx  <- subset(as.data.frame(gtf), type == "transcript")
# transcript_id -> gene_id  and  gene_id -> gene_name

Reproduce the whole thing later

conda env create -f environment.yml, then run scripts/01 → 02 → 03 → make_tx2gene.R. Everything you need is in the repo.

→ Now the fun part: load the data and find differentially expressed genes — 3 · Import & DESeq2.