1 · The data

Before any analysis, understand what you’re working with. This is exactly where an AI assistant shines: let Claude Code read and summarize the files for you.

The experiment

GSE197576 is a human bulk-RNAseq study. We use four single-end samples:

Sample	Condition	SRA experiment
Normoxia_sgCTRL_1	normoxia	SRX14311105
Normoxia_sgCTRL_2	normoxia	SRX14311106
Hypoxia_sgCTRL_1	hypoxia	SRX14311111
Hypoxia_sgCTRL_2	hypoxia	SRX14311112

Hypoxia = low oxygen. Cells respond by stabilizing the HIF transcription factors, which switch on a well-known program of genes (VEGFA, CA9, BNIP3, SLC2A1/GLUT1, PGK1…). If our analysis is right, those should be among the top up-regulated genes — a free biological positive control.

Explore it with Claude Code

Try this prompt

What is in data/salmon/Normoxia_sgCTRL_1/quant.sf? Show me the first few rows and explain each column.

A salmon quant.sf is a plain TSV with one row per transcript:

Name                Length  EffectiveLength  TPM       NumReads
ENST00000456328.2   1657    1465.83          0.412331  12.000
ENST00000450305.2   632     441.30           0.000000  0.000
...

Name — the transcript ID (versioned GENCODE ENST).
Length / EffectiveLength — transcript length and the length adjusted for the fragment-length distribution.
TPM — transcripts per million (normalized abundance).
NumReads — estimated reads assigned to the transcript (this is what DESeq2 ultimately uses, after summing to gene level).

Try this prompt

Read data/samples.csv and confirm how many samples are in each condition. Also tell me how many transcripts are quantified in each quant.sf file.

Why we need `tx2gene`

Salmon quantifies transcripts, but differential expression is usually done per gene. The map from transcript to gene comes from the GENCODE annotation and lives in data/tx2gene.csv:

transcript_id        gene_id
ENST00000456328.2    ENSG00000290825.2
ENST00000450305.2    ENSG00000223972.6
...

tximport uses this to sum transcript-level estimates into gene-level counts in the next step. The companion data/gene_name_map.csv turns the cryptic ENSG… IDs into readable symbols at the end.

The Claude Code habit

Notice the pattern: ask it to read and explain before you ask it to compute. You catch misunderstandings about the data early — when they’re cheap to fix.

→ Next: how those quant.sf files were produced — 2 · Preprocessing.

The experiment

Explore it with Claude Code

Why we need tx2gene

Why we need `tx2gene`