3 · Import & DESeq2

This is the heart of the workshop: from salmon files to a ranked list of differentially expressed genes. Drive each step by prompting Claude Code — the code below is what it should produce, and the output is real, run on this repo’s data.

Load the salmon output with tximport

Try this prompt

Read data/samples.csv. Then load the four salmon quant.sf files with tximport, summarizing transcripts to genes using data/tx2gene.csv. Show me the dimensions of the gene-level count matrix.

library(tximport)
library(readr)
library(dplyr)

samples <- read_csv("data/samples.csv", show_col_types = FALSE)
samples$condition <- factor(samples$condition, levels = c("normoxia", "hypoxia"))

files <- file.path("data", "salmon", samples$sample, "quant.sf")
names(files) <- samples$sample

tx2gene <- read_csv("data/tx2gene.csv", show_col_types = FALSE)
txi <- tximport(files, type = "salmon", tx2gene = tx2gene)

dim(txi$counts)            # genes x samples
## [1] 62749     4
head(round(txi$counts), 3)
##                    Normoxia_sgCTRL_1 Normoxia_sgCTRL_2 Hypoxia_sgCTRL_1
## ENSG00000000003.16               116               137              133
## ENSG00000000005.6                  0                 0                0
## ENSG00000000419.14               475               505              422
##                    Hypoxia_sgCTRL_2
## ENSG00000000003.16              147
## ENSG00000000005.6                 0
## ENSG00000000419.14              410

tximport returned three matrices — counts, abundance (TPM) and length. DESeq2 uses all three: the per-gene average transcript length becomes an offset that corrects for differential transcript usage between samples.

Build the DESeq2 dataset

Try this prompt

Create a DESeq2 dataset from the tximport object with design ~ condition, and set normoxia as the reference level so fold changes read as hypoxia-vs-normoxia.

library(DESeq2)

dds <- DESeqDataSetFromTximport(txi, colData = samples, design = ~ condition)
dds$condition <- relevel(dds$condition, ref = "normoxia")
dds
## class: DESeqDataSet 
## dim: 62749 4 
## metadata(1): version
## assays(2): counts avgTxLength
## rownames(62749): ENSG00000000003.16 ENSG00000000005.6 ...
##   ENSG00000293559.1 ENSG00000293560.1
## rowData names(0):
## colnames(4): Normoxia_sgCTRL_1 Normoxia_sgCTRL_2 Hypoxia_sgCTRL_1
##   Hypoxia_sgCTRL_2
## colData names(4): sample condition srx replicate

Filter low-count genes

Genes with almost no reads add noise and drag down multiple-testing power. A light filter is standard:

Try this prompt

Remove genes whose total count across all samples is 10 or fewer, and tell me how many genes remain.

keep <- rowSums(counts(dds)) > 10
dds <- dds[keep, ]
nrow(dds)
## [1] 16654

Run differential expression

Try this prompt

Run DESeq2 and extract results for the contrast hypoxia vs normoxia. Summarize how many genes are up and down at padj < 0.05.

dds <- DESeq(dds)
res <- results(dds, contrast = c("condition", "hypoxia", "normoxia"))
summary(res)
## 
## out of 16654 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 1215, 7.3%
## LFC < 0 (down)     : 1362, 8.2%
## outliers [1]       : 0, 0%
## low counts [2]     : 3552, 21%
## (mean count < 12)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

DESeq() does three things in one call: estimates size factors (library-size normalization), estimates per-gene dispersion, and fits a negative-binomial GLM with a Wald test.

Annotate with gene symbols

The results are indexed by ENSG… IDs. Join the symbol map so the table is human-readable:

Try this prompt

Convert the results to a data frame, add gene symbols from data/gene_name_map.csv, sort by adjusted p-value, and show the top 15 genes.

library(tibble)

gene_name_map <- read_csv("data/gene_name_map.csv", show_col_types = FALSE)

res_df <- res |>
  as.data.frame() |>
  rownames_to_column("gene_id") |>
  left_join(gene_name_map, by = "gene_id") |>
  arrange(padj)

res_df |>
  select(gene_name, gene_id, baseMean, log2FoldChange, padj) |>
  head(15)

# Biological sanity check: are canonical hypoxia genes up-regulated?
hif <- c("VEGFA", "CA9", "BNIP3", "SLC2A1", "PGK1", "LDHA", "PDK1")
res_df |>
  filter(gene_name %in% hif) |>
  select(gene_name, log2FoldChange, padj) |>
  arrange(padj)

If the hypoxia program is up, the analysis is behaving. 🎯

→ Let’s make this visual — 4 · Visualization.