library(tximport)
library(readr)
library(dplyr)
samples <- read_csv("data/samples.csv", show_col_types = FALSE)
samples$condition <- factor(samples$condition, levels = c("normoxia", "hypoxia"))
files <- file.path("data", "salmon", samples$sample, "quant.sf")
names(files) <- samples$sample
tx2gene <- read_csv("data/tx2gene.csv", show_col_types = FALSE)
txi <- tximport(files, type = "salmon", tx2gene = tx2gene)
dim(txi$counts) # genes x samples
## [1] 62749 4
head(round(txi$counts), 3)
## Normoxia_sgCTRL_1 Normoxia_sgCTRL_2 Hypoxia_sgCTRL_1
## ENSG00000000003.16 116 137 133
## ENSG00000000005.6 0 0 0
## ENSG00000000419.14 475 505 422
## Hypoxia_sgCTRL_2
## ENSG00000000003.16 147
## ENSG00000000005.6 0
## ENSG00000000419.14 4103 · Import & DESeq2
This is the heart of the workshop: from salmon files to a ranked list of differentially expressed genes. Drive each step by prompting Claude Code — the code below is what it should produce, and the output is real, run on this repo’s data.
Load the salmon output with tximport
Try this prompt
Read data/samples.csv. Then load the four salmon quant.sf files with tximport, summarizing transcripts to genes using data/tx2gene.csv. Show me the dimensions of the gene-level count matrix.
tximport returned three matrices — counts, abundance (TPM) and length. DESeq2 uses all three: the per-gene average transcript length becomes an offset that corrects for differential transcript usage between samples.
Build the DESeq2 dataset
Try this prompt
Create a DESeq2 dataset from the tximport object with design ~ condition, and set normoxia as the reference level so fold changes read as hypoxia-vs-normoxia.
library(DESeq2)
dds <- DESeqDataSetFromTximport(txi, colData = samples, design = ~ condition)
dds$condition <- relevel(dds$condition, ref = "normoxia")
dds
## class: DESeqDataSet
## dim: 62749 4
## metadata(1): version
## assays(2): counts avgTxLength
## rownames(62749): ENSG00000000003.16 ENSG00000000005.6 ...
## ENSG00000293559.1 ENSG00000293560.1
## rowData names(0):
## colnames(4): Normoxia_sgCTRL_1 Normoxia_sgCTRL_2 Hypoxia_sgCTRL_1
## Hypoxia_sgCTRL_2
## colData names(4): sample condition srx replicateFilter low-count genes
Genes with almost no reads add noise and drag down multiple-testing power. A light filter is standard:
Try this prompt
Remove genes whose total count across all samples is 10 or fewer, and tell me how many genes remain.
keep <- rowSums(counts(dds)) > 10
dds <- dds[keep, ]
nrow(dds)
## [1] 16654Run differential expression
Try this prompt
Run DESeq2 and extract results for the contrast hypoxia vs normoxia. Summarize how many genes are up and down at padj < 0.05.
dds <- DESeq(dds)
res <- results(dds, contrast = c("condition", "hypoxia", "normoxia"))
summary(res)
##
## out of 16654 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 1215, 7.3%
## LFC < 0 (down) : 1362, 8.2%
## outliers [1] : 0, 0%
## low counts [2] : 3552, 21%
## (mean count < 12)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?resultsDESeq() does three things in one call: estimates size factors (library-size normalization), estimates per-gene dispersion, and fits a negative-binomial GLM with a Wald test.
Annotate with gene symbols
The results are indexed by ENSG… IDs. Join the symbol map so the table is human-readable:
Try this prompt
Convert the results to a data frame, add gene symbols from data/gene_name_map.csv, sort by adjusted p-value, and show the top 15 genes.
library(tibble)
gene_name_map <- read_csv("data/gene_name_map.csv", show_col_types = FALSE)
res_df <- res |>
as.data.frame() |>
rownames_to_column("gene_id") |>
left_join(gene_name_map, by = "gene_id") |>
arrange(padj)
res_df |>
select(gene_name, gene_id, baseMean, log2FoldChange, padj) |>
head(15)# Biological sanity check: are canonical hypoxia genes up-regulated?
hif <- c("VEGFA", "CA9", "BNIP3", "SLC2A1", "PGK1", "LDHA", "PDK1")
res_df |>
filter(gene_name %in% hif) |>
select(gene_name, log2FoldChange, padj) |>
arrange(padj)If the hypoxia program is up, the analysis is behaving. 🎯
→ Let’s make this visual — 4 · Visualization.