Alignment to the genome

Last updated: 2025-01-04

Checks: 6 1

Knit directory: reproduce_genomics_paper_figures/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20241226)

The command set.seed(20241226) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 883cb56

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 883cb56. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rproj.user/
    Ignored:    analysis/.DS_Store
    Ignored:    data/fastq/
    Ignored:    data/public_data/
    Ignored:    data/reference/

Unstaged changes:
    Modified:   .gitignore
    Modified:   analysis/01_download_fastq_from_GEO.Rmd
    Modified:   analysis/02_align_to_hg38.Rmd
    Modified:   analysis/08_homework.Rmd
    Modified:   analysis/_site.yml
    Modified:   analysis/about.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/02_align_to_hg38.Rmd) and HTML (docs/02_align_to_hg38.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	c9a4ca2	crazyhottommy	2024-12-31	version 0.1
html	c9a4ca2	crazyhottommy	2024-12-31	version 0.1
Rmd	fb4fce6	crazyhottommy	2024-12-27	preprocessing done
html	fb4fce6	crazyhottommy	2024-12-27	preprocessing done

The next step is to align the fastq files to the genome. We will use hg38 reference human genome for the alignment.

Where to download the reference?

watch my YouTube video on this topic:

One can go to: 1. UCSC https://hgdownload.soe.ucsc.edu/downloads.html

to download the reference genome manually or use refgenie

Choose the aligner

It is single-end read with 50 base pairs.

zless TAZ.fatq.gz | head -2 |tail -1
ATAGGCTTTAAGCTGTCTTTGGTTTGAAGGTGGGATTTTACCGGGGACCC

zless TAZ.fatq.gz | head -2 |tail -1 | wc -L
      50

The ChIP-seq signal can happen to anywhere in the genome, so we need an aligner to align the reads to the genome.

For RNAseq data, we need to align the reads to a transcriptome.

STAR is a popular aligner for RNAseq.

BWA is very popular for DNA-seq (WGS, WES) alignment.

For long reads (pacbio or nanopore), use minimap2.

BWA and minimap2 are both developed by Heng Li, not surprisingly, the mapping God.

Let’s use bowtie2 https://bowtie-bio.sourceforge.net/bowtie2/index.shtml for our ChIP-seq data.

Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes.

In the early days, ChIP-seq data were 36bp long, so Bowtie 1 was used.

install bowtie2

conda install bowtie2 -c bioconda
Channels:
 - bioconda
 - defaults
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: failed

LibMambaUnsatisfiableError: Encountered problems while solving:
  - nothing provides libcxx >=16 needed by bowtie2-2.5.4-hdcd892d_2

Could not solve for environment specs
The following packages are incompatible
└─ bowtie2 is not installable because there are no viable options
   ├─ bowtie2 2.5.4 would require
   │  └─ libcxx >=16 , which does not exist (perhaps a missing channel);
   └─ bowtie2 2.5.4 would require
      └─ libcxx >=18 , which does not exist (perhaps a missing channel).

Ask ChatGPT, and it is lacking the libcxx. let’s install it first.

conda install libcxx -c conda-forge

conda install bowtie2 -c bioconda

Now it works, and after it finishes installing you can

bowtie2

download index

The aligners usually needs an index file that is created using the genome reference files (fasta) we downloaded from UCSC, NCBI, ENSEMBL or GENCODE.

For bowtie2, there is a pre-built index file for us to download.

Scroll down the page https://bowtie-bio.sourceforge.net/bowtie2/index.shtml and on the left you will see the index to download.

We will download the GRCH38 (aka, hg38 without decoy). click it, it will download a GRCh38_noalt_as.zip file of 3.5G. unzip it and place the folder to the data/reference folder.

Which reference genome to use? Bonus, read this post by HengLi.

If there is no pre-built index, take a look at this nf-core/references.

# make sure you are in the data folder
cd data/

bowtie2 -x reference/GRCh38_noalt_as/GRCh38_noalt_as -U fastq/YAP.fastq.gz -S fastq/YAP1.sam --threads 8  --no-mixed --no-discordant -k 1

-x reference/GRCh38_noalt_as/GRCh38_noalt_as: The path to the Bowtie2 reference genome index (built with bowtie2-build reference.fa reference_index).

-U input.fastq: Input FASTQ file for single-end reads.

-S output.sam: Output alignment file in SAM format.

--threads 8: Use 8 threads to speed up alignment.

--no-mixed: Ensures that only properly paired reads are reported (not relevant for single-end data but good for ChIP-seq).

--no-discordant: Prevents reporting discordant alignments (not relevant for single-end data but ensures consistent results for paired-end).

-k 1: Reports only the best alignment for each read (important for ChIP-seq to avoid multi-mapping).

Below is the output. And it is super-fast!! for 24 million reads, it only took me 1.5 mins with 8 CPUs.

24549590 reads; of these:
  24549590 (100.00%) were unpaired; of these:
    895629 (3.65%) aligned 0 times
    23653961 (96.35%) aligned exactly 1 time
    0 (0.00%) aligned >1 times
96.35% overall alignment rate
       93.73 real       733.08 user        21.44 sys

The output is the sam file which is a format to store the alignment. It is just a text file and you can use less -S YAP1.sam to see the content.

After we made one sample work, we can loop over and do the same for other fastq files.

rm YAP.sam

for fq in fastq/*fastq.gz
do
  bowtie2 -x reference/GRCh38_noalt_as/GRCh38_noalt_as -U $fq -S ${fq/fastq.gz/sam} --threads 8  --no-mixed --no-discordant -k 1
done

here we used bash string manipulation ${fq/fastq.gz/sam} to replace the fastq.gz to sam as the output.

Bonus: read my blog post on it https://crazyhottommy.blogspot.com/2015/07/string-manipulation-in-bash.html

Make sure you have all the sam files:

ls fastq/*sam
fastq/IgG.sam   fastq/TAZ.sam   fastq/TEAD4.sam fastq/YAP.sam

convert sam to bam

Most of the tools work with the binary form of the sam file which is a bam file. The size of the bam file is smaller samtools is the key toolkit to deal with it.

conda install samtools -c bioconda

for sam in fastq/*sam
do
  samtools view -bS $sam > ${sam/sam/bam}
done

# check the size of the sam and bam files
ls -shlt fastq/*am

 2590616 -rw-r--r--@ 1 tommytang  staff   1.2G Dec 26 16:02 fastq/YAP.bam
 3310856 -rw-r--r--@ 1 tommytang  staff   1.6G Dec 26 16:01 fastq/TEAD4.bam
 2885488 -rw-r--r--@ 1 tommytang  staff   1.4G Dec 26 15:59 fastq/TAZ.bam
 3277792 -rw-r--r--@ 1 tommytang  staff   1.5G Dec 26 15:58 fastq/IgG.bam
10158320 -rw-r--r--@ 1 tommytang  staff   4.8G Dec 26 15:54 fastq/YAP.sam
14483624 -rw-r--r--@ 1 tommytang  staff   6.9G Dec 26 15:52 fastq/TEAD4.sam
11437112 -rw-r--r--@ 1 tommytang  staff   5.4G Dec 26 15:50 fastq/TAZ.sam
12947640 -rw-r--r--@ 1 tommytang  staff   6.2G Dec 26 15:48 fastq/IgG.sam

# remove the sam files to save space
rm fastq/*sam

sort the reads in the bam file by coordinates.

for bam in fastq/*bam
do
  samtools sort -@ 4 -o ${bam/bam/sorted.bam} $bam
done

ls fastq/*bam
fastq/IgG.bam          fastq/TAZ.bam          fastq/TEAD4.bam        fastq/YAP.bam
fastq/IgG.sorted.bam   fastq/TAZ.sorted.bam   fastq/TEAD4.sorted.bam fastq/YAP.sorted.bam

# remove all the unsorted bam 
ls fastq/*bam | grep -v sorted.bam | xargs rm

Always test the command for a single sample and then apply it to all samples after you confirm it works.

Why Sort a BAM File?

Efficient Retrieval by Genomic Coordinates:

Sorting arranges the reads in the BAM file by their genomic coordinates (e.g., by chromosome and position). This is crucial for downstream tools (e.g., variant calling, visualization) that rely on efficient access to reads aligned to specific regions. Compatibility with Analysis Tools:

Many tools require sorted BAM files as input because they process data regionally (e.g., calling peaks or calculating coverage).

index the bam file


for bam in fastq/*sorted.bam
do 
  samtools index $bam
done

Why Index a BAM File?

Random Access to Reads:

The BAM index (.bai or .csi file) allows tools to quickly locate and access reads aligned to specific genomic regions without scanning the entire file. This dramatically improves performance for tasks like visualization (e.g., in IGV) or focused analyses of particular loci.

Faster Processing:

Many tools (e.g., samtools, deepTools, GATK) use the index to load only relevant portions of the BAM file, reducing memory and computation time.

Requirement for Visualization:

Genome browsers like IGV or UCSC Genome Browser require indexed BAM files to display aligned reads efficiently.

sessionInfo()

R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] vembedr_0.1.5   workflowr_1.7.1

loaded via a namespace (and not attached):
 [1] jsonlite_1.8.8    compiler_4.4.1    promises_1.3.0    Rcpp_1.0.13      
 [5] stringr_1.5.1     git2r_0.35.0      assertthat_0.2.1  callr_3.7.6      
 [9] later_1.3.2       jquerylib_0.1.4   yaml_2.3.10       fastmap_1.2.0    
[13] R6_2.5.1          curl_5.2.1        knitr_1.48        tibble_3.2.1     
[17] rprojroot_2.0.4   bslib_0.8.0       pillar_1.9.0      rlang_1.1.4      
[21] utf8_1.2.4        cachem_1.1.0      stringi_1.8.4     httpuv_1.6.15    
[25] xfun_0.46         getPass_0.2-4     fs_1.6.4          sass_0.4.9       
[29] cli_3.6.3         magrittr_2.0.3    ps_1.7.7          digest_0.6.36    
[33] processx_3.8.4    rstudioapi_0.16.0 lifecycle_1.0.4   vctrs_0.6.5      
[37] evaluate_0.24.0   glue_1.7.0        whisker_0.4.1     fansi_1.0.6      
[41] rmarkdown_2.27    httr_1.4.7        tools_4.4.1       pkgconfig_2.0.3  
[45] htmltools_0.5.8.1