Last updated: 2025-01-04
Checks: 6 1
Knit directory:
reproduce_genomics_paper_figures/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
The R Markdown file has unstaged changes. To know which version of
the R Markdown file created these results, you’ll want to first commit
it to the Git repo. If you’re still working on the analysis, you can
ignore this warning. When you’re finished, you can run
wflow_publish
to commit the R Markdown file and build the
HTML.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20241226)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 883cb56. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rproj.user/
Ignored: analysis/.DS_Store
Ignored: data/fastq/
Ignored: data/public_data/
Ignored: data/reference/
Unstaged changes:
Modified: .gitignore
Modified: analysis/01_download_fastq_from_GEO.Rmd
Modified: analysis/02_align_to_hg38.Rmd
Modified: analysis/08_homework.Rmd
Modified: analysis/_site.yml
Modified: analysis/about.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/02_align_to_hg38.Rmd
) and
HTML (docs/02_align_to_hg38.html
) files. If you’ve
configured a remote Git repository (see ?wflow_git_remote
),
click on the hyperlinks in the table below to view the files as they
were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | c9a4ca2 | crazyhottommy | 2024-12-31 | version 0.1 |
html | c9a4ca2 | crazyhottommy | 2024-12-31 | version 0.1 |
Rmd | fb4fce6 | crazyhottommy | 2024-12-27 | preprocessing done |
html | fb4fce6 | crazyhottommy | 2024-12-27 | preprocessing done |
The next step is to align the fastq files to the genome. We will use hg38 reference human genome for the alignment.
watch my YouTube video on this topic:
One can go to: 1. UCSC https://hgdownload.soe.ucsc.edu/downloads.html
NCBI https://www.ncbi.nlm.nih.gov/datasets/docs/v2/policies-annotation/genomeftp/
GENCODE https://www.gencodegenes.org/
to download the reference genome manually or use refgenie
It is single-end read with 50 base pairs.
zless TAZ.fatq.gz | head -2 |tail -1
ATAGGCTTTAAGCTGTCTTTGGTTTGAAGGTGGGATTTTACCGGGGACCC
zless TAZ.fatq.gz | head -2 |tail -1 | wc -L
50
The ChIP-seq signal can happen to anywhere in the genome, so we need an aligner to align the reads to the genome.
For RNAseq data, we need to align the reads to a
transcriptome
.
STAR
is
a popular aligner for RNAseq.
BWA
is very
popular for DNA-seq (WGS, WES) alignment.
For long reads (pacbio or nanopore), use minimap2
.
BWA and minimap2 are both developed by Heng Li, not surprisingly, the mapping God.
Let’s use bowtie2
https://bowtie-bio.sourceforge.net/bowtie2/index.shtml
for our ChIP-seq data.
Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes.
In the early days, ChIP-seq data were 36bp long, so Bowtie 1 was used.
conda install bowtie2 -c bioconda
Channels:
- bioconda
- defaults
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: failed
LibMambaUnsatisfiableError: Encountered problems while solving:
- nothing provides libcxx >=16 needed by bowtie2-2.5.4-hdcd892d_2
Could not solve for environment specs
The following packages are incompatible
└─ bowtie2 is not installable because there are no viable options
├─ bowtie2 2.5.4 would require
│ └─ libcxx >=16 , which does not exist (perhaps a missing channel);
└─ bowtie2 2.5.4 would require
└─ libcxx >=18 , which does not exist (perhaps a missing channel).
Ask ChatGPT, and it is lacking the libcxx. let’s install it first.
conda install libcxx -c conda-forge
conda install bowtie2 -c bioconda
Now it works, and after it finishes installing you can
bowtie2
The aligners usually needs an index file that is created using the genome reference files (fasta) we downloaded from UCSC, NCBI, ENSEMBL or GENCODE.
For bowtie2, there is a pre-built index file for us to download.
Scroll down the page https://bowtie-bio.sourceforge.net/bowtie2/index.shtml and on the left you will see the index to download.
We will download the GRCH38 (aka, hg38 without decoy). click it, it
will download a GRCh38_noalt_as.zip
file of 3.5G. unzip it
and place the folder to the data/reference
folder.
Which reference genome to use? Bonus, read this post by HengLi.
If there is no pre-built index, take a look at this nf-core/references.
# make sure you are in the data folder
cd data/
bowtie2 -x reference/GRCh38_noalt_as/GRCh38_noalt_as -U fastq/YAP.fastq.gz -S fastq/YAP1.sam --threads 8 --no-mixed --no-discordant -k 1
-x reference/GRCh38_noalt_as/GRCh38_noalt_as
: The path
to the Bowtie2 reference genome index (built with bowtie2-build
reference.fa reference_index).
-U input.fastq
: Input FASTQ file for single-end
reads.
-S output.sam
: Output alignment file in SAM format.
--threads 8
: Use 8 threads to speed up alignment.
--no-mixed
: Ensures that only properly paired reads are
reported (not relevant for single-end data but good for ChIP-seq).
--no-discordant
: Prevents reporting discordant
alignments (not relevant for single-end data but ensures consistent
results for paired-end).
-k 1
: Reports only the best alignment for each read
(important for ChIP-seq to avoid multi-mapping).
Below is the output. And it is super-fast!! for 24 million reads, it only took me 1.5 mins with 8 CPUs.
24549590 reads; of these:
24549590 (100.00%) were unpaired; of these:
895629 (3.65%) aligned 0 times
23653961 (96.35%) aligned exactly 1 time
0 (0.00%) aligned >1 times
96.35% overall alignment rate
93.73 real 733.08 user 21.44 sys
The output is the sam
file which is a format to store
the alignment. It is just a text file and you can use
less -S YAP1.sam
to see the content.
After we made one sample work, we can loop over and do the same for other fastq files.
rm YAP.sam
for fq in fastq/*fastq.gz
do
bowtie2 -x reference/GRCh38_noalt_as/GRCh38_noalt_as -U $fq -S ${fq/fastq.gz/sam} --threads 8 --no-mixed --no-discordant -k 1
done
here we used bash string manipulation ${fq/fastq.gz/sam}
to replace the fastq.gz
to sam
as the
output.
Bonus: read my blog post on it https://crazyhottommy.blogspot.com/2015/07/string-manipulation-in-bash.html
Make sure you have all the sam files:
ls fastq/*sam
fastq/IgG.sam fastq/TAZ.sam fastq/TEAD4.sam fastq/YAP.sam
Most of the tools work with the binary form of the sam file which is
a bam file. The size of the bam file is smaller samtools
is
the key toolkit to deal with it.
conda install samtools -c bioconda
for sam in fastq/*sam
do
samtools view -bS $sam > ${sam/sam/bam}
done
# check the size of the sam and bam files
ls -shlt fastq/*am
2590616 -rw-r--r--@ 1 tommytang staff 1.2G Dec 26 16:02 fastq/YAP.bam
3310856 -rw-r--r--@ 1 tommytang staff 1.6G Dec 26 16:01 fastq/TEAD4.bam
2885488 -rw-r--r--@ 1 tommytang staff 1.4G Dec 26 15:59 fastq/TAZ.bam
3277792 -rw-r--r--@ 1 tommytang staff 1.5G Dec 26 15:58 fastq/IgG.bam
10158320 -rw-r--r--@ 1 tommytang staff 4.8G Dec 26 15:54 fastq/YAP.sam
14483624 -rw-r--r--@ 1 tommytang staff 6.9G Dec 26 15:52 fastq/TEAD4.sam
11437112 -rw-r--r--@ 1 tommytang staff 5.4G Dec 26 15:50 fastq/TAZ.sam
12947640 -rw-r--r--@ 1 tommytang staff 6.2G Dec 26 15:48 fastq/IgG.sam
# remove the sam files to save space
rm fastq/*sam
sort the reads in the bam file by coordinates.
for bam in fastq/*bam
do
samtools sort -@ 4 -o ${bam/bam/sorted.bam} $bam
done
ls fastq/*bam
fastq/IgG.bam fastq/TAZ.bam fastq/TEAD4.bam fastq/YAP.bam
fastq/IgG.sorted.bam fastq/TAZ.sorted.bam fastq/TEAD4.sorted.bam fastq/YAP.sorted.bam
# remove all the unsorted bam
ls fastq/*bam | grep -v sorted.bam | xargs rm
Always test the command for a single sample and then apply it to all samples after you confirm it works.
Why Sort a BAM File?
Sorting arranges the reads in the BAM file by their genomic coordinates (e.g., by chromosome and position). This is crucial for downstream tools (e.g., variant calling, visualization) that rely on efficient access to reads aligned to specific regions. Compatibility with Analysis Tools:
for bam in fastq/*sorted.bam
do
samtools index $bam
done
Why Index a BAM File?
The BAM index (.bai or .csi file) allows tools to quickly locate and access reads aligned to specific genomic regions without scanning the entire file. This dramatically improves performance for tasks like visualization (e.g., in IGV) or focused analyses of particular loci.
Many tools (e.g., samtools, deepTools, GATK) use the index to load only relevant portions of the BAM file, reducing memory and computation time.
Genome browsers like IGV or UCSC Genome Browser require indexed BAM files to display aligned reads efficiently.
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] vembedr_0.1.5 workflowr_1.7.1
loaded via a namespace (and not attached):
[1] jsonlite_1.8.8 compiler_4.4.1 promises_1.3.0 Rcpp_1.0.13
[5] stringr_1.5.1 git2r_0.35.0 assertthat_0.2.1 callr_3.7.6
[9] later_1.3.2 jquerylib_0.1.4 yaml_2.3.10 fastmap_1.2.0
[13] R6_2.5.1 curl_5.2.1 knitr_1.48 tibble_3.2.1
[17] rprojroot_2.0.4 bslib_0.8.0 pillar_1.9.0 rlang_1.1.4
[21] utf8_1.2.4 cachem_1.1.0 stringi_1.8.4 httpuv_1.6.15
[25] xfun_0.46 getPass_0.2-4 fs_1.6.4 sass_0.4.9
[29] cli_3.6.3 magrittr_2.0.3 ps_1.7.7 digest_0.6.36
[33] processx_3.8.4 rstudioapi_0.16.0 lifecycle_1.0.4 vctrs_0.6.5
[37] evaluate_0.24.0 glue_1.7.0 whisker_0.4.1 fansi_1.0.6
[41] rmarkdown_2.27 httr_1.4.7 tools_4.4.1 pkgconfig_2.0.3
[45] htmltools_0.5.8.1