Last updated: 2025-01-04

Checks: 6 1

Knit directory: reproduce_genomics_paper_figures/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20241226) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 883cb56. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rproj.user/
    Ignored:    analysis/.DS_Store
    Ignored:    data/fastq/
    Ignored:    data/public_data/
    Ignored:    data/reference/

Unstaged changes:
    Modified:   .gitignore
    Modified:   analysis/01_download_fastq_from_GEO.Rmd
    Modified:   analysis/02_align_to_hg38.Rmd
    Modified:   analysis/03_generate_bigwig.Rmd
    Modified:   analysis/08_homework.Rmd
    Modified:   analysis/_site.yml
    Modified:   analysis/about.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/03_generate_bigwig.Rmd) and HTML (docs/03_generate_bigwig.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd c9a4ca2 crazyhottommy 2024-12-31 version 0.1
html c9a4ca2 crazyhottommy 2024-12-31 version 0.1
Rmd fb4fce6 crazyhottommy 2024-12-27 preprocessing done
html fb4fce6 crazyhottommy 2024-12-27 preprocessing done

generate bigwig files

Now that we have the sorted bam file, we can generate a genome wide signal view as a bigwig file.

A BigWig file is a binary file format commonly used in bioinformatics to efficiently store and visualize continuous data over genomic coordinates, such as read coverage or signal intensity. It is derived from the Wiggle (WIG) format but optimized for faster access and reduced file size, making it suitable for large-scale genomic data.

UCSC has a detailed page explaining what is it https://genome.ucsc.edu/goldenpath/help/bigWig.html

Bonus: Bioinformatics is nortorious about different file formats. read https://genome.ucsc.edu/FAQ/FAQformat.html for various formats definitions

For this purpose, we will use deeptools https://deeptools.readthedocs.io/en/develop/content/example_usage.html

deeptools is very versatile and it has many sub-commands.

We will use bamCoverage to generate bigwig files.

# install it via conda if you do not have it yet
conda install -c conda-forge -c bioconda deeptools
# or 
pip install deeptools

I really like the demonstration of how coverage files are computed by the deeptools authors.

  • RPKM: reads per kilobase per million reads The formula is: RPKM (per bin) = number of reads per bin /(number of mapped reads (in millions) * bin length (kp))

  • RPGC: reads per genomic content used to normalize reads to 1x depth of coverage sequencing depth is defined as: (total number of mapped reads * fragment length) / effective genome size

cd data/fastq
bamCoverage --bam YAP.sorted.bam --normalizeUsing RPKM --extendReads 200 -o YAP1.bw

Read here to understand why you need to extend the reads to 200 bp. We only sequenced 50bp of the DNA after antibody pull-down, but the real DNA is about 200 bp (the size of the DNA after sonication/fragmentation)

We used default bin size 50 bp.

Generate bigwig files for all samples

for bam in *sorted.bam
do
  bamCoverage --bam $bam --normalizeUsing RPKM --extendReads 200 -o ${bam/sorted.bam/bw}
done

peak calling

MACS is the most popular peak caller for ChIPseq. It is maintained by Tao Liu. I had the pleasure working with him on some single-cell ATACseq stuff when I was in Shirley Liu’s lab.

MACS is now MACS3! https://macs3-project.github.io/MACS/docs/INSTALL.html

pip install --upgrade macs3

The MACS3 callpeak subcommands have many paramters and you want to read https://macs3-project.github.io/MACS/docs/callpeak.html

macs3 callpeak -t YAP.sorted.bam -c IgG.sorted.bam -f BAM -n YAP -g hs --outdir YAP_peak

It takes a couple of minutes. output:

ls YAP_peak/
YAP1_model.r          YAP1_peaks.narrowPeak YAP1_peaks.xls        YAP1_summits.bed

Do it for all samples.

for bam in *sorted.bam
do
  if [[ "$bam" != "IgG.sorted.bam" ]]; then
    echo macs3 callpeak -t $bam -c IgG.sorted.bam -f BAM -n ${bam%.sorted.bam} -g hs --outdir ${bam/.sorted.bam/_peak}
  fi
done

macs3 callpeak -t TAZ.sorted.bam -c IgG.sorted.bam -f BAM -n TAZ -g hs --outdir TAZ_peak
macs3 callpeak -t TEAD4.sorted.bam -c IgG.sorted.bam -f BAM -n TEAD4 -g hs --outdir TEAD4_peak
macs3 callpeak -t YAP.sorted.bam -c IgG.sorted.bam -f BAM -n YAP -g hs --outdir YAP_peak

We learned something new here!

if [[ "$bam" != "IgG.bam" ]]; then: Checks if the current file is not IgG.bam.

remove the “echo” and run it:

for bam in *sorted.bam
do
  if [[ "$bam" != "IgG.sorted.bam" ]]; then
    macs3 callpeak -t $bam -c IgG.sorted.bam -f BAM -n ${bam%.sorted.bam} -g hs --outdir ${bam/.sorted.bam/_peak}
  fi
done

How many peaks we get for each transcription factor?

find . -name "*Peak"  | xargs wc -l
   11512 ./TEAD4_peak/TEAD4_peaks.narrowPeak
   10719 ./TAZ_peak/TAZ_peaks.narrowPeak
    9807 ./YAP_peak/YAP_peaks.narrowPeak
   32038 total

According to the manual page:

NAME_peaks.narrowPeak is BED6+4 format file which contains the peak locations together with peak summit, p-value, and q-value. If you plan to load it to the UCSC genome browser, please make sure that you turn on –trackline option. Definition of some specific columns are:

5th: integer score for display. It’s calculated as int(-10log10pvalue) or int(-10log10qvalue) depending on whether -p (pvalue) or -q (qvalue) is used as score cutoff. Please note that currently this value might be out of the [0-1000] range defined in UCSC ENCODE narrowPeak format. You can let the value saturated at 1000 (i.e. p/q-value = 10^-100) by using the following 1-liner awk: awk -v OFS=“ ‘{$5=$5>1000?1000:$5} {print}’ NAME_peaks.narrowPeak

7th: fold-change at peak summit

8th: -log10pvalue at peak summit

9th: -log10qvalue at peak summit

10th: relative summit position to peak start

What is Model Building in MACS?

Model building in MACS (Model-Based Analysis of ChIP-Seq) is a step that attempts to determine the average fragment size of DNA fragments from sequencing data. It uses the shifted positions of the tags (forward and reverse reads) to estimate the peak signal. The estimated fragment size is then used to build a model of the peak shape, which is important for identifying and refining peak regions.

Model Building: Single-End vs. Paired-End

  • Single-End Data:

Model building is used because the fragment size is not directly available from single-end reads. MACS shifts the reads by half the estimated fragment size to align them to the putative binding sites.

  • Paired-End Data:

Model building is usually not needed because the fragment size is directly available from the paired-end read alignment. In paired-end mode, MACS calculates the actual insert size between read pairs, bypassing the need for model building.

When to Use –no-model and –extend-size

For single-end data, you might use –no-model and –extend-size in specific scenarios where you want to skip model building and manually specify the fragment size:

–no-model:

Disables the model building step. Use this option if you already know the approximate fragment size from your library preparation.

–extend-size:

Extends reads to the specified fragment size (e.g., 200 bp). This mimics the actual fragment length in the absence of paired-end information.

Best Practices for Single-End Data

  • With Model Building (Default Behavior):

Let MACS build the model unless there are specific reasons to disable it.

macs2 callpeak -t treatment.bam -c control.bam -f BED -g hs -n sample –outdir results

  • Without Model Building (–no-model):

Use when: You know the fragment size (e.g., from library preparation or empirical testing). The library is not suitable for accurate model building (e.g., very low read depth or noisy data). Specify –extend-size to set the fragment size.

macs2 callpeak -t treatment.bam -c control.bam -f BED -g hs -n sample --outdir results --no-model --extend-size 200

Why Use –no-model and –extend-size for Single-End?

Consistency: You can ensure the fragment size is consistently applied across all datasets. Noise Reduction: Avoid errors from noisy or low-quality data affecting model estimation. Speed: Skipping model building can make peak calling faster.

Model building is relevant for single-end data and used by default unless –no-model is specified. For paired-end data, model building is typically unnecessary since the fragment size is directly calculated. Use –no-model –extend-size for single-end data when you want to bypass model building and directly set the fragment size.

Challenges

The paper uses a previous generated ChIP-seq dataset https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE49651 for H3K4me1, H3K4me3 and H3K27ac. The authors used H3K4me1 to define enhancers and H3K4me3 to define promoters. H3K4me1/H3K27ac to define active enhancers and H3K4me3/H3K27ac to define active promoters.

Can you process the data yourself?


sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] workflowr_1.7.1

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5       httr_1.4.7        cli_3.6.3         knitr_1.48       
 [5] rlang_1.1.4       xfun_0.46         stringi_1.8.4     processx_3.8.4   
 [9] promises_1.3.0    jsonlite_1.8.8    glue_1.7.0        rprojroot_2.0.4  
[13] git2r_0.35.0      htmltools_0.5.8.1 httpuv_1.6.15     ps_1.7.7         
[17] sass_0.4.9        fansi_1.0.6       rmarkdown_2.27    jquerylib_0.1.4  
[21] tibble_3.2.1      evaluate_0.24.0   fastmap_1.2.0     yaml_2.3.10      
[25] lifecycle_1.0.4   whisker_0.4.1     stringr_1.5.1     compiler_4.4.1   
[29] fs_1.6.4          pkgconfig_2.0.3   Rcpp_1.0.13       rstudioapi_0.16.0
[33] later_1.3.2       digest_0.6.36     R6_2.5.1          utf8_1.2.4       
[37] pillar_1.9.0      callr_3.7.6       magrittr_2.0.3    bslib_0.8.0      
[41] tools_4.4.1       cachem_1.1.0      getPass_0.2-4