Abstract
The advent of next-generation sequencing (NGS) technologies such as ChIP-seq and RNA-seq has helped to reveal many functional properties of DNA and RNA, respectively. However, due to the repeat content of higher eukaryotic genomes, identifying the genomic origin of some sequencing reads is often a challenge. Standard analysis pipelines typically neglect ambiguously mapping reads. Thus, families of genetic elements with very similar members remain underexplored.
Here, we investigated the systematic bias of this practice in downstream analysis. Specifically, we analysed six publicly available single-end ChIP-seq and paired-end RNA-seq ENCODE libraries from human and mouse, weighting the number of reads (or read pairs) mapping to each of the genomic features of interest based on the ambiguity of the corresponding mappings. As expected, a substantial fraction of the ambiguously mapping reads (43–79%) mapped to transposons, which comprehend about half of the human and mouse genomes. Notably, those reads were predominantly mapping to evolutionary young transposons such as AluY and L1HS. Thus, discarding ambiguously mapping reads tends to result in the specific underrepresentation of recently active transposons. Moreover, this common strategy also leads to the underrepresentation of genes with particular function, including cytochrome-c oxidase activity and MHC class I and II protein binding.
This study is a proof of principle to raise awareness on potential systematic distortions caused by the common practice of discarding multimappers from NGS data analysis, and encourages the development of strategies that explicitly consider genomic repetitive sequences.
Here, we investigated the systematic bias of this practice in downstream analysis. Specifically, we analysed six publicly available single-end ChIP-seq and paired-end RNA-seq ENCODE libraries from human and mouse, weighting the number of reads (or read pairs) mapping to each of the genomic features of interest based on the ambiguity of the corresponding mappings. As expected, a substantial fraction of the ambiguously mapping reads (43–79%) mapped to transposons, which comprehend about half of the human and mouse genomes. Notably, those reads were predominantly mapping to evolutionary young transposons such as AluY and L1HS. Thus, discarding ambiguously mapping reads tends to result in the specific underrepresentation of recently active transposons. Moreover, this common strategy also leads to the underrepresentation of genes with particular function, including cytochrome-c oxidase activity and MHC class I and II protein binding.
This study is a proof of principle to raise awareness on potential systematic distortions caused by the common practice of discarding multimappers from NGS data analysis, and encourages the development of strategies that explicitly consider genomic repetitive sequences.
Original language | English |
---|---|
Publication status | Published - 17 Apr 2023 |
Event | 27th Annual International Conference on Research in Computational Molecular Biology: RECOMB 2023 - Istanbul Mariott Hotel Sisli, Istanbul, Turkey Duration: 16 Apr 2023 → 19 Apr 2023 http://recomb2023.bilkent.edu.tr/index.html |
Conference
Conference | 27th Annual International Conference on Research in Computational Molecular Biology |
---|---|
Abbreviated title | RECOMB 2023 |
Country/Territory | Turkey |
City | Istanbul |
Period | 16/04/23 → 19/04/23 |
Internet address |
Keywords
- repeats
- multimappers
- bias
- Functional Analysis