Standard high-throughput functional analysis pipelines omit specific repetitive genetic groups

Michelle Almeida da Paz; Sarah Warger; Leila Taher

Standard high-throughput functional analysis pipelines omit specific repetitive genetic groups

Michelle Almeida da Paz, Sarah Warger, Leila Taher

Institute of Biomedical Informatics (7200)

Research output: Contribution to conference › Poster › peer-review

Abstract

The advent of next-generation sequencing (NGS) technologies such as ChIP-seq and RNA-seq has helped to reveal many functional properties of DNA and RNA, respectively. However, due to the repeat content of higher eukaryotic genomes, identifying the genomic origin of some sequencing reads is often a challenge. Standard analysis pipelines typically neglect ambiguously mapping reads. Thus, families of genetic elements with very similar members remain underexplored.
Here, we investigated the systematic bias of this practice in downstream analysis. Specifically, we analysed six publicly available single-end ChIP-seq and paired-end RNA-seq ENCODE libraries from human and mouse, weighting the number of reads (or read pairs) mapping to each of the genomic features of interest based on the ambiguity of the corresponding mappings. As expected, a substantial fraction of the ambiguously mapping reads (43–79%) mapped to transposons, which comprehend about half of the human and mouse genomes. Notably, those reads were predominantly mapping to evolutionary young transposons such as AluY and L1HS. Thus, discarding ambiguously mapping reads tends to result in the specific underrepresentation of recently active transposons. Moreover, this common strategy also leads to the underrepresentation of genes with particular function, including cytochrome-c oxidase activity and MHC class I and II protein binding.
This study is a proof of principle to raise awareness on potential systematic distortions caused by the common practice of discarding multimappers from NGS data analysis, and encourages the development of strategies that explicitly consider genomic repetitive sequences.

Original language	English
Publication status	Published - 17 Apr 2023
Event	27th Annual International Conference on Research in Computational Molecular Biology: RECOMB 2023 - Istanbul Mariott Hotel Sisli, Istanbul, Turkey Duration: 16 Apr 2023 → 19 Apr 2023 http://recomb2023.bilkent.edu.tr/index.html

Conference

Conference	27th Annual International Conference on Research in Computational Molecular Biology
Abbreviated title	RECOMB 2023
Country/Territory	Turkey
City	Istanbul
Period	16/04/23 → 19/04/23
Internet address	http://recomb2023.bilkent.edu.tr/index.html

Keywords

repeats
multimappers
bias
Functional Analysis

Cite this

@conference{d71968dc38e94af69980c19a2a67918d,

title = "Standard high-throughput functional analysis pipelines omit specific repetitive genetic groups",

abstract = "The advent of next-generation sequencing (NGS) technologies such as ChIP-seq and RNA-seq has helped to reveal many functional properties of DNA and RNA, respectively. However, due to the repeat content of higher eukaryotic genomes, identifying the genomic origin of some sequencing reads is often a challenge. Standard analysis pipelines typically neglect ambiguously mapping reads. Thus, families of genetic elements with very similar members remain underexplored. Here, we investigated the systematic bias of this practice in downstream analysis. Specifically, we analysed six publicly available single-end ChIP-seq and paired-end RNA-seq ENCODE libraries from human and mouse, weighting the number of reads (or read pairs) mapping to each of the genomic features of interest based on the ambiguity of the corresponding mappings. As expected, a substantial fraction of the ambiguously mapping reads (43–79%) mapped to transposons, which comprehend about half of the human and mouse genomes. Notably, those reads were predominantly mapping to evolutionary young transposons such as AluY and L1HS. Thus, discarding ambiguously mapping reads tends to result in the specific underrepresentation of recently active transposons. Moreover, this common strategy also leads to the underrepresentation of genes with particular function, including cytochrome-c oxidase activity and MHC class I and II protein binding. This study is a proof of principle to raise awareness on potential systematic distortions caused by the common practice of discarding multimappers from NGS data analysis, and encourages the development of strategies that explicitly consider genomic repetitive sequences. ",

keywords = "repeats, multimappers, bias, Functional Analysis",

author = "{Almeida da Paz}, Michelle and Sarah Warger and Leila Taher",

year = "2023",

month = apr,

day = "17",

language = "English",

note = "27th Annual International Conference on Research in Computational Molecular Biology : RECOMB 2023, RECOMB 2023 ; Conference date: 16-04-2023 Through 19-04-2023",

url = "http://recomb2023.bilkent.edu.tr/index.html",

}

TY - CONF

T1 - Standard high-throughput functional analysis pipelines omit specific repetitive genetic groups

AU - Almeida da Paz, Michelle

AU - Warger, Sarah

AU - Taher, Leila

PY - 2023/4/17

Y1 - 2023/4/17

N2 - The advent of next-generation sequencing (NGS) technologies such as ChIP-seq and RNA-seq has helped to reveal many functional properties of DNA and RNA, respectively. However, due to the repeat content of higher eukaryotic genomes, identifying the genomic origin of some sequencing reads is often a challenge. Standard analysis pipelines typically neglect ambiguously mapping reads. Thus, families of genetic elements with very similar members remain underexplored. Here, we investigated the systematic bias of this practice in downstream analysis. Specifically, we analysed six publicly available single-end ChIP-seq and paired-end RNA-seq ENCODE libraries from human and mouse, weighting the number of reads (or read pairs) mapping to each of the genomic features of interest based on the ambiguity of the corresponding mappings. As expected, a substantial fraction of the ambiguously mapping reads (43–79%) mapped to transposons, which comprehend about half of the human and mouse genomes. Notably, those reads were predominantly mapping to evolutionary young transposons such as AluY and L1HS. Thus, discarding ambiguously mapping reads tends to result in the specific underrepresentation of recently active transposons. Moreover, this common strategy also leads to the underrepresentation of genes with particular function, including cytochrome-c oxidase activity and MHC class I and II protein binding. This study is a proof of principle to raise awareness on potential systematic distortions caused by the common practice of discarding multimappers from NGS data analysis, and encourages the development of strategies that explicitly consider genomic repetitive sequences.

AB - The advent of next-generation sequencing (NGS) technologies such as ChIP-seq and RNA-seq has helped to reveal many functional properties of DNA and RNA, respectively. However, due to the repeat content of higher eukaryotic genomes, identifying the genomic origin of some sequencing reads is often a challenge. Standard analysis pipelines typically neglect ambiguously mapping reads. Thus, families of genetic elements with very similar members remain underexplored. Here, we investigated the systematic bias of this practice in downstream analysis. Specifically, we analysed six publicly available single-end ChIP-seq and paired-end RNA-seq ENCODE libraries from human and mouse, weighting the number of reads (or read pairs) mapping to each of the genomic features of interest based on the ambiguity of the corresponding mappings. As expected, a substantial fraction of the ambiguously mapping reads (43–79%) mapped to transposons, which comprehend about half of the human and mouse genomes. Notably, those reads were predominantly mapping to evolutionary young transposons such as AluY and L1HS. Thus, discarding ambiguously mapping reads tends to result in the specific underrepresentation of recently active transposons. Moreover, this common strategy also leads to the underrepresentation of genes with particular function, including cytochrome-c oxidase activity and MHC class I and II protein binding. This study is a proof of principle to raise awareness on potential systematic distortions caused by the common practice of discarding multimappers from NGS data analysis, and encourages the development of strategies that explicitly consider genomic repetitive sequences.

KW - repeats

KW - multimappers

KW - bias

KW - Functional Analysis

M3 - Poster

T2 - 27th Annual International Conference on Research in Computational Molecular Biology

Y2 - 16 April 2023 through 19 April 2023

ER -

Standard high-throughput functional analysis pipelines omit specific repetitive genetic groups

Abstract

Conference

Keywords

Fingerprint

Cite this