提取基因组中的内含子、外显子以及基因间区

大家好，欢迎来到IT知识分享网。

文章主要内容来自：https://www.jieandze1314.com/post/cnposts/165/ ：仅供本人学习参考时间仓促有不少内容没有细细整理大家可以点链接看原文

前言

DaveTang的这篇博客更新于2014年，那时转录组测序很火热。RNA-Seq往往是把reads比对回基因组或转录组，然后就是对reads进行注释，看看它们落在什么位置，其实定量也是其中一步。这里，作者就带我们看看他是如何获得外显子（exonic）、内含子（intronic）、基因间区（intergenic regions）的坐标的。

首先下载参考基因组注释

#hg38版本 wget -c ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_41/gencode.v41.annotation.gz zcat gencode.v41.annotation.gtf.gz | head -n6 description: evidence-based annotation of the human genome (GRCh38), version 41 (Ensembl 107) provider: GENCODE contact:  format: gtf date: 2022-05-12 chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; hgnc_id "HGNC:37102"; havana_gene "OTTHUMG00000000961.2";

其中的第三列说明了feature的类型，其中会有exon、CDS、UTR等等

zcat gencode.v44.annotation.gtf.gz| grep -v "^#" | cut -f3 | sort | uniq -c | sort -k1rn  exon  CDS  UTR  transcript 97009 start_codon 90749 stop_codon 61852 gene 119 Selenocysteine

它们之间的关系，可以看看：http://blog.sciencenet.cn/blog-1271266-792469.html 以及 http://www.dxy.cn/bbs/topic/36728037
UTR（UntranslatedRegions)即非翻译区，是信使RNA（mRNA）分子两端的非编码片段，UTR在DNA序列中算是外显子;5’-UTR从mRNA起点的甲基化鸟嘌呤核苷酸帽延伸至AUG起始密码子，3’-UTR从编码区末端的终止密码子延伸至多聚A尾巴（Poly-A）的末端；
基因间区是每个基因之间的间隔序列，不属于外显子也不属于内含子，它是non coding；

【图片中描述的过程：基因经过转录形成 pre-mRNA，这里面包含着内含子和外显子（5端是一个外显子，但是这段外显子不仅包含CDS，还包含5’ UTR；3端也是以外显子结束，但它也是不仅包含CDS，还包含3’ UTR），经过剪接形成成熟mRNA,内含子已减掉。如果抛开后来加上去的cap和poly A的话，这时全是外显子，但是不全是CDS，因为只有中间的那部分以起始密码子开始、以终止密码子结束的片段才是CDS，只有这部分才会被翻译成蛋白质】

获得外显子、内含子以及基因间区

外显子的获得 => mergeBed

从GTF文件中其实已经能找到exon的踪迹，但是同一个基因下的各个exon可能会有交集，于是这里作者在统计时，将存在重叠的外显子融合在一起（这里只是采用最简单的方法进行探索，但不是最准确的）

使用bedtools的mergeBed功能，但前提是先使用sortBed将坐标进行排序

# 首先挑出exon的起始和终止位点 zcat gencode.v33.annotation.gtf.gz | awk 'BEGIN{OFS="\t";} $3=="exon" {print $1,$4-1,$5}' | head -n5 chr1 11868 12227 chr1 12612 12721 chr1 13220 14409 chr1 12009 12057 chr1 12178 12227 # 然后进行排序，接着进行merge zcat gencode.v33.annotation.gtf.gz | awk 'BEGIN{OFS="\t";} $3=="exon" {print $1,$4-1,$5}' | sortBed | mergeBed -i - | gzip >gencode.v33.exon.annotation.gtf.gz zcat gencode.v33.exon.annotation.bed.gz | head -n5 chr1 11868 12227 chr1 12612 12721 chr1 12974 13052 chr1 13220 14501 chr1 15004 15038

小tip：上面👆的awk命令中出现了一个{print $1,$4-1,$5}，其中$4是指的GTF的第4列，即起始位点。它要减一是因为要从GTF变为BED，而它们的坐标系统不一致：GTF has 1-based start coordinates, and BED has 0-based start coordinates

内含子的获得 => subtractBed

# 选择GTF中gene的feature，然后排除merged exon zcat gencode.v33.annotation.gtf.gz | awk 'BEGIN{OFS="\t";} $3=="gene" {print $1,$4-1,$5}' | sortBed | subtractBed -a stdin -b gencode.v33.exon.annotation.bed.gz | gzip >gencode.v33.intron.annotation.bed.gz zcat gencode.v33.intron.annotation.bed.gz | head -n5 # 结果可以和上面👆的exon结果对照 chr1 12227 12612 chr1 12721 12974 chr1 13052 13220 chr1 14501 15004 chr1 15038 15795

基因间区的获得 => complementBed

需要先获得整个基因组各个染色体的长度，然后从中排除掉gene

# 各个染色体的长度(在bedtools complementBed的帮助文档中) mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from hg38.chromInfo" >hg38.genome # 去掉hg38.genome第一行注释：chrom size sed -i '1d' hg38.genome # 排除gene zcat gencode.v33.annotation.gtf.gz | awk 'BEGIN{OFS="\t";} $3=="gene" {print $1,$4-1,$5}' | sortBed | complementBed -i stdin -g hg38.genome | gzip >gencode.v33.intergenic.annotation.bed.gz

但这里报了个错：Error: Sorted input specified, but the file stdin has the following record with a different sort order than the genomeFile hg38.genome
实际上问题就是因为我们的sort bed后的chrxx不是按照数字排序的（http://asearchforsolutions.blogspot.com/2018/11/error-sorted-input-specified-but-file.html

zcat gencode.v33.annotation.gtf.gz | awk 'BEGIN{OFS="\t";} $3=="gene" {print $1,$4-1,$5}' | sortBed | cut -f1 | uniq chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 # 而这里的hg38.genome却是拍好顺序的 cat hg38.genome | cut -f1 | uniq | head chr1 chr2 chr3 chr4 chr5 chr6 chr7 chrX chr8 chr9 # 因此我们“恃强凌弱”，将hg38.genome改成和sortBed后一致的（相比之下hg38.genome更容易修改），也不按照字母排序 cat hg38.genome| sort -k1.4 >hg38-2.genome head hg38-2.genome chr1 chr10 chr10_GLv1_alt chr10_GLv1_alt chr10_KIv1_alt chr10_KIv1_alt chr10_KNv1_fix chr10_KNv1_fix chr10_KNv1_fix chr10_KNv1_fix

于是，继续进行代码，就没有问题啦：

zcat gencode.v33.annotation.gtf.gz | awk 'BEGIN{OFS="\t";} $3=="gene" {print $1,$4-1,$5}' | sortBed | complementBed -i stdin -g hg38-2.genome | gzip >gencode.v33.intergenic.annotation.bed.gz zcat gencode.v33.intergenic.annotation.bed.gz | head -n5 chr1 0 11868 chr1 31109 34553 chr1 36081 52472 chr1 53312 57597 chr1 64116 65418

到此便成功提取基因组中的内含子、外显子以及基因区间

获得对应的.fa文件

Tool: bedtools getfasta (aka fastaFromBed) Version: v2.30.0 Summary: Extract DNA sequences from a fasta file based on feature coordinates. Usage: bedtools getfasta [OPTIONS] -fi <fasta> -bed <bed/gff/vcf> Options: -fi Input FASTA file -fo Output file (opt., default is STDOUT -bed BED/GFF/VCF file of ranges to extract from -fi -name Use the name field and coordinates for the FASTA header -name+ (deprecated) Use the name field and coordinates for the FASTA header -nameOnly Use the name field for the FASTA header -split Given BED12 fmt., extract and concatenate the sequences from the BED "blocks" (e.g., exons) -tab Write output in TAB delimited format. -bedOut Report extract sequences in a tab-delimited BED format instead of in FASTA format. - Default is FASTA format. -s Force strandedness. If the feature occupies the antisense, strand, the sequence will be reverse complemented. - By default, strand information is ignored. -fullHeader Use full fasta header. - By default, only the word before the first space or tab is used. -rna The FASTA is RNA not DNA. Reverse complementation handled accordingly.

在提取基因间区序列过程中出现如下问题

WARNING. chromosome (chr10_GLv1_alt) was not found in the FASTA file. Skipping. WARNING. chromosome (chr10_GLv1_alt) was not found in the FASTA file. Skipping. WARNING. chromosome (chr10_KIv1_alt) was not found in the FASTA file. Skipping. WARNING. chromosome (chr10_KIv1_alt) was not found in the FASTA file. Skipping. WARNING. chromosome (chr10_KNv1_fix) was not found in the FASTA file. Skipping. WARNING. chromosome (chr10_KNv1_fix) was not found in the FASTA file. Skipping. WARNING. chromosome (chr10_KNv1_fix) was not found in the FASTA file. Skipping.

网上找到跟这个序列号相关的文章说是这些是基因碎片，对于提取基因间区、外显子、内含子影响应该不大

免责声明：本站所有文章内容,图片，视频等均是来源于用户投稿和互联网及文摘转载整编而成，不代表本站观点，不承担相关法律责任。其著作权各归其原作者或其出版社所有。如发现本站有涉嫌抄袭侵权/违法违规的内容,侵犯到您的权益，请在线联系站长,一经查实,本站将立刻删除。本文来自网络,若有侵权，请联系删除，如若转载，请注明出处：https://haidsoft.com/146612.html