umi_tools使用

最新推荐文章于 2025-04-02 08:31:49 发布

风语者666

最新推荐文章于 2025-04-02 08:31:49 发布

阅读量5.6k

点赞数 2

分类专栏：生信文章标签：自动驾驶 python

本文链接：https://blog.csdn.net/u014210048/article/details/108610232

版权

生信专栏收录该内容

31 篇文章

订阅专栏

首先保存两个网址：

umi_tools github网址：https://github.com/CGATOxford/UMI-tools

umi_tools 官方网址：https://umi-tools.readthedocs.io/en/latest/

1. 提取UMI信息到readname

umi_tools extract --bc-pattern=NNNNNNN --bc-pattern2=NNNNNNN \
        -I ./V300056107_L01_1069_1.fq.gz \
        -S ./test.extract.reconcile.1.pe_1.fq \
        --read2-in=./V300056107_L01_1069_2.fq.gz \
        --read2-out=./test.extract.reconcile.1.pe_2.fq \
        --ignore-read-pair-suffixes

还可以加一个“--reconcile-pairs”，如果是在rawdata处开始做 extract的时候，这个是没必要的。

报错1.

Read pairs do not match
F300002275L1C001R0030000002/1 != F300002275L1C001R0030000002/2

这种不是真正的不匹配，而是umi_tools的一个bug。

可以用 --ignore-read-pair-suffixes 参数来忽略 “/”及以后的内容。

通过master branch 安装的umi_tools有这个“--ignore-read-pair-suffixes”参数，

原来那个umi_tools的问题，是这个工具与相应的python版本不对应的问题。
https://github.com/CGATOxford/UMI-tools/issues/431 #这是umi_tools项目组成员的解答

报错2. 如果出现：

Read pairs do not match
F300002275L1C001R0030000005 != F300002275L1C001R0030000013

说明这是fq1和fq2里面的reads不是一一对应的，readname也对应不上。

它在输出文件中只会输出配对的reads处理后的结果，其他不配对的，全都在标准错误中输出。

这时候可以加“--reconcile-pairs”，这样就允许read1和read2中的read name不匹配。

问题3：umi_tools 跑得太慢的问题：
https://github.com/CGATOxford/UMI-tools/issues/257

解决方法是：输出文件不要加“.gz”格式后缀。

我试过，加了gz后缀，耗时8小时左右，不加，差不多40分钟。

=====================================================================================

我从一开始用python2，然后换到python3.7，最后换到python3.6

CTTGCAAGTCACTAGAGTGGTGCAGCCTATTTTTTAAAAGTCGTGTGTGTCCTCTTACCCAGTACTTCCTCTTCATATGCACCTTCCGCGCTGCTACAGC
CTTGCATTTACTGCAGGGGAAATAGTTGACATAAAGATGTACTTGCGTATTAGGCACTCCGATTTCAAAGATTTACTCGTATATTGGTCAAAGATATACT

@F300002275L1C001R0030000002/1_CTTGCAACTTGCAT
GTCACTAGAGTGGTGCAGCCTATTTTTTAAAAGTCGTGTGTGTCCTCTTACCCAGTACTTCCTCTTCATATGCACCTTCCGCGCTGCTACAGC

@F300002275L1C001R0030000002/2_CTTGCAACTTGCAT
TTACTGCAGGGGAAATAGTTGACATAAAGATGTACTTGCGTATTAGGCACTCCGATTTCAAAGATTTACTCGTATATTGGTCAAAGATATACT

2. 去低质量、含Nreads， mapping, sort, index：略

3. dedup （类似picard的markdup）

umi_tools dedup -I ./test.coordinate.sort.bam \
        --output-stats=deduplicated \
        -S ./test.marked_duplicates.directional.bam \
        --method directional --paired

注： --method 参数有5种可选值；作者认为directional是最好的：https://github.com/CGATOxford/UMI-tools/issues/435

--paired 参数对于PE reads来说必不可少，否则它会只输出read2，不输出read1.

dedup太慢解决方法：https://github.com/CGATOxford/UMI-tools/issues/340

The biggest thing you can do here to improve things is not generate the stats #也就是不要--output-stats=deduplicated 这个参数。

结果比较：

coordinate.sort.bam 38078276 #before_dedup
marked_duplicates.adjacency.bam 35220427 #adjacency
marked_duplicates.cluster.bam 35219515 #cluster
marked_duplicates.directional.bam 35219616 #directional，作者团队成员推荐这个
marked_duplicates.percentile.bam 35483003 #percentile
marked_duplicates.unique.bam 35483001 #unique
marked_duplicates.picard.bam 34350858 #picard

UMI序列设置：

In the default basic mode we specify the location of barcodes and UMIs in a read using a simple string. Ns specify the location of bases to be treated as UMIs, Cs as bases to treated as cell barcode and Xs as bases that are neither and that should be retained on the read. By default this pattern is applied to the 5’ end of the read, but we can tell extract to look on the 3’ end of the read using --3-prime

https://umi-tools.readthedocs.io/en/latest/regex.html