Bioinformatics Data Skills by Oreilly学习笔记-6

This chapter covers retrieving bioinformatics data using wget, curl, rsync, and scp. It also discusses data integrity with SHA and MD5 checksums, comparing data differences, and managing compressed data with gzip. The text emphasizes the importance of data integrity and provides examples for working with compressed files." 121036758,877216,微前端框架qiankun详解,"['前端', '前端框架', '微前端', 'qiankun']
摘要由CSDN通过智能技术生成

Chapter6 Bioinformatics Data

Retrieving Bioinformatics Data
Downloading Data with wget and curl

Two common command-line programs for downloading data from the Web are wget and curl. Depending on your system, these may not be already installed; you’ll have to install them with a package manager (e.g., Homebrew or apt-get).
1. wget
wget is useful for quickly downloading a file from the command line—for example, human chromosome 22 from the GRCh37 (also known as hg19) assembly version:

$ wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/chr22.fa.gz
--2013-06-30 00:15:45-- http://[...]/goldenPath/hg19/chromosomes/chr22.fa.gz
Resolving hgdownload.soe.ucsc.edu... 128.114.119.163
Connecting to hgdownload.soe.ucsc.edu|128.114.119.163|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11327826 (11M) [application/x-gzip]
Saving to: ‘chr22.fa.gz’
17% [======> ] 1,989,172 234KB/s eta 66s

wget can also handle FTP links (which start with “ftp,” short for File Transfer Protocol)
e.g.

$ wget --accept "*.gtf" --no-directories --recursive --no-parent \
http://genomics.someuniversity.edu/labsite/annotation.html

But beware! wget’s recursive downloading can be quite aggressive. If not constrained, wget will download everything it can reach within the maximum depth set by --level. In the preceding example, we limited wget in two ways: with --no-parent to prevent wget from downloading pages higher in the directory structure, and with --accept “*.gtf”, which only allows wget to download filenames matching this pattern.
在这里插入图片描述
在这里插入图片描述
2. curl
Curl behaves similarly, although by default writes the file to standard output.

$ curl http://[...]/goldenPath/hg19/chromosomes/chr22.fa.gz > chr22.fa.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
14 10.8M 14 1593k 0 0 531k 0 0:00:20 0:00:02 0:00:18 646k

Curl has the advantage that it can transfer files using more protocols than wget, including SFTP (secure FTP) and SCP (secure copy). One especially nice feature of curl is that it can follow page redirects if the -L/–location option is enabled.

Rsync and Secure Copy (scp)

Synchronizing these entire directories across a network
1. Rsync
语法:rsync source destination
The most common combination of rsync options used to copy an entire directory are -avz. The option -a enables wrsync’s archive mode, -z enables file transfer compression, and -v makes rsync’s progress more verbose so you can see what’s being transferred. Because we’ll be connecting to the remote host through SSH, we also need to use -e ssh. Our directory copying command would look as follows:

$ rsync -avz -e ssh zea_mays/data/ vinceb@[...]:/home/deborah/zea_mays/data
building file list ... done
zmaysA_R1.fastq
zmaysA_R2.fastq
zmaysB_R1.fastq
zmaysB_R2.fastq
zmaysC_R1.fastq
zmaysC_R2.fastq
sent 2861400 bytes received 42 bytes 107978.94 bytes/sec
total size is 8806085 speedup is 3.08

注意zea_mays/data/最后的“/”表示将此文件夹中所有内容拷贝到后一个文件夹中。
2. scp(Secure copy)
Copy over an SSH connection

$ scp Zea_mays.AGPv3.20.gtf 192.168.237.42:/home/deborah/zea_mays/data/
Zea_mays.AGPv3.20.gtf 100% 55 0.1KB/s 00:00
Data Integrity

1. SHA and MD5 Checksums

$ echo "bioinformatics is fun" | shasum
f9b70d0d1b0a55263f1b012adab6abf572e3030b -
$ echo "bioinformatic is fun" | shasum
e7f33eedcfdc9aef8a9b4fec07e58f0cf292aa67 -
$ shasum Csyrichta_TAGGACT_L008_R1_001.fastq
fea7d7a582cdfb64915d486ca39da9ebf7ef1d83 Csyrichta_TAGGACT_L008_R1_001.fastq
$ shasum data/*fastq > fastq_checksums.sha
$ cat fastq_checksums.sha
524d9a057c51b1[...]d8b1cbe2eaf92c96a9 data/Csyrichta_TAGGACT_L008_R1_001.fastq
d2940f444f00c7[...]4f9c9314ab7e1a1b16 data/Csyrichta_TAGGACT_L008_R1_002.fastq
623a4ca571d572[...]1ec51b9ecd53d3aef6 data/Csyrichta_TAGGACT_L008_R1_003.fastq
f0b3a4302daf7a[...]7bf1628dfcb07535bb data/Csyrichta_TAGGACT_L008_R1_004.fastq
53e2410863c36a[...]4c4c219966dd9a2fe5 data/Csyrichta_TAGGACT_L008_R1_005.fastq
e4d0ccf541e90c[...]5db75a3bef8c88ede7 data/Csyrichta_TAGGACT_L008_R1_006.fastq
$ shasum -c fastq_checksums.sha
data/Csyrichta_TAGGACT_L008_R1_001.fastq: OK
data/Csyrichta_TAGGACT_L008_R1_002.fastq: OK
data/Csyrichta_TAGGACT_L008_R1_003.fastq: OK
data/Csyrichta_TAGGACT_L008_R1_004.fastq: OK
data/Csyrichta_TAGGACT_L008_R1_005.fastq: OK
data/Csyrichta_TAGGACT_L008_R1_006.fastq: FAILED
shasum: WARNING: 1 computed checksum did NOT match
Looking at Diferences Between Data

Unix’s diff works line by line, and outputs blocks (called hunks) that differ between files (resembling Git’s git diff command we saw in Chapter 4).

$ diff -u gene-1.bed gene-2.bed

The option -u tells diff to output in unifed dif format
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
Be cautious when running diff on large datasets.

Compressing Data and Working with Compressed Data

The two most common compression systems used on Unix are gzip and bzip2. Both have their advantages: gzip compresses and decompresses data faster than bzip2, but bzip2 has a higher compression ratio (the previously mentioned FASTQ file is only about 16 GB when compressed with bzip2). Generally, gzip is used in bioinformatics to compress most sizable files, while bzip2 is more common for long-term data archiving. We’ll focus primarily on gzip, but bzip2’s tools behave very similarly to gzip.

gzip

Suppose we have a program that removes low-quality bases from FASTQ files called trimmer (this is an imaginary program). Our trimmer program can handle gzipped input files natively, but writes uncompressed trimmed FASTQ results to standard output. Using gzip, we can compress trimmer’s output in place, before writing to the disk:

$ trimmer in.fastq.gz | gzip > out.fastq.gz
$ ls
in.fastq
$ gzip in.fastq
$ ls
in.fastq.gz
$ gunzip in.fastq.gz
$ ls
in.fastq

gzip和gunzip若不加选项,均替换源文件,即返回当前目录。使用-c可以将结果重定向为标准输出。

$ gzip -c in.fastq > in.fastq.gz
$ gunzip -c in.fastq.gz > duplicate_in.fastq

可以附加(或覆盖)到另一文件中:

$ ls
in.fastq.gz in2.fastq
$ gzip -c in2.fastq >> in.fastq.gz

Also, note that gzip does not separate these compressed files: files compressed together are concatenated. If you need to compress multiple separate files into a single archive, use the tar utility (see the examples section of man tar for details)

Working with Gzipped Compressed Files

For example, we can search compressed files using grep’s analog for gzipped files, zgrep. Likewise, cat has zcat (on some systems like OS X, this is gzcat), diff has zdiff, and less has zless. If programs cannot handle compressed input, you can use zcat and pipe output directly to the standard input of another program.

$ zgrep --color -i -n "AGATAGAT" Csyrichta_TAGGACT_L008_R1_001.fastq.gz
2706: ACTTCGGAGAGCCCATATATACACACTAAGATAGATAGCGTTAGCTAATGTAGATAGATT

There can be a slight performance cost in working with gzipped files, as your CPU must decompress input first.

一个练习:
For this example, we’ll download the GRCm38 mouse reference genome and accompanying annotation.
GRCm38 prefix refers to Genome Reference Consortium. We can download GRCm38 from Ensembl (a member of the consortium) using wget.

$ wget  ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.toplevel.fa.gz
--2019-08-26 18:56:25--  ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.toplevel.fa.gz
          => “GRCm38_to_NCBIM37.chain.gzâ€
esolving ftp.ensembl.org... 193.62.193.8
Connecting to ftp.ensembl.org|193.62.193.8|:21... failed: Connection timed out.
Retrying.

显示连接到ftp.ensembl.org超时,百度查到可能是防火墙的问题,参考这篇博文。https://yq.aliyun.com/articles/475066/

Take a quick peek at all sequence headers

$ zgrep "^>" Mus_musculus.GRCm38.74.dna.toplevel.fa.gz | less

Checksum

$ wget ftp://ftp.ensembl.org/pub/release-74/fasta/mus_musculus/dna/CHECKSUMS
$ sum Mus_musculus.GRCm38.74.dna.toplevel.fa.gz
53504 793314
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
This practical book teaches the skills that scientists need for turning large sequencing datasets into reproducible and robust biological findings. Many biologists begin their bioinformatics training by learning scripting languages like Python and R alongside the Unix command line. But there's a huge gap between knowing a few programming languages and being prepared to analyze large amounts of biological data. Rather than teach bioinformatics as a set of workflows that are likely to change with this rapidly evolving field, this book demsonstrates the practice of bioinformatics through data skills. Rigorous assessment of data quality and of the effectiveness of tools is the foundation of reproducible and robust bioinformatics analysis. Through open source and freely available tools, you'll learn not only how to do bioinformatics, but how to approach problems as a bioinformatician. Go from handling small problems with messy scripts to tackling large problems with clever methods and tools Focus on high-throughput (or "next generation") sequencing data Learn data analysis with modern methods, versus covering older theoretical concepts Understand how to choose and implement the best tool for the job Delve into methods that lead to easier, more reproducible, and robust bioinformatics analysis Table of Contents Part I. Ideology: Data Skills for Robust and Reproducible Bioinformatics Chapter 1. How to Learn Bioinformatics Part II. Prerequisites: Essential Skills for Getting Started with a Bioinformatics Project Chapter 2. Setting Up and Managing a Bioinformatics Project Chapter 3. Remedial Unix Shell Chapter 4. Working with Remote Machines Chapter 5. Git for Scientists Chapter 6. Bioinformatics Data Part III. Practice: Bioinformatics Data Skills Chapter 7. Unix Data Tools Chapter 8. A Rapid Introduction to the R Language Chapter 9. Working with Range Data Chapter 10. Working with Sequence Data Chapter 11. Working with Alignment Data Chapter 12. Bioinformatics Shell Scripting, Writing Pipelines, and Parallelizing Tasks Chapter 13. Out-of-Memory Approaches: Tabix and SQLite Chapter 14. Conclusion
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值