fastq相关

最新推荐文章于 2023-08-20 19:31:58 发布

weixin_34077371

最新推荐文章于 2023-08-20 19:31:58 发布

阅读量1.4k

点赞数

文章标签：数据库 python ruby

原文链接：https://my.oschina.net/u/3732258/blog/1843795

版权

2019独角兽企业重金招聘Python工程师标准>>>

Sequence data formats

1. Common sequence data formats including GenBank, FASTA, FASTQ formats. GenBank and FASTA format often represent curated sequencing information. FASTQ often represent experimentally obtained data.

(1) GenBank file format

GenBank is part of the International Nucleotide Sequence Database Collaboration , which comprises the DNA DataBank of Japan (DDBJ), the European Nucleotide Archive (ENA), and GenBank at NCBI. These three organizations exchange data on a daily basis.
More information on GenBank format can be found here

When do we use the GenBank format?

GenBank format can represent variety of information while keeping this information human-readable. It is not suitable for data-analysis.

(2) FASTA format

在生物信息学中，FASTA格式是一种用于记录核酸序列或肽序列的文本格式，其中的核酸或氨基酸均以单个字母编码呈现。该格式同时还允许在序列之前定义名称和编写注释。这一格式最初由FASTA软件包定义，但现今已是生物信息学领域的一项标准。
FASTA简明的格式降低了序列操纵和分析的难度，令序列可被文本处理工具和诸如Python、Ruby和Perl等脚本语言处理。
FASTA is a DNA sequence format for specifying or representing DNA sequences. It does not contain sequence quality information.
Reference: Wikipedia FASTA格式

(3) FASTQ file format

FASTQ is extended FASTA file format with sequencing quality score (phred score).
Please refer to the following references:

fasta与fastq格式文件解读
Wikipedia FASTQ格式 (Simplified Chinese) or FASTQ format (English)
FASTQ文件中，一个序列通常由四行组成：
第一行以@开头，之后为序列的标识符以及描述信息（与FASTA格式的描述行类似）
第二行为序列信息
第三行以+开头，之后可以再次加上序列的标识及描述信息（可选）
第四行为质量得分信息，与第二行的序列相对应，长度必须与第二行相同
The character '!' represents the lowest quality while '~' is the highest. Here are the quality value characters in left-to-right increasing order of quality (ASCII):

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

Further reading:
Differences between FASTA, FASTQ and SAM formats

2. Databases that contain gene sequencing data

NCBI GEO: can search datasets (sequencing data from a series of participants)
NCBI SRA: can search sequencing data from individual participant
ArrayExpress: Experiments are submitted directly to ArrayExpress or are imported from the NCBI Gene Expression Omnibus database. For high-throughput sequencing based experiments the raw data is brokered to the European Nucleotide Archive, while the experiment descriptions and processed data are archived in ArrayExpress.
European Nucleotide Archive: Learn more about how to use ENA by reading ENA: Guidelines and Tips.

I prefer NCBI GEO and SRA because I can use Aspera to download SRA files, which is super fast. It's best to keep Aspera connect software up-to-date.

Install Aspera connect on Ubuntu Linux

mkdir -p ~/biosoft/ascp && cd ~/biosoft/ascp
wget https://download.asperasoft.com/download/sw/connect/3.7.4/aspera-connect-3.7.4.147727-linux-64.tar.gz
tar -zxvf aspera-connect-3.7.4.147727-linux-64.tar.gz
bash aspera-connect-3.7.4.147727-linux-64.sh
# Installing Aspera Connect
# Deploying Aspera Connect (/home/jshi/.aspera/connect) for the current user only.
# Unable to update desktop database, Aspera Connect may not be able to auto-launch
# Restart firefox manually to load the Aspera Connect plug-in
# Install complete.
# construct soft link
sudo ln -s /home/jshi/.aspera/connect/bin/ascp /usr/bin/ascp
ascp -h # help
ascp -A # version

If you have older version, you need to uninstall before you install newer version of Aspera. Actually, you need to delete related files in the following folder:

# ~/.mozilla/plugins/libnpasperaweb.so
# ~/.aspera/connect
rm ~/.mozilla/plugins/libnpasperaweb_{connect build #}.so
yes|rm -rf ~/.aspera/connect

3. How to download SRA files from NCBI SRA database?

According to SRA group, they recommand Prefetch program provided in SRAtoolkit. More detail can be found in Download Guide.

1. Download SRA files by using prefetch

I don't recommand install SRAtoolkit by using sudo apt-get install sratoolkit because the version might be older. I personally prefer to install the latest softwares.
SRA files will be deposited in the default file folder ~/ncbi/public/sra.

# Install SRAtoolkit
mkdir -p ~/biosoft/sratools && cd ~/biosoft/sratools
wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.8.2-1/sratoolkit.2.8.2-1-ubuntu64.tar.gz
tar -zxvf sratoolkit.2.8.2-1-ubuntu64.tar.gz
# You
echo 'export PATH=$PATH:/home/jshi/biosoft/sratools/sratoolkit.2.8.2-1-ubuntu64/bin' >>
~/.bashrc 
source ~/.bashrc

Prefetch can use several different way to download SAR files, the default one is Aspera, if you want prefetch to use only Aspera to download, you can use the following code.

mkdir -p ~/data/project/GSE48240 && cd ~/data/project/GSE48240
# manually generate SRA file list
touch GSE48240.txt
for i in $(seq -w 1 3); do echo "SRR92222""$i" >>GSE48240.txt;done
# Using efetch to generate SRA file list
esearch -db sra -query PRJNA209632 | efetch -format runinfo | cut -f 1 -d ',' |grep SRR >> GSE48240.txt
prefetch -t ascp -a "/usr/bin/ascp|/home/jshi/.aspera/connect/etc/asperaweb_id_dsa.openssh" --option-file GSE48240.txt

Alternatively, you can use curl, wget or ftp to download from generated download links, but will be as slow as snail.

2. Convert SRA files to FASTQ files on the fly

This is a better way if you don't have too much space to save the SRA files. fastq-dump will covert SRA files to fastq files on the fly.

cat GSE48240.txt | xargs -n 1 echo fastq-dump --split-files $1

other

R中修改个别变量名（reshape包）使用names（）函数

names(leadership)

names(leadership)[2] <- “testDate”

names(leadership)[6:10] <-c(“item1”, “item2”, “item3”, “item4”, “item5”)

How do I remove part of a string?
https://stackoverflow.com/questions/9704213/r-remove-part-of-string
gsub
sub_str

# install bioawk
apt-get install bison
cd ~/biosoft
git clone https://github.com/lh3/bioawk
cd bioawk
make
sudo cp bioawk /usr/local/bin

# Download and unzip the file on the fly. 
curl http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz | gunzip -c > chr22.fa

# Look at the file
cat chr22.fa | head -4

# Count how many "N" are in chr22 sequence
cat chr22.fa | grep -o N  | wc -l

# Count how many bases are in Chr22?
cat chr22.fa | bioawk -c fastx '{ print length($seq) }'

-------------------------------------------------------------------------------------------------------

各基因组的对应关系

首先是NCBI对应UCSC，对应ENSEMBL数据库：

GRCh36 (hg18): ENSEMBL release_52.

GRCh37 (hg19): ENSEMBL release_59/61/64/68/69/75.

GRCh38 (hg38): ENSEMBL release_76/77/78/80/81/82.

可以看到ENSEMBL的版本特别复杂！！！很容易搞混！

但是UCSC的版本就简单了，就hg18,19,38, 常用的是hg19，但是我推荐大家都转为hg38

看起来NCBI也是很简单，就GRCh36,37,38，但是里面水也很深！

Feb 13 2014 00:00    Directory April_14_2003
Apr 06 2006 00:00    Directory BUILD.33
Apr 06 2006 00:00    Directory BUILD.34.1
Apr 06 2006 00:00    Directory BUILD.34.2
Apr 06 2006 00:00    Directory BUILD.34.3
Apr 06 2006 00:00    Directory BUILD.35.1
Aug 03 2009 00:00    Directory BUILD.36.1
Aug 03 2009 00:00    Directory BUILD.36.2
Sep 04 2012 00:00    Directory BUILD.36.3
Jun 30 2011 00:00    Directory BUILD.37.1
Sep 07 2011 00:00    Directory BUILD.37.2
Dec 12 2012 00:00    Directory BUILD.37.3

可以看到，有37.1, 37.2， 37.3 等等，不过这种版本一般指的是注释在更新，基因组序列一般不会更新！！！

反正你记住hg19基因组大小是3G，压缩后八九百兆即可！！！

如果要下载GTF注释文件，基因组版本尤为重要！！！

对NCBI：ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/GFF/ ##最新版（hg38）

ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/ARCHIVE/ ## 其它版本

对于ensembl：

ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz

变幻中间的release就可以拿到所有版本信息：ftp://ftp.ensembl.org/pub/

对于UCSC，那就有点麻烦了：

需要选择一系列参数：

http://genome.ucsc.edu/cgi-bin/hgTables

1. Navigate to http://genome.ucsc.edu/cgi-bin/hgTables

2. Select the following options:
clade: Mammal
genome: Human
assembly: Feb. 2009 (GRCh37/hg19)
group: Genes and Gene Predictions
track: UCSC Genes
table: knownGene
region: Select "genome" for the entire genome.
output format: GTF - gene transfer format
output file: enter a file name to save your results to a file, or leave blank to display results in the browser

3. Click 'get output'.

现在重点来了，搞清楚版本关系了，就要下载呀！

UCSC里面下载非常方便，只需要根据基因组简称来拼接url即可：

http://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/chromFa.tar.gz

http://hgdownload.cse.ucsc.edu/goldenPath/mm9/bigZips/chromFa.tar.gz

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz

http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/chromFa.tar.gz

或者用shell脚本指定下载的染色体号：

for i in $(seq 1 22) X Y M;
do echo $i;
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr${i}.fa.gz;

## 这里也可以用NCBI的：ftp://ftp.ncbi.nih.gov/genomes/M_musculus/ARCHIVE/MGSCv3_Release3/Assembled_Chromosomes/chr前缀
done
gunzip *.gz
for i in $(seq 1 22) X Y M;
do cat chr${i}.fa >> hg19.fasta;
done
rm -fr chr*.fasta

---------------------------------------------------------------------------------------------------------

Tool: fastq-dump

Usage:

fastq-dump [options] <path/file> [<path/file> ...]

fastq-dump [options] <accession>

Frequently Used Options:

General:
-h	\|	--help	Displays ALL options, general usage, and version information.
-V	\|	--version	Display the version of the program.
Data formatting:
		--split-files	Dump each read into separate file. Files will receive suffix corresponding to read number.
		--split-spot	Split spots into individual reads.
		--fasta <[line width]>	FASTA only, no qualities. Optional line wrap width (set to zero for no wrapping).
-I	\|	--readids	Append read id after spot id as 'accession.spot.readid' on defline.
-F	\|	--origfmt	Defline contains only original sequence name.
-C	\|	--dumpcs <[cskey]>	Formats sequence using color space (default for SOLiD). "cskey" may be specified for translation.
-B	\|	--dumpbase	Formats sequence using base space (default for other than SOLiD).
-Q	\|	--offset <integer>	Offset to use for ASCII quality scores. Default is 33 ("!").
Filtering:
-N	\|	--minSpotId <rowid>	Minimum spot id to be dumped. Use with "X" to dump a range.
-X	\|	--maxSpotId <rowid>	Maximum spot id to be dumped. Use with "N" to dump a range.
-M	\|	--minReadLen <len>	Filter by sequence length >= <len>
		--skip-technical	Dump only biological reads.
		--aligned	Dump only aligned sequences. Aligned datasets only; see sra-stat.
		--unaligned	Dump only unaligned sequences. Will dump all for unaligned datasets.
Workflow and piping:
-O	\|	--outdir <path>	Output directory, default is current working directory ('.').
-Z	\|	--stdout	Output to stdout, all split data become joined into single stream.
		--gzip	Compress output using gzip.
		--bzip2	Compress output using bzip2.

Use examples:

fastq-dump -X 5 -Z SRR390728

Prints the first five spots (-X 5) to standard out (-Z). This is a useful starting point for verifying other formatting options before dumping a whole file.

fastq-dump -I --split-files SRR390728

Produces two fastq files (--split-files) containing ".1" and ".2" read suffices (-I) for paired-end data.

fastq-dump --split-files --fasta 60 SRR390728

Produces two (--split-files) fasta files (--fasta) with 60 bases per line ("60" included after --fasta).

fastq-dump --split-files --aligned -Q 64 SRR390728

Produces two fastq files (--split-files) that contain only aligned reads (--aligned; Note: only for files submitted as aligned data), with a quality offset of 64 (-Q 64) Please see the documentation on vdb-dump if you wish to produce fasta/qual data.Possible errors and their solution:

fastq-dump.2.x err: item not found while constructing within virtual database module - the path '<path/SRR*.sra>' cannot be opened as database or table

This error indicates that the .sra file cannot be found. Confirm that the path to the file is correct.

fastq-dump.2.x err: name not found while resolving tree within virtual file system module - failed SRR*.sra

The data are likely reference compressed and the toolkit is unable to acquire the reference sequence(s) needed to extract the .sra file. Please confirm that you have tested and validated the configuration of the toolkit. If you have elected to prevent the toolkit from contacting NCBI, you will need to manually acquire the reference(s) here

--------------------------------------------------------------------------------------------------------

下载流程：

1：wget -i ftp://ftp.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP000/SRP000001/SRR000001/SRR000001.sra

从NCBI官网下载sra数据文件

2：

使用fastq-dump工具将sra转换成双端fastq

转载于:https://my.oschina.net/u/3732258/blog/1843795