Extract lowercase masked FASTA from a BLAST database with masking information

If a BLAST database contains masking information, this can be extracted using the blastdbcmd options –db_mask and –mask_sequence as follows:

$ blastdbcmd -info -db mask-data-db
Database: Mask data test
        10 sequences; 12,609 total residues

Date: Feb 17, 2009  5:10 PM     Longest sequence: 1,694 residues

Available filtering algorithms applied to database sequences:

Algorithm ID  Algorithm name      Algorithm options                       
    20        seg                 default options used                    
    40        repeat              -species Desmodus_rotundus              

Volumes:
        mask-data-db
$ blastdbcmd -db mask-data-db -mask_sequence_with 20 -entry 71022837
>gi|71022837|ref|XP_761648.1| hypothetical protein UM05501.1 [Ustilago maydis 521]
MPPSARHSAHPSHHPHAGGRDLHHAAGGPPPQGGPGMPPGPGNGPMHHPHSSYAQSMPPPPGLPPHAMNGINGPPPSTHG
GPPPRMVMADGPGGAGGPPPPPPPHIPRSSSAQSRIMEAaggpagpppagppastspavQklslANEaawvsIGsaaetm
EdydralsayeaalrhnpysvpalsaiagvhrtldnfekavdyfqrvlnivpengdTWGSMGHCYLMMDDLQRAYTAYQQ
ALYHLPNPKEPKLWYGIGILYDRYGSLEHAEEAFASVVRMDPNYEKANEIYFRLGIIYKQQNKFPASLECFRYILDNPPR
PLTEIDIWFQIGHVYEQQKEFNAAKEAYERVLAENPNHAKVLQQLGWLYHLSNAGFNNQERAIQFLTKSLESDPNDAQSW
YLLGRAYMAGQNYNKAYEAYQQAVYRDGKNPTFWCSIGVLYYQINQYRDALDAYSRAIRLNPYISEVWFDLGSLYEACNN
QISDAIHAYERAADLDPDNPQIQQRLQLLRNAEAKGGELPEAPVPQDVHPTAYANNNGMAPGPPTQIGGGPGPSYPPPLV
GPQLAGNGGGRGDLSDRDLPGPGHLGSSHSPPPFRGPPGTDDRGARGPPHGALAPMVGGPGGPEPLGRGGFSHSRGPSPG
PPRMDPYGRRLGSPPRRSPPPPLRSDVHDGHGAPPHVHGQGHGQGHGQGHGQGHGQGHGQSHGHSHGGEFRGPPPLAAAG
PGGPPPPLDHYGRPMGGPMSEREREMEWEREREREREREQAARGYPASGRITPKNEPGYARSQHGGSNAPSPAFGRPPVY
GRDEGRDYYNNSHPGSGPGGPRGGYERGPGAPHAPAPGMRHDERGPPPAPFEHERGPPPPHQAGDLRYDSYSDGRDGPFR
GPPPGLGRPTPDWERTRAGEYGPPSLHDGAEGRNAGGSASKSRRGPKAKDELEAAPAPPSPVPSSAGKKGKTTSSRAGSP
WSAKGGVAAPGKNGKASTPFGTGVGAPVAAAGVGGGVGSKKGAAISLRPQEDQPDSRPGSPQSRRDASPASSDGSNEPLA
ARAPSSRMVDEDYDEGAADALMGLAGAASASSASVATAAPAPVSPVATSDRASSAEKRAESSLGKRPYAEEERAVDEPED
SYKRAKSGSAAEIEADATSGGRLNGVSVSAKPEATAAEGTEQPKETRTETPPLAVAQATSPEAINGKAESESAVQPMDVD
GREPSKAPSESATAMKDSPSTANPVVAAKASEPSPTAAPPATSMATSEAQPAKADSCEKNNNDEDEREEEEGQIHEDPID
APAKRADEDGAK
$ 

Extract all human sequences from the nr database

Although one cannot select GIs by taxonomy from a database, a combination of unix command line tools will accomplish this:

$ blastdbcmd -db nr -entry all -outfmt "%g %T" | \
   awk ' { if ($2 == 9606) { print $1 } } ' | \
   blastdbcmd -db nr -entry_batch - -out human_sequences.txt

The first blastdbcmd invocation produces 2 entries per sequence (GI and taxonomy ID), the awk command selects from the output of that command those sequences which have a taxonomy ID of 9606 (human) and prints its GIs, and finally the second blastdbcmd invocation uses those GIs to print the sequence data for the human sequences in the nr database.

Custom data extraction and formatting from a BLAST database

The following examples show how to extract selected information from a BLAST database and how to format it:

Extract the accession, sequence length, 
and masked locations for GI 71022837:
$ blastdbcmd -entry 71022837 -db Test/mask-data-db  -outfmt "%a %l %m"
XP_761648.1 1292 119-139;140-144;147-152;154-160;161-216;

Extract different sequence ranges from the BLAST databases

The command below will extract two different sequences: bases 40-80 in human chromosome Y (GI 13626247) with the masked regions in lowercase characters (notice argument 30, the masking algorithm ID which is available in this BLAST database) and bases 1-10 in the minus strand of human chromosome 20 (GI 14772189).

$ printf "%s %s %s %s\n%s %s %s\n" 13626247 40-80 plus 30 14772189 1-10 minus \
| blastdbcmd -db GPIPE/9606/current/all_contig -entry_batch -
>gi|13626247|ref|NT_025975.2|:40-80 Homo sapiens chromosome Y genomic contig, GRCh37.p10 Primary Assembly
tgcattccattctattctcttctACTGCATACAatttcact
>gi|14772189|ref|NT_025215.4|:c10-1 Homo sapiens chromosome 20 genomic contig, GRCh37.p10 Primary Assembly
GCTCTAGATC
$

Display the locations where BLAST will search for BLAST databases

This is accomplished by using the -show_blastdb_search_path option in blastdbcmd:

$ blastdbcmd -show_blastdb_search_path
:/net/nabl000/vol/blast/db/blast1:/net/nabl000/vol/blast/db/blast2:
$

Display the available BLAST databases at a given directory

This is accomplished by using the -list option in blastdbcmd:

$ blastdbcmd -list repeat  -recursive
repeat/repeat_3055 Nucleotide
repeat/repeat_31032 Nucleotide
repeat/repeat_35128 Nucleotide
repeat/repeat_3702 Nucleotide
repeat/repeat_40674 Nucleotide
repeat/repeat_4530 Nucleotide
repeat/repeat_4751 Nucleotide
repeat/repeat_6238 Nucleotide
repeat/repeat_6239 Nucleotide
repeat/repeat_7165 Nucleotide
repeat/repeat_7227 Nucleotide
repeat/repeat_7719 Nucleotide
repeat/repeat_7955 Nucleotide
repeat/repeat_9606 Nucleotide
repeat/repeat_9989 Nucleotide
$ 

The first column of the default output is the file name of the BLAST database (usually provided as the –db argument to other BLAST+ applications), the second column represents the molecule type of the BLAST database. This output is configurable via the list_outfmt command line option.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值