matlab提取DNA序列

班里一同学做毕业设计,要将从网下载的DNA序列文件进行提取,把里面的外显子提取并保存成单独文件,下载下来的文件就是一个网页的文本内容,她竟然就按上面的标示,一个一个查找,复制,新建,粘贴,重命名。女生真是有耐心,据说连着做了三四天才把六十多个文件提取完成,最后实在做不下去就找我帮忙,本着英雄救美的原则,鄙人就花了大半天时间写了个小程序,实现了手工操作的步骤。

先说下要做什么,从网上下载下来的网页内容如下:

LOCUS       NC_001133             230218 bp    DNA     linear   PLN 14-JUL-2011
DEFINITION  Saccharomyces cerevisiae S288c chromosome I, complete sequence.
ACCESSION   NC_001133
VERSION     NC_001133.9  GI:330443391
DBLINK      Project: 128
KEYWORDS    .
SOURCE      Saccharomyces cerevisiae S288c
  ORGANISM  Saccharomyces cerevisiae S288c
            Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina;
            Saccharomycetes; Saccharomycetales; Saccharomycetaceae;
            Saccharomyces.
REFERENCE   1  (bases 1 to 230218)
  AUTHORS   Goffeau,A., Barrell,B.G., Bussey,H., Davis,R.W., Dujon,B.,
            Feldmann,H., Galibert,F., Hoheisel,J.D., Jacq,C., Johnston,M.,
            Louis,E.J., Mewes,H.W., Murakami,Y., Philippsen,P., Tettelin,H. and
            Oliver,S.G.
  TITLE     Life with 6000 genes
  JOURNAL   Science 274 (5287), 546 (1996)
   PUBMED   8849441
REFERENCE   2  (bases 1 to 230218)
  AUTHORS   Bussey,H., Kaback,D.B., Zhong,W., Vo,D.T., Clark,M.W., Fortin,N.,
            Hall,J., Ouellette,B.F., Keng,T., Barton,A.B. et al.
  TITLE     The nucleotide sequence of chromosome I from Saccharomyces
            cerevisiae
  JOURNAL   Proc. Natl. Acad. Sci. U.S.A. 92 (9), 3809-3813 (1995)
   PUBMED   7731988
REFERENCE   3  (bases 1 to 230218)
  CONSRTM   NCBI Genome Project
  TITLE     Direct Submission
  JOURNAL   Submitted (25-APR-2011) National Center for Biotechnology
            Information, NIH, Bethesda, MD 20894, USA
REFERENCE   4  (bases 1 to 230218)
  CONSRTM   Saccharomyces Genome Database
  TITLE     Direct Submission
  JOURNAL   Submitted (31-MAR-2011) Department of Genetics, Stanford
            University, Stanford, CA 94305-5120, USA
  REMARK    Sequence update by submitter
REFERENCE   5  (bases 1 to 230218)
  CONSRTM   Saccharomyces Genome Database
  TITLE     Direct Submission
  JOURNAL   Submitted (11-DEC-2009) Department of Genetics, Stanford
            University, Stanford, CA 94305-5120, USA
COMMENT     PROVISIONAL REFSEQ: This record has not yet been subject to final
            NCBI review. The reference sequence is identical to BK006935.
            On Apr 26, 2011 this sequence version replaced gi:269970293.
            COMPLETENESS: full length.
PRIMARY     REFSEQ_SPAN         PRIMARY_IDENTIFIER PRIMARY_SPAN        COMP
            1-2669              U73805.1           1-2669
            2592-106271         U12980.3           1-103681
            106140-135553       L05146.3           1-29415
            133553-175545       L22015.2           1-41988
            175399-230218       L28920.2           1-54812
FEATURES             Location/Qualifiers
     source          1..230218
                     /organism="Saccharomyces cerevisiae S288c"
                     /mol_type="genomic DNA"
                     /strain="S288c"
                     /db_xref="taxon:559292"
                     /chromosome="I"
     repeat_region   complement(1..801)
                     /note="TEL01L; Telomeric region on the left arm of
                     Chromosome I; composed of an X element core sequence, X
                     element combinatorial repeats, and a short terminal
                     stretch of telomeric repeats;
                     Telomeric region on the left arm of Chromosome I; composed
                     of an X element core sequence, X element combinatorial
                     repeats, and a short terminal stretch of telomeric
                     repeats"
                     /rpt_family="Telomeric Region"
                     /db_xref="SGD:S000028862"
     repeat_region   complement(1..62)
                     /note="TEL01L-TR; Terminal stretch of telomeric repeats on
                     the left arm of Chromosome I;
                     Terminal stretch of telomeric repeats on the left arm of
                     Chromosome I"
                     /rpt_family="Telomeric Repeat"
                     /db_xref="SGD:S000028864"
     repeat_region   complement(63..336)
                     /note="TEL01L-XR; Telomeric X element combinatorial Repeat
                     region on the left arm of Chromosome I; contains repeats
                     of the D, C, B and A types, as well as Tbf1p binding
                     sites; formerly called SubTelomeric Repeats;
                     Telomeric X element combinatorial Repeat region on the
                     left arm of Chromosome I; contains repeats of the D, C, B
                     and A types, as well as Tbf1p binding sites; formerly
                     called SubTelomeric Repeats"
                     /rpt_family="X element Combinatorial Repeats"
                     /db_xref="SGD:S000028866"
     repeat_region   complement(337..801)
                     /note="TEL01L-XC; Telomeric X element Core sequence on the
                     left arm of Chromosome I; contains an ARS consensus
                     sequence, an Abf1p binding site consensus sequence and two
                     small overlapping ORFs (YAL068W-A and YAL069W);
                     Telomeric X element Core sequence on the left arm of
                     Chromosome I; …………

这是开头的文件内容,在最后面是DNA序列,在这些E文中有个CDS字段表示的是DNA中外显子的位置,要把文件中所有标示的CDS段中的序列提取出来,当然还要把所有的DNA序列全部提取到一个单独的文件,CDS段中的数字标示也要全部提取到一个文件中。

任务已明确,手工操作确实难度不大,但却很繁琐,我想先实现提取CDS段再说后面的。

提取CDS段要看字符串是不是“CDS”,若是则提取后面的一串“数字”,后面的数字还是字符串形式,而且有的数字前面还有一串我不认识的英语,先做简单的。

首先读取文件了。

fid=fopen(ORIGINAL_FILE_NAME,'r');          
file_content=fscanf(fid,'%s');      
fclose(fid);
file_length=length(file_content);

将文件读取到file_content中,这里用的fscanf函数,只能读取字符,几乎所有不可见的字符都会忽略,回车换行也读不出来的。

接下来要新建个文件来保存CDS,如下

fid_out=fopen(ALL_CDS,'w');

然后就要开始遍历file_content的内容了,一个字符一个字符的遍历,好麻烦啊,不过在程序里就方便了,

cds_count=0;
xulie=[];
for i=1:file_length
    switch file_content(i)
        case {'O'}
            if(file_content(i:i+5)=='ORIGIN')
                xulie=file_content(i+6:file_length-1);
            end
        case {'C'}
            if(file_content(i:i+2)=='CDS')
                cds_count=cds_count+1;

变量xulie保存全部DNA序列,cds_count记录找到的CDS的个数,找到了CDS就要提取后面对应的数字了。

               tmp1=i+3;
                tmp2=double(file_content(tmp1));
                while(tmp2 ~= 47 && tmp2 ~= 41 )
                    if(isnumber(file_content(tmp1))==1)
                        fprintf(fid_out,'%c',file_content(tmp1));
                    end
                    if((file_content(tmp1))==',')
                        cds_count=cds_count+1;
                        fprintf(fid_out,'+');
                    end
                    tmp1=tmp1+1;
                    tmp2=double(file_content(tmp1));
                end
                fprintf(fid_out,'+');

其中fid_out就是所有cds对应的数保存的文件夹,

提取后的CDS文件内容如下图



这里用“+”表示两段的分割,看着不美观,不过主要不是要它。

提取完要关闭文件。

fclose(fid_out);
fprintf('共有%d个外显子。\n',cds_count);
fprintf('任务1已完成。\n');

接下来要处理提取的所有DNA序列,因为源文件为了便于查找对dna进行了序列标注,这里要将里面的数字删除,并将小写的actg转换成大写,这个功能在上学期写过一个,当时做的就像上面那个,用个循环,一个一个的处理,这次我查了点资料,几句话就搞定了。

string_out=char(xulie);
for tmp1=0:9
    string_out=strrep(string_out,num2str(tmp1),'');
end
string_out=upper(string_out);
string_out=char(string_out);
fid_out=fopen(ALL_XULIE,'w');
string_out=char(string_out);
fprintf(fid_out,'%s',string_out);
fclose(fid_out);
fprintf('任务2已完成。\n');

这里就是将0~9这10个数字全部替换成空,再用upper函数转换大写,几乎瞬间完成。

接下来就是重头戏了,要将cds对应的序列单独提取并保存到文件,这里要用到前面生成的两个文件。

fid=fopen(ALL_CDS,'r');          
file_content=fscanf(fid,'%s');      
fclose(fid);
file_length=length(file_content);

fid=fopen(ALL_XULIE,'r');          
xulie_content=fscanf(fid,'%s');      
fclose(fid);
xulie_length=length(file_content);

然后我又笨笨的一个大的for循环,一个一个的提取创建吧。

for tmp1=1 : file_length
    if(file_content(tmp1)~='.' && file_content(tmp1)~='+')
        tmp2=[tmp2(:)',file_content(tmp1)'];
    else
        if(file_content(tmp1)=='.' && file_content(tmp1-1)=='.')
            bg_num=str2num(tmp2);
            tmp2='';
        elseif(file_content(tmp1)=='+')
            ed_num=str2num(tmp2);
            tmp2='';
            cds_count=cds_count+1;
            
            tmp_str2=strcat(PRE_FIX_NAME,num2str(cds_count));
            tmp_str2=strcat(tmp_str2,'.txt');
            tmp_str1=xulie_content(bg_num:ed_num);
            fid_out=fopen(tmp_str2,'w');
            fprintf(fid_out,'%s',tmp_str1);
            fclose(fid_out);
        end
    end
end

这里面挺麻烦的,因为开始和结束处之间有两个分割符,所以前面那个就忽略掉了。

新建的文件名要按顺序起名,并要有共同的前缀。这功能就在上面的程序中实现。

最后做出来的效果如下:

功能实现了。可以放松下了。

全部程序:




ORIGINAL_FILE_NAME='dat/dat.txt';
ALL_XULIE='dat/new_dat.txt';
ALL_CDS='dat/CDS.txt';
PRE_FIX_NAME='dat/exon';

clc;
fprintf('任务已启动。\n');
%读取文件内容
fid=fopen(ORIGINAL_FILE_NAME,'r');          
file_content=fscanf(fid,'%s');      
fclose(fid);
file_length=length(file_content);
tmp1=0;
tmp2=0;

fid_out=fopen(ALL_CDS,'w');
cds_count=0;
xulie=[];
for i=1:file_length
    switch file_content(i)
        case {'O'}
            if(file_content(i:i+5)=='ORIGIN')
                xulie=file_content(i+6:file_length-1);
            end
        case {'C'}
            if(file_content(i:i+2)=='CDS')
                cds_count=cds_count+1;
                tmp1=i+3;
                tmp2=double(file_content(tmp1));
                while(tmp2 ~= 47 && tmp2 ~= 41 )
                    if(isnumber(file_content(tmp1))==1)
                        fprintf(fid_out,'%c',file_content(tmp1));
                    end
                    if((file_content(tmp1))==',')
                        cds_count=cds_count+1;
                        fprintf(fid_out,'+');
                    end
                    tmp1=tmp1+1;
                    tmp2=double(file_content(tmp1));
                end
                fprintf(fid_out,'+');
            end
    end
%     if(bitand(i,65535)==65535)
%         fprintf('任务1完成%f%%\n',i/file_length*100);
%     end
end
fclose(fid_out);
fprintf('共有%d个外显子。\n',cds_count);
fprintf('任务1已完成。\n');

string_out=char(xulie);
for tmp1=0:9
    string_out=strrep(string_out,num2str(tmp1),'');
end
string_out=upper(string_out);
string_out=char(string_out);
fid_out=fopen(ALL_XULIE,'w');
string_out=char(string_out);
fprintf(fid_out,'%s',string_out);
fclose(fid_out);
fprintf('任务2已完成。\n');



fid=fopen(ALL_CDS,'r');          
file_content=fscanf(fid,'%s');      
fclose(fid);
file_length=length(file_content);

fid=fopen(ALL_XULIE,'r');          
xulie_content=fscanf(fid,'%s');      
fclose(fid);
xulie_length=length(file_content);

tmp2='';
bg_num=0;
ed_num=0;
cds_count=0;
tmp_str1='';
tmp_str2='';
for tmp1=1 : file_length
    if(file_content(tmp1)~='.' && file_content(tmp1)~='+')
        tmp2=[tmp2(:)',file_content(tmp1)'];
    else
        if(file_content(tmp1)=='.' && file_content(tmp1-1)=='.')
            bg_num=str2num(tmp2);
            tmp2='';
        elseif(file_content(tmp1)=='+')
            ed_num=str2num(tmp2);
            tmp2='';
            cds_count=cds_count+1;
            
            tmp_str2=strcat(PRE_FIX_NAME,num2str(cds_count));
            tmp_str2=strcat(tmp_str2,'.txt');
            tmp_str1=xulie_content(bg_num:ed_num);
            fid_out=fopen(tmp_str2,'w');
            fprintf(fid_out,'%s',tmp_str1);
            fclose(fid_out);
        end
    end
end
fprintf('任务3完成。\n');
fprintf('共有%d个外显子。\n',cds_count);
fprintf('任务全部完成。\n\n');

还有个小程序

function jg=isnumber(canshu)
if((double(canshu)<=58 && double(canshu)>47) ...
        || double(canshu)==46)
    jg=1;
else
    jg=0;
end

完了。






评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值