matlab一串DNA序列,matlab提取DNA序列小程序

最新推荐文章于 2022-04-02 16:29:15 发布

呐傷.已黯淡

最新推荐文章于 2022-04-02 16:29:15 发布

阅读量478

点赞数

文章标签： matlab一串DNA序列

从网上下载下来的网页内容如下：

LOCUS NC_001133 230218 bp DNA linear PLN 14-JUL-2011

DEFINItiON Saccharomyces cerevisiae S288c chromosome I, complete sequence.

ACCESSION NC_001133

VERSION NC_001133.9 GI:330443391

DBLINK Project: 128

KEYWORDS .

SOURCE Saccharomyces cerevisiae S288c

ORGANISM Saccharomyces cerevisiae S288c

Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina;

Saccharomycetes; Saccharomycetales; Saccharomycetaceae;

Saccharomyces.

REFERENCE 1 (bases 1 to 230218)

AUTHORS Goffeau,A., Barrell,B.G., Bussey,H., Davis,R.W., Dujon,B.,

Feldmann,H., Galibert,F., Hoheisel,J.D., Jacq,C., Johnston,M.,

Louis,E.J., Mewes,H.W., Murakami,Y., Philippsen,P., Tettelin,H. and

Oliver,S.G.

TITLE Life with 6000 genes

JOURNAL Science 274 (5287), 546 (1996)

PUBMED 8849441

REFERENCE 2 (bases 1 to 230218)

AUTHORS Bussey,H., Kaback,D.B., Zhong,W., Vo,D.T., Clark,M.W., Fortin,N.,

Hall,J., Ouellette,B.F., Keng,T., Barton,A.B. et al.

TITLE The nucleotide sequence of chromosome I from Saccharomyces

cerevisiae

JOURNAL Proc. Natl. Acad. Sci. U.S.A. 92 (9), 3809-3813 (1995)

PUBMED 7731988

REFERENCE 3 (bases 1 to 230218)

CONSRTM NCBI Genome Project

TITLE Direct Submission

JOURNAL Submitted (25-APR-2011) National Center for Biotechnology

Information, NIH, Bethesda, MD 20894, USA

REFERENCE 4 (bases 1 to 230218)

CONSRTM Saccharomyces Genome Database

TITLE Direct Submission

JOURNAL Submitted (31-MAR-2011) Department of Genetics, Stanford

University, Stanford, CA 94305-5120, USA

REMARK Sequence update by submitter

REFERENCE 5 (bases 1 to 230218)

CONSRTM Saccharomyces Genome Database

TITLE Direct Submission

JOURNAL Submitted (11-DEC-2009) Department of Genetics, Stanford

University, Stanford, CA 94305-5120, USA

COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final

NCBI review. The reference sequence is identical to BK006935.

On Apr 26, 2011 this sequence version replaced gi:269970293.

COMPLETENESS: full length.

PRIMARY REFSEQ_SPAN PRIMARY_IDENTIFIER PRIMARY_SPAN COMP

1-2669 U73805.1 1-2669

2592-106271 U12980.3 1-103681

106140-135553 L05146.3 1-29415

133553-175545 L22015.2 1-41988

175399-230218 L28920.2 1-54812

FEATURES Location/Qualifiers

source 1..230218

/organism="Saccharomyces cerevisiae S288c"

/mol_type="genomic DNA"

/strain="S288c"

/db_xref="taxon:559292"

/chromosome="I"

repeat_region complement(1..801)

/note="TEL01L; Telomeric region on the left arm of

Chromosome I; composed of an X element core sequence, X

element combinatorial repeats, and a short terminal

stretch of telomeric repeats;

Telomeric region on the left arm of Chromosome I; composed

of an X element core sequence, X element combinatorial

repeats, and a short terminal stretch of telomeric

repeats"

/rpt_family="Telomeric Region"

/db_xref="SGD:S000028862"

repeat_region complement(1..62)

/note="TEL01L-TR; Terminal stretch of telomeric repeats on

the left arm of Chromosome I;

Terminal stretch of telomeric repeats on the left arm of

Chromosome I"

/rpt_family="Telomeric Repeat"

/db_xref="SGD:S000028864"

repeat_region complement(63..336)

/note="TEL01L-XR; Telomeric X element combinatorial Repeat

region on the left arm of Chromosome I; contains repeats

of the D, C, B and A types, as well as Tbf1p binding

sites; formerly calLED SubTelomeric Repeats;

Telomeric X element combinatorial Repeat region on the

left arm of Chromosome I; contains repeats of the D, C, B

and A types, as well as Tbf1p binding sites; formerly

called SubTelomeric Repeats"

/rpt_family="X element Combinatorial Repeats"

/db_xref="SGD:S000028866"

repeat_region complement(337..801)

/note="TEL01L-XC; Telomeric X element Core sequence on the

left arm of Chromosome I; contains an ARS consensus

sequence, an Abf1p binding site consensus sequence and two

small overlapping ORFs (YAL068W-A and YAL069W);

Telomeric X element Core sequence on the left arm of

Chromosome I; …………

这是开头的文件内容，在最后面是DNA序列，在这些E文中有个CDS字段表示的是DNA中外显子的位置，要把文件中所有标示的CDS段中的序列提取出来，当然还要把所有的DNA序列全部提取到一个单独的文件，CDS段中的数字标示也要全部提取到一个文件中。

任务已明确，手工操作确实难度不大，但却很繁琐，我想先实现提取CDS段再说后面的。

提取CDS段要看字符串是不是“CDS”，若是则提取后面的一串“数字”，后面的数字还是字符串形式，而且有的数字前面还有一串我不认识的英语，先做简单的。

首先读取文件了。

fid=fopen(ORIGINAL_FILE_NAME,'r');

file_content=fscanf(fid,'%s');

fclose(fid);

file_length=length(file_content);

将文件读取到file_content中，这里用的fscanf函数，只能读取字符，几乎所有不可见的字符都会忽略，回车换行也读不出来的。

接下来要新建个文件来保存CDS，如下

fid_out=fopen(ALL_CDS,'w');

然后就要开始遍历file_content的内容了，一个字符一个字符的遍历，好麻烦啊，不过在程序里就方便了，

cds_count=0;

xulie=[];

for i=1:file_length

switch file_content(i)

case {'O'}

if(file_content(i:i+5)=='ORIGIN')

xulie=file_content(i+6:file_length-1);

end

case {'C'}

if(file_content(i:i+2)=='CDS')

cds_count=cds_count+1;

变量xulie保存全部DNA序列，cds_count记录找到的CDS的个数，找到了CDS就要提取后面对应的数字了。

tmp1=i+3;

tmp2=double(file_content(tmp1));

while(tmp2 ~= 47 && tmp2 ~= 41 )

if(isnumber(file_content(tmp1))==1)

fprintf(fid_out,'%c',file_content(tmp1));

end

if((file_content(tmp1))==',')

cds_count=cds_count+1;

fprintf(fid_out,'+');

end

tmp1=tmp1+1;

tmp2=double(file_content(tmp1));

end

fprintf(fid_out,'+');

其中fid_out就是所有cds对应的数保存的文件夹，

提取完要关闭文件。

fclose(fid_out);

fprintf('共有%d个外显子。\n',cds_count);

fprintf('任务1已完成。\n');

接下来要处理提取的所有DNA序列，因为源文件为了便于查找对dna进行了序列标注，这里要将里面的数字删除，并将小写的actg转换成大写，这个功能在上学期写过一个，当时做的就像上面那个，用个循环，一个一个的处理，这次我查了点资料，几句话就搞定了。

string_out=char(xulie);

for tmp1=0:9

string_out=strrep(string_out,num2str(tmp1),'');

end

string_out=upper(string_out);

string_out=char(string_out);

fid_out=fopen(ALL_XULIE,'w');

string_out=char(string_out);

fprintf(fid_out,'%s',string_out);

fclose(fid_out);

fprintf('任务2已完成。\n');

这里就是将0~9这10个数字全部替换成空，再用upper函数转换大写，几乎瞬间完成。

接下来就是重头戏了，要将cds对应的序列单独提取并保存到文件，这里要用到前面生成的两个文件。

fid=fopen(ALL_CDS,'r');

file_content=fscanf(fid,'%s');

fclose(fid);

file_length=length(file_content);

fid=fopen(ALL_XULIE,'r');

xulie_content=fscanf(fid,'%s');

fclose(fid);

xulie_length=length(file_content);

然后我又笨笨的一个大的for循环，一个一个的提取创建吧。

for tmp1=1 : file_length

if(file_content(tmp1)~='.' && file_content(tmp1)~='+')

tmp2=[tmp2(:)',file_content(tmp1)'];

else

if(file_content(tmp1)=='.' && file_content(tmp1-1)=='.')

bg_num=str2num(tmp2);

tmp2='';

elseif(file_content(tmp1)=='+')

ed_num=str2num(tmp2);

tmp2='';

cds_count=cds_count+1;

tmp_str2=strcat(PRE_FIX_NAME,num2str(cds_count));

tmp_str2=strcat(tmp_str2,'.txt');

tmp_str1=xulie_content(bg_num:ed_num);

fid_out=fopen(tmp_str2,'w');

fprintf(fid_out,'%s',tmp_str1);

fclose(fid_out);

end

这里面挺麻烦的，因为开始和结束处之间有两个分割符，所以前面那个就忽略掉了。

新建的文件名要按顺序起名，并要有共同的前缀。这功能就在上面的程序中实现。

功能实现了。可以放松下了。

全部程序：

ORIGINAL_FILE_NAME='dat/dat.txt';

ALL_XULIE='dat/new_dat.txt';

ALL_CDS='dat/CDS.txt';

PRE_FIX_NAME='dat/exon';

clc;

fprintf('任务已启动。\n');

%读取文件内容

fid=fopen(ORIGINAL_FILE_NAME,'r');

file_content=fscanf(fid,'%s');

fclose(fid);

file_length=length(file_content);

tmp1=0;

tmp2=0;

fid_out=fopen(ALL_CDS,'w');

cds_count=0;

xulie=[];

for i=1:file_length

switch file_content(i)

case {'O'}

if(file_content(i:i+5)=='ORIGIN')

xulie=file_content(i+6:file_length-1);

end

case {'C'}

if(file_content(i:i+2)=='CDS')

cds_count=cds_count+1;

tmp1=i+3;

tmp2=double(file_content(tmp1));

while(tmp2 ~= 47 && tmp2 ~= 41 )

if(isnumber(file_content(tmp1))==1)

fprintf(fid_out,'%c',file_content(tmp1));

end

if((file_content(tmp1))==',')

cds_count=cds_count+1;

fprintf(fid_out,'+');

end

tmp1=tmp1+1;

tmp2=double(file_content(tmp1));

end

fprintf(fid_out,'+');

end

% if(bitand(i,65535)==65535)

% fprintf('任务1完成%f%%\n',i/file_length*100);

% end

end

fclose(fid_out);

fprintf('共有%d个外显子。\n',cds_count);

fprintf('任务1已完成。\n');

string_out=char(xulie);

for tmp1=0:9

string_out=strrep(string_out,num2str(tmp1),'');

end

string_out=upper(string_out);

string_out=char(string_out);

fid_out=fopen(ALL_XULIE,'w');

string_out=char(string_out);

fprintf(fid_out,'%s',string_out);

fclose(fid_out);

fprintf('任务2已完成。\n');

fid=fopen(ALL_CDS,'r');

file_content=fscanf(fid,'%s');

fclose(fid);

file_length=length(file_content);

fid=fopen(ALL_XULIE,'r');

xulie_content=fscanf(fid,'%s');

fclose(fid);

xulie_length=length(file_content);

tmp2='';

bg_num=0;

ed_num=0;

cds_count=0;

tmp_str1='';

tmp_str2='';

for tmp1=1 : file_length

if(file_content(tmp1)~='.' && file_content(tmp1)~='+')

tmp2=[tmp2(:)',file_content(tmp1)'];

else

if(file_content(tmp1)=='.' && file_content(tmp1-1)=='.')

bg_num=str2num(tmp2);

tmp2='';

elseif(file_content(tmp1)=='+')

ed_num=str2num(tmp2);

tmp2='';

cds_count=cds_count+1;

tmp_str2=strcat(PRE_FIX_NAME,num2str(cds_count));

tmp_str2=strcat(tmp_str2,'.txt');

tmp_str1=xulie_content(bg_num:ed_num);

fid_out=fopen(tmp_str2,'w');

fprintf(fid_out,'%s',tmp_str1);

fclose(fid_out);

end

fprintf('任务3完成。\n');

fprintf('共有%d个外显子。\n',cds_count);

fprintf('任务全部完成。\n\n');

还有个小程序

function jg=isnumber(canshu)

if((double(canshu)<=58 && double(canshu)>47) ...

|| double(canshu)==46)

jg=1;

else

jg=0;

end

呐傷.已黯淡

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
matlab一串DNA序列,matlab提取DNA序列小程序

从网上下载下来的网页内容如下：LOCUS NC_001133 230218 bp DNA linear PLN 14-JUL-2011DEFINItiONSaccharomyces cerevisiae S288c chromosome I, complete sequence.ACCESSION NC_001133VERSION ...
复制链接

扫一扫