python二进制值_Python:如何使用二进制值编码DNA序列?

I would like to convert a file that contained few DNA sequences into binary values which is as follow:

A=1000

C=0100

G=0010

T=0001

FileA.txt

CCGAT

GCTTA

Desired output

01000100001010000001

00100100000100011000

I have tried using this code to solve my problem but the bin output file seem failed to output my desired answer. Can anyone help me?

Code

import sys

if len(sys.argv) != 2 :

sys.stderr.write('Usage: {} \n'.format(sys.argv[0]))

sys.exit()

# assumes the file only contains dna and newlines

sequence = ''

for line in open(sys.argv[1]) :

sequence += line.strip().upper()

sequence = sequence.replace('A', chr(0b1000))

sequence = sequence.replace('C', chr(0b0100))

sequence = sequence.replace('G', chr(0b0010))

sequence = sequence.replace('T', chr(0b0001))

outfile = open(sys.argv[1] + '.bin', 'wb')

outfile.write(bytearray(sequence, encoding = 'utf-8'))

解决方案

Do you want ascii output or binary? The below will give you what you show in your post (though on a single line. Code needs to be modified to keep newlines).

import sys

if len(sys.argv) != 2 :

sys.stderr.write('Usage: {} \n'.format(sys.argv[0]))

sys.exit()

# assumes the file only contains dna and newlines

sequence = ''

for line in open(sys.argv[1]) :

sequence += line.strip().upper()

sequence = sequence.replace('A', '1000')

sequence = sequence.replace('C', '0100')

sequence = sequence.replace('G', '0010')

sequence = sequence.replace('T', '0001')

outfile = open(sys.argv[1] + '.bin', 'wb')

outfile.write(sequence)

EDIT This creates a binary file where each nucleotide is a byte and the newlines are preserved in binary format.

import sys

if len(sys.argv) != 2 :

sys.stderr.write('Usage: {} \n'.format(sys.argv[0]))

sys.exit()

# assumes the file only contains dna and newlines

newbytearray=bytearray(b'',encoding='utf-8')

dict={'A':0b1000,'C':0b0100,'G':0b0010,'T':0b0001,'\n':0b1010}

with open(sys.argv[1]) as file:

while True:

char=file.read(1)

if not char:

file.close()

break

newbytearray.append(dict[char])

outfile = open(sys.argv[1] + '.bin', 'wb')

outfile.write(newbytearray)

outfile.close()

#Converts the binary file to unicode and prints the result sequence.

testBin = open('fileA.txt.bin','rb')

sequence=''

for line in testBin:

line = line.replace(chr(0b1000),'1000')

line = line.replace(chr(0b0100),'0100')

line = line.replace(chr(0b0010),'0010')

line = line.replace(chr(0b0001),'0001')

line = line.replace(chr(0b1010),'\n')

sequence += line

#outputVerify = open('outputVerify.txt','wb')

#outputVerify.write(sequence)

#outputVerify.close()

print sequence

testBin.close()

#Shows the data of the binary file. Note that byte 6 is the newline character 0b1010.

testBin = open('fileA.txt.bin','rb')

list = ''

i=0

while True:

b = testBin.read(1)

i += 1

if not b:

break #due to eof

list += b

print 'byte: ' + str(i) + ' is '+ '{0:04b}'.format(ord(b)) +' and has decimal representation: ' + str(ord(b))

testBin.close()

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值