data transformation python,使用正则表达式将数据转换为Python中的字典

I have a dataset with FASTA formatted sequencing, basically like this:

>pc284

ATCGCGACTCGAC

>pc293

ACCCGACCTCAGC

I want to take to use each tag as a key in the dictionary, and store the gene as a value.

This is the code I have, but really isn't doing anything:

import re

fileData = open('d.fasta', 'r')

myDict = dict()

for line in fileData:

match = re.search('(\>)(\w+)(\r)(\w+)', line)

if match:

gene = match.group(3)

myDict[gene[0]] = gene[1]

print myDict

解决方案

\r is not a valid character class, I think you meant to use \s instead. You can reduce the groups if you don't use them either.

But most of all, you need to extract your groups correctly:

match = re.search(r'>(\w+)\s+(\w+)', line)

if match:

tag, gene = match.groups()

myDict[tag] = gene

By creating only two capturing groups, we can more simply extract those two with .groups() and directly assign them to two variables, tag and gene.

However, reading up on the FASTA format seems to indicate this is a multi-line format with the tag on one line, the gene data on multiple lines after that. In that case your \r was meant to match the newline. This won't work as you read the file one line at a time.

It would be much simpler to read that format without regular expressions like so:

myDict = {}

with open('d.fasta', 'rU') as fileData:

tag = None

for line in fileData:

line = line.strip()

if not line:

continue

if line[0] == '>':

tag = line[1:]

myDict[tag] = ''

else:

assert tag is not None, 'Invalid format, found gene without tag'

myDict[tag] += line

print myDict

This reads the file line by line, detecting tags based on the starting > character, then reads multiple lines of gene information collecting it into your dictionary under the most-recently read tag.

Note the rU mode; we open the file using python's universal newlines mode, to handle whatever newline convention was used to create the file.

Last but not least; take a look at the BioPy project; their Bio.SeqIO module handles FASTA plus many other formats perfectly.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值