使用kmer的count结果构建numpy的array

在完成kmer生成的count后,汇总所有样本生成numpy array,用于机器学习的训练

输入文件

  1. 不同分类样本的kmer数据文件夹,本文是chrom和plas两类
  2. 总的kmer list
import numpy as np
import glob
import argparse
import pandas as pd

parser = argparse.ArgumentParser(description='Make kmer matirx from kmer files of each kmer files using directoruy')
parser.add_argument("totalkmer",help="input the total kmerfile as features")  # 总的kmer的list
parser.add_argument("chrommers",help="input chrommer filesfloder") # chrom样本kmer的文件夹名字
parser.add_argument("plasmers",help="input chrommer files id") # plas样本kmer文件夹的名字
args = parser.parse_args()

# read total kmer list
kmerlist = {}
count = 0
with open(args.totalkmer)as f:
    for line in f:
        i = line.strip().split("\t")
        kmerlist[i[0]] = count
        count += 1

# read chrom kmers
mat = []
chromlist = glob.glob(args.chrommers)
for i in chromlist:
    arr = [0]*len(kmerlist)
    with open(i) as f:
            for line in f:
                     j = line.strip().split("\t")
                     site = kmerlist[j[0]]
                     arr[site] = int(j[1])
    mat.append(arr)
# read plsmid kmers
plaslist = glob.glob(args.plasmers)
for i in plaslist:
    arr = [0]*len(kmerlist)
    with open(i) as f:
        for line in f:                     
            j = line.strip().split("\t")                     
            site = kmerlist[j[0]]                     
            arr[site] = int(j[1])
    mat.append(arr)

allmatrix = np.array(mat, dtype="int32") # numpy array

# make target
target = np.hstack((np.zeros(len(chromlist)),np.ones(len(plaslist))))  # produce label

#delete samples with sum less than 1995 for 2k and 4995 for 5k
# for 5k frag
(allmatrix.sum(axis=1)!=4995).sum()
idx=allmatrix.sum(axis=1)==4995
allmatrix.shape
target.shape

allmatrix_com=allmatrix[idx]
target_com=target[idx]

allmatrix_com.shape
target_com.shape

# save matrix and target
pd.DataFrame(allmatrix_com).to_csv('allmatrix_com.csv')
np.savetxt("target_com",target_com)

# save kmerlist/index as jason files
import json
with open('kmerlist_index.json','w') as f:
    json.dump(kmerlist, f)

#save matrix sample id
sampleid=chromlist+plaslist
sampleid=[x[6:]for x in sampleid]
sampleid_a=np.array(sampleid)[allidx]
with open('matrix_sample_id.json','w') as f:
    json.dump(sampleid,f)

#读取kmerlist
with open('kmerlist_index.json','r') as f:
    kmerlist = json.load(f)
(END) 

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值