两文件关联过滤生成新文件

【问题】
I am trying to make a comparison between two large files, tab delimited. I have been trying to use awk & bash (Ubuntu 15.10), python (v3.5) and powershell (windows 10). My only background is Java but my field tends to stick with the scripting languages.

I am trying to see

File 1 A[ ]

1 gramene gene 4854 9652 - ID=gene:GRMZM2G059865;biotype=protein_coding;description=Uncharacterized protein \[Source:UniProtKB/TrEMBL%3BAcc:C0P8I2\];gene_id=GRMZM2G059865;logic_name=genebuilder;version=1
1 gramene gene 9882 10387 - ID=gene:GRMZM5G888250;biotype=protein_coding;gene_id=GRMZM5G888250;logic_name=genebuilder;version=1
1 gramene gene 109519 111769 - ID=gene:GRMZM2G093344;biotype=protein_coding;gene_id=GRMZM2G093344;logic_name=genebuilder;version=1
1 gramene gene 136307 138929 + ID=gene:GRMZM2G093399;biotype=protein_coding;gene_id=GRMZM2G093399;logic_name=genebuilder;version=1

File 2 B [ ]

S1_6370 T/C 1 6370 +
S1_8210 T 1 8210 +
S1_8376 A 1 8376 +
S1_9889 A 1 9889 +

Output

1 ID=gene:GRMZM2G059865 4857 9652 - S1_6370 T/C 6370 + 
1 ID=gene:GRMZM2G059865 4857 9652 -S1_8210 T 8210 +
1 ID=gene:GRMZM2G059865 4857 9652 -S1_8376 A 8376 +
1 ID=gene:GRMZM5G888250 9882 10387 -S1_9889 A 9889 +

My general logic

loop (until  end of A\[  \]  and B\[  \])``if``B\[$4\]>A\[$4\]  && B\[$4\]<A\[$5\] #if the value in B column 4 is in between the values in A columns 4 & 5.``then``-F”\\t”  print  {A\[1\], A\[9(filtered)\], A\[$4FS$5\], B\[$1\], B\[$2\], B\[$3\], B\[$4\], B\[$5\]} #hopefully reflects awk column calls if the two files were able to have their columns defined that way.``movea++  \# to see if the next set of B column 4 values is in between the values in A columns 4 & 5 ``else``moveb++  #to see if the next set of A columns 4&5 values contain the current vales of B column 4 in them.

I know this logic doesn’t follow any language that I am aware of but is similar in parts. It seems like NR and FNR are two built in running values in awk. Awk helped me split up File 2 that had 10 values in B[$1] into 10 files quite easily and also cut helped with cutting out the few hundred columns (~255+) beyond the 5 you see here. Now I am working File 2 sizes around a couple MB instead of 1 file of 1.6 GB. Other than cutting down loading times, I wanted to simplify the loops. I haven’t backtracked to my previous attempts of python or powershell since I trimmed the file sizes down. I convinced myself they just weren’t going to read my files with their built in libraries or cmdlets. Which I’ll try sometime soon if I am unable to figure out an awk solution.

comparing multiple files and columns using awk #referenced Awk greater than less than but within a set range #referenced efficiently splitting one file into several files by value of column #the one thing that worked Using awk to get a specific string in line #might be able to filter column 9 How to check value of a column lies between values of two columns in other file and print corresponding value from column in Unix? #this seemed the closest but without all the printing out in a third file I wanted, still not able to figure out the syntax completely

python bash powershell

别人解答:

awk '
BEGIN{
x=getline s <"B"
split(s,b,"\\t")
}
!x {
exit
}
{
sub(/;.*/,"",$9)
while (x && $4<b\[4\] && b\[4\]<$5) {
print $1,$9,$4,$5,$7,b\[1\],b\[2\],b\[4\],b\[5\]
x=getline s <"B"; split(s,b,"\\t")
}
}
' OFS='\\t' A

【回答】
File2 较小,可读入内存加快查询速度,file1 较大,可用游标批量读取。按行读入后使用 for 循环,按列号查询并将结果追加到文件中。

JAVA 直接编码会相当繁琐。Awk 的代码很简单,但效率较差,不适合处理太大的文件。这类问题还可以用 SPL 实现,代码简单,性能也很好,且可用命令行执行,也可集成 JAVA:

AB
1=file("file2.txt").import()
2=file("file1.txt").cursor()
3for A2=A1.select(#4>A3.#4 && #4> A3.#5)
4=B3.new(A3.#1,A3.#7.split(";")(1),A3.#4,A3.#5,"-",#1,#2,#4,#5)
5=file("D:/result.txt").export@a(B4)

A3:循环 A2 中的游标记录

B3:查询 file1 和 file2 中满足条件的记录

B4:按列表获取字段值,生成新的序表

B5:将查询结果追加导出到文件中

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值