两文件关联过滤生成新文件

润乾软件

于 2022-07-19 11:31:25 发布

阅读量362

点赞数

分类专栏： JAVA计算文章标签：文本大数据关联

本文链接：https://blog.csdn.net/raqsoft/article/details/125868573

版权

JAVA计算专栏收录该内容

363 篇文章 2 订阅

订阅专栏

【问题】
I am trying to make a comparison between two large files, tab delimited. I have been trying to use awk & bash (Ubuntu 15.10), python (v3.5) and powershell (windows 10). My only background is Java but my field tends to stick with the scripting languages.

I am trying to see

File 1 A[ ]

1 gramene gene 4854 9652 - ID=gene:GRMZM2G059865;biotype=protein_coding;description=Uncharacterized protein \[Source:UniProtKB/TrEMBL%3BAcc:C0P8I2\];gene_id=GRMZM2G059865;logic_name=genebuilder;version=1
1 gramene gene 9882 10387 - ID=gene:GRMZM5G888250;biotype=protein_coding;gene_id=GRMZM5G888250;logic_name=genebuilder;version=1
1 gramene gene 109519 111769 - ID=gene:GRMZM2G093344;biotype=protein_coding;gene_id=GRMZM2G093344;logic_name=genebuilder;version=1
1 gramene gene 136307 138929 + ID=gene:GRMZM2G093399;biotype=protein_coding;gene_id=GRMZM2G093399;logic_name=genebuilder;version=1

File 2 B [ ]

S1_6370 T/C 1 6370 +
S1_8210 T 1 8210 +
S1_8376 A 1 8376 +
S1_9889 A 1 9889 +

Output

1 ID=gene:GRMZM2G059865 4857 9652 - S1_6370 T/C 6370 + 
1 ID=gene:GRMZM2G059865 4857 9652 -S1_8210 T 8210 +
1 ID=gene:GRMZM2G059865 4857 9652 -S1_8376 A 8376 +
1 ID=gene:GRMZM5G888250 9882 10387 -S1_9889 A 9889 +

My general logic

loop (until  end of A\[  \]  and B\[  \])``if``B\[$4\]>A\[$4\]  && B\[$4\]<A\[$5\] #if the value in B column 4 is in between the values in A columns 4 & 5.``then``-F”\\t”  print  {A\[1\], A\[9(filtered)\], A\[$4FS$5\], B\[$1\], B\[$2\], B\[$3\], B\[$4\], B\[$5\]} #hopefully reflects awk column calls if the two files were able to have their columns defined that way.``movea++  \# to see if the next set of B column 4 values is in between the values in A columns 4 & 5 ``else``moveb++  #to see if the next set of A columns 4&5 values contain the current vales of B column 4 in them.

I know this logic doesn’t follow any language that I am aware of but is similar in parts. It seems like NR and FNR are two built in running values in awk. Awk helped me split up File 2 that had 10 values in B[$1] into 10 files quite easily and also cut helped with cutting out the few hundred columns (~255+) beyond the 5 you see here. Now I am working File 2 sizes around a couple MB instead of 1 file of 1.6 GB. Other than cutting down loading times, I wanted to simplify the loops. I haven’t backtracked to my previous attempts of python or powershell since I trimmed the file sizes down. I convinced myself they just weren’t going to read my files with their built in libraries or cmdlets. Which I’ll try sometime soon if I am unable to figure out an awk solution.

comparing multiple files and columns using awk #referenced Awk greater than less than but within a set range #referenced efficiently splitting one file into several files by value of column #the one thing that worked Using awk to get a specific string in line #might be able to filter column 9 How to check value of a column lies between values of two columns in other file and print corresponding value from column in Unix? #this seemed the closest but without all the printing out in a third file I wanted, still not able to figure out the syntax completely

python bash powershell

别人解答：

awk '
BEGIN{
x=getline s <"B"
split(s,b,"\\t")
}
!x {
exit
}
{
sub(/;.*/,"",$9)
while (x && $4<b\[4\] && b\[4\]<$5) {
print $1,$9,$4,$5,$7,b\[1\],b\[2\],b\[4\],b\[5\]
x=getline s <"B"; split(s,b,"\\t")
}
}
' OFS='\\t' A

【回答】
File2 较小，可读入内存加快查询速度，file1 较大，可用游标批量读取。按行读入后使用 for 循环，按列号查询并将结果追加到文件中。

JAVA 直接编码会相当繁琐。Awk 的代码很简单，但效率较差，不适合处理太大的文件。这类问题还可以用 SPL 实现，代码简单，性能也很好，且可用命令行执行，也可集成 JAVA：

	A	B
1	=file("file2.txt").import()
2	=file("file1.txt").cursor()
3	for A2	=A1.select(#4>A3.#4 && #4> A3.#5)
4		=B3.new(A3.#1,A3.#7.split(";")(1),A3.#4,A3.#5,"-",#1,#2,#4,#5)
5		=file("D:/result.txt").export@a(B4)