python两个文件比较快_比较python（numpy）中两个巨大的csv文件的最快方法

最新推荐文章于 2023-06-03 06:00:19 发布

曲绿意

最新推荐文章于 2023-06-03 06:00:19 发布

阅读量213

点赞数

文章标签： python两个文件比较快

本文链接：https://blog.csdn.net/weixin_26872803/article/details/111965643

版权

I am trying find the intesect sub set between two pretty big csv files of

phone numbers(one has 600k rows, and the other has 300mil). I am currently using pandas to open both files and then converting the needed columns into 1d numpy arrays and then using numpy intersect to get the intersect. Is there a better way of doing this, either with python or any other method. Thanks for any help

import pandas as pd

import numpy as np

df_dnc = pd.read_csv('dncTest.csv', names = ['phone'])

df_test = pd.read_csv('phoneTest.csv', names = ['phone'])

dnc_phone = df_dnc['phone']

test_phone = df_test['phone']

np.intersect1d(dnc_phone, test_phone)

解决方案

I will give you general solution with some Python pseudo code. What you are trying to solve here is the classical problem from the book "Programming Pearls" by Jon Bentley.

This is solved very efficiently with just a simple bit array, hence my comment, how long is (how many digits does have) the phone number.

Let's say the phone number is at most 10 digits long, than the max phone number you can have is: 9 999 999 999 (spaces are used for better readability). Here we can use 1bit per number to identify if the number is in set or not (bit is set or not set respectively), thus we are going to use 9 999 999 999 bits to identify each number, i.e.:

bits[0] identifies the number 0 000 000 000

bits[193] identifies the number 0 000 000 193

having a number 659 234-4567 would be addressed by the bits[6592344567]

Doing so we'd need to pre-allocate 9 999 999 999 bits initially set to 0, which is: 9 999 999 999 / 8 / 1024 / 1024 = around 1.2 GB of memory.

I think that holding the intersection of numbers at the end will use more space than the bits representation => at most 600k ints will be stored => 64bit * 600k = around 4.6 GB (actually int is not stored that efficiently and might use much more), if these are string you'll probably end with even more memory requirements.

Parsing a phone number string from CSV file (line by line or buffered file reader), converting it to a number and than doing a constant time memory lookup will be IMO faster than dealing with strings and merging them. Unfortunately, I don't have these phone number files to test, but would be interested to hear your findings.

from bitstring import BitArray

max_number = 9999999999

found_phone_numbers = BitArray(length=max_number+1)

# replace this function with the file open function and retrieving

# the next found phone number

def number_from_file_iteator(dummy_data):

for number in dummy_data:

yield number

def calculate_intersect():

# should be open a file1 and getting the generator with numbers from it

# we use dummy data here

for number in number_from_file_iteator([1, 25, 77, 224322323, 8292, 1232422]):

found_phone_numbers[number] = True

# open second file and check if the number is there

for number in number_from_file_iteator([4, 24, 224322323, 1232422, max_number]):

if found_phone_numbers[number]:

yield number

number_intersection = set(calculate_intersect())

print number_intersection

I used BitArray from bitstring pip package and it needed around 2 secs to initialize the entire bitstring. Afterwards, scanning the file will use constant memory. At the end I used a set to store the items.

Note 1: This algorithm can be modified to just use the list. In that case a second loop as soon as bit number matches this bit must be reset, so that duplicates do not match again.

Note 2: Storing in the set/list occurs lazy, because we use the generator in the second for loop. Runtime complexity is linear, i.e. O(N).

曲绿意

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python两个文件比较快_比较python（numpy）中两个巨大的csv文件的最快方法

I am trying find the intesect sub set between two pretty big csv files ofphone numbers(one has 600k rows, and the other has 300mil). I am currently using pandas to open both files and then converting ...
复制链接

扫一扫

python两个文件比较快_比较python（numpy）中两个巨大的csv文件的最快方法

“相关推荐”对你有帮助么？