python两个文件比较快_比较python(numpy)中两个巨大的csv文件的最快方法

I am trying find the intesect sub set between two pretty big csv files of

phone numbers(one has 600k rows, and the other has 300mil). I am currently using pandas to open both files and then converting the needed columns into 1d numpy arrays and then using numpy intersect to get the intersect. Is there a better way of doing this, either with python or any other method. Thanks for any help

import pandas as pd

import numpy as np

df_dnc = pd.read_csv('dncTest.csv', names = ['phone'])

df_test = pd.read_csv('phoneTest.csv', names = ['phone'])

dnc_phone = df_dnc['phone']

test_phone = df_test['phone']

np.intersect1d(dnc_phone, test_phone)

解决方案

I will give you general solution with some Python pseudo code. What you are trying to solve here is the classical problem from the book "Programming Pearls" by Jon Bentley.

This is solved very efficiently with just a simple bit array, hence my comment, how long is (how many digits does have) the phone number.

Let's say the phone number is at most 10 digits long, than the max phone number you can have is: 9 999 999 999 (spaces are used for better readability). Here we can use 1bit per number to identify if the number is in set or not (bit is set or not set respectively), thus we are going to use 9 999 999 999 bits to identify each number, i.e.:

bits[0] identifies the number 0 000 000 000

bits[193] identifies the number 0 000 000 193

having a number 659 234-4567 would be addressed by the bits[6592344567]

Doing so we'd need to pre-allocate 9 999 999 999 bits initially set to 0, which is: 9 999 999 999 / 8 / 1024 / 1024 = around 1.2 GB of memory.

I think that holding the intersection of numbers at the end will use more space than the bits representation => at most 600k ints will be stored => 64bit * 600k = around 4.6 GB (actually int is not stored that efficiently and might use much more), if these are string you'll probably end with even more memory requirements.

Parsing a phone number string from CSV file (line by line or buffered file reader), converting it to a number and than doing a constant time memory lookup will be IMO faster than dealing with strings and merging them. Unfortunately, I don't have these phone number files to test, but would be interested to hear your findings.

from bitstring import BitArray

max_number = 9999999999

found_phone_numbers = BitArray(length=max_number+1)

# replace this function with the file open function and retrieving

# the next found phone number

def number_from_file_iteator(dummy_data):

for number in dummy_data:

yield number

def calculate_intersect():

# should be open a file1 and getting the generator with numbers from it

# we use dummy data here

for number in number_from_file_iteator([1, 25, 77, 224322323, 8292, 1232422]):

found_phone_numbers[number] = True

# open second file and check if the number is there

for number in number_from_file_iteator([4, 24, 224322323, 1232422, max_number]):

if found_phone_numbers[number]:

yield number

number_intersection = set(calculate_intersect())

print number_intersection

I used BitArray from bitstring pip package and it needed around 2 secs to initialize the entire bitstring. Afterwards, scanning the file will use constant memory. At the end I used a set to store the items.

Note 1: This algorithm can be modified to just use the list. In that case a second loop as soon as bit number matches this bit must be reset, so that duplicates do not match again.

Note 2: Storing in the set/list occurs lazy, because we use the generator in the second for loop. Runtime complexity is linear, i.e. O(N).

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值