大于在python中怎么打出来,如何在Python中解析大于100GB的文件?

I have text files with about 100 Gb size with the below format (with duplicate records of line and ips and domains) :

domain|ip

yahoo.com|89.45.3.5

bbc.com|45.67.33.2

yahoo.com|89.45.3.5

myname.com|45.67.33.2

etc.

I am trying to parse them using the following python code but I still get Memory error. Does anybody know a more optimal way of parsing such files? (Time is an important element for me)

files = glob(path)

for filename in files:

print(filename)

with open(filename) as f:

for line in f:

try:

domain = line.split('|')[0]

ip = line.split('|')[1].strip('\n')

if ip in d:

d[ip].add(domain)

else:

d[ip] = set([domain])

except:

print (line)

pass

print("this file is finished")

for ip, domains in d.iteritems():

for domain in domains:

print("%s|%s" % (ip, domain), file=output)

解决方案

Python objects take a little more memory than the same value does on disk; there is a little overhead in a reference count, and in sets there is also the cached hash value per value to consider.

Don't read all those objects into (Python) memory; use a database instead. Python comes with a library for the SQLite database, use that to convert your file to a database. You can then build your output file from that:

import csv

import sqlite3

from itertools import islice

conn = sqlite3.connect('/tmp/ipaddresses.db')

conn.execute('CREATE TABLE IF NOT EXISTS ipaddress (domain, ip)')

conn.execute('''\

CREATE UNIQUE INDEX IF NOT EXISTS domain_ip_idx

ON ipaddress(domain, ip)''')

for filename in files:

print(filename)

with open(filename, 'rb') as f:

reader = csv.reader(f, delimiter='|')

cursor = conn.cursor()

while True:

with conn:

batch = list(islice(reader, 10000))

if not batch:

break

cursor.executemany(

'INSERT OR IGNORE INTO ipaddress VALUES(?, ?)',

batch)

conn.execute('CREATE INDEX IF NOT EXISTS ip_idx ON ipaddress(ip)')

with open(outputfile, 'wb') as outfh:

writer = csv.writer(outfh, delimiter='|')

cursor = conn.cursor()

cursor.execute('SELECT ip, domain from ipaddress order by ip')

writer.writerows(cursor)

This handles your input data in batches of 10000, and produces an index to sort on after inserting. Producing the index is going to take some time, but it'll all fit in your available memory.

The UNIQUE index created at the start ensures that only unique domain - ip address pairs are inserted (so only unique domains per ip address are tracked); the INSERT OR IGNORE statement skips any pair that is already present in the database.

Short demo with just the sample input you gave:

>>> import sqlite3

>>> import csv

>>> import sys

>>> from itertools import islice

>>> conn = sqlite3.connect('/tmp/ipaddresses.db')

>>> conn.execute('CREATE TABLE IF NOT EXISTS ipaddress (domain, ip)')

>>> conn.execute('''\

... CREATE UNIQUE INDEX IF NOT EXISTS domain_ip_idx

... ON ipaddress(domain, ip)''')

>>> reader = csv.reader('''\

... yahoo.com|89.45.3.5

... bbc.com|45.67.33.2

... yahoo.com|89.45.3.5

... myname.com|45.67.33.2

... '''.splitlines(), delimiter='|')

>>> cursor = conn.cursor()

>>> while True:

... with conn:

... batch = list(islice(reader, 10000))

... if not batch:

... break

... cursor.executemany(

... 'INSERT OR IGNORE INTO ipaddress VALUES(?, ?)',

... batch)

...

>>> conn.execute('CREATE INDEX IF NOT EXISTS ip_idx ON ipaddress(ip)')

>>> writer = csv.writer(sys.stdout, delimiter='|')

>>> cursor = conn.cursor()

>>> cursor.execute('SELECT ip, domain from ipaddress order by ip')

>>> writer.writerows(cursor)

45.67.33.2|bbc.com

45.67.33.2|myname.com

89.45.3.5|yahoo.com

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值