大于在python中怎么打出来,如何在Python中解析大于100GB的文件？

最新推荐文章于 2022-11-21 14:37:58 发布

林欢喜

最新推荐文章于 2022-11-21 14:37:58 发布

阅读量194

点赞数

文章标签：大于在python中怎么打出来

I have text files with about 100 Gb size with the below format (with duplicate records of line and ips and domains) :

domain|ip

yahoo.com|89.45.3.5

bbc.com|45.67.33.2

yahoo.com|89.45.3.5

myname.com|45.67.33.2

etc.

I am trying to parse them using the following python code but I still get Memory error. Does anybody know a more optimal way of parsing such files? (Time is an important element for me)

files = glob(path)

for filename in files:

print(filename)

with open(filename) as f:

for line in f:

try:

domain = line.split('|')[0]

ip = line.split('|')[1].strip('\n')

if ip in d:

d[ip].add(domain)

else:

d[ip] = set([domain])

except:

print (line)

pass

print("this file is finished")

for ip, domains in d.iteritems():

for domain in domains:

print("%s|%s" % (ip, domain), file=output)

解决方案

Python objects take a little more memory than the same value does on disk; there is a little overhead in a reference count, and in sets there is also the cached hash value per value to consider.

Don't read all those objects into (Python) memory; use a database instead. Python comes with a library for the SQLite database, use that to convert your file to a database. You can then build your output file from that:

import csv

import sqlite3

from itertools import islice

conn = sqlite3.connect('/tmp/ipaddresses.db')

conn.execute('CREATE TABLE IF NOT EXISTS ipaddress (domain, ip)')

conn.execute('''\

CREATE UNIQUE INDEX IF NOT EXISTS domain_ip_idx

ON ipaddress(domain, ip)''')

for filename in files:

print(filename)

with open(filename, 'rb') as f:

reader = csv.reader(f, delimiter='|')

cursor = conn.cursor()

while True:

with conn:

batch = list(islice(reader, 10000))

if not batch:

break

cursor.executemany(

'INSERT OR IGNORE INTO ipaddress VALUES(?, ?)',

batch)

conn.execute('CREATE INDEX IF NOT EXISTS ip_idx ON ipaddress(ip)')

with open(outputfile, 'wb') as outfh:

writer = csv.writer(outfh, delimiter='|')

cursor = conn.cursor()

cursor.execute('SELECT ip, domain from ipaddress order by ip')

writer.writerows(cursor)

This handles your input data in batches of 10000, and produces an index to sort on after inserting. Producing the index is going to take some time, but it'll all fit in your available memory.

The UNIQUE index created at the start ensures that only unique domain - ip address pairs are inserted (so only unique domains per ip address are tracked); the INSERT OR IGNORE statement skips any pair that is already present in the database.

Short demo with just the sample input you gave:

>>> import sqlite3

>>> import csv

>>> import sys

>>> from itertools import islice

>>> conn = sqlite3.connect('/tmp/ipaddresses.db')

>>> conn.execute('CREATE TABLE IF NOT EXISTS ipaddress (domain, ip)')

>>> conn.execute('''\

... CREATE UNIQUE INDEX IF NOT EXISTS domain_ip_idx

... ON ipaddress(domain, ip)''')

>>> reader = csv.reader('''\

... yahoo.com|89.45.3.5

... bbc.com|45.67.33.2

... yahoo.com|89.45.3.5

... myname.com|45.67.33.2

... '''.splitlines(), delimiter='|')

>>> cursor = conn.cursor()

>>> while True:

... with conn:

... batch = list(islice(reader, 10000))

... if not batch:

... break

... cursor.executemany(

... 'INSERT OR IGNORE INTO ipaddress VALUES(?, ?)',

... batch)

...

>>> conn.execute('CREATE INDEX IF NOT EXISTS ip_idx ON ipaddress(ip)')

>>> writer = csv.writer(sys.stdout, delimiter='|')

>>> cursor = conn.cursor()

>>> cursor.execute('SELECT ip, domain from ipaddress order by ip')

>>> writer.writerows(cursor)

45.67.33.2|bbc.com

45.67.33.2|myname.com

89.45.3.5|yahoo.com

林欢喜

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
大于在python中怎么打出来,如何在Python中解析大于100GB的文件？

I have text files with about 100 Gb size with the below format (with duplicate records of line and ips and domains) :domain|ipyahoo.com|89.45.3.5bbc.com|45.67.33.2yahoo.com|89.45.3.5myname.com|45.67.3...
复制链接

扫一扫

大于在python中怎么打出来,如何在Python中解析大于100GB的文件？

“相关推荐”对你有帮助么？