I have text files with about 100 Gb size with the below format (with duplicate records of line and ips and domains) :
domain|ip
yahoo.com|89.45.3.5
bbc.com|45.67.33.2
yahoo.com|89.45.3.5
myname.com|45.67.33.2
etc.
I am trying to parse them using the following python code but I still get Memory error. Does anybody know a more optimal way of parsing such files? (Time is an important element for me)
files = glob(path)
for filename in files:
print(filename)
with open(filename) as f:
for line in f:
try:
domain = line.split('|')[0]
ip = line.split('|')[1].strip('\n')
if ip in d:
d[ip].add(domain)
else:
d[ip] = set([domain])
except:
print (line)
pass
print("this file is finished")
for ip, domains in d.iteritems():
for domain in domains:
print("%s|%s" % (ip, domain), file=output)
解决方案
Python objects take a little more memory than the same value does on disk; there is a little overhead in a reference count, and in sets there is also the cached hash value per value to consider.
Don't read all those objects into (Python) memory; use a database instead. Python comes with a library for the SQLite database, use that to convert your file to a database. You can then build your output file from that:
import csv
import sqlite3
from itertools import islice
conn = sqlite3.connect('/tmp/ipaddresses.db')
conn.execute('CREATE TABLE IF NOT EXISTS ipaddress (domain, ip)')
conn.execute('''\
CREATE UNIQUE INDEX IF NOT EXISTS domain_ip_idx
ON ipaddress(domain, ip)''')
for filename in files:
print(filename)
with open(filename, 'rb') as f:
reader = csv.reader(f, delimiter='|')
cursor = conn.cursor()
while True:
with conn:
batch = list(islice(reader, 10000))
if not batch:
break
cursor.executemany(
'INSERT OR IGNORE INTO ipaddress VALUES(?, ?)',
batch)
conn.execute('CREATE INDEX IF NOT EXISTS ip_idx ON ipaddress(ip)')
with open(outputfile, 'wb') as outfh:
writer = csv.writer(outfh, delimiter='|')
cursor = conn.cursor()
cursor.execute('SELECT ip, domain from ipaddress order by ip')
writer.writerows(cursor)
This handles your input data in batches of 10000, and produces an index to sort on after inserting. Producing the index is going to take some time, but it'll all fit in your available memory.
The UNIQUE index created at the start ensures that only unique domain - ip address pairs are inserted (so only unique domains per ip address are tracked); the INSERT OR IGNORE statement skips any pair that is already present in the database.
Short demo with just the sample input you gave:
>>> import sqlite3
>>> import csv
>>> import sys
>>> from itertools import islice
>>> conn = sqlite3.connect('/tmp/ipaddresses.db')
>>> conn.execute('CREATE TABLE IF NOT EXISTS ipaddress (domain, ip)')
>>> conn.execute('''\
... CREATE UNIQUE INDEX IF NOT EXISTS domain_ip_idx
... ON ipaddress(domain, ip)''')
>>> reader = csv.reader('''\
... yahoo.com|89.45.3.5
... bbc.com|45.67.33.2
... yahoo.com|89.45.3.5
... myname.com|45.67.33.2
... '''.splitlines(), delimiter='|')
>>> cursor = conn.cursor()
>>> while True:
... with conn:
... batch = list(islice(reader, 10000))
... if not batch:
... break
... cursor.executemany(
... 'INSERT OR IGNORE INTO ipaddress VALUES(?, ?)',
... batch)
...
>>> conn.execute('CREATE INDEX IF NOT EXISTS ip_idx ON ipaddress(ip)')
>>> writer = csv.writer(sys.stdout, delimiter='|')
>>> cursor = conn.cursor()
>>> cursor.execute('SELECT ip, domain from ipaddress order by ip')
>>> writer.writerows(cursor)
45.67.33.2|bbc.com
45.67.33.2|myname.com
89.45.3.5|yahoo.com