I'm not very experienced with complicated large-scale parsing in Python, do you guys have any tips or guides on how to easily parse multiple text files with different formats, and combining them into a single .csv file and ultimately entering them into a database?
An example of the text files is as follows:
general.txt (Name -- Department (DEPT) Room # [Age]
John Doe -- Management (MANG) 205 [Age: 40]
Equipment: Laptop, Desktop, Printer, Stapler
Experience: Python, Java, HTML
Description: Hardworking, awesome
Mary Smith -- Public Relations (PR) 605 [Age: 24]
Equipment: Mac, PC
Experience: Social Skills
Description: fun to be around
Scott Lee -- Programmer (PG) 403 [Age: 25]
Equipment: Personal Computer
Experience: HTML, CSS, JS
Description: super-hacker
Susan Kim -- Programmer (PG) 504 [Age: 21]
Equipment: Desktop
Experience: Social Skills
Descriptions: fun to be around
Bob Simon -- Programmer (PG) 101 [Age: 29]
Equipment: Pure Brain Power
Experience: C++, C, Java
Description: never comes out of his room
cars.txt (a list of people who own cars by their department/room #)
Programmer: PG 403, PG 101
Management: MANG 205
house.txt
Programmer: PG 504
The final csv should preferably tabulate to something like:
Name | Division | Division Abbrevation | Equipment | Room | Age | Car? | House? |
Scott Lee Programming PG PC 403 25 YES NO
Mary Smith Public Rel. PR Mac, PC 605 24 NO NO
The ultimate goal is to have a database, where searching "PR" would return every row where a person's Department is "PR," etc. There's maybe 30 text files total, each representing one or more columns in a database. Some columns are short paragraphs, which include commas. Around 10,000 rows total. I know Python has built in csv, but I'm not sure where to start, and how to end with just 1 csv. Any help?
解决方案
It looks like you're looking for someone who will solve a whole problem for you. Here I am :)
General idea is to parse general info to dict (using regular expressions), then append additional fields to it and finally write to CSV. Here's Python 3.x solution (I think Python 2.7+ should suffice):
import csv
import re
def read_general(fname):
# Read general info to dict with 'PR 123'-like keys
# Gerexp that will split row into ready-to-use dict
re_name = re.compile(r'''
(?P.+)
\ --\ # Separator + space
(?P.+)
\ # Space
\(
(?P.*)
\)
\ # Space
(?P\d+)
\ # Space
\[Age:\ # Space at the end
(?P\d+)
\]
''', re.X)
general = {}
with open(fname, 'rt') as f:
for line in f:
line = line.strip()
m = re_name.match(line)
if m:
# Name line, start new man
man = m.groupdict()
key = '%s %s' % (m.group('Division_Abbreviation'), m.group('Id'))
general[key] = man
elif line:
# Non empty lines
# Add values to dict
key, value = line.split(': ', 1)
man[key] = value
return general
def add_bool_criteria(fname, field, general):
# Append a field with YES/NO value
with open(fname, 'rt') as f:
yes_keys = set()
# Phase one, gather all keys
for line in f:
line = line.strip()
_, keys = line.split(': ', 1)
yes_keys.update(keys.split(', '))
# Fill data
for key, man in general.items(): # iteritems() will be faster in Python 2.x
man[field] = 'YES' if key in yes_keys else 'NO'
def save_csv(fname, general):
with open(fname, 'wt') as f:
# Gather field names
all_fields = set()
for value in general.values():
all_fields.update(value.keys())
# Write to csv
w = csv.DictWriter(f, all_fields)
w.writeheader()
w.writerows(general.values())
def main():
general = read_general('general.txt')
add_bool_criteria('cars.txt', 'Car?', general)
add_bool_criteria('house.txt', 'House?', general)
from pprint import pprint
pprint(general)
save_csv('result.csv', general)
if __name__ == '__main__':
main()
I wish you lot of $$$ for this ;)
Side note
CSV is a history, you could use JSON for storage and further use, because it's simpler to use, more flexible and human readable.