I want to convert plain structured text files to the CSV format using Python.
The input looks like this
[-------- 1 -------]
Version: 2
Stream: 5
Account: A
[...]
[------- 2 --------]
Version: 3
Stream: 6
Account: B
[...]
The output is supposed to look like this:
Version; Stream; Account; [...]
2; 5; A; [...]
3; 6; B; [...]
I.e. the input is structured text records delimited by [--------] and containing : -pairs and the ouput should be CSV containing one record per line.
I am able to retrive the : -pairs into CSV format via
colonseperated = re.compile(' *(.+) *: *(.+) *')
fixedfields = re.compile('(\d{3} \w{7}) +(.*)')
-- but I have trouble to recognize beginning and end of the structured text records and with the re-writing as CSV line-records. Furthermore I would like to be able to separate different type of records, i.e. distinguish between - say - Version: 2 and Version: 3 type of records.
解决方案
Reading the list is not that hard:
def read_records(iterable):
record = {}
for line in iterable:
if line.startswith('[------'):
# new record, yield previous
if record:
yield record
record = {}
continue
key, value = line.strip().split(':', 1)
record[key.strip()] = value.strip()
# file done, yield last record
if record:
yield record
This produces dictionaries from your input file.
From this you can produce CSV output using the csv module, specifically the csv.DictWriter() class:
# List *all* possible keys, in the order the output file should list them
headers = ('Version', 'Stream', 'Account', ...)
with open(inputfile) as infile, open(outputfile, 'wb') as outfile:
records = read_records(infile)
writer = csv.DictWriter(outfile, headers, delimiter=';')
writer.writeheader()
# and write
writer.writerows(records)
Any header keys missing from a record will leave that column empty for that record. Any extra headers you missed will raise an exception; either add those to the headers tuple, or set the extrasaction keyword to the DictWriter() constructor to 'ignore'.