python读取csv格式文本_使用Python将纯文本文件解析为CSV文件

1586010002-jmsa.png

I have a series of HTML files that are parsed into a single text file using Beautiful Soup. The HTML files are formatted such that their output is always three lines within the text file, so the output will look something like:

Hello!

How are you?

Well, Bye!

But it could just as easily be

83957

And I ain't coming back!

hgu39hgd

In other words, the contents of the HTML files are not really standard across each of them, but they do always produce three lines.

So, I was wondering where I should start if I want to then take the text file that is produced from Beautiful Soup and parse that into a CSV file with columns such as (using the above examples):

Title Intro Tagline

Hello! How are you? Well, Bye!

83957 And I ain't coming back! hgu39hgd

The Python code for stripping the HTML from the text files is this:

import os

import glob

import codecs

import csv

from bs4 import BeautifulSoup

path = "c:\\users\\me\\downloads\\"

for infile in glob.glob(os.path.join(path, "*.html")):

markup = (infile)

soup = BeautifulSoup(codecs.open(markup, "r", "utf-8").read())

with open("extracted.txt", "a") as myfile:

myfile.write(soup.get_text())

And I gather I can use this to set up the columns in my CSV file:

csv.put_HasColumnNames(True)

csv.SetColumnName(0,"title")

csv.SetColumnName(1,"intro")

csv.SetColumnName(2,"tagline")

Where I'm drawing blank is how to iterate through the text file (extracted.txt) one line at a time and, as I get to a new line, set it to the correct cell in the CSV file. The first several lines of the file are blank, and there are many blank lines between each grouping of text. So, first I would need to open the file and read it:

file = open("extracted.txt")

for line in file.xreadlines():

pass # csv.SetCell(0,0 X) (obviously, I don't know what to put in X)

Also, I don't know how to tell Python to just keep reading the file, and adding to the CSV file until it's finished. In other words, there's no way to know exactly how many total lines will be in the HTML files, and so I can't just csv.SetCell(0,0) to cdv.SetCell(999,999)

解决方案

I'm not entirely sure what CSV library you're using, but it doesn't look like Python's built-in one. Anyway, here's how I'd do it:

import csv

import itertools

with open('extracted.txt', 'r') as in_file:

stripped = (line.strip() for line in in_file)

lines = (line for line in stripped if line)

grouped = itertools.izip(*[lines] * 3)

with open('extracted.csv', 'w') as out_file:

writer = csv.writer(out_file)

writer.writerow(('title', 'intro', 'tagline'))

writer.writerows(grouped)

This sort of makes a pipeline. It first gets data from the file, then removes all the whitespace from the lines, then removes any empty lines, then groups them into groups of three, and then (after writing the CSV header) writes those groups to the CSV file.

To combine the last two columns as you mentioned in the comments, you could change the writerow call in the obvious way and the writerows to:

writer.writerows((title, intro + tagline) for title, intro, tagline in grouped)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值