Beautifulsoup的使用

最新推荐文章于 2024-07-10 17:28:32 发布

转载最新推荐文章于 2024-07-10 17:28:32 发布 · 89 阅读

文章标签：

本文介绍如何利用BeautifulSoup库从HTML文件中提取所需数据，包括电影名称、导演、关键助手及Moose元素，通过解析混乱的HTML内容生成结构化数据。

一款名为 Beautiful Soup 的常用配套工具帮助 Python 程序理解 Web 站点中包含的脏乱“基本是 HTML” 内容。是用Python写的一个HTML/XML的解析器，它可以很好的处理不规范标记并生成剖析树(parse tree)。

使用 Beautiful Soup 从无序的内容中生成整齐的数据

				
from glob import glob
from BeautifulSoup import BeautifulSoup

def process():
    print "!MOVIE,DIRECTOR,KEY_GRIP,THE_MOOSE"
    for fname in glob('result_*'):
        # Put that sloppy HTML into the soup
        soup = BeautifulSoup(open(fname))

        # Try to find the fields we want, but default to unknown values
        try:
            movie = soup.findAll('span', {'class':'movie_title'})[1].contents[0]
        except IndexError:
            fname = "UNKNOWN"

        try:
            director = soup.findAll('div', {'class':'director'})[1].contents[0]
        except IndexError:
            lname = "UNKNOWN"

        try:
            # Maybe multiple grips listed, key one should be in there
            grips = soup.findAll('p', {'id':'grip'})[0]
            grips = " ".join(grips.split())   # Normalize extra spaces
        except IndexError:
            title = "UNKNOWN"

        try:
            # Hide some stuff in the HTML <meta> tags
            moose = soup.findAll('meta', {'name':'shibboleth'})[0]['content']
        except IndexError:
            moose = "UNKNOWN"

        print '"%s","%s","%s","%s"' % (movie, director, grips, moose)

具体可参考：http://www.crummy.com/software/BeautifulSoup/documentation.zh.html

与其类似的还有PyQuery库，看参考其网址 http://packages.python.org/pyquery/