Beautiful Soup 是用Python写的一个HTML/XML的解析器,它可以很好的处理不规范标记并生成剖析树(parse tree)。它提供简单又常用的导航(navigating),搜索以及修改剖析树的操作。它可以大大节省你的编程时间。
使用python开发网页分析功能时,可以借用该库的网页解析功能,时分方便,比自己写正则方便很多,使用时需要引入模块,如下:
在程序中中导入 Beautiful Soup库:
from BeautifulSoup import BeautifulSoup # For processing HTML from BeautifulSoup import BeautifulStoneSoup # For processing XML import BeautifulSoup # To get everything
Beautiful Soup对html处理比较好,对xml处理不是特别完美,如下:
#! /usr/bin/python
#coding:utf-8
from BeautifulSoup import BeautifulSoup
import re
doc = ['<html><head><title>Page title</title></head>',
'<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
'<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
'</html>']
soup = BeautifulSoup(''.join(doc))
print soup.prettify()
输出如下:
# <html>
# <head>
# <title>
# Page title
# </title>
# </head>
# <body>
# <p id="firstpara" align="center">
# This is paragraph
# <b>
# one
# </b>
# .
# </p>
# <p id="secondpara" align="blah">
# This is paragraph
# <b>
# two
# </b>
# .
# </p>
# </body>
# </html>
当然它的功能很强大,下面是一个从网页中提取title的例子,如下:
#!/usr/bin/env python
#coding:utf-8
import Queue
import threading
import urllib2
import time
from BeautifulSoup import BeautifulSoup
hosts = ["http://yahoo.com", "http://google.com", "http://amazon.com","http://ibm.com"]
queue = Queue.Queue()
out_queue = Queue.Queue()
class ThreadUrl(threading.Thread):
"""Threaded Url Grab"""
def __init__(self, queue, out_queue):
threading.Thread.__init__(self)
self.queue = queue
self.out_queue = out_queue
def run(self):
while True:
#grabs host from queue
host = self.queue.get()
#grabs urls of hosts and then grabs chunk of webpage
url = urllib2.urlopen(host)
chunk = url.read()
#place chunk into out queue
self.out_queue.put(chunk)
#signals to queue job is done
self.queue.task_done()
class DatamineThread(threading.Thread):
"""Threaded Url Grab"""
def __init__(self, out_queue):
threading.Thread.__init__(self)
self.out_queue = out_queue
def run(self):
while True:
#grabs host from queue
chunk = self.out_queue.get()
#parse the chunk
soup = BeautifulSoup(chunk)
print soup.findAll(['title'])
#signals to queue job is done
self.out_queue.task_done()
start = time.time()
def main():
#spawn a pool of threads, and pass them queue instance
for i in range(5):
t = ThreadUrl(queue, out_queue)
t.setDaemon(True)
t.start()
#populate queue with data
for host in hosts:
queue.put(host)
for i in range(5):
dt = DatamineThread(out_queue)
dt.setDaemon(True)
dt.start()
#wait on the queue until everything has been processed
queue.join()
out_queue.join()
main()
print "Elapsed Time: %s" % (time.time() - start)
该例子用到了多线程和队列,队列可以简化多线程开发,即分而治之的思想,一个线程只有一个独立的功能,通过队列共享数据,简化程序逻辑,输出结果如下:
[<title>IBM - United States</title>]
[<title>Google</title>]
[<title>Yahoo!</title>]
[<title>Amazon.com: Online Shopping for Electronics, Apparel, Computers, Books, DVDs & more</title>]
Elapsed Time: 12.5929999352
中文文档:http://www.crummy.com/software/BeautifulSoup/documentation.zh.html
官方地址:http://www.crummy.com/software/BeautifulSoup/#Download/