python Beautiful Soup分析网页

最新推荐文章于 2023-07-04 10:11:27 发布

lxzo123

最新推荐文章于 2023-07-04 10:11:27 发布

阅读量4.5k

点赞数

分类专栏：工作 web开发 python 文章标签： python import processing 多线程 books class

本文链接：https://blog.csdn.net/lxzo123/article/details/6727593

版权

工作同时被 3 个专栏收录

63 篇文章 0 订阅

订阅专栏

web开发

4 篇文章 0 订阅

订阅专栏

python

4 篇文章 0 订阅

订阅专栏

Beautiful Soup 是用Python写的一个HTML/XML的解析器，它可以很好的处理不规范标记并生成剖析树(parse tree)。它提供简单又常用的导航（navigating），搜索以及修改剖析树的操作。它可以大大节省你的编程时间。

使用python开发网页分析功能时，可以借用该库的网页解析功能，时分方便，比自己写正则方便很多，使用时需要引入模块，如下：

在程序中中导入 Beautiful Soup库:

from BeautifulSoup import BeautifulSoup          # For processing HTML
from BeautifulSoup import BeautifulStoneSoup     # For processing XML
import BeautifulSoup                             # To get everything

Beautiful Soup对html处理比较好，对xml处理不是特别完美，如下：

#! /usr/bin/python
#coding:utf-8

from BeautifulSoup import BeautifulSoup
import re

doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))

print soup.prettify()

输出如下：

# <html>
#  <head>
#   <title>
#    Page title
#   </title>
#  </head>
#  <body>
#   <p id="firstpara" align="center">
#    This is paragraph
#    <b>
#     one
#    </b>
#    .
#   </p>
#   <p id="secondpara" align="blah">
#    This is paragraph
#    <b>
#     two
#    </b>
#    .
#   </p>
#  </body>
# </html>

当然它的功能很强大，下面是一个从网页中提取title的例子，如下：

#!/usr/bin/env python
#coding:utf-8
import Queue
import threading
import urllib2
import time
from BeautifulSoup import BeautifulSoup

hosts = ["http://yahoo.com", "http://google.com", "http://amazon.com","http://ibm.com"]
queue = Queue.Queue()
out_queue = Queue.Queue()

class ThreadUrl(threading.Thread):
    """Threaded Url Grab"""
    def __init__(self, queue, out_queue):
        threading.Thread.__init__(self)
        self.queue = queue
        self.out_queue = out_queue

    def run(self):
        while True:
            #grabs host from queue
            host = self.queue.get()
            #grabs urls of hosts and then grabs chunk of webpage
            url = urllib2.urlopen(host)
            chunk = url.read()
            #place chunk into out queue
            self.out_queue.put(chunk)
            #signals to queue job is done
            self.queue.task_done()

class DatamineThread(threading.Thread):
    """Threaded Url Grab"""
    def __init__(self, out_queue):
        threading.Thread.__init__(self)
        self.out_queue = out_queue

    def run(self):
        while True:
            #grabs host from queue
            chunk = self.out_queue.get()
            #parse the chunk
            soup = BeautifulSoup(chunk)
            print soup.findAll(['title'])
            #signals to queue job is done
            self.out_queue.task_done()

start = time.time()
def main():
    #spawn a pool of threads, and pass them queue instance
    for i in range(5):
        t = ThreadUrl(queue, out_queue)
        t.setDaemon(True)
        t.start()
    #populate queue with data
    for host in hosts:
        queue.put(host)

    for i in range(5):
        dt = DatamineThread(out_queue)
        dt.setDaemon(True)
        dt.start()


    #wait on the queue until everything has been processed
    queue.join()
    out_queue.join()

main()
print "Elapsed Time: %s" % (time.time() - start)

该例子用到了多线程和队列，队列可以简化多线程开发，即分而治之的思想，一个线程只有一个独立的功能，通过队列共享数据，简化程序逻辑，输出结果如下：

[<title>IBM - United States</title>]
[<title>Google</title>]
[<title>Yahoo!</title>]
[<title>Amazon.com: Online Shopping for Electronics, Apparel, Computers, Books, DVDs & more</title>]
Elapsed Time: 12.5929999352

中文文档：http://www.crummy.com/software/BeautifulSoup/documentation.zh.html

官方地址：http://www.crummy.com/software/BeautifulSoup/#Download/

lxzo123

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
python Beautiful Soup分析网页

Beautiful Soup 是用Python写的一个HTML/XML的解析器，它可以很好的处理不规范标记并生成剖析树(parse tree)。它提供简单又常用的导航（navigating），搜索以及修改剖析树的操作。它可以大大节省你的编程时间。使用python开发网页分析功
复制链接

扫一扫