Crawl byBeautifulSoup

最新推荐文章于 2024-09-11 20:39:26 发布

daodao18

最新推荐文章于 2024-09-11 20:39:26 发布

阅读量265

点赞数

分类专栏：学学依一文章标签： python BeautifulSoup Crawl

本文链接：https://blog.csdn.net/shanlovelong/article/details/86592522

版权

学学依一专栏收录该内容

6 篇文章 0 订阅

订阅专栏

基础知识

HTML-hyper text markup language，超文本链接标示语言。利用类似<a>, <\a>的标签来识别内容，然后通过浏览器的实现标准来翻译成精彩的页面。当然，一个好看的网页并不仅仅有HTML，毕竟字符串是静态的，只能实现静态效果，要做出漂亮的网页还需要能美化样式的CSS和动态效果的JavaScript。

DOM-document object model，文档对象模型。DOM是W3C组织推荐的处理可扩展标志语言的标准编程接口。在网页上，组织页面（或文档）的对象被组织在一个树形结构中，用来表示文档中对象的标准模型就称为DOM。

HTTP-hyper text transfer protocol，超文本传输协议，是一种建立在TCP上的无状态连接，整个基本的工作流程是客户端发送一个HTTP请求，说明客户端想要访问的资源和请求的动作，服务端收到请求之后，服务端开始处理请求，并根据请求做出相应的动作访问服务器资源，最后通过发送HTTP响应把结果返回给客户端。其中一个请求的开始到一个响应的结束称为事务，当一个事务结束后还会在服务端添加一条日志条目。

浏览器和服务器之间主要有以下几种通信方式：

GET：向服务器请求资源，请求以明文的方式传输，一般在URL上可以看到请求的参数；
POST：从网页上提交表单，以报文的形式传输，请求资源；
还有几种比较少见的。

爬虫辅助工具

BeautifulSoup库，一款优秀的HTML/XML解析库，采用来做爬虫，不用考虑编码，还有中日韩文的文档，其社区活跃度之高，可见一斑。[注] 这个在解析的时候需要一个解析器，在文档中可以看到，推荐lxml。
Requests库，一款比较好用的HTTP库，当然python自带有urllib以及urllib2等库但用起来是绝对没有这款舒服的，哈哈。
Fiddler工具，这是一个HTTP抓包软件，能够截获所有的HTTP通讯。如果爬虫运行不了，可以从这里寻找答案，官方链接可能进不去，可以直接百度下载。

A demo

import os
import sys

import numpy as np
from bs4 import BeautifulSoup
from lxml import html
import xml
import requests

url = 'https://movie.douban.com/top250'
# get该网页从而获取该网页的html内容
f = requests.get(url)
# 用lxml解析器解析该网页的内容
soup = BeautifulSoup(f.content, 'lxml')
# 目前的理解
# f - a response
# f.content - F12之后html内容，整个字符串
# f.content.decode() - 将这个字符串解码，格式化，易于查看
# soup - soup之后的形式

# print(f.content.decode())
# print(soup)
for item in soup.find_all('div', class_='item'):
    # print(soup.find_next('span').string)
    s = item.find_next('span')
    print(s.string)

stdout：

C:\Users\winter\Anaconda2\envs\tensorflow\python.exe E:/PycharmProjects/CONSOLE/wget.py
肖申克的救赎
霸王别姬
这个杀手不太冷
阿甘正传
美丽人生
泰坦尼克号
千与千寻
辛德勒的名单
盗梦空间
机器人总动员
忠犬八公的故事
三傻大闹宝莱坞
海上钢琴师
放牛班的春天
大话西游之大圣娶亲
楚门的世界
龙猫
星际穿越
教父
熔炉
无间道
当幸福来敲门
疯狂动物城
触不可及
怦然心动

Process finished with exit code 0

随便深入理解

tag.name
tag.attrs

Beautiful Soup

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。

Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式。这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。

Beautiful Soup已成为和lxml、html6lib一样出色的python解释器，为用户灵活地提供不同的解析策略或强劲的速度。

解析器

虽然我还不是很清楚Beautiful Soup和HTML解析器的关系，可能BS是个解释器，哈哈。I will go with it.

Let’s see the dry ones.

解析器	使用方法	pros	cons
python标准库	`BeautifulSoup(ml, 'html.parser')`	python的内置标准库；执行速度适中；文档容错能力强	Python 2.7.3 or 3.2.2前的版本中文档容错能力差
lxml HTML解析器	`BeautifulSoup(ml, 'lxml')`	速度快；文档容错能力强	需要安装c语言库
lxml XML 解析器	`BeautifulSoup(ml, 'xml')`	速度快；唯一支持XML的解析器	需要安装c语言库
html5lib	`BeautifulSoup(ml, 'html5lib')`	最好的容错性；以浏览器的方式解析文档；生成HTML5格式的文档	速度慢；不依赖于外部扩展

四大对象种类

Beautiful Soup将复杂的HTML文档转换成一个复杂的树形结构，每个节点都是python对象。所有对象可以归纳为4种：

Tag
NavigableString
BeautifulSoup
Comment

Tag

Tag是什么？通俗点讲，就是HTML中的一个个标签，例如：

<title>The Dormouse's story</title>
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

我们可以利用soup加标签名轻松地获取这些标签的内容，是不是感觉比正则表达式方便多了？不过有一点是，它查找的是在所有内容中的第一个符合要求的标签。

HTML = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(HTML, 'lxml')
# <class 'bs4.BeautifulSoup'>
print(type(soup),'\n')
# 格式化输出soup对象的内容
# print(soup.prettify())

print(soup.head)
print(soup.head.title)
print(soup.title)
print(soup.p)
for p in soup.find_all('p', class_='story'):
    print(p)

stdout：

C:\Users\winter\Anaconda2\envs\tensorflow\python.exe E:/PycharmProjects/CONSOLE/Wget.py
<class 'bs4.BeautifulSoup'> 

<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>
<title>The Dormouse's story</title>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>

Process finished with exit code 0

Tag重要的两个属性就是name和attrs，就像上面提到的。

NavigableString

获得标签的内容之后，使用.string获取标签内部的文字。

print(soup.p.string)
print(type(soup.p.string))

C:\Users\winter\Anaconda2\envs\tensorflow\python.exe E:/PycharmProjects/CONSOLE/Wget.py
The Dormouse's story
<class 'bs4.element.NavigableString'>

Process finished with exit code 0

BeautifulSoup

BeautifulSoup对象表示的是一个文档的全部内容。大部分时候，可以把它当作Tag对象，是一个特殊的 Tag，我们可以分别获取它的类型，名称，以及属性来感受一下。

print(type(soup.name))
print(soup.name)
print(soup.attrs)

C:\Users\winter\Anaconda2\envs\tensorflow\python.exe E:/PycharmProjects/CONSOLE/Wget.py
<class 'str'>
[document]
{}

Process finished with exit code 0

Comment

Comment对象是一个特殊类型的NavigableString对象，其实输出的内容仍然不包括注释符号，但是如果不好好处理它，可能会对我们的文本处理造成意想不到的麻烦。

print(soup.a)
print(soup.a.string)
print(soup.a.Comment)
print(type(soup.a.string))

C:\Users\winter\Anaconda2\envs\tensorflow\python.exe E:/PycharmProjects/CONSOLE/Wget.py
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
 Elsie 
None
<class 'bs4.element.Comment'>

Process finished with exit code 0

A little more complicated one

# -*- coding=utf-8 -*-
import os
import sys

import numpy as np
from bs4 import BeautifulSoup
from lxml import html
import xml
import requests
import wget
import json

url = 'https://movie.douban.com/tag/#/?sort=U&range=0,10&tags=2018'
# h = requests.get(url)
# print(h.text)
# print(h.content.decode())
soup = BeautifulSoup(open('demo.html', 'r', encoding='utf-8'), 'lxml')
# soup = BeautifulSoup(requests.get(url).content, 'lxml')
wrapper = soup.find('div', id='wrapper')
# print(type(wrapper))
# print(wrapper)
# article = wrapper.div.div.div.div.div.div
list_wp = wrapper.find('div', class_='list-wp')
if not os.path.exists('douban'):
    os.mkdir('douban')
dict = {}
for item in list_wp.find_all('a'):
    # print(item.img.attrs)
    attrs = item.img.attrs
    src, alt = attrs['src'], attrs['alt']
    if not os.path.exists('douban/'+'{}.jpg'.format(alt)):
        wget.download(src, 'douban/'+'{}.jpg'.format(alt))
    dict[alt] = src
    print(alt + ' ' + src)
json.dump(dict, open('demo.json', 'w'), ensure_ascii=False, indent='\t')

demo.json

{
	"大黄蜂": "https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2541662397.jpg",
	"无名之辈": "https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2539661066.jpg",
	"毒液：致命守护者": "https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2537158013.jpg",
	"海王": "https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2541280047.jpg",
	"无双": "https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2535096871.jpg",
	"蜘蛛侠：平行宇宙": "https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2542867516.jpg",
	"来电狂响": "https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2542268337.jpg",
	"我不是药神": "https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2519070834.jpg",
	"红海行动": "https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2514119443.jpg",
	"头号玩家": "https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2516578307.jpg",
	"大江大河": "https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2541093820.jpg",
	"西虹市首富": "https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2529206747.jpg",
	"碟中谍6：全面瓦解": "https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2529365085.jpg",
	"一出好戏": "https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2529571873.jpg",
	"江湖儿女": "https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2533283770.jpg",
	"唐人街探案2": "https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2511355624.jpg",
	"复仇者联盟3：无限战争": "https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2517753454.jpg",
	"无问西东": "https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2507572275.jpg",
	"知否知否应是绿肥红瘦": "https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2537131688.jpg",
	"邪不压正": "https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2526297221.jpg"
}

遍历文档树

喜欢上了用表格来总结内容。

what	way	other
直接子节点	.contents - 将tag的子节点以列表的形式给出；.children同样的功能with迭代器	-
所有子孙节点	.descendants - 对tag的所有子孙节点进行递归循环，迭代器	-
节点内容	.string	如果tag只有一个 NavigableString 类型子节点,那么这个tag可以使用 .string得到子节点。如果一个tag仅有一个子节点,那么这个tag也可以使用 .string 方法,输出结果与当前唯一子节点的 .string 结果相同。

//TODO 其他节点的获取方法

搜索文档树

find_all(name, attrs, recursive, text, …)

# name
字符串 - 直接匹配
正则表达式 - 正则匹配
列表 - 返回与列表中任一元素匹配的内容
True - 匹配任何值except字符串节点
自定义方法 - True即返回

# 正则表达式
import re
for tag in soup.find_all(re.compile('^b')):
    print(tag.name)
    
# 自定义方法
def class_without_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
    
soup.find_all(class_without_id)

# keyword参数
如果一个指定名字的参数不是搜索内置的参数名，搜索时会把该参数当作指定名字tag的属性来搜索，如果包含一个名字为id的参数，Beautiful Soup会搜索每个tag的“id”属性。

soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

如果传入href参数，Beautiful Soup会搜索每个tag的“href”属性。
	
soup.find_all(href=re.compile("elsie"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

使用多个指定名字的参数可以同时过滤tag的多个属性。
soup.find_all(href=re.compile("elsie"), id='link1')
# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]

使用class_过滤。
soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

有些tag属性在搜索不能使用，比如HTML5中的 data-* 属性
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression

但是可以通过find_all()方法的attrs参数定义一个字典参数来搜索包含特殊属性的tag。
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]