Python网络爬虫

最新推荐文章于 2022-04-15 15:45:48 发布

weixin_48357536

最新推荐文章于 2022-04-15 15:45:48 发布

阅读量692

点赞数 1

分类专栏： Python 文章标签： python 爬虫

本文链接：https://blog.csdn.net/weixin_48357536/article/details/120146920

版权

Python 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Python网络爬虫

1.爬虫规则

Requests库入门

get()方法

方法语法

response=requests.get(url,params=None,**kwargs)

response：包含服务器资源的对象

request：向服务器请求资源的对象

参数	含义
url	拟获取页面的url链接
params	url中的额外参数，字典或字节流格式，可选
**kwargs	12个控制访问的参数

Response对象包含了服务器返回的所有信息，也包含了请求的Request信息

>>> import requests
>>> r=requests.get("http://www.baidu.com")
>>> print(r.status_code)
200
>>> type(r)
<class 'requests.models.Response'>
>>> r.headers
{'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Mon, 05 Jul 2021 12:37:27 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:29 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}

Response对象属性

属性	说明
r.status_code	HTTP请求的返回状态，200成功
r.text	HTTP响应内容的字符串形式
r.encoding	从HTTP header看出的响应内容编码方式
r.apparent_encoding	从内容分析出的响应内容编码方式
r.content	HTTP响应内容的二进制形式

r.encoding：如果header中不存在charset，认为是ISO-8859-1，r.text默认根据r.encoding显示网页内容

r,apparent_encoding：根据网页内容分析出的编码方式，备选编码方式

Requests库的异常

异常	说明
requests.ConnectionError	网络连接错误异常，如DNS查询失败、拒绝连接等
requests.HTTPError	HTTP错误异常
requests.URLRequired	URL缺失异常
requests.TooManyRedirects	超过最大重定向次数，重定向异常
requests.ConnectTimeout	连接远程服务器超时异常
requests.Timeout	请求URL超时，超时异常

爬取网页的通用代码框架

import requests

def getHTMLText(url):
    try:
        r=requests.get(url,timeout=30)
        r.raise_for_status()#若状态不是200，直接引发HTTPError异常
        r.encoding=r.apparent_encoding
        return r.text
    except:
        return "产生异常"

if __name__=="__main__":
    url="http://www.baidu.com"
    print(getHTMLText(url))

Requests库方法和Http协议

Requests库方法

方法	说明
requests.request()	构造一个请求，支撑以下所有方法
requests.get()	获取HTML网页的主要方法
requests.head()	获取HTML网页头信息
requests.post()	向HTML网页提交POST请求
requests.put()	向HTML网页提交PUT请求
requests.patch()	向HTML网页提交局部修改请求
requests.delete()	向HTML网页提交删除请求

HTTP协议

Hypertext Transfer Protocol超文本传输协议

HTTP是一个基于“请求与响应”模式的、无状态的应用层协议

HTTP协议采用URL作为定位网络资源的标识，格式如下

http://host[:port][path]

host：合法的Internet主机域名或IP地址
port：端口号，缺省端口为80
path：请求资源的路径

例如：

http://www.bit.edu.cn
http://220.181.111.188/duty

URL是通过HTTP协议存取资源的Internet路径，一个URL对应一个数据资源

HTTP协议对资源的操作

方法	说明
GET	请求获取URL位置的资源
HEAD	请求获取URL位置资源的响应消息报告，即该资源的头部信息
POST	请求向URL位置的资源后附加新的数据
PUT	请求向URL位置存储一个资源，覆盖原位置的资源
PATCH	请求局部更新URL位置的资源
DELETE	请求删除URL位置存储的资源

实例：

head()方法

>>> r=requests.head('http://httpbin.org/get')
>>> r.headers
{'Date': 'Mon, 05 Jul 2021 13:00:56 GMT', 'Content-Type': 'application/json', 'Content-Length': '305', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true'}
>>> r.text
''

post()方法

>>> payload={'key1':'value1','key2':'value2'}
>>> r=requests.post('http://httpbin.org/post',data=payload)
>>> print(r.text)
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "key1": "value1", 
    "key2": "value2"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "23", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.25.1", 
    "X-Amzn-Trace-Id": "Root=1-60e3032e-19c3e31f439eff19729d08cf"
  }, 
  "json": null, 
  "origin": "219.142.99.9", 
  "url": "http://httpbin.org/post"
}

>>> r=requests.post('http://httpbin.org/post',data='ABC')
>>> print(r.text)
{
  "args": {}, 
  "data": "ABC", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "3", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.25.1", 
    "X-Amzn-Trace-Id": "Root=1-60e30365-2c7c766e12c061ed10545839"
  }, 
  "json": null, 
  "origin": "219.142.99.9", 
  "url": "http://httpbin.org/post"
}

put()方法

>>> payload={'key1':'value1','key2':'value2'}
>>> r=requests.put('http://httpbin.org/put',data=payload)
>>> print(r.text)
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "key1": "value1", 
    "key2": "value2"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "23", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.25.1", 
    "X-Amzn-Trace-Id": "Root=1-60e303d2-73835b2f43e8154f01e6eebd"
  }, 
  "json": null, 
  "origin": "219.142.99.9", 
  "url": "http://httpbin.org/put"
}

Requests库主要方法解析

requests.request()

requests.request(method,url,**kwargs)

method

请求方式，共七种

r=requests.request(‘GET’,url,**kwargs)

r=requests.request(‘HEAD’,url,**kwargs)

r=requests.request(‘POST’,url,**kwargs)

r=requests.request(‘PUT’,url,**kwargs)

r=requests.request(‘PATCH’,url,**kwargs)

r=requests.request(‘DELETE’,url,**kwargs)

r=requests.request(‘OPTIONS’,url,**kwargs)

url

拟获取页面的url链接

**kwargs

控制访问参数，共13个

params：字典或字节序列，作为参数增加到url中

kv={'key1':'value1','key2':'value2'}r=requests.request('GET','http://python123.io/ws',params=kv)print(r.url)https://python123.io/ws?key1=value1&key2=value2

data：字典、字节序列或文件对象，作为Request的内容

kv={'key1':'value1','key2':'value2'}
r=requests.request('POST','http://python123.io/ws',data=kv)
body='主体内容'
r=requests.request('POST','http://python123.io/ws',data=body)

json：JSON格式的数据，作为Request的内容

kv={'key1':'value1'}
r=requests.request('POST','http://python123.io/ws',json=kv)

headers：字典，HTTP定制头

hd={'user-agent':'Chrome/10'}
r=requests.request('POST','http://python123.io/ws',headers=hd)

auth：元组，支持HTTP认证功能

files：字典类型，传输文件

fs={'file':open('data.xls','rb')}
r=requests.request('POST','http://python123.io/ws',files=fs)

timeout：设定超时时间，单位为秒

r=requests.request('GET','http://www.baidu,com',timeout=10)

proxies：字典类型，设定访问代理服务器，可以增加登录认证

pxs={'http':'http://user:pass@10.10.10.1:1234'     'http':'https://10.10.10.1:4321'}r=requests.request('GET','http://www.baidu.com',proxies=pxs)

allow_redirects：True/False，默认为True，重定向开关

stream：True/False，默认为True，获取内容立即下载开关

verify：True/False，默认为True，认证SSL证书开关

cert：本地SSL证书路径

get/head/post/put/patch/delete

requests.get(url,params=None,**kwargs)requests.head(url,**kwargs)requests.post(url,data=None,json=None,**kwargs)requests.put(url,data=None,**kwargs)requests.patch(url,data=None,**kwargs)requests.delete(url,**kwargs)

Robots协议

规模	数据量	爬取速度	实现方式	用途
小	小	不敏感	Requests库	爬取网页
中	较大	敏感	Scrapy库	爬取网站、系列网站
大	搜索引擎	关键	定制开发	爬取全网

爬虫引发的问题：性能骚扰、法律风险、隐私泄露

网络爬虫的限制：

来源审查：判断User-Agent进行限制，只响应浏览器或友好爬虫访问
发布公告：Robots协议，告知所有爬虫网站的爬取策略，要求爬虫遵守

网址根目录下robots.txt

遵守原则

访问量很小：可以遵守

访问量较大：建议遵守

非商业且偶尔：建议遵守

商业利益：必须遵守

类人行为：不必遵守

Requests库爬虫实战

淘宝商品页面

import requestsurl="https://detail.tmall.com/item.htm?id=644921704475&ali_refid=a3_430673_1006:1151058490:N:366HymmnBJsFRQSPHPLglw0x8tfCeIwXVIODU6D0C0Q=:e3f86a3c79ee2a3ea5423fe0a31391d6&ali_trackid=1_e3f86a3c79ee2a3ea5423fe0a31391d6&spm=a2e0b.20350158.31919782.2"try:    r=requests.get(url)    r.raise_for_status()    r.encoding=r.apparent_encoding    print(r.text)except:    print("失败")

亚马逊商品页面

import requests
url="https://www.amazon.cn/dp/B08LGHY8Y2/?_encoding=UTF8&pd_rd_w=VJkWI&pf_rd_p=f5621b50-8106-4635-9197-31638310081d&pf_rd_r=6WJTTANEYVKKGGG0609Q&pd_rd_r=d0890f31-f4c8-4797-801e-44e3610bfc74&pd_rd_wg=Nr1Vm&ref_=pd_gw_unk"
kv={'user-agent':'Mozilla/5.0'}
try:
    r=requests.get(url,headers=kv)
    r.raise_for_status()
    r.encoding=r.apparent_encoding
    print(r.text)
except:
    print("失败")

百度360搜索关键词提交

import requests
keyword='Python'
url="http://www.baidu.com/s"
kv={'wd':keyword}
try:
    r=requests.get(url,params=kv)
    r.raise_for_status()
    r.encoding=r.apparent_encoding
    print(len(r.text))
except:
    print("失败")

import requests
keyword='Python'
url="http://www.so.com/s"
kv={'q':keyword}
try:
    r=requests.get(url,params=kv)
    r.raise_for_status()
    r.encoding=r.apparent_encoding
    print(len(r.text))
except:
    print("失败")

网络图片爬取和存储

import requests
import os
url="https://images-cn.ssl-images-amazon.cn/images/I/51Fsi+75pdL.jpg"
root="D://pics//"
path=root+url.split('/')[-1]

try:
    if not os.path.exists(root):
        os.mkdir(root)
    if not os.path.exists(path):
        r=requests.get(url)
        with open(path,'wb')as f:
            f.write(r.content)
            f.close()
            print("文件保存成功")
    else:
        print("文件已存在")
except:
    print("失败")

IP地址归属地自动查询

import requests
url="https://m.ip138.com/iplookup.asp?ip="
kv={'user-agent':'Mozilla/5.0'}
try:
    r=requests.get(url+'172.22.187.161',headers=kv)
    r.raise_for_status()
    r.encoding=r.apparent_encoding
    print(r.text)
except:
    print("失败")

2.爬虫提取

Beautiful Soup库入门

Beautiful Soup小测

HTML格式输出

import requests
from bs4 import BeautifulSoup

r=requests.get("http://python123.io/ws/demo.html")
print(r.text)
demo=r.text
soup=BeautifulSoup(demo,"html.parser")
print(soup.prettify())

Beautiful Soup基本元素

Beautiful Soup库是解析、遍历、维护“标签树”的功能库

标签Tag：<p>…</p>

名称Name：p，成对出现

属性Attributes：class=“title”，可以有0个或多个

Beautiful Soup库的引用

from bs4 import BeautifulSoup
import bs4

HTML文件等价于标签树等价于BeautifulSoup类

一个BeautifulSoup对象对应着一个HTML/XML文档全部内容

soup=BeautifulSoup("<html>data<html>","html.parser")
soup2=BeautifulSoup(open("D://demo.html"),"html.parser")

Beautiful Soup库解析器

解析器	使用方法	条件
bs4的HTML解析器	BeautifulSoup(mk,‘html.parser’)	安装bs4库
lxml的HTML解析器	BeautifulSoup(mk,‘lxml’)	pip install lxml
lxml的XML解析器	BeautifulSoup(mk,‘xml’)	pip install lxml
html5lib的解析器	BeautifulSoup(mk,‘html5lib’)	pip install html5lib

BeautifulSoup类的基本元素

基本元素	说明
Tag	标签，最基本的信息组织单元，分别用<>和</>表明开头结尾
Name	标签名字，例如<p>…</p>的名字’p’，格式：<tag>.name
Attributes	标签的属性，字典形式组织，格式：<tag>.attrs
NavigableString	标签内非属性字符串,<>…</>中的字符串，格式：<tag>.string
Comment	标签内字符串的注释部分，一种特殊的Comment类型

Tag、Name、Attributes演示

>>> import requests>>> r=requests.get("http://python123.io/ws/demo.html")>>> demo=r.text>>> from bs4 import BeautifulSoup>>> soup=BeautifulSoup(demo,"html.parser")>>> soup.title<title>This is a python demo page</title>>>> tag=soup.a#返回第一个a标签>>> tag<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>>>> soup.a.name'a'>>> soup.a.parent.name'p'>>> soup.a.parent.parent.name'body'>>> tag=soup.a>>> tag.attrs{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}>>> tag.attrs['class']['py1']>>> tag.attrs['href']'http://www.icourse163.org/course/BIT-268001'>>> type(tag.attrs)<class 'dict'>>>> type(tag)<class 'bs4.element.Tag'>

NavigableString演示

>>> soup.a<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>>>> soup.a.string'Basic Python'>>> soup.p<p class="title"><b>The demo python introduces several python courses.</b></p>>>> soup.p.string'The demo python introduces several python courses.'>>> type(soup.p.string)<class 'bs4.element.NavigableString'>

Comment演示

>>> newsoup=BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>,""html.parser")
>>> newsoup.b.string
'This is a comment'
>>> type(newsoup.b.string)
<class 'bs4.element.Comment'>
>>> newsoup.p.string
'This is not a comment'
>>> type(newsoup.p.string)
<class 'bs4.element.NavigableString'>

基于bs4库的HTML遍历方法

<html>
	<head>
<title>This is a python demo page</title>
	</head>
	<body>
		<p class="title">
			<b>The demo python introduces several python courses.</b>
		</p>
		<p class=“course”>Python is a wonderful general‐purpose programming language.
You can learn Python from novice to professional by tracking the following courses:
			<a href="http://www.icourse163.org/course/BIT‐268001" class="py1"
id="link1">Basic Python</a> and
			<a href="http://www.icourse163.org/course/BIT‐1001870001" class="py2"
id="link2">Advanced Python</a>.
		</p>
	</body>
</html>

标签树的下行遍历

属性	说明
.contents	子节点的列表，将所有的儿子节点存入列表
.children	子节点的迭代类型，与.contents类似，用于循环遍历儿子节点
.descendants	子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

.contents演示

>>> soup=BeautifulSoup(demo,"html.parser")
>>> soup.head
<head><title>This is a python demo page</title></head>
>>> soup.head.contents
[<title>This is a python demo page</title>]
>>> soup.body.contents
['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']
>>> len(soup.body.contents)
5

.children和.descendants演示

for child in soup.body.children:
    print(child)#遍历儿子节点

for child in soup.body.descendants:
    print(child)#遍历子孙节点

关于二者的区别

原始HTML

html ="""<html>    <head>        <title>The Dormouse's story</title>    </head>    <body>        <p class="story">            Once upon a time there were three little sisters; and their names were            <a href="http://example.com/elsie" class="sister" id="link1">                <span>Elsie</span>            </a>            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>             and            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>            and they lived at the bottom of a well.        </p>        <p class="story">...</p>"""

.children代码

from bs4 import BeautifulSoupsoup1 = BeautifulSoup(html, 'lxml')print(soup1.p.children)for i, child in enumerate(soup1.p.children):    print(i, child)

.children结果

<list_iterator object at 0x000000DD59609898>0             Once upon a time there were three little sisters; and their names were            1 <a class="sister" href="http://example.com/elsie" id="link1"><span>Elsie</span></a>2 3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>4              and            5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>6             and they lived at the bottom of a well.

.descendants代码

from bs4 import BeautifulSoupsoup2 = BeautifulSoup(html, 'lxml')print(soup2.p.children)for i, desc in enumerate(soup2.p.descendants):    print(i, desc)

.descendants结果

<list_iterator object at 0x000000DD595897B8>0             Once upon a time there were three little sisters; and their names were            1 <a class="sister" href="http://example.com/elsie" id="link1"><span>Elsie</span></a>2 3 <span>Elsie</span>4 Elsie5 6 7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>8 Lacie9              and            10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>11 Tillie12             and they lived at the bottom of a well.

形象点说，就是假设现在小区里要调查房屋入住情况。
.children呢，找出每一户，登记一下有没人住，就完事了。
而 .descendants呢，就不一样了。登记完每一户的入住情况后，还要登记每间房的入住情况，
每间房分别住了谁，兴趣、职业、爱好、户口所在地……

标签树上行遍历

属性	说明
.parent	节点的父亲标签
.parents	节点先辈标签的迭代类型，用于循环遍历先辈节点

>>> import requests>>> from bs4 import BeautifulSoup>>> r=requests.get("http://python123.io/ws/demo.html")>>> demo=r.text>>> soup=BeautifulSoup(demo,"html.parser")>>> soup.title.parent<head><title>This is a python demo page</title></head>>>> soup.html.parent<html><head><title>This is a python demo page</title></head><body><p class="title"><b>The demo python introduces several python courses.</b></p><p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p></body></html>>>> soup.parent>>> for parent in soup.a.parents:	if parent is None:		print(parent)	else:print(parent.name)pbodyhtml[document]

遍历包括所有先辈节点，包括soup本身，要注意判断

标签树平行遍历

属性	说明
.next_sibling	返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling	返回按照HTML文本顺序的上一个平行节点标签
.next_siblings	迭代类型，返回按照HTML文本顺序的后续所有平行节点标签
.previous_sibling	迭代类型，返回按照HTML文本顺序的前续所有平行节点标签

注意，平行遍历发生在同一个父节点下的各个节点之间

.next_sibling和.previous_sibling演示

>>> soup=BeautifulSoup(demo,"html.parser")>>> soup.a.next_sibling' and '>>> soup.a.next_sibling.next_sibling<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>>>> soup.a.previous_sibling'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n'>>> soup.a.previous_sibling.previous_sibling>>> soup.a.parent<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>

.next_siblings和.previous_siblings演示

for sibling in soup.a.next_sibling:    print(sibling)for sibling in soup.a.previous_sibling:    print(sibling)

总结

基于bs4库的HTML格式输出

原来的样子

>>> r=requests.get("http://python123.io/ws/demo.html")>>> demo=r.text>>> demo'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'

后来的样子

>>> from bs4 import BeautifulSoup>>> soup=BeautifulSoup(demo,"html.parser")>>> soup.prettify()'<html>\n <head>\n  <title>\n   This is a python demo page\n  </title>\n </head>\n <body>\n  <p class="title">\n   <b>\n    The demo python introduces several python courses.\n   </b>\n  </p>\n  <p class="course">\n   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\n   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">\n    Basic Python\n   </a>\n   and\n   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">\n    Advanced Python\n   </a>\n   .\n  </p>\n </body>\n</html>'

>>> print(soup.prettify())<html> <head>  <title>   This is a python demo page  </title> </head> <body>  <p class="title">   <b>    The demo python introduces several python courses.   </b>  </p>  <p class="course">   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">    Basic Python   </a>   and   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">    Advanced Python   </a>   .  </p> </body></html>

.prettify()可用于标签

.prettify()

>>> print(soup.a.prettify())<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1"> Basic Python</a>

bs4库将任何HTML输入都变成utf-8编码

>>> soup=BeautifulSoup("<p>中文</p>","html.parser")>>> soup.p.string'中文'>>> print(soup.p.prettify())<p> 中文</p>

信息组织与提取方法

信息标记的三种形式

XML

extensible markup language

实例

<person>	<firstName>Tian</firstName>	<lastName>Song</lastName>	<address>		<streetAddr>中关村南大街5号</streetAddr>		<city>北京市</city>		<zipcode>100081</zipcode>	</address>	<prof>Computer System</prof><prof>Security</prof></person>

例如：

<name>...</name><img src="china.jpg" size="10">...</img>

名称Name：img

属性Attribute：src=“china.jpg” size=“10”

标签Tag：整个

空元素缩写形式

<name /><img src="china.jpg" size="10"/>

注释书写形式

<!-- --><!-- This is a comment, very useful -->

JSON

JavaScript Object Notation

有类型的键值对

实例

{	“firstName” : “Tian” ,	“lastName” : “Song” ,	“address” : {					“streetAddr” : “中关村南大街5号” ,					“city” : “北京市” ,					“zipcode” : “100081”				} ,	“prof” : [ “Computer System” , “Security” ]}

例如

"key":"value""name":"北京师范大学"

类型：由“”表明

键key：“name”

值value：“北京师范大学”

多个值

"key":["value1","value2"]"name":["北京师范大学","北京大学"]

键值对嵌套

"key":{"subkey":"subvalue"}"name":{    "newName":"北京师范大学",    "oldName":"京师大学堂师范馆"		}

YAML

YAML Ain’t Markup Language

无类型键值对

实例

firstName : TianlastName : Songaddress :	streetAddr : 中关村南大街5号	city : 北京市	zipcode : 100081prof :‐Computer System‐Security

例如

key:valuename:北京师范大学

无类型：仅字符串

键key：name

值value：北京师范大学

缩进表示所属关系

key:	subkey:subvaluename:	newName:北京师范大学	oldName：京师大学堂师范馆

-表示并列关系

key:#comments	-value1	-value2name:	-北京师范大学	-京师大学堂师范馆

|表示整块数据，#表示注释

text:| #学校介绍学校的前身是1902年创立的京师大学堂师范馆，1908年改称京师优级师范学堂，独立设校，1912年改名为北京高等师范学校。1923年学校更名为北京师范大学。1931年、1952年北平女子师范大学、辅仁大学先后并入北京师范大学。1959年，被中央确定为首批全国重点大学。2017年，学校进入国家“世界一流大学”建设A类名单。

标记形式	优缺点	用途
XML	最早的通用信息标记语言，可扩展性好，但繁琐	Internet上信息交互与传递
JSON	信息有类型，适合程序处理(js)，较XML简洁，但无注释	移动应用云端和节点的信息通信
YAML	信息无类型，文本信息比例最高，可读性好	各类系统的配置文件，有注释易读

信息提取的一般方法

信息提取：从标记后的信息中提取所关注的内容

方法一

完整解析信息的标记形式，再提取关键信息

需要标记解释器，例如bs4库的标签树遍历

优点：信息解析准确

缺点：提取过程繁琐，速度慢

方法二

无视标记形式，直接搜索关键信息

优点：提取过程简洁，速度较快

缺点：提取结果准确性与信息内容相关

融合方法

结合形式解析与搜索方法，提取关键信息

需要标记解析器及文本查找函数

实例：

提取HTML中所有的URL链接

思路：

搜索到所有<a>标签
解析<a>标签格式，提取href后的链接内容

>>> from bs4 import BeautifulSoup>>> soup=BeautifulSoup(demo,"html.parser")>>> for link in soup.find_all('a'):	print(link.get('href'))http://www.icourse163.org/course/BIT-268001http://www.icourse163.org/course/BIT-1001870001

基于bs4库的HTML内容查找方法

<>.find_all(name,attrs,recursive,string,**kwargs)

返回一个列表类型，存储查找的结果

name：对标签名称的检索字符串

例子

>>> soup.find_all('a')[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]>>> soup.find_all(['a','b'])[<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]>>> for tag in soup.find_all(True):	print(tag.name)htmlheadtitlebodypbpaa>>> import re>>> for tag in soup.find_all(re.compile('b')):	print(tag.name)bodyb

attrs：对标签属性值的检索字符串，可标注属性检索

>>> soup.find_all('p','course')[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]>>> soup.find_all(id='link1')[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]>>> soup.find_all(id='link')[]>>> soup.find_all(id=re.compile('link'))[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

recursive：是否对子孙全部检索，默认为True

>>> soup.find_all('a')[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]>>> soup.find_all('a',recursive=False)[]

string：<>…</>中字符串区域的检索字符串

>>> soup
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>

>>> soup.find_all(string="Basic Python")
['Basic Python']

>>> soup.find_all(string=re.compile("python"))
['This is a python demo page', 'The demo python introduces several python courses.']

简写

<tag>(..)等价于<tag>.find_all(..)
soup(..)等价于soup.find_all(..)

方法	说明
<>.find()	搜索且只返回一个结果
<>.find_parents()	在先辈节点中搜索，返回列表类型
<>.find_parent()	在先辈节点中返回一个结果
<>.find_next_siblings()	在后续平行节点中搜索，返回列表类型
<>.find_next_sibling()	在后续平行节点中返回一个结果
<>.find_previous_siblings()	在前序平行节点中搜索，返回列表类型
<>.find_previous_sibling()	在前序平行节点中返回一个结果

实例：世界大学排名爬虫

import bs4.elementimport requestsfrom bs4 import BeautifulSoupimport bs4def getHTMLText(url):    try:        r=requests.get(url,timeout=30)        r.raise_for_status()        r.encoding=r.apparent_encoding        return r.text    except:        return ""def fillUnivList(ulist,html):    soup=BeautifulSoup(html,"html.parser")    #print(soup)    for tr in soup.find('tbody').children:        if isinstance(tr,bs4.element.Tag):            tds=tr('td')            paiming=tr.find('span')            #tds[0].replace("\r","")            if paiming:                paiming=paiming.string.strip()            else:                paiming="?"            print(paiming)            ulist.append([paiming,tds[1].string.replace("\n","").strip(),tds[2].string.replace("\n","").strip()])def printUnivList(ulist,num):    tplt="{0:^10}\t{1:{3}^10}\t{2:^10}"    print(tplt.format("排名","学校名称","国家",chr(12288)))#中文空格填充    for i in range(1,num+1):        u=ulist[i]        print(tplt.format(str(u[0]),u[1],u[2],chr(12288)))def main():    uinfo=[]    url="https://www.liuxue86.com/yuanxiaopaiming/"    html=getHTMLText(url)    fillUnivList(uinfo,html)    printUnivList(uinfo,20)main()

3.爬虫实战

Re库入门

正则表达式的基本概念

正则表达式 rugular expression,regex,RE

正则表达式是用来简洁地表达一组字符串的表达式

正则表达式是一种通用的字符串表达框架

正则表达式是一种针对字符串表达“简洁”和“特征”思想的工具

正则表达式可以用来判断某个字符串的特征归属

正则表达式可用于：

表达文本类型的特征（病毒、入侵等）
同时查找或替换一组字符串
匹配字符串的全部或部分

编译：将符合正则表达式语法的字符串转换成正则表达式特征

#我有正则表达式如下regex='P(Y|YT|YTH|YTHO)?N'#编译p=re.compile(regex)#得到特征p，p等价于'PN''PYN''PYTN''PYTHN''PYTHON'

正则表达式的语法

操作符	说明	实例
.	表示任何单个字符
[]	字符集，对单个字符给出取值范围	[abc]表示a、b、c，[a-z]表示从a到z的单个字符
[^]	非字符集，对单个字符给出排除范围	[^abc]表示非a或b或c的单个字符
*	前一个字符0次或无限次扩展	abc*表示ab、abc、abcc、abccc等
+	前一个字符1次或无限次扩展	abc+表示abc、abcc、abccc等
?	前一个字符0次或1次扩展	abc?表示ab、abc
\|	左右表达式任意一个	abc\|def表示abc、def
{m}	扩展前一个字符m次	ab{2}c表示abbc
{m,n}	扩展前一个字符m到n次（闭区间）	ab{1,2}c表示abc、abcc
^	匹配字符串开头	^abc表示abc且在一个字符串开头
$	匹配字符串结尾	abc$表示abc且在一个字符串的结尾
()	分组标记，内部只能使用\|操作符	(abc)表示abc，(abc\|def)表示abc、def
\d	数字，等价于[0-9]
\w	单词字符，等价于[A-Za-z0-9]

例子：

正则表达式	对应字符串
P(Y\|YT\|YTH\|YTHO)N	‘PN’ ‘PYN’ ‘PYTN’ ‘PYTHN’ ‘PYTHON’
PYTHON+	‘PYTHON’ ‘PYTHONN’ ‘PYTHONNN’ …
PY[TH]ON	‘PYTON’ ‘PYHON’
PY[^TH]?ON	‘PYON’ ‘PYaON’ ‘PYbON’ ‘PYcON’ …
PY{:3}N	‘PN’ ‘PYN’ ‘PYYN’ ‘PYYYN’

经典正则表达式例子

正则表达式	表示的字符串
¹+$	由26个字母组成的字符串
²+$	由26个字母和数字组成的字符串
^-?\d+$	整数形式的字符串
³[1-9][0-9]$	正整数形式的字符串
[1-9]\d{5}	中国境内邮政编码，6位
[\u4e00-\u9fa5]	匹配中文字符
\d{3}-\d{8}\|\d{4}-\d{7}	国内电话号码，010-68913536

匹配IP地址

粗略匹配

\d+.\d+.\d+.\d+

\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}

精确匹配

数字	正则表达式
0-99	[1-9]?\d
100-199	1\d{2}
200-249	2[0-4]\d
250-255	25[0-5]

(([1-9]?\d|1\d{2}|2[0-4]\d|25[0-5]).){3}([1-9]?\d|1\d{2}|2[0-4]\d|25[0-5])

Re库的基本使用

正则表达式的表示类型

raw string类型（原生字符串）

r'text'r'[1-9]\d{5}'r'\d{3}-\d{8}|\d{4}-\d{7}'

raw string是不包含对转义符再次转义的字符串

string类型（比较麻烦）

'[1-9]\\d{5}''\\d{3}-\\d{8}|\\d{4}-\\d{7}'

Re库主要功能函数

函数	说明
re.search()	在一个字符串中搜索匹配正则表达式的第一个位置，返回match对象
re.match()	从一个字符串的开始位置起匹配正则表达式，返回match对象
re.findall()	搜索字符串，以列表类型返回全部能匹配的子串
re.split()	将一个字符串按照正则表达式匹配结果进行分割，返回列表类型
re.finditer()	搜索字符串，返回一个匹配结果的迭代类型，每个迭代元素是match对象
re.sub()	在一个字符串中替换所有匹配正则表达式的子串，返回替换后的字符串

re.search()

re.search(pattern,string,flags=0)

在一个字符串中搜索匹配正则表达式的第一个位置，返回match对象

pattern：正则表达式的字符串或原生字符串表示

string：待匹配字符串

flags：正则表达式使用时的控制标记

常用标记	说明
re.I re.IGNORECASE	忽略正则表达式的大小写，[A-Z]能匹配小写字符
re.M re.MULTILINE	正则表达式中的^操作符能够将给定字符串的每行当做匹配的开始
re.S re.DOTALL	正则表达式中的.操作符能够匹配所有字符，默认匹配除换行外所有字符

演示

>>> import re>>> match=re.search(r'[1-9]\d{5}','BIT 100081')>>> if match:	print(match.group(0))100081

re.match()

re.match(pattern,string,flags=0)

从一个字符串的开始位置起匹配正则表达式，返回match对象

演示

>>> import re>>> match=re.match(r'[1-9]\d{5}','BIT 100081')>>> match.group(0)Traceback (most recent call last):  File "<pyshell#9>", line 1, in <module>    match.group(0)AttributeError: 'NoneType' object has no attribute 'group'    >>> match=re.match(r'[1-9]\d{5}','100081 BIT')>>> if match:	match.group(0)'100081'

re.findall()

re.findall(pattern,string,flags=0)

搜索字符串，以列表类型返回全部能匹配的子串

演示

>>> import re>>> ls=re.findall(r'[1-9]\d{5}','BIT100081 TSU100084')>>> ls['100081', '100084']

re.split()

re.split(pattern,string,maxsplit=0,flags=0)

将一个字符串按照正则表达式匹配结果进行分割，返回列表类型

maxsplit：最大分割数，剩余部分作为最后一个元素输出

演示

>>> import re
>>> re.split(r'[1-9]\d{5}','BIT100081 TSU100084')
['BIT', ' TSU', '']
>>> re.split(r'[1-9]\d{5}','BIT100081 TSU100084',maxsplit=1)
['BIT', ' TSU100084']

re.finditer()

re.finditer(pattern,string,flags=0)

搜索字符串，返回一个匹配结果的迭代类型，每个迭代元素是match对象

演示

>>> import re
>>> for m in re.finditer(r'[1-9]\d{5}','BIT100081 TSU100084'):
	if m:
		print(m.group(0))

100081
100084

re.sub()

re.sub(pattern,repl,string,count=0,flags=0)

repl：替换匹配字符串的字符串

count：最大替换次数

演示

>>> import re>>> re.sub(r'[1-9]\d{5}','zipcode','BIT100081 TSU100084')'BITzipcode TSUzipcode'

Re库的面向对象用法

#函数式用法rst=re.search(r'[1-9]\d{5}','BIT 100081')#面向对象用法(将正则表达式字符串形式编译为正则表达式对象)pat=re.compile(r'[1-9]\d{5}')rst=pat.search('BIT 100081')

#用到的方法regex=re.compile(pattern,flags=0)

正则表达式对象的方法

函数	说明
regex.search()	在一个字符串中搜索匹配正则表达式的第一个位置，返回match对象
regex.match()	从一个字符串的开始位置起匹配正则表达式，返回match对象
regex.findall()	搜索字符串，以列表类型返回全部能匹配的子串
regex.split()	将一个字符串按照正则表达式匹配结果进行分割，返回列表类型
regex.finditer()	搜索字符串，返回一个匹配结果的迭代类型，每个迭代元素都是match对象
regex.sub()	在一个字符串中替换所有匹配正则表达式的子串，返回替换后的字符串

Re库的match对象

Match对象介绍

match对象是一次匹配的结果，包含匹配的很多信息，其type为<class ‘_sre.SRE_Match’>

Match对象的属性

属性	说明
.string	待匹配的文本
.re	匹配时使用的pattern对象（就是那个正则表达式）
.pos	正则表达式搜索文本的开始位置
.endpos	正则表达式搜索文本的结束位置

Match对象的方法

方法	说明
.group(0)	获得匹配后的字符串
.start()	匹配字符串在原始字符串的开始位置
.end()	匹配字符串在原始字符串的结束位置
.span()	返回(.start(),.end())

演示

>>> import re>>> m=re.search(r'[1-9]\d{5}','BIT100081 TSU100084')>>> m.string'BIT100081 TSU100084'>>> m.rere.compile('[1-9]\\d{5}')>>> m.pos0>>> m.endpos19>>> m.group(0)'100081'>>> m.start()3>>> m.end()9>>> m.span<built-in method span of re.Match object at 0x00000182760B7420>>>> m.span()(3, 9)

Re库的贪婪匹配和最小匹配

#Re库默认采用贪婪匹配，输出最长的匹配了的子串>>> match=re.search(r'PY.*N','PYANBNCNDN')>>> match.group(0)'PYANBNCNDN'#加个问号可以得到最小匹配>>> match=re.search(r'PY.*?N','PYANBNCNDN')>>> match.group(0)'PYAN'

最小匹配操作符

操作符	说明
*?	前一个字符0次或无限次扩展，最小匹配
+?	前一个字符1次或无限次扩展，最小匹配
??	前一个字符0次或1次扩展，最小匹配
{m,n}?	扩展前一个字符m到n次，最小匹配

只要是长度输出可能不同的，都可以通过在操作符后面加问号变成最小匹配

实例：淘宝商品比价定向爬虫

#CrowTaobaoPrice.py
import requests
import re

def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""
    
def parsePage(ilt, html):#关注这里面正则表达式的使用
    try:
        plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"',html)
        tlt = re.findall(r'\"raw_title\"\:\".*?\"',html)
        for i in range(len(plt)):
            price = eval(plt[i].split(':')[1])#关注这个去掉：前面东西并且转换为数字的小技巧
            title = eval(tlt[i].split(':')[1])
            ilt.append([price , title])
    except:
        print("")

def printGoodsList(ilt):
    tplt = "{:4}\t{:8}\t{:16}"#关注这个常用的格式化输出的方式
    print(tplt.format("序号", "价格", "商品名称"))
    count = 0
    for g in ilt:
        count = count + 1
        print(tplt.format(count, g[0], g[1]))
        
def main():
    goods = '书包'
    depth = 3
    start_url = 'https://s.taobao.com/search?q=' + goods
    infoList = []
    for i in range(depth):
        try:
            url = start_url + '&s=' + str(44*i)#观察得到页面与url关系
            html = getHTMLText(url)
            parsePage(infoList, html)
        except:
            continue
    printGoodsList(infoList)
    
main()

实例：股票数据定向爬虫

#CrawBaiduStocksA.py
import requests
from bs4 import BeautifulSoup
import traceback
import re

def getHTMLText(url):
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

def getStockList(lst, stockURL):
    html = getHTMLText(stockURL)
    soup = BeautifulSoup(html, 'html.parser') 
    a = soup.find_all('a')
    for i in a:
        try:
            href = i.attrs['href']
            lst.append(re.findall(r"[s][hz]\d{6}", href)[0])
        except:
            continue

def getStockInfo(lst, stockURL, fpath):
    for stock in lst:
        url = stockURL + stock + ".html"
        html = getHTMLText(url)
        try:
            if html=="":
                continue
            infoDict = {}
            soup = BeautifulSoup(html, 'html.parser')
            stockInfo = soup.find('div',attrs={'class':'stock-bets'})

            name = stockInfo.find_all(attrs={'class':'bets-name'})[0]
            infoDict.update({'股票名称': name.text.split()[0]})
            
            keyList = stockInfo.find_all('dt')
            valueList = stockInfo.find_all('dd')
            for i in range(len(keyList)):
                key = keyList[i].text
                val = valueList[i].text
                infoDict[key] = val
            
            with open(fpath, 'a', encoding='utf-8') as f:
                f.write( str(infoDict) + '\n' )
        except:
            traceback.print_exc()
            continue

def main():
    stock_list_url = 'https://quote.eastmoney.com/stocklist.html'
    stock_info_url = 'https://gupiao.baidu.com/stock/'
    output_file = 'D:/BaiduStockInfo.txt'
    slist=[]
    getStockList(slist, stock_list_url)
    getStockInfo(slist, stock_info_url, output_file)

main()

优化：定向爬虫定编码，增加进度条

#CrawBaiduStocksB.py
import requests
from bs4 import BeautifulSoup
import traceback
import re

def getHTMLText(url, code="utf-8"):
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = code
        return r.text
    except:
        return ""

def getStockList(lst, stockURL):
    html = getHTMLText(stockURL, "GB2312")
    soup = BeautifulSoup(html, 'html.parser') 
    a = soup.find_all('a')
    for i in a:
        try:
            href = i.attrs['href']
            lst.append(re.findall(r"[s][hz]\d{6}", href)[0])
        except:
            continue

def getStockInfo(lst, stockURL, fpath):
    count = 0
    for stock in lst:
        url = stockURL + stock + ".html"
        html = getHTMLText(url)
        try:
            if html=="":
                continue
            infoDict = {}
            soup = BeautifulSoup(html, 'html.parser')
            stockInfo = soup.find('div',attrs={'class':'stock-bets'})

            name = stockInfo.find_all(attrs={'class':'bets-name'})[0]
            infoDict.update({'股票名称': name.text.split()[0]})
            
            keyList = stockInfo.find_all('dt')
            valueList = stockInfo.find_all('dd')
            for i in range(len(keyList)):
                key = keyList[i].text
                val = valueList[i].text
                infoDict[key] = val
            
            with open(fpath, 'a', encoding='utf-8') as f:
                f.write( str(infoDict) + '\n' )
                count = count + 1
                print("\r当前进度: {:.2f}%".format(count*100/len(lst)),end="")
        except:
            count = count + 1
            print("\r当前进度: {:.2f}%".format(count*100/len(lst)),end="")
            continue

def main():
    stock_list_url = 'https://quote.eastmoney.com/stocklist.html'
    stock_info_url = 'https://gupiao.baidu.com/stock/'
    output_file = 'D:/BaiduStockInfo.txt'
    slist=[]
    getStockList(slist, stock_list_url)
    getStockInfo(slist, stock_info_url, output_file)

main()

4.爬虫框架

Scrapy爬虫框架

框架

Scrapy不是一个函数功能库，是一个爬虫框架

爬虫框架是实现爬虫功能的一个软件结构和功能组件集合

爬虫框架是一个半成品，能够帮助用户实现专业网络爬虫

数据流

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-cPfNRuji-1630939750112)(Python网络爬虫.assets\image-20210713130214872.png)]

Engine从Spider处获得爬取请求Request
Engine将爬取请求转发给Scheduler，用于调度
Engine从Scheduler处获得下一个要爬取的请求
Engine将爬取请求通过中间件发送给Downloader
爬取网页后，Downloader形成响应Response，通过中间件发给Engine
Engine将收到的响应通过中间件发送给Spider处理
Spider处理响应后产生爬取项scraped Item和新的爬取请求Requests给Engine
Engine将爬取项发送给Item Pipeline框架出口
Engine将爬取请求发送给Scheduler

数据流的出入口

Engine控制各模块数据流，不间断从Scheduler处获得爬取请求，直到请求为空
框架入口：Spider的初始爬取请求
框架出口：Item Pipeline

各个模块解析

Engine（不需要改）

控制所有模块之间的数据流

根据条件触发事件
Downloader（不需要改）

根据请求下载网页
Scheduler（不需要改）

对所有爬取请求进行调度管理
Downloader Middleware

目的：实施Engine、Scheduler和Downloader之间进行用户可配置的控制

功能：修改、丢弃、新增请求或响应

用户可以编写配置代码
Spider

解析Downloader返回的响应Response

产生爬取项scraped item

产生额外的爬取请求Request

需要用户编写配置代码
Item Pipeline

以流水线方式处理Spider产生的爬取项

由一组操作顺序组成，类似流水线，每个操作是一个Item Pipeline类型

可能操作包括：清理、检验和查重爬取项中的HTML数据、将数据存储到数据库

需要用户编写配置代码

Spider Middleware

目的：对请求和爬取项的再处理

功能：修改、丢弃、新增请求或爬取项

用户可以编写配置代码

Scrapy命令行

格式

>scrapy <command> [options] [args]

常用命令

命令	说明	格式
startproject	创建一个新工程	scrapy startproject <name> [dir]
genspider	创建一个爬虫	scrapy genspider [options] <name> <domain>
settings	获得爬虫配置信息	scrapy settings [options]
crawl	运行一个爬虫	scrapy crawl <spider>
list	列出工程中所有爬虫	scrapy list
shell	启动URL调试命令行	scrapy shell [url]

Scrapy爬虫基本使用

Scrapy的第一个实例

步骤一

建立Scrapy工程

D:\py_pro>scrapy startproject python123demo

目录结构介绍

步骤二

在工程中产生一个Scrapy爬虫

D:\py_pro\python123demo>scrapy genspider demo python123.io

import scrapy

class DemoSpider(scrapy.Spider):#由于名字是demo所以类名叫demospider
    name = 'demo'#爬虫名字
    allowed_domains = ['python123.io']#只能爬取这个域名下的相关链接
    start_urls = ['http://python123.io/']#Scrapy爬取的初始页面

    def parse(self, response):#解析页面的方法用于处理响应，解析内容形成字典，发现新的URL爬取请求
        pass

步骤三

配置产生的spider爬虫

import scrapy

class DemoSpider(scrapy.Spider):
    name = 'demo'
    #allowed_domains = ['python123.io']
    start_urls = ['http://python123.io/ws/demo.html']

    def parse(self, response):
        fname=response.url.split('/')[-1]
        with open(fname,'wb')as f:
            f.write(response.body)
        self.log('Save file %s.' % fname)

步骤四

运行爬虫，获取网页

D:\py_pro\python123demo>scrapy crawl demo

demo.py两个等价版本

版本一

class DemoSpider(scrapy.Spider):
    name = 'demo'
    start_urls = ['http://python123.io/ws/demo.html']

版本二

class DemoSpider(scrapy.Spider):
    name = 'demo'
    
    def start_requests(self):
    	urls = [
            'http://python123.io/ws/demo.html'
        		]
        for url in urls:
            yield scrapy.Request(url=url,callback=self.parse)

yield关键字

包含yield语句的函数是一个生成器

生成器每次产生一个值，之后函数被冻结，被唤醒后再产生一个值

生成器是一个不断产生值的函数

例子

>>> def gen(n):
	for i in range(n):
		yield i**2

		
>>> for i in gen(5):
	print(i," ",end="")

	
0  1  4  9  16

Scrapy爬虫的基本使用

Scrapy爬虫的基本步骤

创建工程和spider模板
编写spider
编写Item Pipeline
优化配置策略

Scrapy爬虫的数据类型

Request类
Response类
Item类

Request类

class scrapy.http.Request()

Request对象表示一个HTTP请求，由Spider生成，由Downloader执行

属性或方法	说明
.url	Request对应的请求URL地址
.method	对应的请求方法’GET’'POST’等
.headers	字典类型风格的请求头
.body	请求内容主体，字符串类型
.meta	用户添加的扩展信息，在Scrapy内部模块间传递信息使用
.copy()	复制该请求

Response类

class scrapy.http.Response()

Response对象表示一个HTTP响应，由Downloader生成，由spider处理

属性或方法	说明
.url	Response对应的URL地址
.status	HTTP状态码，默认是200
.headers	Response对应的头部信息
.body	Response对应的内容信息，字符串类型
.flags	一组标记
.request	产生Response类型对应的Request对象
.copy()	复制该响应

Item类

class scrapy.item.Item()

Item对象表示一个从HTML页面中提取的信息内容，由Spider生成，由Item Pipeline处理

Item类似字典类型，可以按照字典类型操作

Scrapy爬虫提取信息的方法

Beautiful Soup
lxml
re
XPath Selector
CSS Selector

<HTML>.css('a::attr(href)').extract()

实例：股票数据Scrapy爬虫

步骤一

建立工程和Spider模板

\>scrapy startproject BaiduStocks
\>cd BaiduStocks
\>scrapy genspider stocks baidu.com

步骤二

编写Spider

配置stocks.py文件

修改对返回页面的处理

修改对新增URL爬取请求的处理

# -*- coding: utf-8 -*-
import scrapy
import re


class StocksSpider(scrapy.Spider):
    name = "stocks"
    start_urls = ['https://quote.eastmoney.com/stocklist.html']

    def parse(self, response):
        for href in response.css('a::attr(href)').extract():
            try:
                stock = re.findall(r"[s][hz]\d{6}", href)[0]
                url = 'https://gupiao.baidu.com/stock/' + stock + '.html'
                yield scrapy.Request(url, callback=self.parse_stock)
            except:
                continue

    def parse_stock(self, response):
        infoDict = {}
        stockInfo = response.css('.stock-bets')
        name = stockInfo.css('.bets-name').extract()[0]
        keyList = stockInfo.css('dt').extract()
        valueList = stockInfo.css('dd').extract()
        for i in range(len(keyList)):
            key = re.findall(r'>.*</dt>', keyList[i])[0][1:-5]
            try:
                val = re.findall(r'\d+\.?.*</dd>', valueList[i])[0][0:-5]
            except:
                val = '--'
            infoDict[key]=val

        infoDict.update(
            {'股票名称': re.findall('\s.*\(',name)[0].split()[0] + \
             re.findall('\>.*\<', name)[0][1:-1]})
        yield infoDict

步骤三

编写ITEM Pipelines

配置pipeline.py文件

定义对爬取项Scraped Item的处理类

配置ITEM_PIPELINES选项

pipeline.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class BaidustocksPipeline(object):
    def process_item(self, item, spider):
        return item

class BaidustocksInfoPipeline(object):
    def open_spider(self, spider):
        self.f = open('BaiduStockInfo.txt', 'w')

    def close_spider(self, spider):
        self.f.close()

    def process_item(self, item, spider):
        try:
            line = str(dict(item)) + '\n'
            self.f.write(line)
        except:
            pass
        return item

settings.py

# Configure item pipelines
# See https://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'BaiduStocks.pipelines.BaidustocksInfoPipeline': 300,
}

配置并发连接选项

选项	说明
CONCURRENT_REQUESTS	Downloader最大并发请求下载数量，默认32
CONCURRENT_ITEMS	Item Pipeline最大并发ITEM处理数量，默认100
CONCURRENT_REQUESTS_PER_DOMAIN	每个目标域名最大的并发请求数量，默认8
CONCURRENT_REQUESTS_PER_IP	每个目标IP最大的并发请求数量，默认0，非0有效

A-Za-z ↩︎
A-Za-z0-9 ↩︎
0-9 ↩︎

weixin_48357536

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
Python网络爬虫

Python网络爬虫1.爬虫规则Requests库入门get()方法方法语法response=requests.get(url,params=None,**kwargs)response：包含服务器资源的对象request：向服务器请求资源的对象参数含义url拟获取页面的url链接paramsurl中的额外参数，字典或字节流格式，可选**kwargs12个控制访问的参数Response对象包含了服务器返回的所有信息，也包含了请求的Request信
复制链接

扫一扫