《Python网络爬虫与信息提取》笔记1_fillunivlist(uinfo,html) nameerror: name 'html' is-CSDN博客

本文链接：https://blog.csdn.net/qq_58647543/article/details/128951439

一、网络爬虫之规则：Requests库

1. request()方法

2. 其他方法

3. 爬取网页的通用代码框架

4. 实例

5. 网络爬虫的“盗亦有道”

二、网络爬虫之提取

1.Beautiful Soup库

2. 信息组织与提取方法

3. 实例：中国大学排名定向爬虫

4、正则表达式入门

5.实例：当当网比价定向爬虫

实例：股票数据定向爬虫

三、Scrapy爬虫框架

Scrapy爬虫框架介绍

实例：Scrapy获取上交所和深交所所有股票的名称和交易信息

一、网络爬虫之规则：Requests库

爬取网页的最好的第三方库，简单简洁，更多信息可访问http://www.python-requests.org

安装方法：Anaconda中已经包含了这个库，如果要安装，使用命令：pip install requests

requests的7个主要方法：

方法说明HTTP协议方法

requests.request()构造一个请求，支撑以下各方法的基础方法

requests.get()获取html网页的主要方法，对应于http的getGET

request.head()获取html网页头信息的方法，对应于http的headHEAD

request.post()向html网页提交post请求的方法，对应于http的postPOST

request.put()向html网页提交put请求的方法，对应于http的putPUT

request.patch()向html网页提交局部修改请求，对应于http的patchPATCH

request.delete()向html提交删除请求，对应于http的deleteDELETE

HTTP对资源的操作：

方法说明

GET请求获取URL位置的资源

HEAD请求获取URL位置资源的响应消息报告，即获得该资源的头部信息

POST请求向URL位置的资源后附加新的数据

PUT请求向URL位置存储一个资源，覆盖原URL位置的资源

PATCH请求局部更新URL位置的资源，覆盖原URL位置的资源

DELETE请求删除URL位置存储的资源

1. request()方法

def request(method, url, **kwargs):

"""Constructs and sends a :class:`Request <Request>`.

:param method: 请求方式，对应GET/HEAD/POST/PUT/PATCH/delete/OPTIONS,OPTIONS获取服务器参数，使用较少.

:param url: 访问链接.

:**kwargs: 控制访问的参数，均为可选项

:param params: 字典或者字节序列，作为参数增加到url中.可以将一些键值对增加到url中，服务器根据参数返回资源.

kv = {'key1':'value1', 'key2':'value2'}

r = requests.request('GET', "http://www.python123.io/ws", params=kv)

print(r.url) #https://www.python123.io/ws?key1=value1&key2=value2

:param data: 字典，字节序列或文件对象，作为request的内容.

r = requests.request('POST', "http://www.python123.io/ws", data=kv)

body = '主体内容'

r = requests.request('POST', "http://www.python123.io/ws", data=body)

:param json: json格式的数据，作为request的内容，向服务器提交.

r = requests.request('POST', "http://www.python123.io/ws", json=kv)

:param headers: 字典，http定制头，模拟浏览器的访问.

hd = {'user-agent' : 'Chrome/10'}

r = requests.request('POST', "http://www.python123.io/ws", headers = hd)

:param cookies: 字典或CookieJar，Request中的cookie

:param files: 字典类型，传输文件.

fs = {'file' : open('data.xls', 'rb')}

r = requests.request('POST', "http://www.python123.io/ws", files=fs)

:param auth: 元组，支持http认证功能.

:param timeout: 设定的超时时间，秒为单位，超时后产生timeout异常

:param allow_redirects: bool, 重定向开关，默认为True.

:type allow_redirects: bool

:param proxies: 字典类型，设定访问代理服务器，可以增加登录认证.

proxy = {'http': 'http://127.0.0.1:1080',

'https': 'https://127.0.0.1:1080'}

r = requests.request('POST', "http://www.python123.io/ws", proxies = proxy)

:param verify: bool, 默认为True, 认证SSL证书开关.

:param stream: bool, 默认为True, 获取内容立即下载开关.

:param cert: 本地SSL证书路径.

:return: :class:`Response <Response>` object

:rtype: requests.Response

"""

函数返回Response对象，Response对象的属性如下：

属性说明

r.status_codehttp请求的返回状态，200表示连接成功，404表示失败

t.texthttp响应内容的字符串形式，即url的页面内容

r.encoding

从http header中猜测的响应内容的编码方式。

如果header中不存在charset，则认为编码为ISO-8859-1,这个编码不能解析中文

r.apparent_encoding

从内容中分析出的响应内容编码方式（备选编码方式）。

从网页内容中推断编码方式，更加准确一些，当encoding不能解析正确编码方式时，采用这个

r.contenthttp响应内容的二进制形式

使用流程：获取response对象->检测状态码->获取内容

2. 其他方法

get()等方法只是对requests()方法做了封装，可以被request()方法替代

def get(url, params=None, **kwargs):

"""Sends a GET request.

url: 拟获取页面的url链接.

params: url中的额外参数，字典或字节流格式，可选.

**kwargs: 12个控制访问的参数.

"""

return request('get', url, params=params, **kwargs)

def head(url, **kwargs):

r"""Sends a HEAD request.

url: 拟获取页面的url链接.

**kwargs: 13个控制访问的参数.

"""

return request('head', url, **kwargs)

def post(url, data=None, json=None, **kwargs):

r"""Sends a POST request.

url/data/json，**kwargs: 11个控制访问的参数.

"""

return request('post', url, data=data, json=json, **kwargs)

def put(url, data=None, **kwargs):

r"""Sends a PUT request.

url/data，**kwargs: 12个控制访问的参数.

"""

return request('put', url, data=data, **kwargs)

def patch(url, data=None, **kwargs):

r"""Sends a PATCH request.

url/data，**kwargs: 12个控制访问的参数.

"""

return request('patch', url, data=data, **kwargs)

def delete(url, **kwargs):

r"""Sends a DELETE request.

url，**kwargs: 13个控制访问的参数.

"""

return request('delete', url, **kwargs)

3. 爬取网页的通用代码框架

网络连接有风险，异常处理很重要

异常说明

requests.ConnectionError网络连接异常，如DNS查询失败、拒绝连接等

requests.HTTPErrorHTTP错误异常

requests.URLRequiredURL缺失异常

requests.TooManyRedirects超过最大重定向次数，产生重定向异常

requests.ConnectTimeout连接远程服务器超时异常

requests.Timeout请求URL超时，产生异常

通用框架：

import requests

def getHTMLText(url):

try:

r = requests.get(url, timeout=30)

r.raise_for_status() #如果状态不是200，引发异常

r.encoding = r.apparent_encoding

return r.text

except:

return "产生异常"

if __name__ == "__main__":

url = "http://www.baidu.com"

print(getHTMLText(url))

4. 实例

实例1：京东商品页面的爬取

import requests

url = "https://item.jd.com/6685410.html"

try:

r = requests.get(url, timeout=30)

r.raise_for_status() # 如果状态不是200，引发异常

r.encoding = r.apparent_encoding

print(r.text[0:1000])

except:

print("爬取失败")

实例2：亚马逊商品页面的爬取。通过headers字段是代码模拟浏览器向http提交请求。

import requests

url = "https://www.amazon.cn/dp/B07DBZZPQL/ref=cngwdyfloorv2_recs_0?pf_rd_p=4940946c-0b2b-498c-9e03-31cf7dae70ec&pf_rd_s=desktop-2&pf_rd_t=36701&pf_rd_i=desktop&pf_rd_m=A1AJ19PSB66TGU&pf_rd_r=YENXHWZT81QNMXW27C8B&pf_rd_r=YENXHWZT81QNMXW27C8B&pf_rd_p=4940946c-0b2b-498c-9e03-31cf7dae70ec"

try:

kv = {'user-agent' : 'Mozilla/5.0'}

r = requests.get(url, headers=kv)

r.raise_for_status() # 如果状态不是200，引发异常

r.encoding = r.apparent_encoding

print(r.text[1000:2000])

except:

print("爬取失败")

实例3：百度360搜索关键字提交

百度关键词接口：http://www.baidu.com/s?wd=keyword

360关键词接口：http://www.so.com/s?q=keywork

import requests

keyword = 'python'

url = "http://www.baidu.com/s"

try:

kv = {'wd' : keyword}

r = requests.get(url, params=kv)

r.raise_for_status()

print(r.request.url)

print(len(r.text))

except:

print("爬取失败")

实例4：网络图片的爬取和存储

网络图片的连接格式：http://www.example.com/picture.jpg ,获取的图片为二进制格式

import requests

import os

url = "http://image.ngchina.com.cn/2018/1127/20181127013714400.jpg"

root = "D://pics//"

path = root + url.split('/')[-1]

try:

if not os.path.exists(root):

os.mkdir(root)

if not os.path.exists(path):

r = requests.get(url)

with open(path, 'wb') as f:

f.write(r.content)

f.close()

print("文件保存成功")

else:

print("文件已经存在")

except:

print("爬取失败")

实例5：IP地址的归属地自动查询

查询IP的链接格式：http://www.ip138.com/ips138.asp?ip=ipaddress

import requests

url = "http://www.ip138.com/ips138.asp?ip="

try:

r = requests.get(url + '202.204.80.112')

r.raise_for_status()

r.encoding = r.apparent_encoding

print(r.text[-2500:-1500])

except:

print("爬取失败")

5. 网络爬虫的“盗亦有道”

1. 网络爬虫引发的问题

网络爬虫的尺寸

小规模，数据量小，爬取速度不敏感，Requests库

中规模，数据规模较大，爬取速度敏感，Scrapy库大规模，搜索引擎，爬取速度关键，定制开发

爬取网页，玩转网页爬取网站，爬取系列网站爬取全网

（1）骚扰问题：受限于编程水平和目的，网络爬虫将会为web服务器带来巨大的资源开销。

（2）法律风险：服务器上的数据有产权归属，网络爬虫获取数据后牟利将会带来法律风险

（3）隐私泄露：网络爬虫可能具备突破简单访问控制的能力，获得被保护数据从而泄露个人隐私

2. 网络爬虫的限制

（1）来源审查：判断User-Agent进行限制

检查来访HTTP协议头的User-Agent域，只响应浏览器或友好爬虫的访问。

（2）发布公告：Robots

Robots协议：

Robots Exclusion Standard 网络爬虫排除标准

作用：告知所有爬虫网站的爬取策略，要求爬虫遵守。

形式：在网站的根目录下的robots.txt文件

使用：自动或人工识别robots.txt，再进行内容爬取，协议可以不遵守，但可能存在法律风险

类人类行为可不遵守，如写小程序一天访问几次服务器

二、网络爬虫之提取

1.Beautiful Soup库

Beautiful Soup库是解析、遍历、维护“标签树”的功能库。

解析器有：html.parser, lxml, xml, html5lib

格式化显示：soup.prettify()，自动为标签间添加换行符。bs4将读入的文件或者字符串转换为"utf-8"。

BeautifulSoup类的基本元素

基本元素说明

Tag标签，最基本的信息组织单元，分别用<>和</>表明开头和结尾

Name标签的名字，<p></p>的名字是'p'，格式:<tag>.name

Attributes标签的属性，字典的组织形式，格式：<tag>.attrs

NavigableString标签内非属性字符串，<>...</>中的字符串，格式：<tag>.string

Comment标签内字符串的注释部分，一种特殊的Comment类型

标签树的遍历

属性说明

下行.contents子节点列表，将<tag>所有儿子节点存入列表

.children子节点的迭代类型，与.contents类似，用于循环遍历儿子节点

.descendants子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

上行.parent节点的父亲标签

.parents节点先辈标签的迭代类型，用于循环遍历先辈节点

平行.next_sibling返回按照HTML文本顺序的下一个平行节点标签

.previous_sibling返回按照HTML文本顺序的上一个平行节点标签

.next_siblings迭代类型，返回按照HTML文本顺序的后续所有平行节点标签

.previous_siblings迭代类型，返回按照HTML文本顺序的前续所有平行节点标签

2. 信息组织与提取方法

信息标记的形式：

实例：

比较：

XML最早的通用信息标记语言，可扩展性好，但是繁琐Internet上的信息交互与传递

JSON信息有类型，适合程序处理，较XML简洁移动应用云端和节点的信息通信，无注释

YAML信息无类型，文本信息比例最高，可读性好各类系统配置文件有注释易读

信息提取的一般方法：

（1）完整解析信息的标记形式，再提取关键信息。XML,JSON,YAML

需要标记解析器，如bs4库的标签树遍历，优点是信息解析准确，缺点是提取过程繁琐

（2）无视标记信息，直接搜索关键信息。搜索

使用对信息的文本查找函数即可。优点是提取过程简洁，速度快，缺点是提取信息的准确性与信息内容直接相关。

融合方法：完整形式解析+搜索，提取关键信息，需要标记解析器及文本查找函数。

实例：提取HTML所有URL链接

思路：1）搜索到所有<a>标签，

2）解析<a>标签格式，提取href后的链接内容

url = "http://python123.io/ws/demo.html"

r = requests.get(url)

demo = r.text

soup = BeautifulSoup(demo, "html.parser")

for link in soup.find_all('a'):

print(link.get('href'))

--------------out---------------

http://www.icourse163.org/course/BIT-268001

http://www.icourse163.org/course/BIT-1001870001

bs4库中HTML内容的查找方法：

<>.find_all(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)

#返回一个列表类型，存储查找的结果

name:对标签名称的检索字符串

attrs:对标签属性值的检索字符串，可标注属性检索

recursive：是否对子孙全部搜索，默认为True

string: <>...</>中字符串区域的检索字符串

soup.find_all('a')

soup.find_all(['a','b'])

soup.find_all(True) #返回所有标签

soup.find_all('p', 'course') #所有属性是course的p标签

soup.find_all(id='link1') #查找包含属性id='link1'的标签

简写形式：

<tag>(..) 等价于 <tag>.find_all(..)

3. 实例：中国大学排名定向爬虫

功能：爬取http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html网站上的大学排名信息，输出排名，学校及总分

步骤：1）从网络上获取大学排名网页信息：getHTMLText()

2）提取网页内容中信息到合适的数据结构（关键，二维结构）:fillUnivList()

3）利用数据结构展示并输出结果:printUnivList

import requests

import bs4

from bs4 import BeautifulSoup

def getHTMLText(url):

try:

r = requests.get(url, timeout = 30)

r.raise_for_status()

r.encoding = r.apparent_encoding

return r.text

except:

return ""

def fillUnivList(ulist, html):

soup = BeautifulSoup(html, 'html.parser')

for tr in soup.find('tbody').children:

if isinstance(tr, bs4.element.Tag):

tds = tr.find_all('td')

ulist.append([tds[0].string,tds[1].string,tds[3].string])

def printUnivList(ulist, num):

tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}"

print(tplt.format("排名", "学校名称", "总分", chr(12288)))

for i in range(num):

u = ulist[i]

print(tplt.format(u[0], u[1], u[2], chr(12288)))

def main():

uinfo = []

url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html'

html = getHTMLText(url)

fillUnivList(uinfo, html)

printUnivList(uinfo, 3)

if __name__ == "__main__":

main()

'''out

排名　　　学校名称　　　总分

1 　　　清华大学　　　 95.9

2 　　　北京大学　　　 82.6

3 　　　浙江大学　　　 80

'''

4、正则表达式入门

regular expression :用来简洁表达一组字符串的表达式。

编译：将符合正则表达式语法的字符串转换成正则表达式特征：p = re.compile( regex ), 特征可以表达一组字符串

常用操作符

操作符说明实例

.表示任何单个字符

[ ]字符集，对单个字符给出取值范围[abc]，表示a,b,c，[a-z]表示a到z的单个字符

[^ ]非字符集，对单个字符给出排除范围[^abc]表示非a或b或c的单个字符

*前一个字符0次或无限次扩展abc*表示ab,abc,abccccc等

+表示前一个字符一次或无限次扩展abc+表示abc,abcc,abccc等

?前一个字符0次或1次扩展abc?表示ab,abc

|左右表达式任取其一abc|def表示abc、def

{m}扩展前一个字符m次ab{2}c表示abbc

{m,n}扩展前一个字符m至n次（含n）ab{1,2}c表示abc,abbc

^匹配字符串开头^abc表示abc且在一个字符串的开头

$匹配字符串结尾abc$表示abc且在一个字符串结尾

()分组标记，内部只能使用|操作符(abc)表示abc,(abc|def)表示abc,def

\d数字，等价于[0-9]

\w单词字符，等价于[A-Za-z0-9_]

经典正则表达式实例：

正则表达式内容

^[A-Za-z]+$由26个字母组成的字符串

^[A-Za-z0-9]+$由26个字母和数字组成的字符串

^-?\d+$整数形式字符串

^[0-9]*[1-9][0-9]*$正整数形式字符串

[1-9]\d{5}中国境内邮政编码

[\u4e00-\u9fa5]匹配中文字符

\d{3}-\d{8}|\d{4}-\d{7}国内电话号码：010-68913536

(([1-9]?\d|1\d{2}|2[0-4]\d|25[0-5]).){3}([1-9]?\d|1\d{2}|2[0-4]\d|25[0-5])IP地址

正则表达式的类型

raw string类型（原生字符串类型,不包含转义符的类型）：r'text'，如r'\d{3}-\d{8}|\d{4}-\d{7}'

string类型，将\理解为转义符，使用更繁琐：如'\\d{3}-\\d{8}|\\d{4}-\\d{7}'

当正则表达式包含转义字符，使用raw string类型

Re库主要功能函数：

函数说明

re.search(pattern,string,flags=0)在一个字符串中搜索匹配正则表达式的第一个位置，返回match对象

re.match(pattern,string,flags=0)从一个字符串的的开始位置起匹配正则表达式，返回match对象

re.findall(pattern,string,flags=0)搜索字符串，以列表类型返回全部能匹配的字符串

re.split(pattern,string,maxsplit=0,flags=0)将一个字符串按照正则表达式匹配结果进行分割，返回列表类型

re.finditer(pattern,string,flags=0)搜索字符串，返回一个匹配结果的迭代类型，每个迭代元素是match对象

re.sub(pattern,repl,string,count=0,flags=0)在一个字符串中替换所有匹配正则表达式的字串，返回替换后的字符串

Re库的另一种等价用法

# 函数式用法：一次性操作

rst = re.search(r'[1-9]\d{5}', 'BIT 100081')

# 面向对象用法：编译后的多次操作

pat = re.compile(r'[1-9]\d{5}')

rst = pat.search('BIT 100081')

# 真正的正则表达式

regex = re.compile(pattern, flags = 0)

Re库的match对象

属性说明方法说明

.string待匹配文本.group(0)获得匹配后的字符串

.re匹配时使用的pattern对象（正则表达式）.start()匹配字符串在原始字符串的开始位置

.pos正则表达式搜索文本的开始位置.end()匹配字符串在原始字符串的结束位置

.endpos正则表达式搜索文本的结束位置.span()返回(.start(), .end())

贪婪匹配和最小匹配：

Re库默认采用贪婪匹配，即输出匹配最长的字串。

# 贪婪匹配

match = re.search(r'PY.*N', 'PYANBNCNDN')

# 最小匹配

match = re.search(r'PY.*?N', 'PYANBNCNDN')

最小匹配操作符

操作符说明

*?前一个字符0次或无限次扩展，最小匹配

+?前一个字符1次或无限次扩展，最小匹配

前一个字符0次或1次扩展，最小匹配

{m,n}?扩展一个字符m至n次(含n)，最小匹配

5.实例：当当网比价定向爬虫

目的：获取淘宝搜索页面信息，提取其中的商品名称和价格

难点：淘宝的搜索接口

技术路线：request - BeautifulSoup

程序结构设计： 1）提交商品搜索请求，循环获取页面

2）对于每个页面，提取商品名称和价格信息

3）将信息输出到屏幕上

import requests

from bs4 import BeautifulSoup

import csv

def getHTMLText(url):

try:

r = requests.get(url, timeout = 30)

r.raise_for_status()

r.encoding = r.apparent_encoding

return r.text

except:

return("")

def parsePage(ilt, html):

try:

soup = BeautifulSoup(html, 'html.parser')

div_tag = soup.find(name='div', attrs={'dd_name':"普通商品区域"})

li_tag = div_tag.find_all(name='li')

for each_goods_li in li_tag:

price = each_goods_li.find(name = 'span', attrs={'class':"price_n"}).string[1:]

name = each_goods_li.find(name='a', attrs={'dd_name': r"单品标题"}).attrs['title']

ilt.append([price,name])

except:

print("")

def printGoodsList(ilt):

tplt = "{:4}\t{:8}\t{:16}"

print(tplt.format("序号","价格", "商品名称"))

count = 0

for g in ilt:

count += 1

print(tplt.format(count, g[0], g[1]))

def saveGoods(ilt):

if(len(ilt) != 0):

headers = ["序号","价格", "商品名称"]

with open('goods.csv','w',encoding='utf-8') as f:

f_csv = csv.writer(f)

f_csv.writerow(headers)

for i in range(len(ilt)):

row = [i+1,ilt[i][0],ilt[i][1]]

f_csv.writerow(row)

def main():

goods = '书包'

depth = 3

start_url = 'http://search.dangdang.com/?key=' + goods + '&page_index='

infoList = []

for i in range(depth):

try:

url = start_url + str(i+1)

html = getHTMLText(url)

parsePage(infoList, html)

except:

continue

printGoodsList(infoList)

saveGoods(infoList)

if __name__ == "__main__":

main()

实例：股票数据定向爬虫

目标：获取上交所和深交所所有股票的名称和交易信息

输出：保存到文件中

候选网站: 1)新浪股票：http://finance.sina.com.cn/stock/ (可能js生成，不太合适)

2)百度股票：https://gupiao.baidu.com/stock/

选取原则：股票信息存在于HTML页面中，非js代码生成，没有Robots协议限制。

程序设计结构：1）从东方财富网获取股票列表

2）根据股票列表逐个到百度股票获取个股信息

3）将结果存储到文件

import requests

from bs4 import BeautifulSoup

import traceback

import re

def getHTMLText(url, code='utf-8'):

try:

r = requests.get(url, timeout = 30)

r.raise_for_status()

r.encoding = code

return r.text

except:

return("")

def getStockList(lst, stockURL):

html = getHTMLText(stockURL, 'GB2312')

soup = BeautifulSoup(html, 'html.parser')

a = soup.find_all('a')

for i in a:

try:

href = i.attrs['href']

lst.append(re.findall(r'[s][hz]\d{6}', href)[0])

except:

continue

def getStockInfo(lst, stockURL, fpath):

count = 0

for stock in lst:

url = stockURL + stock + '.html'

html = getHTMLText(url)

try:

if html == "":

continue

infoDict = {}

soup = BeautifulSoup(html, 'html.parser')

stockInfo = soup.find('div', attrs={ 'class':"stock-bets"})

name = stockInfo.find_all(attrs={ 'class':"bets-name"})[0]

infoDict.update({'股票名称':name.text.split()[0]})

keyList = stockInfo.find_all('dt')

valueList = stockInfo.find_all('dd')

for i in range(len(keyList)):

key = keyList[i].text

value = valueList[i].text

infoDict[key] = value

with open(fpath, 'a', encoding='utf-8') as f:

f.write(str(infoDict) + '\n')

count += 1

print('\r当前速度：{:.2f}%'.format(count*100/len(lst)), end='')

except:

#traceback.print_exc()

count += 1

print('\r当前速度：{:.2f}%'.format(count * 100 / len(lst)), end='')

continue

def main():

stock_list_url = 'http://quote.eastmoney.com/stocklist.html'

stock_info_url = 'https://gupiao.baidu.com/stock/'

output_file = 'D://BaiduStockInfo.txt'

slist = []

getStockList(slist,stock_list_url)

getStockInfo(slist,stock_info_url,output_file)

if __name__ == '__main__':

main()

三、Scrapy爬虫框架

Scrapy爬虫框架介绍

scrapy不是一个简单的函数功能库，而是一个爬虫框架: 5+2结构

框架解析：

使用模块功能

不需要用户修改Engine框架核心，控制所有模块之间的数据流；根据条件触发事件。

Downloader根据请求下载网页

Scheduler对所有爬取请求进行调度管理

需要用户修改Downloader Middleware

实施Engine、Scheduer和Downloader之间进行用户可配置的控制

修改、丢弃、新增请求或响应

Spider

解析Downloader返回的响应（Response）

产生爬取项，产生额外的爬取请求

Item Piplines

以流水线方式处理Spider产生的爬取项

操作包括：清理，检验，查重，存储数据

Spider Middleware对请求和爬取项再处理

Requests VS Scrapy

相同点不同点选择

1)页面请求和爬取两个重要技术路线；

2)可用性好，文档丰富，入门简单；

3)都没有处理js、提交表单、应对验证码等功能（可扩展）。RequestsScrapy

1)非常小的需求：requests库

2)不太小：Scrapy,持续，周期爬取信息，积累形成库

3)定制程度很高：自搭框架，requests>Scrapy

页面级爬虫网站级爬虫

功能库框架

并发性考虑不足，性能较差并发性好，性能较高

重点在于页面下载重点在于爬虫结构

定制灵活一般定制灵活，深度定制困难

上手十分简单入门稍难

常用命令：

命令说明格式

startproject创建一个新工程scrapy startproject<name>[dir]

genspider创建一个爬虫scrapy genspider [options] <name><domain>

settings获得爬虫配置信息scrapy settings [options]

crawl运行一个爬虫scrapy crawl<spider>

list列出工程中所有爬虫scrapy list

shell启动URL调试命令行scrapy shell [url]

步骤：

1）建立一个Scrapy爬虫工程：scrapy startproject python123demo

2）在工程中产生一个Scrapy爬虫：scrapy genspider demo python123.io

3）配置产生的spider爬虫demp.py

简化版

import scrapy

class DemoSpider(scrapy.Spider):

name = 'demo'

#allowed_domains = ['python123.io']

start_urls = ['http://python123.io/ws/demo.html']

def parse(self, response):

fname = response.url.split('/')[-1]

with open(fname, 'wb') as f:

f.write(response.body)

self.log("Save file %s." % fname)

完整版：

import scrapy

class DemoSpider(scrapy.Spider):

name = 'demo'

def start_requests(self):

urls = [

'http://python123.io/ws/demo.html'

]

for url in urls:

yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):

fname = response.url.split('/')[-1]

with open(fname, 'wb') as f:

f.write(response.body)

self.log("Save file %s." % fname)

4）运行爬虫，获取网页：scrapy crawl demo

使用步骤:

1) 创建一个工程和Spider模板

数据类型：

Request类

Response类

Item类

2) 编写Spider

3) 编写Item Pipleline

4）优化配置策略

1）Request类

class scrapy.http.Request(): 表示一个http请求，由Spider生成，由Downloader执行

属性或方法说明

.urlRequest对应的请求的URL地址

.method对应的请求方法，'Get'，‘POST’等

.headers字典类型请求风格头

.body请求内容主体，字符串风格

.meta用户添加的扩展信息，在Scrapy内部模块间传递信息使用

.copy()复制该请求

2）Response类

class. scrapy.http.Response():表示一个http响应。由Downloader生成，由Spider处理

属性或方法说明

.urlResponse对应的URL地址

.statusHTTP状态码，默认是200

.headersResponse对应的头信息

.bodyResponse 对应的内容信息，字符串类型

.flags一组标记

.request产生Response类型对应的Request对象

.copy()复制该响应

3）Item类

class scrapy.item.Item(): Item对象表示一个从HTML中提取的信息内容，由Spider生成，由Item Pipeline处理。类似字典类型，可以按照字典类型操作

Scrapy爬虫支持多种HTML信息提取方法：Beautiful Soup, lxml, re, XPath Selector, CSS Selector

实例：Scrapy获取上交所和深交所所有股票的名称和交易信息

百度股票：https://gupiao.baidu.com/stock/

单个股票：https://gupiao.baidu.com/stock/sz002439

东方财富网：http://quote.eastmoney.com/stocklist.html

1）建立工程和spider模板

>scrapy startproject BaiduStocks

>cd BaiduStocks

>scrapy genspider stocks baidu.com

>修改spiders/stocks.py文件

2）编写spider

>配置stocks.py文件

>修改对返回页面的处理

>修改对新增URL爬取请求的处理

3）编写Pipelines

>配置pipelines.py文件

>d定义对爬取项的处理类

>配置ITEM_PIPLINES选项

stocks.py

# -*- coding: utf-8 -*-

import scrapy

import re

class StocksSpider(scrapy.Spider):

name = 'stocks'

start_urls = ['http://quote.eastmoney.com/stocklist.html']

def parse(self, response):

for href in response.css('a::attr(href)').extract():

try:

stock = re.findall(r"[s][hz]\d{6}", href)[0]

url = "https://gupiao.baidu.com/stock/"+ stock + '.html'

yield scrapy.Request(url, callback=self.parse_stock)

except:

continue

def parse_stock(self, response):

infoDict = {}

stockInfo = response.css('.stock-bets')

name = stockInfo.css('.bets-name').extract()[0]

keyList = stockInfo.css('dt').extract()

valueList = stockInfo.css('dd').extract()

for i in range(len(keyList)):

key = re.findall(r'>.*</dt', keyList[i])[0][1:-5]

try:

val = re.findall(r'\d+\.?.*</dd', valueList[i])[0][0:-5]

except:

val = '--'

infoDict[key] = val

infoDict.update({'股票名称': re.findall('\s.*\(',name)[0].split()[0] +

re.findall('\>.*\<',name)[0][1:-1]})

yield infoDict

pipelines.py, 修改settings.py中的内容，关联BaidustocksInfoPipline

class BaidustocksInfoPipline(object):

def open_spider(self, spider):

self.f = open('BaiduStockInfo.txt', 'w')

def close_spide(self, spider):

self.f.close()

def process_item(self, item, spider):

try:

line = str(dict(item)) + '\n'

self.f.write(line)

except:

pass

return item