网络爬虫 | 基础

最新推荐文章于 2024-04-26 03:23:26 发布

至此@

最新推荐文章于 2024-04-26 03:23:26 发布

阅读量662

点赞数 2

分类专栏： Python 文章标签： python

本文链接：https://blog.csdn.net/qq_42876577/article/details/104771656

版权

Python 专栏收录该内容

3 篇文章 1 订阅

订阅专栏

1、爬虫工作原理

首先，爬虫可以模拟浏览器去向服务器发出请求；
其次，等服务器响应后，爬虫程序还可以代替浏览器帮我们解析数据；
接着，爬虫可以根据我们设定的规则批量提取相关数据，而不需要我们去手动提取；
最后，爬虫可以批量地把数据存储到本地。

爬虫的工作分为四步：

第1步：获取数据。爬虫程序会根据我们提供的网址，向服务器发起请求，然后返回数据。
第2步：解析数据。爬虫程序会把服务器返回的数据解析成我们能读懂的格式。
第3步：提取数据。爬虫程序再从中提取出我们需要的数据。
第4步：储存数据。爬虫程序把这些有用的数据保存起来，便于你日后的使用和分析。

2、 requests

requests库可以帮我们下载网页源代码、文本、图片，甚至是音频。其实，“下载”本质上是向服务器发送请求并得到响应。

2.1 requests.get()

import requests
#引入requests库
res = requests.get('URL')
#requests.get是在调用requests库中的get()方法，它向服务器发送了一个请求，括号里的参数是你需要的数据所在的网址，然后服务器对请求作出了响应。
#我们把这个响应返回的结果赋值在变量res上。

import requests：引入requests库
res = requests.get(‘URL’)：requests.get()发送了请求，然后得到了服务器的响应。服务器返回的结果是个Response对象，现在存储到了我们定义的变量res中。

2.2 Response对象的常用属性

res是一个对象，属于requests.models.Response类。

import requests 
res = requests.get('URL') 
print(type(res))
#打印变量res的数据类型

结果：

<class 'requests.models.Response'>

Response对象常用的四个属性：

属性	作用
response.status code	检查请求是否成功
response.content	把reponse对象转换为二进制数据
response.text	把reponse对象转换为字符串数据
response.encoding	定义response对象的编码

response.content 例子：

下载图片

import requests
res = requests.get('xxx.png')
#发出请求，并把返回的结果放在变量res中
pic=res.content
#把Reponse对象的内容以二进制数据的形式返回
photo = open('ppt.jpg','wb')
#新建了一个文件ppt.jpg，这里的文件没加路径，它会被保存在程序运行的当前目录下。
#图片内容需要以二进制wb读写。你在学习open()函数时接触过它。
photo.write(pic) 
#获取pic的二进制内容
photo.close()
#关闭文件

response.text 例子：

下载小说

import requests
#引用requests库
res = requests.get('xxx.md')
#下载小说，我们得到一个对象，它被命名为res
novel=res.text
#把Response对象的内容以字符串的形式返回
print(novel[:800])
#现在，可以打印小说了，但考虑到整章太长，只输出800字看看就好。在关于列表的知识那里，你学过[:800]的用法。

res.encoding 例子：

下载小说

import requests
#引用requests库
res = requests.get('xxx.md')
#下载小说，我们得到一个对象，它被命名为res
res.encoding='utf-8'
#定义Reponse对象的编码为utf-8。
novel=res.text
#把Response对象的内容以字符串的形式返回
print(novel[:800])
#打印小说的前800个字。

3、爬虫理论

Robots协议是互联网爬虫的一项公认的道德规范，它的全称是==“网络爬虫排除标准”（Robots exclusion protocol）==，这个协议用来告诉爬虫，哪些页面是可以抓取的，哪些不可以。
如何查看网站的robots协议呢，很简单，在网站的域名后加上==/robots.txt==就可以了。

User-agent:  Baiduspider #百度爬虫
Allow:  /article #允许访问 /article.htm
Allow:  /oshtml #允许访问 /oshtml.htm
Allow:  /ershou #允许访问 /ershou.htm
Allow: /$ #允许访问根目录，即淘宝主页
Disallow:  /product/ #禁止访问/product/
Disallow:  / #禁止访问除 Allow 规定页面之外的其他所有页面

User-Agent:  Googlebot #谷歌爬虫
Allow:  /article
Allow:  /oshtml
Allow:  /product #允许访问/product/
Allow:  /spu
Allow:  /dianpu
Allow:  /oversea
Allow:  /list
Allow:  /ershou
Allow: /$
Disallow:  / #禁止访问除 Allow 规定页面之外的其他所有页面

…… # 文件太长，省略了对其它爬虫的规定，想看全文的话，点击上面的链接

User-Agent:  * #其他爬虫
Disallow:  / #禁止访问所有页面

Allow代表可以被访问，Disallow代表禁止被访问。

4、BeautifulSoup

使用BeautifulSoup解析和提取网页中的数据。

【提取数据】是指把我们需要的数据从众多数据中挑选出来。

4.1 解析数据

BeautifulSoup解析数据的用法：

bs对象= BeautifulSoup(要解析的文本，' 解析器')

例子：

import requests
from bs4 import BeautifulSoup
#引入BS库
res = requests.get('xxx.html') 
html = res.text
soup = BeautifulSoup(html,'html.parser') #把网页解析为BeautifulSoup对象

soup的数据类型是<class ‘bs4.BeautifulSoup’>，说明soup是一个BeautifulSoup对象。

注意：
response.text和soup打印出的内容表面上看长得一模一样，却有着不同的内心，它们属于不同的类：<class ‘str’> 与<class ‘bs4.BeautifulSoup’>。前者是字符串，后者是已经被解析过的BeautifulSoup对象。之所以打印出来的是一样的文本，是因为BeautifulSoup对象在直接打印它的时候会调用该对象内的__str__方法，所以直接打印 bs 对象显示字符串是__str__的返回结果。

4.2 提取数据

find()与find_all()
find()与find_all()是BeautifulSoup对象的两个方法，它们可以匹配html的标签和属性，把BeautifulSoup对象里符合要求的数据都提取出来。

方法	作用	用法	示例
find()	提取满足要求的首个数据	BeautifulSoup对象.find (标签，属性)	soup.find(‘div’ ,class_ =‘books’)
find_all()	提取满足要求的所有数据	BeautifulSoup对象.find_all (标签，属性)	soup.find_all(‘div’ ,class_ =‘books’)

例子：

import requests
from bs4 import BeautifulSoup
url = 'https://localprod.pandateacher.com/python-manuscript/crawler-html/spider-men5.0.html'
res = requests.get (url)
print(res.status_code)
soup = BeautifulSoup(res.text,'html.parser')
item = soup.find('div') #使用find()方法提取首个<div>元素，并放到变量item里。
	print(type(item)) #打印item的数据类型
	print(item)       #打印item

运行结果：首个<div>元素。还打印了它的数据类型：<class 'bs4.element.Tag'>，说明这是一个Tag类对象。

import requests
from bs4 import BeautifulSoup
url = 'https://localprod.pandateacher.com/python-manuscript/crawler-html/spider-men5.0.html'
res = requests.get (url)
print(res.status_code)
soup = BeautifulSoup(res.text,'html.parser')
items = soup.find_all('div') #用find_all()把所有符合要求的数据提取出来，并放在变量items里
	print(type(items)) #打印item的数据类型
	print(items)       #打印item

运行结果：三个<div>元素，它们一起组成了一个列表结构。打印items的类型，显示的是<class 'bs4.element.ResultSet'>，是一个ResultSet类的对象。其实是Tag对象以列表结构储存了起来，可以把它当做列表来处理。

Tag对象

属性/方法	作用
Tag.find()和Tag.fnd all()	提取Tag中的Tag
Tag.text	提取Tag中的文字
Tag['属性名]	输入参数:属性名，可以提取Tag中这个属性的值

例子：

爬取书苑不太冷网站中每本书的类型、链接、标题和简介

import requests # 调用requests库
from bs4 import BeautifulSoup # 调用BeautifulSoup库
res = requests.get('https://localprod.pandateacher.com/python-manuscript/crawler-html/spider-men5.0.html')
# 返回一个response对象，赋值给res
html = res.text
# 把res的内容以字符串的形式返回
soup = BeautifulSoup( html,'html.parser') 
# 把网页解析为BeautifulSoup对象
items = soup.find_all(class_='books') # 通过定位标签和属性提取我们想要的数据
for item in items:
    kind = item.find('h2') # 在列表中的每个元素里，匹配标签<h2>提取出数据
    title = item.find(class_='title') #在列表中的每个元素里，匹配属性class_='title'提取出数据
    brief = item.find(class_='info') #在列表中的每个元素里，匹配属性class_='info'提取出数据
    print(kind,'\n',title,'\n',brief) # 打印提取出的数据
    print(type(kind),type(title),type(brief)) # 打印提取出的数据类型

用Tag.text提出Tag对象中的文字，用Tag[‘href’]提取出URL

5、例子：

5.1 例一

写一个循环，提取页面的所有菜名、URL、食材，并将它存入列表。

[[菜A,URL_A,食材A],[菜B,URL_B,食材B],[菜C,URL_C,食材C]]

import requests
# 引用requests库
from bs4 import BeautifulSoup
# 引用BeautifulSoup库

res_foods = requests.get('http://www.xiachufang.com/explore/')
# 获取数据
bs_foods = BeautifulSoup(res_foods.text,'html.parser')
# 解析数据
list_foods = bs_foods.find_all('div',class_='info pure-u')
# 查找最小父级标签

list_all = []
# 创建一个空列表，用于存储信息

for food in list_foods:

    tag_a = food.find('a')
    # 提取第0个父级标签中的<a>标签
    name = tag_a.text[17:-13]
    # 菜名，使用[17:-13]切掉了多余的信息
    URL = 'http://www.xiachufang.com'+tag_a['href']
    # 获取URL
    tag_p = food.find('p',class_='ing ellipsis')
    # 提取第0个父级标签中的<p>标签
    ingredients = tag_p.text[1:-1]
    # 食材，使用[1:-1]切掉了多余的信息
    list_all.append([name,URL,ingredients])
    # 将菜名、URL、食材，封装为列表，添加进list_all

print(list_all)
# 打印

# 以下是另外一种解法


list_foods = bs_foods.find_all('div',class_='info pure-u')
# 查找最小父级标签

list_all = []
# 创建一个空列表，用于存储信息

for food in list_foods:

    tag_a = food.find('a')
    # 提取第0个父级标签中的<a>标签
    name = tag_a.text[17:-13]
    # 菜名，使用[17:-13]切掉了多余的信息
    URL = 'http://www.xiachufang.com'+tag_a['href']
    # 获取URL
    tag_p = food.find('p',class_='ing ellipsis')
    # 提取第0个父级标签中的<p>标签
    ingredients = tag_p.text[1:-1]
    # 食材，使用[1:-1]切掉了多余的信息
    list_all.append([name,URL,ingredients])
    # 将菜名、URL、食材，封装为列表，添加进list_all

print(list_all)
# 打印

5.2 例二

爬取歌

import requests
from bs4 import  BeautifulSoup

res_music = requests.get('https://y.qq.com/portal/search.html#page=1&searchid=1&remoteplace=txt.yqq.top&t=song&w=%E5%91%A8%E6%9D%B0%E4%BC%A6')
# 请求html，得到response
bs_music = BeautifulSoup(res_music.text,'html.parser')
# 解析html
list_music = bs_music.find_all('a',class_='js_song')
# 查找class属性值为“js_song”的a标签，得到一个由标签组成的列表
for music in list_music:
# 对查找的结果执行循环
    print(music['title'])
    # 打印出我们想要的音乐名

上面错误

5.2.1 Network

Network的功能是：记录在当前页面上发生的所有请求。
- 第0行的左侧，红色的圆钮是启用Network监控（默认高亮打开），灰色圆圈是清空面板上的信息。
- 右侧勾选框Preserve log，它的作用是“保留请求日志”。如果不点击这个，当发生页面跳转的时候，记录就会被清空。所以，我们在爬取一些会发生跳转的网页时，会点亮它。
- 第3行，是对请求进行分类查看。最常用的是：
  - ALL（查看全部）
  - XHR（一种不借助刷新页面即可传输的对象）
  - Doc（Document，第0个请求一般在这里），有时候也会看看：
  - Img（仅查看图片）/Media（仅查看媒体文件）
  - Other（其他）。

在这里插入图片描述
刚刚写的代码，只是模拟了这52个请求中的一个（准确来说，就是第0个请求），而这个请求里并不包含歌曲清单。

5.2.2 XHR

Ajax技术：应用这种技术，好处是显而易见的——更新网页内容，而不用重新加载整个网页。
Ajax技术在工作的时候创建一个XHR（或是Fetch）对象，然后利用XHR对象来实现，服务器和浏览器之间传输数据。在这里，XHR和Fetch并没有本质区别。

歌单应该在XHR下的client_search中
在这里插入图片描述

Headers：标头（请求信息）
Preview：预览
Response：原始信息
Timing：时间。

在这里插入图片描述
General里的Requests URL（最外层是一个字典）就是应该去访问的链接。

在这里插入图片描述
XHR是一个字典，键data对应的值也是一个字典；在该字典里，键song对应的值也是一个字典；在该字典里，键list对应的值是一个列表；在该列表里，一共有20个元素；每一个元素都是一个字典；在每个字典里，键name的值，对应的是歌曲名。

import requests
# 引用requests库
res = requests.get('https://c.y.qq.com/soso/fcgi-bin/client_search_cp?ct=24&qqmusic_ver=1298&new_json=1&remoteplace=txt.yqq.song&searchid=60997426243444153&t=0&aggr=1&cr=1&catZhida=1&lossless=0&flag_qc=0&p=1&n=20&w=%E5%91%A8%E6%9D%B0%E4%BC%A6&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8&notice=0&platform=yqq.json&needNewCode=0')
# 调用get方法，下载这个字典
print(res.text)
# 把它打印出来

res.text取到的，是字符串。不是列表/字典。

5.2.3 JSON

json是一种特殊的字符串，这种字符串特殊在它的写法——它是用列表/字典的语法写成的。
json能够有组织地存储信息。
json()方法，将response对象，转为列表/字典

import requests
# 引用requests库
res_music = requests.get('https://c.y.qq.com/soso/fcgi-bin/client_search_cp?ct=24&qqmusic_ver=1298&new_json=1&remoteplace=txt.yqq.song&searchid=60997426243444153&t=0&aggr=1&cr=1&catZhida=1&lossless=0&flag_qc=0&p=1&n=20&w=%E5%91%A8%E6%9D%B0%E4%BC%A6&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8&notice=0&platform=yqq.json&needNewCode=0')
# 调用get方法，下载这个字典
json_music = res_music.json()
# 使用json()方法，将response对象，转为列表/字典
list_music = json_music['data']['song']['list']
# 一层一层地取字典，获取歌单列表
for music in list_music:
# list_music是一个列表，music是它里面的元素
    print(music['name'])
    # 以name为键，查找歌曲名

拿到：歌曲名、所属专辑、播放时长，以及播放链接

import requests
# 引用requests库
res_music = requests.get('https://c.y.qq.com/soso/fcgi-bin/client_search_cp?ct=24&qqmusic_ver=1298&new_json=1&remoteplace=txt.yqq.song&searchid=60997426243444153&t=0&aggr=1&cr=1&catZhida=1&lossless=0&flag_qc=0&p=1&n=20&w=%E5%91%A8%E6%9D%B0%E4%BC%A6&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8&notice=0&platform=yqq.json&needNewCode=0')
# 调用get方法，下载这个字典
json_music = res_music.json()
# 使用json()方法，将response对象，转为列表/字典
list_music = json_music['data']['song']['list']
# 一层一层地取字典，获取歌单列表
for music in list_music:
# list_music是一个列表，music是它里面的元素
    print(music['name'])
    # 以name为键，查找歌曲名
    print('所属专辑：'+music['album']['name'])
    # 查找专辑名
    print('播放时长：'+str(music['interval'])+'秒')
    # 查找播放时长
    print('播放链接：https://y.qq.com/n/yqq/song/'+music['mid']+'.html\n\n')
    # 查找播放链接

5.3 例三

在这里插入图片描述
Query String Parametres中文翻译是：查询字符串参数。

爬取多个歌曲
- 点击第2页、第3页进行翻页，此时Network会多加载出2个XHR，它们的Name都是client_search…。
- 只有一个参数变化，说明p代表的应该就是页码。
- 每次循环去更改p的值

import requests
# 引用requests模块
for x in range(5):
    res_music = requests.get('https://c.y.qq.com/soso/fcgi-bin/client_search_cp?ct=24&qqmusic_ver=1298&new_json=1&remoteplace=txt.yqq.song&searchid=60997426243444153&t=0&aggr=1&cr=1&catZhida=1&lossless=0&flag_qc=0&p='+str(x+1)+'&n=20&w=%E5%91%A8%E6%9D%B0%E4%BC%A6&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8&notice=0&platform=yqq.json&needNewCode=0')
    # 调用get方法，下载这个字典
    json_music = res_music.json()
    # 使用json()方法，将response对象，转为列表/字典
    list_music = json_music['data']['song']['list']
    # 一层一层地取字典，获取歌单列表
    for music in list_music:
    # list_music是一个列表，music是它里面的元素
        print(music['name'])
        # 以name为键，查找歌曲名
        print('所属专辑：'+music['album']['name'])
        # 查找专辑名
        print('播放时长：'+str(music['interval'])+'秒')
        # 查找播放时长
        print('播放链接：https://y.qq.com/n/yqq/song/'+music['mid']+'.html\n\n')
        # 查找播放链接

5.3.1 params

requests模块里的requests.get()提供了一个参数叫params，可以让我们用字典的形式，把参数传进去。

把Query String Parametres里的内容，直接复制下来，封装为一个字典，传递给params。

import requests
# 引用requests模块
url = 'https://c.y.qq.com/soso/fcgi-bin/client_search_cp'
for x in range(5):
    
    params = {
    'ct':'24',
    'qqmusic_ver': '1298',
    'new_json':'1',
    'remoteplace':'sizer.yqq.song_next',
    'searchid':'64405487069162918',
    't':'0',
    'aggr':'1',
    'cr':'1',
    'catZhida':'1',
    'lossless':'0',
    'flag_qc':'0',
    'p':str(x+1),
    'n':'20',
    'w':'周杰伦',
    'g_tk':'5381',
    'loginUin':'0',
    'hostUin':'0',
    'format':'json',
    'inCharset':'utf8',
    'outCharset':'utf-8',
    'notice':'0',
    'platform':'yqq.json',
    'needNewCode':'0'    
    }
    # 将参数封装为字典
    res_music = requests.get(url,params=params)
    # 调用get方法，下载这个字典
    json_music = res_music.json()
    # 使用json()方法，将response对象，转为列表/字典
    list_music = json_music['data']['song']['list']
    # 一层一层地取字典，获取歌单列表
    for music in list_music:
    # list_music是一个列表，music是它里面的元素
        print(music['name'])
        # 以name为键，查找歌曲名
        print('所属专辑：'+music['album']['name'])
        # 查找专辑名
        print('播放时长：'+str(music['interval'])+'秒')
        # 查找播放时长
        print('播放链接：https://y.qq.com/n/yqq/song/'+music['mid']+'.html\n\n')
        # 查找播放链接

爬取评论
- 找评论：
  先把Network面板清空，再点击一下评论翻页，看看有没有多出来的新XHR，多出来的那一个，就应该是和评论相关的啦。

在这里插入图片描述

反动页数发现：
- XHR有两个参数在不断变化：一个是pagenum，一个lasthotcommentid。
- lasthotcommentid，它的含义是：上一条热评的评论id。

import requests
# 引用requests模块
url = 'https://c.y.qq.com/base/fcgi-bin/fcg_global_comment_h5.fcg'
commentid = ''
# 设置一个初始commentid
for x in range(5):
    
    params = {
    'g_tk':'5381',
    'loginUin':'0',
    'hostUin':'0',
    'format':'json',
    'inCharset':'utf8',
    'outCharset':'GB2312',
    'notice':'0',
    'platform':'yqq.json',
    'needNewCode':'0',
    'cid':'205360772',
    'reqtype':'2',
    'biztype':'1',
    'topid':'102065756',
    'cmd':'8',
    'needcommentcrit':'0',
    'pagenum':str(x),
    'pagesize':'25',
    'lasthotcommentid':commentid,
    'domain':'qq.com',
    'ct':'24',
    'cv':'101010  '
    }
    # 将参数封装为字典，其中pagenum和lastcommentid是特殊的变量
    res_comment = requests.get(url,params=params)
    # 调用get方法，下载评论列表
    json_comment = res_comment.json()
    # 使用json()方法，将response对象，转为列表/字典
    list_comment = json_comment['comment']['commentlist']
    # 一层一层地取字典，获取评论列表
    for comment in list_comment:
    # list_comment是一个列表，comment是它里面的元素
        print(comment['rootcommentcontent'])
        # 输出评论
    commentid = list_comment[24]['commentid']
    # 将最后一个评论的id赋值给comment，准备开始下一次循环

5.3.2 Request Headers

服务器通过什么识别我们是真实浏览器，还是Python爬虫的？
在这里插入图片描述

Requests Headers，请求头。
user-agent会记录你电脑的信息和浏览器版本。
origin和referer则记录了这个请求，最初的起源是来自哪个页面。
区别是referer会比origin携带的信息更多些。
user-agent不修改，那么这里的默认值就会是Python，会被浏览器认出来。

修改origin或referer，一并作为字典写入headers

import requests
url = 'https://c.y.qq.com/base/fcgi-bin/fcg_global_comment_h5.fcg'
# 这是那个，请求歌曲评论的url
headers = {
    'origin':'https://y.qq.com',
    # 请求来源，本案例中其实是不需要加这个参数的，只是为了演示
    'referer':'https://y.qq.com/n/yqq/song/004Z8Ihr0JIu5s.html',
    # 请求来源，携带的信息比“origin”更丰富，本案例中其实是不需要加这个参数的，只是为了演示
    'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
    # 标记了请求从什么设备，什么浏览器上发出
    }
params = {
'g_tk':'5381',
'loginUin':'0',
'hostUin':'0',
'format':'json',
'inCharset':'utf8',
'outCharset':'GB2312',
'notice':'0',
'platform':'yqq.json',
'needNewCode':'0',
'cid':'205360772',
'reqtype':'2',
'biztype':'1',
'topid':'102065756',
'cmd':'8',
'needcommentcrit':'0',
'pagenum':0,
'pagesize':'25',
'lasthotcommentid':'',
'domain':'qq.com',
'ct':'24',
'cv':'101010  '
    }

res_music = requests.get(url,headers=headers,params=params)
# 发起请求

6、存储数据

6.1 存储方式

file=open('test.csv','a+')
#创建test.csv文件，以追加的读写模式
file.write('美国队长,钢铁侠,蜘蛛侠')
#写入test.csv文件
file.close()
#关闭文件

在这里插入图片描述

6.2 csv写入与读取

直接运用别人写好的模块，比使用open()函数来读写方便。

import csv
#引用csv模块。
csv_file = open('demo.csv','w',newline='',encoding='utf-8')
#创建csv文件，我们要先调用open()函数，传入参数：文件名“demo.csv”、写入模式“w”、newline=''、encoding='utf-8'。

“w”就是writer，即文件写入模式，它会以覆盖原内容的形式写入新添加的内容。
newline=’ '参数的原因是，可以避免csv文件出现两倍的行距
encoding=‘utf-8’，可以避免编码问题导致的报错或乱码。

csv文件写入

csv.writer()函数来建立一个writer对象，
调用writer对象的writerow()方法。
writerow()函数里，需要放入列表参数

import csv
#引用csv模块。
csv_file = open('demo.csv','w',newline='',encoding='utf-8')
#调用open()函数打开csv文件，传入参数：文件名“demo.csv”、写入模式“w”、newline=''、encoding='utf-8'。
writer = csv.writer(csv_file)
# 用csv.writer()函数创建一个writer对象。
writer.writerow(['电影','豆瓣评分'])
#调用writer对象的writerow()方法，可以在csv文件里写入一行文字 “电影”和“豆瓣评分”。
writer.writerow(['银河护卫队','8.0'])
#在csv文件里写入一行文字 “银河护卫队”和“8.0”。
writer.writerow(['复仇者联盟','8.1'])
#在csv文件里写入一行文字 “复仇者联盟”和“8.1”。
csv_file.close()
#写入完成后，关闭文件。

读csv文件

import csv
csv_file = open('demo.csv','r',newline='',encoding='utf-8')
reader = csv.reader(csv_file)
for row in reader:
    print(row)

csv官方文档

6.3 Excel写入与读取

写入操作

引用openpyxl，然后通过openpyxl.Workbook()函数就可以创建新的工作薄
获取工作表
操作单元格，往单元格里写入内容
工作表里写入一行内容的话，就得用到append函数
保存这个Excel文件

import openpyxl 
#引用openpyxl 。

#写入
wb = openpyxl.Workbook()
#利用openpyxl.Workbook()函数创建新的workbook（工作薄）对象，就是创建新的空的Excel文件。
sheet = wb.active
#wb.active就是获取这个工作薄的活动表，通常就是第一个工作表。
sheet.title = 'new title'
#可以用.title给工作表重命名。现在第一个工作表的名称就会由原来默认的“sheet1”改为"new title"。
sheet['A1'] = '漫威宇宙' 
#把'漫威宇宙'赋值给第一个工作表的A1单元格，就是往A1的单元格中写入了'漫威宇宙'。
rows = [['美国队长','钢铁侠','蜘蛛侠'],['是','漫威','宇宙', '经典','人物']]
#先把要写入的多行内容写成列表，再放进大列表里，赋值给rows。
for i in rows:
    sheet.append(i)
#遍历rows，同时把遍历的内容添加到表格里，这样就实现了多行写入。
print(rows)
#打印rows
wb.save('Marvel.xlsx')
#保存新建的Excel文件，并命名为“Marvel.xlsx”

#读取的代码：
wb = openpyxl.load_workbook('Marvel.xlsx')
#调用openpyxl.load_workbook()函数，打开“Marvel.xlsx”文件。
sheet = wb['new title']
#获取“Marvel.xlsx”工作薄中名为“new title”的工作表。
sheetname = wb.sheetnames
print(sheetname)
#sheetnames是用来获取工作薄所有工作表的名字的。如果你不知道工作薄到底有几个工作表，就可以把工作表的名字都打印出来。
A1_cell = sheet['A1']
A1_value = A1_cell.value
print(A1_value)
#把“new title”工作表中A1单元格赋值给A1_cell，再利用单元格value属性，就能打印出A1单元格的值。

openpyxl官方文档

6.3.1 例子

存储歌单

import requests,openpyxl
wb=openpyxl.Workbook()  
#创建工作薄
sheet=wb.active 
#获取工作薄的活动表
sheet.title='restaurants' 
#工作表重命名

sheet['A1'] ='歌曲名'     #加表头，给A1单元格赋值
sheet['B1'] ='所属专辑'   #加表头，给B1单元格赋值
sheet['C1'] ='播放时长'   #加表头，给C1单元格赋值
sheet['D1'] ='播放链接'   #加表头，给D1单元格赋值

url = 'https://c.y.qq.com/soso/fcgi-bin/client_search_cp'
for x in range(5):
    params = {
        'ct': '24',
        'qqmusic_ver': '1298',
        'new_json': '1',
        'remoteplace': 'sizer.yqq.song_next',
        'searchid': '64405487069162918',
        't': '0',
        'aggr': '1',
        'cr': '1',
        'catZhida': '1',
        'lossless': '0',
        'flag_qc': '0',
        'p': str(x + 1),
        'n': '20',
        'w': '周杰伦',
        'g_tk': '5381',
        'loginUin': '0',
        'hostUin': '0',
        'format': 'json',
        'inCharset': 'utf8',
        'outCharset': 'utf-8',
        'notice': '0',
        'platform': 'yqq.json',
        'needNewCode': '0'
    }

    res_music = requests.get(url, params=params)
    json_music = res_music.json()
    list_music = json_music['data']['song']['list']
    for music in list_music:
        name = music['name']
        # 以name为键，查找歌曲名，把歌曲名赋值给name
        album = music['album']['name']
        # 查找专辑名，把专辑名赋给album
        time = music['interval']
        # 查找播放时长，把时长赋值给time
        link = 'https://y.qq.com/n/yqq/song/' + str(music['file']['media_mid']) + '.html\n\n'
        # 查找播放链接，把链接赋值给link
        sheet.append([name, album, time,url])
        # 把name、album、time和link写成列表，用append函数多行写入Excel
        print('歌曲名：' + name + '\n' + '所属专辑:' + album +'\n' + '播放时长:' + str(time) + '\n' + '播放链接:'+ url)
        
wb.save('Jay.xlsx')            
#最后保存并命名这个Excel文件

7、爬取思想

在这里插入图片描述

8、cookies及session

8.1 cookies及其用法

当你登录一个网站，你都会在登录页面看到一个可勾选的选项“记住我”，如果你勾选了，以后你再打开这个网站就会自动登录，这就是cookie在起作用。
勾选“记住我”，服务器就会生成一个cookies和xxx这个账号绑定。接着，它把这个cookies告诉你的浏览器，让浏览器把cookies存储到你的本地电脑。当下一次，浏览器带着cookies访问博客，服务器会知道你是xxx。

8.1.1 例子

发表博客评论

post带着参数地请求登录
获得登录的cookies
带cookies去请求发表评论

import requests
#引入requests。
url = ' https://wordpress-edu-3autumn.localprod.forc.work/wp-login.php'
#把请求登录的网址赋值给url。
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'
}
#加请求头，前面有说过加请求头是为了模拟浏览器正常的访问，避免被反爬虫。
data = {
'log': 'spiderman',  #写入账户
'pwd': 'crawler334566',  #写入密码
'wp-submit': '登录',
'redirect_to': 'https://wordpress-edu-3autumn.localprod.forc.work/wp-admin/',
'testcookie': '1'
}
#把有关登录的参数封装成字典，赋值给data。
login_in = requests.post(url,headers=headers,data=data)
#用requests.post发起请求，放入参数：请求登录的网址、请求头和登录参数，然后赋值给login_in。
cookies = login_in.cookies
#提取cookies的方法：调用requests对象（login_in）的cookies属性获得登录的cookies，并赋值给变量cookies。

url_1 = 'https://wordpress-edu-3autumn.localprod.forc.work/wp-comments-post.php'
#我们想要评论的文章网址。
data_1 = {
'comment': input('请输入你想要发表的评论：'),
'submit': '发表评论',
'comment_post_ID': '7',
'comment_parent': '0'
}
#把有关评论的参数封装成字典。
comment = requests.post(url_1,headers=headers,data=data_1,cookies=cookies)
#用requests.post发起发表评论的请求，放入参数：文章网址、headers、评论参数、cookies参数，赋值给comment。
#调用cookies的方法就是在post请求中传入cookies=cookies的参数。
print(comment.status_code)
#打印出comment的状态码，若状态码等于200，则证明我们评论成功。

8.2 session及其用法

session是会话过程中，服务器用来记录特定用户会话的信息。
cookies中存储着session的编码信息，session中又存储了cookies的信息。
当浏览器第一次访问网页时，服务器会返回set cookies的字段给浏览器，而浏览器会把cookies保存到本地。
等浏览器第二次访问这个网页时，就会带着cookies去请求，而因为cookies里带有会话的编码信息，服务器立马就能辨认出这个用户，同时返回和这个用户相关的特定编码的session。

8.2.1 例子

发表博客评论

import requests
#引用requests。
session = requests.session()
#用requests.session()创建session对象，相当于创建了一个特定的会话，帮我们自动保持了cookies。
url = 'https://wordpress-edu-3autumn.localprod.forc.work/wp-login.php'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}
data = {
    'log':input('请输入账号：'), #用input函数填写账号和密码，这样代码更优雅，而不是直接把账号密码填上去。
    'pwd':input('请输入密码：'),
    'wp-submit':'登录',
    'redirect_to':'https://wordpress-edu-3autumn.localprod.forc.work/wp-admin/',
    'testcookie':'1'
}
session.post(url,headers=headers,data=data)
#在创建的session下用post发起登录请求，放入参数：请求登录的网址、请求头和登录参数。

url_1 = 'https://wordpress-edu-3autumn.localprod.forc.work/wp-comments-post.php'
#把我们想要评论的文章网址赋值给url_1。
data_1 = {
'comment': input('请输入你想要发表的评论：'),
'submit': '发表评论',
'comment_post_ID': '7',
'comment_parent': '0'
}
#把有关评论的参数封装成字典。
comment = session.post(url_1,headers=headers,data=data_1)
#在创建的session下用post发起评论请求，放入参数：文章网址，请求头和评论参数，并赋值给comment。
print(comment)
#打印comment

8.3 存储cookies

cookies转成字典，然后再通过json模块转成字符串。

import requests,json
#引入requests和json模块。
session = requests.session()   
url = ' https://wordpress-edu-3autumn.localprod.forc.work/wp-login.php'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}
data = {
'log': input('请输入你的账号:'),
'pwd': input('请输入你的密码:'),
'wp-submit': '登录',
'redirect_to': 'https://wordpress-edu-3autumn.localprod.forc.work/wp-admin/',
'testcookie': '1'
}
session.post(url, headers=headers, data=data)

cookies_dict = requests.utils.dict_from_cookiejar(session.cookies)
#把cookies转化成字典。
print(cookies_dict)
#打印cookies_dict
cookies_str = json.dumps(cookies_dict)
#调用json模块的dumps函数，把cookies从字典再转成字符串。
print(cookies_str)
#打印cookies_str
f = open('cookies.txt', 'w')
#创建名为cookies.txt的文件，以写入模式写入内容。
f.write(cookies_str)
#把已经转成字符串的cookies写入文件。
f.close()
#关闭文件。

8.4 读取cookies

存储cookies时，是把它先转成字典，再转成字符串。
读取cookies时，要先把字符串转成字典，再把字典转成（json模块）cookies本来的格式。

读取代码：

cookies_txt = open('cookies.txt', 'r')
#以reader读取模式，打开名为cookies.txt的文件。
cookies_dict = json.loads(cookies_txt.read())
#调用json模块的loads函数，把字符串转成字典。
cookies = requests.utils.cookiejar_from_dict(cookies_dict)
#把转成字典的cookies再转成cookies本来的格式。
cookies = session.cookies
#获取cookies：就是调用requests对象（session）的cookies属性。

8.4.1 例子

发表博客评论

程序能读取到cookies，就自动登录，发表评论；如果读取不到，就重新输入账号密码登录，再评论。

import requests,json
session = requests.session()
#创建会话。
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'
}
#添加请求头，避免被反爬虫。
try:
#如果能读取到cookies文件，执行以下代码，跳过except的代码，不用登录就能发表评论。
    cookies_txt = open('cookies.txt', 'r')
    #以reader读取模式，打开名为cookies.txt的文件。
    cookies_dict = json.loads(cookies_txt.read())
    #调用json模块的loads函数，把字符串转成字典。
    cookies = requests.utils.cookiejar_from_dict(cookies_dict)
    #把转成字典的cookies再转成cookies本来的格式。
    cookies = session.cookies
    #获取cookies：就是调用requests对象（session）的cookies属性。

except FileNotFoundError:
#如果读取不到cookies文件，程序报“FileNotFoundError”（找不到文件）的错，则执行以下代码，重新登录获取cookies，再评论。

    url = ' https://wordpress-edu-3autumn.localprod.forc.work/wp-login.php'
    #登录的网址。
    data = {'log': input('请输入你的账号:'),
            'pwd': input('请输入你的密码:'),
            'wp-submit': '登录',
            'redirect_to': 'https://wordpress-edu-3autumn.localprod.forc.work/wp-admin/',
            'testcookie': '1'}
    #登录的参数。
    session.post(url, headers=headers, data=data)
    #在会话下，用post发起登录请求。

    cookies_dict = requests.utils.dict_from_cookiejar(session.cookies)
    #把cookies转化成字典。
    cookies_str = json.dumps(cookies_dict)
    #调用json模块的dump函数，把cookies从字典再转成字符串。
    f = open('cookies.txt', 'w')
    #创建名为cookies.txt的文件，以写入模式写入内容
    f.write(cookies_str)
    #把已经转成字符串的cookies写入文件
    f.close()
    #关闭文件

url_1 = 'https://wordpress-edu-3autumn.localprod.forc.work/wp-comments-post.php'
#文章的网址。
data_1 = {
'comment': input('请输入你想评论的内容：'),
'submit': '发表评论',
'comment_post_ID': '7',
'comment_parent': '0'
}
#评论的参数。
comment = session.post(url_1,headers=headers,data=data_1)
#在创建的session下用post发起评论请求，放入参数：文章网址，请求头和评论参数，并赋值给comment。
print(comment.status_code)
#打印comment的状态码

解决cookies会过期的问题。代码里加一个条件判断，如果cookies过期，就重新获取新的cookies。

import requests, json
session = requests.session()
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'}

def cookies_read():
    cookies_txt = open('cookies.txt', 'r')
    cookies_dict = json.loads(cookies_txt.read())
    cookies = requests.utils.cookiejar_from_dict(cookies_dict)
    return (cookies)
    # 以上4行代码，是cookies读取。

def sign_in():
    url = ' https://wordpress-edu-3autumn.localprod.forc.work/wp-login.php'
    data = {'log': input('请输入你的账号'),
            'pwd': input('请输入你的密码'),
            'wp-submit': '登录',
            'redirect_to': 'https://wordpress-edu-3autumn.localprod.forc.work/wp-admin/',
            'testcookie': '1'}
    session.post(url, headers=headers, data=data)
    cookies_dict = requests.utils.dict_from_cookiejar(session.cookies)
    cookies_str = json.dumps(cookies_dict)
    f = open('cookies.txt', 'w')
    f.write(cookies_str)
    f.close()
    # 以上5行代码，是cookies存储。


def write_message():
    url_2 = 'https://wordpress-edu-3autumn.localprod.forc.work/wp-comments-post.php'
    data_2 = {
        'comment': input('请输入你要发表的评论：'),
        'submit': '发表评论',
        'comment_post_ID': '7',
        'comment_parent': '0'
    }
    return (session.post(url_2, headers=headers, data=data_2))
    #以上9行代码，是发表评论。

try:
    session.cookies = cookies_read()
except FileNotFoundError:
    sign_in()
    session.cookies = cookies_read()

num = write_message()
if num.status_code == 200:
    print('成功啦！')
else:
    sign_in()
    session.cookies = cookies_read()
    num = write_message()

9、selenium

官方中文文档

设置浏览器引擎

# 本地Chrome浏览器设置方法
from selenium import webdriver #从selenium库中调用webdriver模块
driver = webdriver.Chrome() # 设置引擎为Chrome，真实地打开一个Chrome浏览器

~~教学系统的浏览器设置方法~~

# 教学系统的浏览器设置方法，你不需要记住它
from selenium.webdriver.chrome.webdriver import RemoteWebDriver # 从selenium库中调用RemoteWebDriver模块
from selenium.webdriver.chrome.options import Options # 从options模块中调用Options类

chrome_options = Options() # 实例化Option对象
chrome_options.add_argument('--headless') # 对浏览器的设置
driver = RemoteWebDriver("http://chromedriver.python-class-fos.svc:4444/wd/hub", chrome_options.to_capabilities()) # 设置浏览器引擎

获取数据

# 本地Chrome浏览器设置方法
from selenium import webdriver #从selenium库中调用webdriver模块
driver = webdriver.Chrome() # 设置引擎为Chrome，真实地打开一个Chrome浏览器

driver.get('https://localprod.pandateacher.com/python-manuscript/hello-spiderman/')
# 打开网页
driver.close() # 关闭浏览器

解析与提取数据
- selenium所解析提取的，是Elements中的所有数据，而BeautifulSoup所解析的则只是Network中第0个请求的响应。
- selenium把网页打开，所有信息就都加载到了Elements。

例子：解析与提取数据网页中，元素的内容。

# 教学系统的浏览器设置方法，你不需要记住它
from selenium.webdriver.chrome.webdriver import RemoteWebDriver # 从selenium库中调用RemoteWebDriver模块
from selenium.webdriver.chrome.options import Options # 从options模块中调用Options类
import time
chrome_options = Options() # 实例化Option对象
chrome_options.add_argument('--headless') # 对浏览器的设置
driver = RemoteWebDriver("http://chromedriver.python-class-fos.svc:4444/wd/hub", chrome_options.to_capabilities()) # 声明浏览器对象


driver.get('https://localprod.pandateacher.com/python-manuscript/hello-spiderman/') # 访问页面
time.sleep(2) # 等待2秒
label = driver.find_element_by_tag_name('label') # 解析网页并提取第一个<lable>标签
print(label.text) # 打印label的文本
driver.close() # 关闭浏览器

selenium提取数据方法
- 提取出的数据属于WebElement类对象
- BeautifulSoup属于Tag对象

方法	作用
find_element_by_tag_name：	通过元素的名称选择
find_element_by_class_name：	通过元素的class属性选择
find_element_by_id：	通过元素的id选择
find_element_by_name：	通过元素的name属性选择
find_element_by_link_text：	通过链接文本获取超链接
find_element_by_link_text：	通过链接文本获取超链接
find_element_by_partial_link_text：	通过链接的部分文本获取超链接

find_element_by_tag_name：通过元素的名称选择
# 如<h1>你好，蜘蛛侠！</h1> 
# 可以使用find_element_by_tag_name('h1')

find_element_by_class_name：通过元素的class属性选择
# 如<h1 class="title">你好，蜘蛛侠！</h1>
# 可以使用find_element_by_class_name('title')

find_element_by_id：通过元素的id选择
# 如<h1 id="title">你好，蜘蛛侠！</h1> 
# 可以使用find_element_by_id('title')

find_element_by_name：通过元素的name属性选择
# 如<h1 name="hello">你好，蜘蛛侠！</h1> 
# 可以使用find_element_by_name('hello')

#以下两个方法可以提取出超链接

find_element_by_link_text：通过链接文本获取超链接
# 如<a href="spidermen.html">你好，蜘蛛侠！</a>
# 可以使用find_element_by_link_text('你好，蜘蛛侠！')

find_element_by_partial_link_text：通过链接的部分文本获取超链接
# 如<a href="https://localprod.pandateacher.com/python-manuscript/hello-spiderman/">你好，蜘蛛侠！</a>
# 可以使用find_element_by_partial_link_text('你好')

例子：获取【你好，蜘蛛侠！】的网页源代码

# 教学系统的浏览器设置方法，你不需要记住它
from selenium.webdriver.chrome.webdriver import RemoteWebDriver # 从selenium库中调用RemoteWebDriver模块
from selenium.webdriver.chrome.options import Options # 从options模块中调用Options类
import time

chrome_options = Options() # 实例化Option对象
chrome_options.add_argument('--headless') # 对浏览器的设置
driver = RemoteWebDriver("http://chromedriver.python-class-fos.svc:4444/wd/hub", chrome_options.to_capabilities()) # 声明浏览器对象

driver.get('https://localprod.pandateacher.com/python-manuscript/hello-spiderman/') # 访问页面
time.sleep(2) # 等待两秒，等浏览器加缓冲载数据

pageSource = driver.page_source # 获取完整渲染的网页源代码
print(type(pageSource)) # 打印pageSource的类型
print(pageSource) # 打印pageSource
driver.close() # 关闭浏览器

自动操作浏览器

.send_keys() # 模拟按键输入，自动填写表单
.click() # 点击元素

# 本地Chrome浏览器设置方法
from selenium import webdriver # 从selenium库中调用webdriver模块
import time # 调用time模块
driver = webdriver.Chrome() # 设置引擎为Chrome，真实地打开一个Chrome浏览器

driver.get('https://localprod.pandateacher.com/python-manuscript/hello-spiderman/') # 访问页面
time.sleep(2) # 暂停两秒，等待浏览器缓冲

teacher = driver.find_element_by_id('teacher') # 找到【请输入你喜欢的老师】下面的输入框位置
teacher.send_keys('必须是吴枫呀') # 输入文字
assistant = driver.find_element_by_name('assistant') # 找到【请输入你喜欢的助教】下面的输入框位置
assistant.send_keys('都喜欢') # 输入文字
button = driver.find_element_by_class_name('sub') # 找到【提交】按钮
button.click() # 点击【提交】按钮
driver.close() # 关闭浏览器

9.1 例子

一、用selenium爬取QQ音乐的歌曲评论，甜甜的。

通过selenium打开浏览器的操作，数据就被加载到elements中了。

# 教学系统的浏览器设置方法，你不需要记住它
from selenium.webdriver.chrome.webdriver import RemoteWebDriver # 从selenium库中调用RemoteWebDriver模块
from selenium.webdriver.chrome.options import Options # 从options模块中调用Options类
import time

chrome_options = Options() # 实例化Option对象
chrome_options.add_argument('--headless') # 对浏览器的设置
driver = RemoteWebDriver("http://chromedriver.python-class-fos.svc:4444/wd/hub", chrome_options.to_capabilities()) # 声明浏览器对象

driver.get('https://y.qq.com/n/yqq/song/000xdZuV2LcQ19.html') # 访问页面
time.sleep(2)

comments = driver.find_element_by_class_name('js_hot_list').find_elements_by_class_name('js_cmt_li') # 使用class_name找到评论
print(len(comments)) # 打印获取到的评论个数
for comment in comments: # 循环
    sweet = comment.find_element_by_tag_name('p') # 找到评论
    print ('评论：%s\n ---\n'%sweet.text) # 打印评论
driver.close() # 关闭浏览器

获取没有展开的评论

# 教学系统的浏览器设置方法，你不需要记住它
from selenium.webdriver.chrome.webdriver import RemoteWebDriver # 从selenium库中调用RemoteWebDriver模块
from selenium.webdriver.chrome.options import Options # 从options模块中调用Options类
from bs4 import BeautifulSoup
import time
chrome_options = Options() # 实例化Option对象
chrome_options.add_argument('--headless') # 对浏览器的设置
driver = RemoteWebDriver("http://chromedriver.python-class-fos.svc:4444/wd/hub", chrome_options.to_capabilities()) # 声明浏览器对象

driver.get('https://y.qq.com/n/yqq/song/000xdZuV2LcQ19.html') # 访问页面
time.sleep(2)

button = driver.find_element_by_class_name('js_get_more_hot') # 根据类名找到【点击加载更多】
button.click() # 点击
time.sleep(2) # 等待两秒
pageSource = driver.page_source # 获取Elements中渲染完成的网页源代码

二、定时与邮件

自动爬取每日的天气，并定时把天气数据和穿衣提示发送到你的邮箱

10、协程

异步：在一个任务未完成时，就可以执行其他多个任务，彼此不受影响。
同步：一个任务结束才能启动下一个。
多协程：一个任务在执行过程中，如果遇到等待，就先去执行其他的任务，当等待结束，再回来继续之前的那个任务。

10.1 gevent库

例子

爬取8个网站

import requests,time
#导入requests和time
start = time.time()
#记录程序开始时间
url_list = ['https://www.baidu.com/',
'https://www.sina.com.cn/',
'http://www.sohu.com/',
'https://www.qq.com/',
'https://www.163.com/',
'http://www.iqiyi.com/',
'https://www.tmall.com/',
'http://www.ifeng.com/']
#把8个网站封装成列表
for url in url_list:
#遍历url_list
    r = requests.get(url)

导入了time模块，记录了程序开始和结束的时间
导入gevent库中monkey模块

from gevent import monkey
#从gevent库里导入monkey模块。
monkey.patch_all()
#monkey.patch_all()能把程序变成协作式运行，就是可以帮助程序实现异步。
import gevent,time,requests
#导入gevent、time、requests。

start = time.time()
#记录程序开始时间。

url_list = ['https://www.baidu.com/',
'https://www.sina.com.cn/',
'http://www.sohu.com/',
'https://www.qq.com/',
'https://www.163.com/',
'http://www.iqiyi.com/',
'https://www.tmall.com/',
'http://www.ifeng.com/']
#把8个网站封装成列表。

def crawler( ):
#定义一个crawler()函数。
    r = requests.get(url)
    #用requests.get()函数爬取网站。
    print(url,time.time()-start,r.status_code)
    #打印网址、请求运行时间、状态码。

tasks_list = [ ]
#创建空的任务列表。

for url in url_list:
#遍历url_list。
    task = gevent.spawn(crawler,url)
    #用gevent.spawn()函数创建任务。
    tasks_list.append(task)
    #往任务列表添加任务。
gevent.joinall(tasks_list)
#执行任务列表里的所有任务，就是让爬虫开始爬取网站。
end = time.time()
#记录程序结束时间。
print(end-start)
#打印程序最终所需时间。

10.2 queue模块

用多协程来爬虫，需要创建大量任务时，我们可以借助queue模块。

第1部分是导入模块。

from gevent import monkey
#从gevent库里导入monkey模块。
monkey.patch_all()
#monkey.patch_all()能把程序变成协作式运行，就是可以帮助程序实现异步。
import gevent,time,requests
#导入gevent、time、requests
from gevent.queue import Queue
#从gevent库里导入queue模块

第2部分，是如何创建队列，以及怎么把任务存储进队列里。

start = time.time()
#记录程序开始时间

url_list = ['https://www.baidu.com/',
'https://www.sina.com.cn/',
'http://www.sohu.com/',
'https://www.qq.com/',
'https://www.163.com/',
'http://www.iqiyi.com/',
'https://www.tmall.com/',
'http://www.ifeng.com/']

work = Queue()
#创建队列对象，并赋值给work。
for url in url_list:
#遍历url_list
    work.put_nowait(url)
    #用put_nowait()函数可以把网址都放进队列里。

第3部分，是定义爬取函数，和如何从队列里提取出刚刚存储进去的网址。

def crawler():
    while not work.empty():
    #当队列不是空的时候，就执行下面的程序。
        url = work.get_nowait()
        #用get_nowait()函数可以把队列里的网址都取出。
        r = requests.get(url)
        #用requests.get()函数抓取网址。
        print(url,work.qsize(),r.status_code)
        #打印网址、队列长度、抓取请求的状态码。

第四部分，让爬虫用多协程执行任务，爬取队列里的8个网站的代码

def crawler():
    while not work.empty():
        url = work.get_nowait()
        r = requests.get(url)
        print(url,work.qsize(),r.status_code)

tasks_list  = [ ]
#创建空的任务列表
for x in range(2):
#相当于创建了2个爬虫
    task = gevent.spawn(crawler)
    #用gevent.spawn()函数创建执行crawler()函数的任务。
    tasks_list.append(task)
    #往任务列表添加任务。
gevent.joinall(tasks_list)
#用gevent.joinall方法，执行任务列表里的所有任务，就是让爬虫开始爬取网站。
end = time.time()
print(end-start)

10.3 例子

用多协程爬到薄荷网的食物热量数据。

用for循环构造出前3个常见食物类别的前3页食物记录的网址和第11个常见食物类别的前3页食物记录的网址，并把这些网址放进队列，打印出来。

#导入所需的库和模块。
import gevent,requests, bs4, csv
from gevent.queue import Queue
from gevent import monkey
monkey.patch_all()

work = Queue()
#创建队列对象，并赋值给work。

#前3个常见食物分类的前3页的食物记录的网址：
url_1 = 'http://www.boohee.com/food/group/{type}?page={page}'
for x in range(1, 4):
    for y in range(1, 4):
        real_url = url_1.format(type=x, page=y)
        work.put_nowait(real_url)
#通过两个for循环，能设置分类的数字和页数的数字。
#然后，把构造好的网址用put_nowait方法添加进队列里。
      
#第11个常见食物分类的前3页的食物记录的网址：
url_2 = 'http://www.boohee.com/food/view_menu?page={page}'
for x in range(1,4):
    real_url = url_2.format(page=x)
    work.put_nowait(real_url)
#通过for循环，能设置第11个常见食物分类的食物的页数。
#然后，把构造好的网址用put_nowait方法添加进队列里。

print(work)
#打印队列

#导入所需的库和模块。
import gevent,requests, bs4, csv
from gevent.queue import Queue
from gevent import monkey
monkey.patch_all()

work = Queue()
#创建队列对象，并赋值给work。

#前3个常见食物分类的前3页的食物记录的网址：
url_1 = 'http://www.boohee.com/food/group/{type}?page={page}'
for x in range(1, 4):
    for y in range(1, 4):
        real_url = url_1.format(type=x, page=y)
        work.put_nowait(real_url)
#通过两个for循环，能设置分类的数字和页数的数字。
#然后，把构造好的网址用put_nowait添加进队列里。
    
#第11个常见食物分类的前3页的食物记录的网址：
url_2 = 'http://www.boohee.com/food/view_menu?page={page}'
for x in range(1,4):
    real_url = url_2.format(page=x)
    work.put_nowait(real_url)
#通过for循环，能设置第11个常见食物分类的食物的页数。
#然后，把构造好的网址用put_nowait添加进队

def crawler():
#定义crawler函数
    headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'
    }
    #添加请求头
    while not work.empty():
    #当队列不是空的时候，就执行下面的程序。
        url = work.get_nowait()
        #用get_nowait()方法从队列里把刚刚放入的网址提取出来。
        res = requests.get(url, headers=headers)
        #用requests.get获取网页源代码。
        bs_res = bs4.BeautifulSoup(res.text, 'html.parser')
        #用BeautifulSoup解析网页源代码。
        foods = bs_res.find_all('li', class_='item clearfix')
        #用find_all提取出<li class="item clearfix">标签的内容。
        for food in foods:
        #遍历foods
            food_name = food.find_all('a')[1]['title']
            #用find_all在<li class="item clearfix">标签下，提取出第2个<a>元素title属性的值，也就是食物名称。
            food_url = 'http://www.boohee.com' + food.find_all('a')[1]['href']
            #用find_all在<li class="item clearfix">标签下，提取出第2个<a>元素href属性的值，跟'http://www.boohee.com'组合在一起，就是食物详情页的链接。
            food_calorie = food.find('p').text
            #用find在<li class="item clearfix">标签下，提取<p>元素，再用text方法留下纯文本，就提取出了食物的热量。              
            print(food_name)
            #打印食物的名称。

tasks_list = []
#创建空的任务列表
for x in range(5):
#相当于创建了5个爬虫
    task = gevent.spawn(crawler)
    #用gevent.spawn()函数创建执行crawler()函数的任务。
    tasks_list.append(task)
    #往任务列表添加任务。
gevent.joinall(tasks_list)
#用gevent.joinall方法，启动协程，执行任务列表里的所有任务，让爬虫开始爬取网站。

添加储存

import gevent,requests, bs4, csv
from gevent.queue import Queue
from gevent import monkey
monkey.patch_all()

work = Queue()
url_1 = 'http://www.boohee.com/food/group/{type}?page={page}'
for x in range(1, 4):
    for y in range(1, 4):
        real_url = url_1.format(type=x, page=y)
        work.put_nowait(real_url)

url_2 = 'http://www.boohee.com/food/view_menu?page={page}'
for x in range(1,4):
    real_url = url_2.format(page=x)
    work.put_nowait(real_url)

def crawler():
    headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'
    }
    while not work.empty():
        url = work.get_nowait()
        res = requests.get(url, headers=headers)
        bs_res = bs4.BeautifulSoup(res.text, 'html.parser')
        foods = bs_res.find_all('li', class_='item clearfix')
        for food in foods:
            food_name = food.find_all('a')[1]['title']
            food_url = 'http://www.boohee.com' + food.find_all('a')[1]['href']
            food_calorie = food.find('p').text
            writer.writerow([food_name, food_calorie, food_url])
            #借助writerow()函数，把提取到的数据：食物名称、食物热量、食物详情链接，写入csv文件。
            print(food_name)

csv_file= open('boohee.csv', 'w', newline='')
#调用open()函数打开csv文件，传入参数：文件名“boohee.csv”、写入模式“w”、newline=''。
writer = csv.writer(csv_file)
# 用csv.writer()函数创建一个writer对象。
writer.writerow(['食物', '热量', '链接'])
#借助writerow()函数往csv文件里写入文字：食物、热量、链接

tasks_list = []
for x in range(5):
    task = gevent.spawn(crawler)
    tasks_list.append(task)
gevent.joinall(tasks_list)

11、Scrapy

Scrapy Engine(引擎）
Scheduler(调度器)部门主要负责处理引擎发送过来的requests对象（即网页请求的相关信息集合，包括params，data，cookies，request headers…等），会把请求的url以有序的方式排列成队，并等待引擎来提取（功能上类似于gevent库的queue模块）。
Downloader（下载器）部门则是负责处理引擎发送过来的requests，进行网页爬取，并将返回的response（爬取到的内容）交给引擎。它对应的是爬虫流程【获取数据】这一步。
Spiders(爬虫)部门是公司的核心业务部门，主要任务是创建requests对象和接受引擎发送过来的response（Downloader部门爬取到的内容），从中解析并提取出有用的数据。它对应的是爬虫流程【解析数据】和【提取数据】这两步。
Item Pipeline（数据管道）部门则是公司的数据部门，只负责存储和处理Spiders部门提取到的有用数据。这个对应的是爬虫流程【存储数据】这一步。

在这里插入图片描述

未完

至此@

关注

2
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
网络爬虫 | 基础

文章目录1、爬虫工作原理2、 requests2.1 requests.get()2.2 Response对象的常用属性3、爬虫理论4、BeautifulSoup4.1 解析数据4.2 提取数据5、例子：5.1 例一5.2 例二5.2.1 Network5.2.2 XHR5.2.3 JSON5.3 例三5.3.1 params5.3.2 Request Headers6、存储数据6.1 存储方...
复制链接

扫一扫