DAY9(DAY10拓展)：python 爬虫

EdmunDJK

已于 2022-09-22 17:02:21 修改

阅读量359

点赞数 1

文章标签： python 爬虫开发语言

于 2022-07-20 22:54:42 首次发布

本文链接：https://blog.csdn.net/qq_49433473/article/details/125902454

版权

python 爬虫（批量爬虫技巧）

1、爬虫定义

自动抓取互联网上的有价值的信息，

2、爬虫架构

调度器、URL 管理器、下载器、解析器、应用程序

调度器		  #相当于一台电脑的CPU，主要负责调度URL管理器、下载器、解析器之间的协调工作。
URL管理器	  #包括待爬取的URL地址和已爬取的URL地址，防止重复抓取URL和循环抓取URL，实现URL管理器主要用三种方式，通过内存、数据			   库、缓存数据库来实现
网页下载器	#通过传入一个URL地址来下载网页，将网页转换成一个字符串，网页下载器有urllib2（Python官方基础模块）包括需要登录、			代理、和cookie，requests(第三方包)
解析器		  #(html.parser,beautifulsoup,lxml)将一个网页字符串进行解析，按要求提取出有用信息，可根据DOM树的解析方式来解析
应用程序	 #从网页中提取的有用数据组成的一个应用。

请添加图片描述

3、requests 库

pip install requests
import requests

'''查看库中内容'''
print(dir(requests))

3.1、响应信息：

apparent_encoding			 #编码方式
encoding					#解码 r.text 的编码方式
headers						#返回响应头，字典格式
history						#返回包含请求历史的响应对象列表（url）
links						#返回响应的解析头链接
reason						#响应状态的描述,比如 "OK"
request						#返回请求此响应的请求对象
url							#返回响应的 URL
status_code					 #返回http的状态码，比如404和200（200是OK，404是Not Found）
close()						#关闭与服务器的连接
content						#返回响应的内容，以字节为单位
cookies						#返回一个 CookieJar 对象，包含了从服务器发回的 cookie
elapsed						#返回一个 timedelta 对象，包含了从发送请求到响应到达之间经过的时间量，可以用于测试响应速度。							比如 r.elapsed.microseconds 表示响应到达需要多少微秒。
is_permanent_redirect		 #如果响应是永久重定向的 url，则返回 True，否则返回 False
is_redirect					#如果响应被重定向，则返回 True，否则返回 False
iter_content()				#迭代响应
iter_lines()				#迭代响应的行
json()						#返回结果的JSON对象(结果需要以JSON格式编写的,否则会引发错误)                                            			    http://www.baidu.com/ajax/demo.json
next						#返回重定向链中下一个请求的 PreparedRequest 对象
ok							#检查 "status_code" 的值，如果小于400，则返回 True，如果不小于 400，则返回 False
raise_for_status()			 #如果发生错误，方法返回一个 HTTPError 对象
text						#返回响应的内容，unicode 类型数据

示例：

import requests

#发送请求
a = requests.get('http://www.baidu.com')
print(a.text)

#返回http的状态码
print(a.status_code)

3.2、requests 请求方法

delete(url, args)				#发送 DELETE 请求到指定 url
get(url, params, args)			 #发送 GET 请求到指定 url
head(url, args)					#发送 HEAD 请求到指定 url
patch(url, data, args)			 #发送 PATCH 请求到指定 url


post(url, data, json, args)		 #发送 POST 请求到指定 url
'''post() 方法可以发送 POST 请求到指定 url，一般格式如下：'''
requests.post(url, data={key: value}, json={key: value}, args)

put(url, data, args)		     #发送 PUT 请求到指定 url
request(method, url, args)		 #向指定的 url 发送指定的请求方法

url 请求 url。
data 参数为要发送到指定 url 的字典、元组列表、字节或文件对象。
json 参数为要发送到指定 url 的 JSON 对象。
args 为其他参数，比如 cookies、headers、verify等。

示例：

a = requests.request('get','http://www.baidu.com/')
a = requests.post('http://www.baidu.com/xxxxx/1.txt')
	requests.get(url)
	requests.put(url)
	requests.delete(url)
	requests.head(url)
	requests.options(url)

import requests
info = {'frame':'信息'}
#设置请求头
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}
# params 接收一个字典或者字符串的查询参数，字典类型自动转换为url编码，不需要urlencode()
response = requests.get("https://www.baidu.com/", params = info, headers = headers)
# 查看响应状态码
print (response.status_code)
# 查看响应头部字符编码
print (response.encoding)
# 查看完整url地址
print (response.url)
# 查看响应内容，response.text 返回的是Unicode格式的数据
print(response.text)

3.3、requests 库中常用的类

requests.Request		#表示的是请求对象 用于准备一个请求发送到服务器
requests.Response		#表示的是响应对象 包含服务器对http请求的响应
requests.Session		#表示的是请求会话 提供cookie持久性、连接池(创建和管理一个连接的缓冲池的技术)和配置

3.4、文件写入

写入已有文件

如需写入已有的文件，必须向 open() 函数添加参数：

"a" - 追加 - 会追加到文件的末尾
"w" - 写入 - 会覆盖任何已有的内容

创建新文件

如需在 Python 中创建新文件，请使用 open() 方法，并使用以下参数之一：

"x" - 创建 - 将创建一个文件，如果文件存在则返回错误
"a" - 追加 - 如果指定的文件不存在，将创建一个文件
"w" - 写入 - 如果指定的文件不存在，将创建一个文件

示例：（利用爬虫进行文件写入）

import requests
f = open("C:/Users/九泽/Desktop/demo.txt", "w",encoding='utf-8')
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}
response = requests.get("https://www.baidu.com/",headers = headers)
f.write(response.content.decode('utf-8'))

4、爬取图片

请添加图片描述

4.1、os 库

os 模块提供了非常丰富的方法用来处理文件和目录

菜鸟教程地址=========>>>>>>https://www.runoob.com/
由菜鸟教程提供

https://www.runoob.com/python/os-file-methods.html

4.2、展示单一图片抓取

示例：

import requests
import os
url="xxxxxxxxxxxxxxxxxxxxxxxxx"
compile="C:/Users/九泽/Desktop/imgs/"
path=compile + url.split('/')[-1]
try:
    if not os.path.exists(compile):
        os.mkdir(compile)
    if not os.path.exists(path):
        headers={
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 '
                          '(KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
        }
        rr=requests.get(url=url,headers=headers)	
        with open(path,'wb') as f:
            f.write(rr.content)
            f.close()
            print('文件保存成功！')
    else:
        print('文件已存在！！')
except:
    print("爬取失败！！！")

4.3、BeautifulSoup 插件

用来提取 xml 和 HTML 中的数据

4.3.1、title

获取源码中的 title 标签内容

title.name				#获取标签名
title.string			#获取string类型字符
title.parent.string		 #获取父标签页

4.3.2、p

**示例：**获取 html 的 p 标签

import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.baidu.com")
n = r.content
m = BeautifulSoup(n,"html.parser")
for i  in m.find_all("p"):						# find_all() 获取源码中所有的某一规定标签内容
    with open('C:/Users/九泽/Desktop/imgs/2.txt','w+',encoding='utf-8') as f:
        f.write(str(i))
        f.close()

**示例：**利用正则表达式输出 p 标签

import requests
import re
from bs4 import BeautifulSoup
r = requests.get("http://www.baidu.com")
n = r.content
m = BeautifulSoup(n,"html.parser")
for tag in m.find_all(re.compile("^p")):
    with open('C:/Users/九泽/Desktop/imgs/1.txt','w+',encoding='utf-8') as f:
        f.write(str(tag))
        f.close()
        print('爬取成功！')

4.3.3、爬虫爬取小说

#-*-coding:utf-8 -*-
import requests
import re
import time
headers ={
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 '
                          '(KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
 }
f=open('C:/Users/九泽/Desktop/imgs/dpcq.txt','a+')

def get_info(url):
    res = requests.get(url,headers=headers)
    if(res.status_code==200):
        contents=re.findall('<p>(.*?)<p>',res.content.decode('utf-8'),re.S)	#re.S 将特殊字符进行输入
        for content in contents:
            f.write(content + '\n')
    else:
        pass
if __name__ =='__main__':
    urls=['http://www.doupoxs.com/doupocangqiong/{}.html'.format(str(i)) for i in range(1,200)]
    for url in urls:
        get_info(url)
        time.sleep(1)
    f.close()

4.3.8、bs 解析html

import requests
from bs4 import BeautifulSoup
r=requests.get("http://www.baidu.com")
m=r.content
n= BeautifulSoup(m,"html.parser")
print(n)

4.3.9、使用 Beautiful Soup 解析 html 文件

#!/usr/bin/python
# -*- coding: UTF-8 -*-
import re
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
#创建一个BeautifulSoup解析对象
soup = BeautifulSoup(html_doc,"html.parser",from_encoding="utf-8")
#获取所有的链接
links = soup.find_all('a')
print "所有的链接"
for link in links:
    print link.name,link['href'],link.get_text()
 
print "获取特定的URL地址"
link_node = soup.find('a',href="http://example.com/elsie")
print link_node.name,link_node['href'],link_node['class'],link_node.get_text()
 
print "正则表达式匹配"
link_node = soup.find('a',href=re.compile(r"ti"))
print link_node.name,link_node['href'],link_node['class'],link_node.get_text()
 
print "获取P段落的文字"
p_node = soup.find('p',class_='story')
print p_node.name,p_node['class'],p_node.get_text()

4、urllib 库

urllib 库用于操作网页 URL，并对网页的内容进行抓取处理

语法：

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

 url：						#url 地址。
data：						#发送到服务器的其他数据对象，默认为 None。
timeout：					#设置访问超时时间。
cafile 和 capath：			#cafile 为 CA 证书， capath 为 CA 证书的路径，使用 HTTPS 需要用到。
cadefault：					#已经被弃用。
context：					#ssl.SSLContext类型，用来指定 SSL 设置

urllib.robotparser		#解析robots.txt文件
urllib.request			#打开/读取url
urllib.parse		    #解析url
urllib.error			#包含 urllib.request 抛出的异常。
readline() 			 	#读取文件的一行内容
readlines()   			#读取文件的全部内容，它会把读取的内容赋值给一个列表变量

示例：

from urllib import request
file = request.urlopen('http://www.baidu.com')
data = file.read()
f= open('C:/Users/九泽/Desktop/2.html','wb')
f.write(data)
f.close()

5、Scrapy

Scrapy 是用 Python 实现的一个为了爬取网站数据、提取结构性数据而编写的应用框架。

Scrapy 常应用在包括数据挖掘，信息处理或存储历史数据等一系列的程序中。

通常我们可以很简单的通过 Scrapy 框架实现一个爬虫，抓取指定网站的内容或图片。

引擎				#负责Spider、ItemPipeline、Downloader、Scheduler中间的通讯，信号、数据传递等
调度器			   #负责接受引擎发送过来的Request请求，并按照一定的方式进行整理排列，入队，当引擎需要时，交还给引擎
下载器			   #负责下载引擎发送的所有Requests请求，并将其获取到的Responses交还给引擎，由引擎交给Spider来处理
爬虫			    #负责处理所有Responses,从中分析提取数据,获取Item字段需要的数据,并将需要跟进的URL提交给引擎,再次进入调度器
管道			   #负责处理Spider中获取到的Item，并进行进行后期处理（详细分析、过滤、存储等）的地方
中间件			   #一个可以自定扩展和操作引擎和Spider中间通信的功能组件
下载中间件		 #你可以当作是一个可以自定义扩展下载功能的组件

请添加图片描述

制作Scrapy的步骤：
新建项目 (scrapy startproject xxx)：新建一个新的爬虫项目
明确目标 （编写items.py）：明确你想要抓取的目标
制作爬虫 （spiders/xxspider.py）：制作爬虫开始爬取网页
存储内容 （pipelines.py）：设计管道存储爬取内容

1、创建新项目

scrapy startproject mySpider		#创建名为mySpider的新项目

2、查看项目文件结构

scrapy genspider mySpider

mySpider/
    scrapy.cfg
    mySpider/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            ...

scrapy.cfg: 		 #项目的配置文件。
mySpider/: 				#项目的Python模块，将会从这里引用代码。
mySpider/items.py: 		 #项目的目标文件。
mySpider/pipelines.py:    #项目的管道文件。
mySpider/settings.py: 	 #项目的设置文件。
mySpider/spiders/: 		 #存储爬虫代码目录。

3、通过 scrapy 爬取网站数据

3.1、明确目标

打开 mySpider 目录下的 items.py
Item 定义结构化数据字段，用来保存爬取到的数据，有点像 Python 中的 dict，但是提供了一些额外的保护减少错误。
可以通过创建一个 scrapy.Item 类， 并且定义类型为 scrapy.Field 的类属性来定义一个 Item（可以理解成类似于 ORM 的映射关系）。

接下来，创建一个 ItcastItem 类，和构建 item 模型（model）

import scrapy
class ItcastItem(scrapy.Item):
   name = scrapy.Field()
   title = scrapy.Field()
   info = scrapy.Field()

3.2、制作爬虫（爬数据）

在当前目录下输入命令，将在 mySpider/spider 目录下创建一个名为 itcast 的爬虫，并指定爬取域的范围：

scrapy genspider itcast "itcast.cn"

打开 mySpider/spider目录里的 itcast.py，默认增加了下列代码:

import scrapy

class ItcastSpider(scrapy.Spider):
    name = "itcast"
    allowed_domains = ["www.itcast.cn"]
    start_urls = (
        'http://www.itcast.cn/channel/teacher.shtml',
    )
    def parse(self, response):
        pass

将start_urls的值修改为需要爬取的第一个url：

start_urls = ("http://www.itcast.cn/channel/teacher.shtml",)

修改parse()方法：

	def parse(self, response):
    	filename = "teacher.html"
    	open(filename, 'w').write(response.body)

然后运行，在mySpider目录下执行：

scrapy crawl itcast

指定保存内容的编码格式

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

3.3、制作爬虫（取数据）

爬取整个网页完毕，接下来的就是的取过程了，首先观察页面源码：

<div class="li_txt">
    <h3>  xxx  </h3>
    <h4> xxxxx </h4>
    <p> xxxxxxxx </p>

xpath 方法，我们只需要输入的 xpath 规则就可以定位到相应 html 标签节点：

参数:
    /html/head/title: 选择HTML文档中 <head> 标签内的 <title> 元素
    /html/head/title/text(): 选择上面提到的 <title> 元素的文字
    //td: 选择所有的 <td> 元素
    //div[@class="mine"]: 选择所有具有 class="mine" 属性的 div 元素

修改 itcast.py 文件代码如下：

# -*- coding: utf-8 -*-
import scrapy

class Opp2Spider(scrapy.Spider):
    name = 'itcast'
    allowed_domains = ['www.itcast.cn']
    start_urls = ['http://www.itcast.cn/channel/teacher.shtml']

    def parse(self, response):
        # 获取网站标题
        context = response.xpath('/html/head/title/text()')   

        # 提取网站标题
        title = context.extract_first()  
        print(title) 
        pass

执行以下命令：

scrapy crawl itcast

在 mySpider/items.py 里定义了一个 ItcastItem 类。这里引入进来:

from mySpider.items import ItcastItem

然后将我们得到的数据封装到一个 ItcastItem 对象中，可以保存每个的属性：

import scrapy
from mySpider.items import ItcastItem

def parse(self, response):
    #open("teacher.html","wb").write(response.body).close()

    # 存放老师信息的集合
    items = []

    for each in response.xpath("//div[@class='li_txt']"):
        # 将我们得到的数据封装到一个 `ItcastItem` 对象
        item = ItcastItem()
        #extract()方法返回的都是unicode字符串
        name = each.xpath("h3/text()").extract()
        title = each.xpath("h4/text()").extract()
        info = each.xpath("p/text()").extract()

        #xpath返回的是包含一个元素的列表
        item['name'] = name[0]
        item['title'] = title[0]
        item['info'] = info[0]

        items.append(item)

    # 直接返回最后数据
    return items

3.4、保存数据

scrapy 保存信息的最简单的方法主要有四种，-o 输出指定格式的文件，命令如下：

scrapy crawl itcast -o teachers.json

json lines 格式，默认为 Unicode 编码

scrapy crawl itcast -o teachers.jsonl

csv 逗号表达式，可用Excel打开

scrapy crawl itcast -o teachers.csv

xml 格式

scrapy crawl itcast -o teachers.csv

各个参数属性

name = "" ：这个爬虫的识别名称，必须是唯一的，在不同的爬虫必须定义不同的名字。

allow_domains = [] 是搜索的域名范围，也就是爬虫的约束区域，规定爬虫只爬取这个域名下的网页，不存在的URL会被忽略。

start_urls = () ：爬取的URL元祖/列表。爬虫从这里开始抓取数据，所以，第一次下载的数据将会从这些urls开始。其他子URL将会从这些					起始URL中继承性生成。

parse(self, response) ：解析的方法，每个初始URL完成下载后将被调用，调用的时候传入从每一个URL传回的Response对象来作为唯一参数，主要作用如下：
				1.负责解析返回的网页数据(response.body)，提取结构化数据(生成item)
				2.生成需要下一页的URL请求。

(这个没有成功所以没有具体的爬取出的内容)换一个

爬虫示例：

1.爬取数据的主要思路

我们从该网址（[https://so.gushiwen.cn/shiwenv_4c5705b99143.aspx](javascript:void(0))）爬取这首诗的标题和诗句，然后保存在我们的文件夹中，

2.scrapy爬虫案例解析

2.1、新建一个scrapy框架名为’poems‘的文件夹

scrapy startproject poems

2.2、新建一个名为’verse‘的爬虫文件

scrapy genspider verse www.xxx.com

2.3、对网页发送请求

打开爬虫文件’verse‘，更改需要爬取的网页地址

import scrapy
class VerseSpider(scrapy.Spider):
    name = 'verse'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['www.xxx.com']

2.4、解析数据

更改parse解析部分，对获取到的数据（response）进行数据解析，所用到的解析方式为xpath解析，方法与requests发送请求的解析方式大同小异，首先在找到我们需要解析的部分内容，并填写相应的代码（如下图）。我们发现，与requests发送请求的解析方式不同的是，在原有基础上加上extract方法，和join方法才能获取文本信息

title = response.xpath('//div[@class="cont"]/h1/text()').extract()
        content = response.xpath('//div[@id=contson4c5705b99143]/text()').extract()
        title = ''.join(content)

2.5、返回数据

我们要保存数据就需要parse模块有返回值，我们先新建一个空列表data，然后我们将title和content放入字典中并添加到列表中

import scrapy
class VerseSpider(scrapy.Spider):
    name = 'verse'
    allowed_domains = ['https://so.gushiwen.cn/']
    start_urls = ['https://so.gushiwen.cn/shiwenv_4c5705b99143.aspx/']

    def parse(self, response):
        data = []
        title = response.xpath('//*[@id="sonsyuanwen"]/div[1]/h1').extract()
        content = response.xpath('//div[@id=contson4c5705b99143]/text()').extract()
        title = ''.join(title)
        content=''.join(content)
        dic = {
            'title': title, 'content': content
        }
        data.append(dic)
        return data