爬虫(一）

最新推荐文章于 2024-10-10 14:17:59 发布

走在分布式的路上

最新推荐文章于 2024-10-10 14:17:59 发布

阅读量198

点赞数 1

分类专栏：爬虫文章标签：爬虫

本文链接：https://blog.csdn.net/weixin_43170863/article/details/99984040

版权

爬虫专栏收录该内容

6 篇文章 0 订阅

订阅专栏

爬虫学习(一)

一.爬虫的概念

模拟客户端(浏览器)发送网络请求，接收请求响应，一种按照一定的规则，自动的抓取互联网信息的程序。

二.爬虫的流程

url–>发送请求，获取响应–>提取数据–>入库
发送请求，获取响应–>提取url

三.ROBOTS协议

网站通过Robots协议告诉搜索引擎哪些页面可以抓取，哪些页面不可以抓取。例如：https://www.taobao.com/robots.txt

题外：

先看图
浏览器发送HTTP请求的过程

1.HTTP和HTTPS:相比多了一个加密解密的过程，更安全但性能较低

2.爬虫要根据当前url地址对应的响应(response)为准，当前url地址的element的内容和url的响应不一样

3.DNS服务器：实现域名的解析成IP地址

4.浏览器渲染(element)出来的页面和爬虫请求(response)的页面不一样

四.页面上的数据在哪里

当前url地址对应的相应中
其他的url地址对应的响应中
- 比如ajax请求中
js生成的
- 部分数据在响应中
- 全部通过js生成

五.url的形式

形式 scheme://host[:port#]/path/…/[?query-string][#anchor]
- http://localhost:4000/file/part01/1.2.html
- http://item.jd.com/11936238.html#product-detail 带上锚点直接定位详情
参考文档：https://blog.csdn.net/amberinheart/article/details/78814540

六.HTTP常见请求头(User-Agent,Cookies)

为了和浏览器更像
Host(主机和端口号)
Connection(链接类型)
Upgrade-Insecure-Requests(升级为HTTPS请求)
User-Agent(浏览器名称)
Accept(传输文件类型)
Referer(页面跳转处)
Accept-Encoding(文件编解码格式)
Cookie(Cookie)
- 与session的区别：cookie保存在浏览器本地相对不安全有上限，而session保存在服务器足够大就可以存
x-requested-with:XMLHttpRequest(是Ajax异步请求)

七.常见响应状态码

200：成功
302：临时转移至新的url
307：临时转移至新的url
404：Not found
500：服务器内部错误

八.常见的请求方式

GET
POST
区别：https://blog.csdn.net/qq_37932082/article/details/79452475

九.字符串的分类

bytes类型
- 二进制类型
- 互联网上数据都是以二进制的方式传输
str类型
- unicode的呈现形式
- UTF-8是Unicode字符集的实现方式，是一种变长的编码方式
编码方式解码方式必须一样，否则就会出现乱码
- b = a.encode() 把str转化为bytes
- a = b.decode("utf-8") 把bytes转化为str类型

十.Request 使用入门

为什么要学习requests，而不是urllib?
- 1.requests的底层实现就是urllib，urllib能做的requests都能做，而且更简单
- 2.requests在python2和python3中通用，方法完全
  一样
- 3.requests简单易用
- 4.requests能够自动帮助我们解压(gzip压缩的等)网页内容
作用：发送网络请求，返回响应数据
中文文档API ：https://2.python-requests.org//zh_CN/latest/index.html
requests中解决编解码的方法
- response.content.decode() 默认使用utf-8解码(推荐)
- response.content.decode("gbk")
- response.text 依靠requests模块进行推测

使用requests来发送网络请求

response = requests.get(url)
response常用方法
- response.text
- response.content
- response.status_code
- response.request.headers
- response.headers
判断请求是否成功

    assert response.status_code==200

查看reponse.url(当发生跳转时两个url可能不一样)

    In [10]: response.url
    Out[10]: 'https://www.baidu.com/'

查看respons.request.url

    In [11]: response.request.url
    Out[11]: 'https://www.baidu.com/'

查看response.headers

    In [7]: response.headers
    Out[7]: {'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'Keep-Alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Wed, 21 Aug 2019 09:51:37 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:23:51 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}

查看请求的headers

    In [9]: response.request.headers
    Out[9]: {'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

发送带header的请求
- 存在问题：由于’User-Agent’: 'python-requests/2.22.0’百度的服务器识别出我们不是通过浏览器
  访问的，所以我们要发送带header的请求。
- 目的：模拟浏览器，欺骗服务器，获取和浏览器一致的内容
- header的形式：字典
  - {'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
- 用法： requests.get(url,headers=headers)
- ```
    In [16]: headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chr
         ...: ome/76.0.3809.100 Safari/537.36"}
    In [17]: response = requests.get("http://www.baidu.com",headers = headers)
    In [19]: response.content.decode()
    Out[19]: 太多了直接取百度的response中看就行，是一样的
```
发送带参数的请求
- https://www.baidu.com/swd=python&rsv_spt=1&rsv_iqid=0x85e9b6aa00036fc0&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf8&tn=baiduhome_pg&rsv_enter=1&rsv_dl=ib&rsv_sug3=6&rsv_sug1=4&rsv_sug7=100中？后面(…)=…就是参数用&隔开，这里只有wd=python这个有用
- 请求参数：
  - https://www.baidu.com/s?wd=python&c=b
- 参数的形式：字典
- 如何添加params：
```
    p = {"wd":"python"}
    response = requests.get(url_temp,headers=headers,params=p)
```
URL编码
- https://www.baidu.com/?wd=%E4%B8%AD%E5%9B%BD
- https://www.baidu.com/?wd=“中国”