【python】爬虫-requests板块

HopeTurbo

已于 2024-08-03 18:21:18 修改

阅读量608

点赞数 15

分类专栏： python 文章标签： python 爬虫开发语言

于 2024-08-03 18:03:26 首次发布

本文链接：https://blog.csdn.net/2301_79740539/article/details/140890785

版权

python 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

前言

爬虫这个词，相信对各位同学来说都不陌生了，那我们今天就来讲讲何为爬虫，以及怎么利用requests数据库爬虫（后续后讲解其他方法）

何为爬虫

网络爬虫（又称为网页爬虫，网络机器人），是一种按照一定的规则，自动地抓取web信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁，自动索引、模拟程序或者蠕虫。

通俗来讲，爬虫就是模拟客户端与服务器与服务器交互，获取你想要的信息。

会不会犯法

有的同学会担心使用爬虫这一项技术会触碰相关的法律法规。爬虫作为一项技术，它存在就一定是有它的道理的，只要我们合理运用就不会触碰相关的法律法规。怎么合理利用？有一个协议叫做robots协议，我们打开网址，在网址后面输入robots.txt就可以去看到网站的所有者允许哪些东西可以获取，哪些是明令禁止的。打个比方

我们在淘宝的官网上加上robots.txt就可以查看它的robots协议了

taobao.com/robots.txthttps://www.taobao.com/robots.txt这个是拼多多的

mobile.yangkeduo.com/robots.txthttps://mobile.yangkeduo.com/robots.txt当然少不了我们csdn的啦

csdn.net/robots.txthttps://www.csdn.net/robots.txt

robots的基本语法为

User-agent: <user-agent_name>
Disallow: <restricted_URL>
Allow: <allowed_URL>

怎么利用requests数据库爬虫

最简单的方法就是，点击右键，进入检查后点击网络，把我们的网站再刷新一遍，我们就完成了嘴基本的get的请求了

代码形式

首先，我们打开cmd，输入pip install request，把我们的request这个数据库下载下来。

接着写代码

import requests
#解析网址
url='https://www.csdn.net/?spm=1038.2274.3001.4476'
#get请求
response=requests.get(url=url)
#获得数据，把我们的文件转换成text形式，返回的是字符串的相应数据
page_text=response.text
print(page_text)

拿到的就是我们刚刚在网页上看到的响应数据

制作一个简易搜索器

来我们打开我们的搜索引擎，输入我们的搜索内容

我这里再bing上面搜索了Eniac这一个词条

Eniac - 搜索 (bing.com)https://cn.bing.com/search?q=Eniac&qs=n&form=QBRE&sp=-1&lq=0&pq=eniac&sc=10-5&sk=&cvid=2C2D7FA81FFD4250AEA8FD725E8DC710&ghsh=0&ghacc=0&ghpl=

import requests
#建立UA伪装
header={
    'user-agent':''（用自己的台端数据）
}
#解析网址
url='https://cn.bing.com/search?Eniac'
#封装数据
param={
    'Eniac'
}
#获取数据
response=requests.get(url=url,params=param)
#转换数据
page_text=response.text
#把结果存储为文件
filename='Eniac'+'.html'
with open(filename,'w',encoding='utf-8')as fp:
    fp.write(page_text)
print(filename,'done')

这里解释一下，什么叫UA伪装，我们爬虫的目的是模拟客户端与服务器交互，但服务器不想让我们这么干，所以就有有一个反爬虫的机制，那么我们这时候就要反反爬虫，我们这里就把我们的台端数据给他，假装是客户端交互。

就是打了马赛克的这个地方会显示

我们这里创建了一个名叫Eniac的HTML文件，我们打开看看

这一个就是bing里面关于Eniac的相关HTML文件了

那怎么让他变得通用起来呢，我们只用把我们的Eniac变成变量就可以了

import requests
headers={
    'user-agent':'XXXX'
}
url='https://cn.bing.com/search?'
kw=input('enter a word:')
param={
    'q':kw
}
response=requests.get(url=url,params=param)
page_text=response.text
filename=kw+'.html'
with open(filename,'w',encoding='utf-8')as fp:
    fp.write(page_text)
print(filename,'done')

简易的百度翻译器

我们在这里，调用一下百度翻译的更新页面来制作我们的简易翻译器

打开百度翻译的页面，打开检查，，选择XHR，输入hello

百度翻译-您的超级翻译伙伴（文本、文档翻译） (baidu.com)https://fanyi.baidu.com/mtpe-individual/multimodal#/

在这里，我们看到sug这里有我们的翻译结果

我们看一下标头的type，里面是json类型，因此，我们最后储存的文件变成json类型文件。

import requests
import json
header={
    'user-agent':'xxxx'
}
post_url='https://fanyi.baidu.com/sug'
word=input('enter a woed:')
data={
    'kw':word
}
response=requests.post(url=post_url,data=data,headers=header)
dic_obj=response.json()

filename=word+'.json'
fp=open(filename,'w',encoding='utf-8')
json.dump(dic_obj,fp=fp,ensure_ascii=False)
print('done')
print(dic_obj)

我们调试调试一下，输入hello，看一下翻译结果

获取豆瓣电影信息

选电影 (douban.com)https://movie.douban.com/explore

打开网络设置，选择XHR，点击左图的加载更多

获取url

查看负载

编写代码

import requests
import json
#解析网址
url='https://m.douban.com/rexxar/api/v2/movie/recommend?refresh=0&start=180&count=20&selected_categories=%7B%22%E7%B1%BB%E5%9E%8B%22:%22%E5%96%9C%E5%89%A7%22%7D&uncollect=false&tags=%E5%96%9C%E5%89%A7'
#提供参数，在负载处查看
param={
    'refresh':' 0',
    'start':'180',
    'count':'20',
    'selected_categories': '{"类型":"喜剧"}',
    'uncollect': 'false',
    'tags': '喜剧',
}
#UA伪装
header={
    "user-Agent":"MXXX"
}
#获取数据
response=requests.get(url=url,params=param,headers=header)
#json文件，通过查看type查找
list_data=response.json()
储存文件
fp=open('./douban.json','w',encoding='utf-8')
json.dump(list_data,fp=fp,ensure_ascii=False)
结束提示词
print('over')

运行结果