day1-爬虫笔记

最新推荐文章于 2024-10-09 13:19:11 发布

m0_69534331

最新推荐文章于 2024-10-09 13:19:11 发布

阅读量91

点赞数

文章标签：爬虫 python http

本文链接：https://blog.csdn.net/m0_69534331/article/details/124672148

版权

爬虫

1、爬虫基础

1.1.cookie

1）作用：将用户的账号密码使用某些形式保存在本地
2）性质：有时效性
3）在爬虫中cookie作用：将本地的请求和远程的服务器中保存的账号密码做对比

1.2.GET和POST

说明：两种不同的和服务器对话方式

1）GET:requests利用GET请求方式向服务器发送请求（加密机制很简单甚至不加密）
2）POST：向服务器发送内容。例如账号密码（某个网站注册登陆账号密码）

2、模块安装

2.1.requests

作用：python实现的简单易用的能够请求网页的三方模块

2.2.安装三方模块准备

如果Terminal中显示ps，安装三方模块会直接安装到本机主环境。
使用venv\Scripts\activate激活虚拟环境
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
修改微软的powershell策略组如果Terminal中显示ps，安装三方模块会直接安装到本机主环境。
使用venv\Scripts\activate激活虚拟环境
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
修改微软的powershell策略组

2.3.修改镜像源

镜像源：存储python三方模块的仓库，python的三方模块所有人都可以制作发布
pip config set global.index-url https://pypi.douban.com/simple
requests模块安装

windows:
pip install requests
mac、linux
pip3 install requests

3、requests使用

导入：import requests

3.1.网页请求

URL = 'http://www.baidu.com'
请求百度服务器+百度服务器响应结果
resp = requests.get(url=URL)
print(resp)  # <Response [200]>

3.2.查看状态码：告诉我们服务器现在的状态

200:爬虫可用
403：爬虫被服务器拒绝
404：资源丢失
500：服务器崩溃

print(resp.status_code)

3.3.查看cookie

print(resp.cookies)

3.4. 如果页面源发生了乱码怎么办

响应结果使用的编码方式ISO-8859-1，它不支持中文

resp.encoding = 'utf-8'

3.5.查看页面源代码（字符串形式）

resp.encoding = 'utf-8'
print(resp.text)

3.6.查看页面的源代码（二进制形式） - 图片、音频、视频等

b’xxxxxxx’
二进制 print(resp.content)

URL = 'http://www.zhihu.com'
resp = requests.get(url=URL)
print(resp.status_code)
print(resp.cookies)
print(resp.text)

5.爬虫伪装

import requests
 URL = 'http://www.baidu.com'
# 伪装爬虫
# User—Agent作用：将爬虫伪装成浏览器
 Headers = {
     'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36'
 }
 resp = requests.get(url=URL,headers=Headers)
 if resp.status_code == 200:
     resp.encoding = 'utf-8'
     print(resp.text)

相同代码封装成函数

def requests_url(href):
    Headers = {
        'User-Agen': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36'
    }
    resp = requests.get(url=href, headers=Headers)
    if resp.status_code == 200:
        return resp
    else:
        return resp.status_code
        
URL = 'https://pvp.qq.com/web201605/herolist.shtml'
result = requests_url(URL)
result.encoding = 'gbk'
print(result.text)

6.天行数据

import requests
import json
content1 = input('请输入一个垃圾：')
# api接口请求地址和参数之间使用？连接，参数以key=value的形式传入,参数和参数之间使用&连接
 URL = f'http://api.tianapi.com/lajifenlei/index?key=a7e9d6a5a9ec9ce1f5ccc5d7f72f7749&word={content1}'
 # 大部分API接口没有反爬机制
 resp = requests.get(url=URL)
 print(resp.text,type(resp.text))
 # 序列化：loads
 data = json.loads(resp.text)
 print(data,type(data))

# 获取json数据中有用信息
 for i in data['newslist']:
     print(i['explain'])

7.图片的读写及下载

with open('1.jpg','rb') as f1:
    result = f1.read()
    print(result)
# 模拟图片下载（图片二进制写入本地文件）
with open('2.jpg','wb') as f2:
    f2.write(result)

# 如何下载在线图片、视频、音频
import requests
URL = 'https://gimg2.baidu.com/image_search/src=http%3A%2F%2Finews.gtimg.com%2Fnewsapp_bt%2F0%2F13684770255%2F1000.jpg&refer=http%3A%2F%2Finews.gtimg.com&app=2002&size=f9999,10000&q=a80&n=0&g=0n&fmt=auto?sec=1654679299&t=f5e9c6aefb0f234d15a8bf7895afbfcf'
resp = requests.get(url=URL)
print(resp.content)
with open('peng.jpg','wb') as f3:
    f3.write(resp.content)