小白学爬虫系列-基础-两种爬虫实现方式

最新推荐文章于 2024-04-21 15:38:24 发布

49.99%

最新推荐文章于 2024-04-21 15:38:24 发布

阅读量333

点赞数

分类专栏： python 文章标签： python

原文链接：https://cloud.tencent.com/developer/article/1562566

版权

python 专栏收录该内容

605 篇文章 21 订阅

订阅专栏

本文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理

本文章来自腾讯云作者：小一不二三

想要学习Python？有问题得不到第一时间解决？来看看这里“1039649593”满足你的需求，资料都已经上传至文件中，可以自行下载！还有海量最新2020python学习资料。
点击查看
在这里插入图片描述

网络爬虫的第一步就是根据 URL，获取网页的 HTM L信息。在 Python3 中，可以使用 urllib.request 和requests 进行网页数据获取。

urllib 库是 Python 内置的，无需我们额外安装，只要安装了 Python 就可以使用这个库。
requests 库是第三方库，需要我们自己安装。

通过urllib进行爬虫

1. 直接引入urllib库

# 引入库
from urllib import request
import chardet

2. 获取网页内容

# 发送请求并接收响应
response = request.urlopen(url)
# 调用read方法读取并转换为utf8编码
html = response.read()
# 获取文本编码
html_encoding = chardet.detect(html)
# 文本转换编码
content = html.decode(html_encoding['encoding'])
print(content)

通过 requests 爬虫
1. requests安装

在cmd中，使用如下指令安装requests：

pip install requests # 只安装python3，使用这条命令安装 requests pip3 install
requests # 同时安装python2 和python3，使用这条命令安装requests

2. 获取网页内容

import requests

url = 'https://www.bxwxorg.com/read/20/'
# 获取网页内容
response = requests.get(url)
# 获取网页相关信息
print(response.headers)
print(response.cookies)
print(response.status_code)
print(response.text)
print(response.content)

在这里插入图片描述

# 方法1. response设置编码方式
response.encoding = 'gbk'
# 此时text输出为中文
print(response.text)

# 方法2. content内容进行解码
print(response.content.decode('gbk'))

爬虫头部伪装

上节说过当你的爬虫不能被服务器识别为真正的浏览器访问时，服务器则不会给你返回正确的信息，这个时候就必须对爬虫代码进行伪装。

HTTP部分请求头部（HEADER）包含:

Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding: gzip, deflate, br
Accept-Language: zh-CN,zh;q=0.9
Cache-Control: max-age=0
Host: www.baidu.com
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36

在这里插入图片描述

设置 header

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
           'Referer': 'http://www.quanshuwang.com/book/44/44683',
           'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
           'Accept-Encoding': 'gzip, deflate, br',
           'Accept-Language': 'zh-CN,zh;q=0.9'
               }
# 设置headers
response = requests.get(url = url, headers = headers)

爬虫登录

1. 通过账号密码登录

login_url = 'https://xxxxx.com/login'
# 设置登录名与密码
form_data = {'username':'*****', 'password':'*****'}  
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
           'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
           'Accept-Encoding': 'gzip, deflate, br',
           'Accept-Language': 'zh-CN,zh;q=0.9'
           } 
response = requests.post(login_url, data=form_data, headers=headers)

2. 通过历史cookie登录

raw_cookies = "k1=v1; k2=v2; k3=v3"; 
cookies = {}
# 设置cookies数据
for line in raw_cookies.split(';'):  
    key,value=line.split('=',1)  
    cookies[key]=value  
login_url = 'http://xxxxxx.com'  
response = requests.post(login_url,cookies=cookies)  
print(response.text)