python爬虫基础入门

最新推荐文章于 2024-09-05 17:17:02 发布

冷淡的蛋黄酱

最新推荐文章于 2024-09-05 17:17:02 发布

阅读量230

点赞数 1

分类专栏：爬虫

本文链接：https://blog.csdn.net/weixin_52730784/article/details/111404563

版权

本文是Python爬虫的基础入门教程，涵盖了爬虫概述、requests库的使用、异常处理、理解robots.txt、动态UA、Beautiful Soup解析库的运用以及XPath解析技巧。通过实例详细讲解了如何发起请求、处理响应数据以及如何解析网页内容。

摘要由CSDN通过智能技术生成

目录
01 | 爬虫概述
02 | requests请求库
03 | 异常处理
04 | robots.txt
05 | 动态UA
06 | Beautiful Soup解析库
07 | 正则表达式
08 | Xpath

01 | 爬虫概述

1.爬虫与浏览器区别

2.爬虫过程
requests库发起请求-设置User- Agent伪装-BeautifulSoup/正则表达式获取数据

02 | requests请求库

1.requests使用步骤
-导入模块
-发送get请求, 获取响应
-从响应中获取数据

# 方法一
#coding:utf-8
# 1.导入模块
import requests
# 2.发送请求，获取响应
response = requests.get('http://www.baidu.com')
# 3.获取响应数据
# 将编码模式转换为中文，使用response.encoding可以查看原来的编码模式
response.encoding = 'utf-8'
print(response.text)

#方法二
import requests
response = requests.get('https://cn.bing.com/')
'''response.encoding = 'utf-8'
print(response.text)'''
# response.content抓取的是二进制，decode()是重新编码，默认是utf-8
print(response.content.decode())
# 一些网站的解码方式是gbk，就要用response.content.decade(encoding = 'gbk')

2.设置 UA 进行伪装

import requests
url = 'https://cn.bing.com/'
# 设置head伪装
head = {
   'User - Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}
# response.content抓取的是二进制，decode()是重新编码，默认是utf-8
data = requests.get(url,headers = head)
print(data.content.decode())
# 一些网站的解码方式是gbk，就要用response.content.decade(encoding = 'gbk')

3.response语法总结

response.encoding # 打印编码
responde.encoding = utf8 # 设置编码为utf8
response.content # 提取二进制数据
response.content

最低0.47元/天解锁文章