python 基础网络爬虫 day03

最新推荐文章于 2022-03-06 22:01:46 发布

CSDN时光

最新推荐文章于 2022-03-06 22:01:46 发布

阅读量861

点赞数

文章标签：爬虫

本文链接：https://blog.csdn.net/qq_42584444/article/details/83409711

版权

本文是Python网络爬虫基础教程的第三天内容，主要讲解requests模块的使用，包括GET和POST请求、Cookie与Session、代理设置等。同时，介绍了urllib.request中的Handler处理器，包括HTTPHandler、ProxyHandler和认证处理。还提到了SSL证书认证和代理的使用，并给出了安装selenium和Scrapy框架的指南。

摘要由CSDN通过智能技术生成

1.requests模块

2.urllib.request 中Handler处理器

day02

1.关于正则解析

分组（想要抓取什么内容就要加小括号（））
正则方法
p = re.compile('...')
r_list = p.findall(html)
结果：[(),(),(),()]
贪婪匹配： .*
非贪婪匹配：.*?

2.抓取步骤

找URL
写正则表达式
定义类，写程序框架
补全代码

3.存入csv文件

Import csv
with open('xxx.csv','a',newline="",encoding="") as f:
writer = csv.writer(f)
writer.writerow([...,...,...])

4.Fiddler常用菜单

Inspector：请求，响应两部分
常用选项
1. Headers
2. WebForms
3. Raw：请求 --纯文本

5.cookie 和 session

cookie：客户端
session：web服务器端

6.请求方式

GET
POST
Cookie模拟登陆
1. 先登录成功1次，利用抓包工具抓取到Cookie
2. 将Request Header（包含cookie），处理为字典，作为参数发请求

7.安装模块

Aanconda Prompt : conda install 模块名
Windows cmd：python -m pip install 模块名

8.requests模块

get(url,params=params,headers=headers)
params:查询参数，字典，不用编码，不用URL拼接
post(url,data=data,headers=headers)
data:Form表单数据，字典，不用编码，不用转码
响应对象属性
1. encoding：响应字符编码，res.encoding='utf-8'
2. text：字符串
3. content：字节流
4. status_code：响应码
5. url：返回实际数据的URL
非结构化数据保存
html = res.content
with open("XXX","wb") as f:
f.write(html)

day03

1.requests模块

代理(参数名:proxies)

获取代理ip的网站
西刺代理网站
快代理
全网代理

普通代理

proxies = {'协议':'协议://IP地址:端口号'}
proxies = {'http':'http://203.86.26.9:3128'}

'''01_普通代理示例.py'''
import requests

url = "http://www.baidu.com/"
proxies = {"http":"http://183.129.207.82:11597"}
headers = {"User-Agent":"Mozilla/5.0"}

res = requests.get(url,proxies=proxies,headers=headers)
print(res.status_code)

私密代理
proxies = {"http":"http://309435365:szayclhp@123.206.119.108:16817"}

'''02_私密代理示例.py'''
import requests

url = "http://httpbin.org/get"
headers = {"User-Agent":"Mozilla/5.0"}
proxies = {"http":"http://309435365:szayclhp@123.206.119.108:16817"}

res = requests.get(url,proxies=proxies,headers=headers)
res.encoding = "utf-8"
print(res.text)

pymysql 和 pymongo回顾示例：

'''创建一个mysql库spiderdb,创建表t1,插入1条记录'''
import pymysql
import warnings

# 创建数据库连接对象
db = pymysql.connect("localhost","root",
                     "123456",charset="utf8")
# 创建游标对象
cursor = db.cursor()
# 执行语句
# 过滤警告
warnings.filterwarnings("ignore")
try:
    cursor.execute("create database if not exists spiderdb")
    cursor.execute("use spiderdb")
    cursor.execute("create table if not exists t1(id int)")
except Warning:
    pass

ins = "insert into t1 values(%s)"
cursor.execute(ins,[1])
cursor.execute(ins,[2])
# 提交
db.commit()
# 关闭
cursor.close()
db.close()

----------------------------------------------------------------------------------------

'''04_pymongo回顾.py'''

import pymongo

# 创建连接对象
conn = pymongo.MongoClient("localhost",27017)
# 创建数据库对象,spiderdb为库的名字
db = conn.spiderdb
# 利用数据库对象创建集合对象
myset = db.t1
# 执行插入
myset.insert({&#

最低0.47元/天解锁文章

CSDN时光

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
python 基础网络爬虫 day03

目录1.requests模块2.urllib.request 中Handler处理器day021.关于正则解析分组（想要抓取什么内容就要加小括号（））正则方法 p = re.compile('...') r_list = p.findall(html) 结果：[(),(),(),()] 贪婪匹配： .* 非贪婪匹配：.*?2.抓取步骤找URL 写正则...
复制链接

扫一扫