Python学习_31 爬虫（一）

燕-孑

已于 2024-03-14 16:47:34 修改

阅读量273

点赞数

分类专栏：猿课笔记_python 文章标签： python 学习爬虫

于 2018-06-13 22:23:20 首次发布

本文链接：https://blog.csdn.net/u011200965/article/details/80686091

版权

猿课笔记_python 专栏收录该内容

33 篇文章 0 订阅

订阅专栏

Python学习_31 爬虫（一）

1、爬虫概念

网络爬虫（又被称为网页蜘蛛，网络机器人，在FOAF社区中间，更经常的称为网页追逐者），是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁、自动索引、模拟程序或者蠕虫。

申明：

爬虫也是只能爬取公开网站，公开的额数据，别人加密的，涉及到隐私的，不能随便搞，否则，后果自负

安装模块

pip install requests

2、了解http定义的和服务器交互的几种方式

get 仅仅获取资源的信息，不增加或者修改数据。

post 一般丢该服务器上的资源，一般我们通过form表单进行提交请求

put 增加

delete 删除

a、get参数

params = {'key1': 'hello', 'key2': 'world'}

url = ' https://www.jd.com'

r = requests.get(url=url, params=params)

print(r.url)

结果：

京东(JD.COM)-正品低价、品质保障、配送及时、轻松购物！

也可以直接

requests.get(' 京东(JD.COM)-正品低价、品质保障、配送及时、轻松购物！')

b、posp参数

params = {'key1': 'hello', 'key2': 'world'}

r = requests.post(" http://httpbin.org/post ", data=params)

print(r.text)

c、http请求

#!/usr/bin/env python

# -*- coding: utf-8 -*-

# @Time : 2018\6\12 0012 21:59

# @Author : xiexiaolong

# @File : test2.py

import requests

wd = "python"

url = "https://www.qiushibaike.com/"

r = requests.get(url)

#print(r.text)

print (r.encoding)

print ( type (r.text))

print (r.content)

分析：requests中text和content方法的区别是：text返回的是str类型的数据；content返回的是bytes二进制的数据。如果读取文本，可以用text；如果读取图片、文件等可以用content方法

3、requests

requests是第三方库，首先需要安装

pip install requests

方法：

print(r.text) 文本

print(r.request) #<PreparedRequest [GET]>

print(r.headers) #请求头

print(r.cookies) #cookies的信息

print(r.cookies[‘_xsrf’]) #可以通过字典的方式取值

print(r.url) #请求的url是多少

print(r.status_code) #http的状态返回码

4、requests请求头

requests请求头可以自定义

#!/usr/bin/env python

# -*- coding: utf-8 -*-

# @Time : 2018\6\12 0012 21:59

# @Author : xiexiaolong

# @File : test2.py

import requests

header = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36' }

r = requests.get( 'https://www.qiushibaike.com/' , headers =header)

#print(r.text)

print (r.headers)

结果：

D:\python\venv\Scripts\python.exe D:/python/0612/test2.py

{'Server': 'openresty', 'Date': 'Wed, 13 Jun 2018 14:05:00 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Content-Length': '17246', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Set-Cookie': '_xsrf=2|4f646747|f3cec628aa22983b51aedef5f158bf66|1528898700; Path=/', 'Vary': 'User-Agent, Accept-Encoding', 'Etag': '"8f86d9aee692a7787a2bc6a9d52925d2a2df37cd"'}

Process finished with exit code 0

分析：可以看到header头部是自己定义的，可以模拟各种浏览器去访问

5、requests的会话对象

#!/usr/bin/env python

# -*- coding: utf-8 -*-

# @Time : 2018\6\12 0012 21:59

# @Author : xiexiaolong

# @File : test2.py

import requests

s = requests.session()

s.get( 'http://www.baidu.com' )

分析：所有会话都保存在s中，注意，python3中 s = requests.session() 是小写，python2中是大写（ s = requests.Session() ）

6、cookie

requests通过会话来获取cookie，cookie的五要素是：name，value，domain，path，expires

Cookie常用的一些属性：

1. Domain 域

2. Path 路径

3. Expires 过期时间

4. name 对应的key值

5. value key对应的value值

cookie中的domain代表的是cookie所在的域，默认情况下就是请求的域名，例如请求 http://www.server1.com/files/hello , 那么响应中的set-Cookie默认会使用 www.server1.com 作为cookie的domain，在浏览器中也是按照domain来组织cookie的。我们可以在响应中设置cookie的domain为其他域，但是浏览器并不会去保存这些domain为其他域的cookie。

cookie中的path能够进一步的控制cookie的访问，当path=/; 当前域的所有请求都可以访问到这个cookie。如果path设为其他值，比如path=/test,那么只有/test下面的请求可以访问到这个cookie

燕-孑

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
Python学习_31 爬虫（一）

Python学习_31爬虫（一）1、爬虫概念网络爬虫（又被称为网页蜘蛛，网络机器人，在FOAF社区中间，更经常的称为网页追逐者），是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁、自动索引、模拟程序或者蠕虫。申明：爬虫也是只能爬取公开网站，公开的额数据，别人加密的，涉及到隐私的，不能随便搞，否则，后果自负安装模块pip install requests...
复制链接

扫一扫