python-爬虫--学习笔记

最新推荐文章于 2024-09-20 09:39:51 发布

Mr_dark_

最新推荐文章于 2024-09-20 09:39:51 发布

阅读量646

点赞数

文章标签：爬虫 python 笔记

本文链接：https://blog.csdn.net/zyj493132456/article/details/79193635

版权

爬虫介绍

什么是爬虫？

一个自动从网络获取数据的程序。

爬虫能干什么？

1.新闻数据：eg 今日头条

2.机器学习：eg 股票数据分析

3.网络搜索引擎：eg 百度，谷歌

http协议

请求头部（原始头）-----客户端发送给服务器

GET / HTTP/1.1 方法路径版本号

host ：远程服务器主机

Connection：keep-alive 客户端希望的连接方式保持连接

Accept：text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

客户端能接受的媒体类型*/* 接受任何类型媒体数据

user-Agent：用来标识浏览器的身份

Accept-Encoding：gzip，default，sdch 客户端能接受编码类型

Accept-Language：客户端能接受的语言类型

应答内容----服务器回复给客户端

HTTP/1.1 200 OK 版本号应答码（200正确）应答信息

Server：表示应答的服务器是谁

Content-Type：text/html;charset=utf-8 表示应答内容是什么数据（text/html网页数据）

Last-Modified：网页上次修改时间

Content-Encoding：gzip 编码方式可以减少网络传输数据大小

Content-Length：4315 应答数据量（字节）

应答码类别

2xx：成功

200：OK

206：Partial Content（部分内容）

3xx：重定向

4xx：客户端错误

400：Bad Request 请求格式不对

404：Not Found 请求资源不存在

5xx：服务端错误

URL介绍

组成部分

协议+路径+参数 eg http：//www.baidu.com/?t=1

编码规则

除了英文、数字和部分符号外，其他的全部使用百分号+十六进制码值进行编码

eg：百度搜索中文

https://www.baidu.com/s?wd=%E4%B8%AD%E6%96%87&rsv_spt=1&rsv_iqid=0xe9075950000062e4&issp=1......

wd=%E4%B8%AD(中)%E6%96%87（文）

参数规则

参数以问号开始，参数对以key=value形式，参数对之间使用&号连接

https://www.baidu.com/s?wd=%E4%B8%AD%E6%96%87&rsv_spt=1&rsv_iqid=0xe9075950000062e4&issp=1....

Cookie介绍

cookie数据样式

Set-Cookie:

dbcl2="17325****:IyV6D+oymuk"; path=/; domain=.douban.com; httponly
Set-Cookie:

as="deleted"; max-age=0; domain=.douban.com; path=/; expires=Thu, 01-Jan-1970 00:00:00 GMT
Cookie:

ll="108288"; bid=GEVDR-QjYjU; __guid=236236167.1529919093741274400.1517206363829.634; __utmt=1; ps=y; ck=NGED; monitor_count=2; _pk_id.100001.8cb4=a1e90d002918e14a.1517206365.1.1517206464.1517206365.; _pk_ses.100001.8cb4=*; __utma=30149280.2142443351.1517206367.1517206367.1517206367.1; __utmb=30149280.2.10.1517206367; __utmc=30149280; __utmz=30149280.1517206367.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); dbcl2="17325****:IyV6D+oymuk"

cookie格式

客户端发送Cookie时：

Cookie：key1=value1;key2=value2

服务器端保存Cookie时：

Set-Cookie：key1=value;path=/；domain=xx

domain and path:定义cookie的作用域。当指定domain，这个domain及其子域名都包含这个cookie。

expires：定义cookie的生命周期

HttpOnly：禁止脚本访问（只能通过浏览器，否则会不安全）

cookie用途

1.登录信息：判断用户是否登录（保存密码）

2.购物车：保存购买商品

cookie总结

1.服务器在客户端存储的信息。（服务器生成服务器解释）

2.请求时，客户端需要把未超时的cookies发送回客户端

3.应答时，服务器会把新的cookies发给客户端，以便下次访问时携带这些cookies

urllib介绍

urllib.urlopen

python 3.x urlopen用法

from urllib import request
s = request.urlopen('http://www.baidu.com') #必须要加“http://”协议名
for i in range(10):
      print('line %d: %s' % (i + 1,s.readline())) #标准化输出
 print(s.getcode()) #获取网页状态码

HTTPMessage

# _*_ coding: utf-8 _*_
import urllib
s = urllib.urlopen('http://www.baidu.com')
# print(s.readlines())
msg = s.info()
print(msg.headers)   #['Date: Mon, 29 Jan 2018 09:27:27 GMT\r\n', 'Content-Type: text/html;   打印http头字段
print(msg.items())   #[('bdqid', '0xe19f7a9c000457fd'), ('x-powered-by', 'HPHP'), ֵ  打印http头
print(dir(msg))      #['__contains__', '__delitem__', '__doc__', '__getitem__',   打印对象包含方法dir()
print(msg.getheader('Content-Type')) #text/html; charset=utf-8   打印头数据

urllib.urlretrieve

# _*_ coding: utf-8 _*_
import urllib
fname, msg = urllib.urlretrieve('http://www.baidu.com','index.html')
print(fname)    # index.html   打印文件名
print(msg.items())   #[('bdqid', '0x97db4c72000002c5'), ('x-powered-by', 'HPHP'), ...   打印http头

reporthook

# _*_ coding: utf-8 _*_
import urllib
def progress(blk,blk_size,total_size):
    print('%d/%d - %.03f%%' % (blk * blk_size, total_size, (float)(blk * blk_size) *100 / total_size))
urllib.urlretrieve('http://www.baidu.com', 'index.html', reporthook=progress)   #90112-1 - -9011200.000% ...  下载进度