python爬虫(廖雪峰商业爬虫)

最新推荐文章于 2024-06-24 14:19:29 发布

置顶

niuyoudao

最新推荐文章于 2024-06-24 14:19:29 发布

阅读量1w

点赞数 8

本文链接：https://blog.csdn.net/baidu_41867252/article/details/86821355

版权

day01

http

hyper text transfer protocol

地址栏中输入网址
请求方式

（1）get方式 : 便捷；缺点（不安全、明文、账号密码易泄露，参数长度有限制）

如：https://www.baidu.com/s?wd=http与https图解

（2）post:安全、非明文、数据大小无限制、上传文件（百度云）

注：
发送网络请求需要带（也可不带）一定的数据(放在request headers里)给服务器(返回数据:response)

注：

在这里插入图片描述

Request Headers

Accept:返回的格式
Accept-Encoding：编码方式，gzip
Connection:长短连接,leep-alive
Cookie:缓存，验证用
Host：域名,www.baidu.com
Refer:标志从哪个页面跳转来的
User-Agent:浏览器和用户信息

Response Headers

1.cache-control:缓存大小
2. Date:发送请求的时间
3. Expires:发送请求结束时间

其他

爬虫价值:

买卖数据
数据分析，出分析报告
流量

合法性：
没有法律规定时是否合法或违法(公司概念:公司让爬数据库，窃取机密，责任在公司)

注：不能获取任意数据，只能获取用户可以访问到的；爱奇艺视频（vip和非vip），普通用户只能爬取非vip资源

爬虫分类：

通用爬虫：使用搜索引擎；优势：开放性，速度快劣势：目标不明确
聚焦爬虫：又称主题网络爬虫优势：目标明确，对用户需求精准，返回内容明确
增量式：翻页，从第一页到最后一页

robots:
规定是否允许其他爬虫爬取某些内容；聚焦爬虫不遵守robots; 查看方法：www.baidu.com/robots.txt

爬虫与反爬虫作斗争：资源对等时，爬虫胜利

爬虫工作原理

爬虫的步骤：

确认目标url
使用python代码发送请求获取数据（其他语言也行，go,java）
解析数据，得到精确的数据 (找到新的url，回到第一步：循环此步骤直到所有页面都已抓取)
数据持久化，保存到本地或数据库

其他知识点：
python3原生模块 urllib.request
a. urlopen 返回response对象；response.read()读出数据.或response.read().decode(‘utf8’)
b. get传参：汉字会报错(解释器的ascii没有汉字，url中的汉字需要转码)

demo

抓取图片并保存

import requests
url = 'https://ss0.baidu.com/73x1bjeh1BF3odCf/it/u=1855917097,3670624805&fm=85&s=C110C5384B62720D4068C5D7030080A3'
r = requests.get(url, timeout=30)
#显示None为正常
print(r.raise_for_status())
r.encoding = r.apparent_encoding
with open('图片.jpg','wb') as fp:
#注意不是r.text
    fp.write(r.content)

抓取百度

import urllib.request

def load_data():
    url = 'https://www.baidu.com/'
    #get请求，http请求
    response = urllib.request.urlopen(url)
    print(response)
    data = response.read()
    #发现data为字节串
    str_data = data.decode('utf8')
    print(str_data)

load_data()

同上

import urllib.request

def load_data():
    url = 'http://www.baidu.com/'
    #get请求，http请求
    response = urllib.request.urlopen(url)
    # print(response)
    data = response.read()
    #发现data为字节串
    str_data = data.decode('utf8')
    # print(str_data)
    #打开的网页是本地的，不能搜索
    with open('baidu.html','w',encoding='utf8') as fp:
        fp.write(str_data)
    #将字符串转换为bytes
    str_name = 'hello'
    bytes_name = str_name.encode('utf8')
    print(bytes_name)
  

load_data()

get请求

import urllib.request
import urllib.parse
import string

def get_method():
    # url = 'https://www.baidu.com/s?wd=%E7%BE%8E%E5%A5%B3'
    url = 'http://www.baidu.com/s?wd='
    #网致里包含汉字，必须转义，否则报错
    name = '美女'
    final_url = url+name
    # print(final_url)
    #使用代码发送网络请求
    encode_url=urllib.parse.quote(final_url,safe=string.printable)
    print(encode_url)
    response = urllib.request.urlopen(encode_url)
    print(response)
    #读取内容
    data = response.read().decode('utf8')
    print(data)
    with open('meinv.html','w',encoding='utf8') as fp:
        fp.write(data)


get_method()

day02

1.get传参

(1)汉字报错：解释器ascii没有汉字，url汉字转码
urllib.parse.quote(url,dafe=string.printabal)
(2)字典传参
urllib.parse.urlencode(url)

2.post请求

urllib.request.urlopen(url,data=“服务器接收的数据”)

3.User-Agent：

1).模拟真实浏览器发送请求(使用场合：百度批量搜索 )
2).如何获取：浏览器的审查元素里，或百度user-agent大全
3).设置方法：request.add_header(动态添加user-agent)
4).响应头：response.headers

4.IP代理

1）免费IP：时效性差，错误率高
2）收费IP:拥有失效不能用的

5.IP分类

1）透明：对方知道我们真实的IP
2）匿名：对方不知道我们真实的IP，但知道使用了代理
3）高匿：对方不知道我们真实的IP，也不道使用了代理

6.handler:处理器的自定义

1)系统的urlopen()不支持添加代理
2）需要创建对应的处理器（handler）
方法：A 代理处理器ProxyHandler(proxy)；B 拿着代理处理器创建opener:build_opener()；C opener.open(url)发送请求

demo

import urllib.request
import urllib.parse
import string

def get_params():
    url = 'http://www.baidu.com/s?'
    params = {'wd':'中文','key':'zhang','value':'san'}
    #冒号变成=
    str_params = urllib.parse.urlencode(params)
    print(str_params)
    final_url = url+str_params
    print(final_url)
    # 有了urllib.parse.urlencode()，可以省略此步
    end_url = urllib.parse.quote(final_url,safe=string.printable)
    print(end_url)

    response = urllib.request.urlopen(end_url)
    data = response.read().decode()
    print(data)

get_params()

import urllib.request

def load_baidu():
    url = 'http://www.baidu.com'
    #创建请求对象
    request = urllib.request.Request(url)
    # print(request.headers)
    response = urllib.request.urlopen(request)
    data = response.read().decode('utf8')
    # 响应头
    # print(response.headers)
    #获取请求头信息
    request_headers = request.headers
    print(request_headers)
    with open('02_baidu.html','w',encoding='utf8') as fp:
        fp.write(data)

load_baidu()

import urllib.request

def load_baidu():
    url = 'https://www.baidu.com'
    header = {
        "User-Agent":'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36 QIHU 360SE',
        "haha":"jeje"
    }
    # 创建请求对象
    request = urllib.request.Request(url,headers=header)
    # 动态添加
    # request = urllib.request.Request(url)
    # request.add_header("User-Agent",'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36 QIHU 360SE')
    # print(request.headers)
    response = urllib.request.urlopen(request)
    data = response.read().decode('utf8')
    # 获取完整的url
    final_url = request.get_full_url()
    print(final_url)
    # 响应头
    # print(response.headers)
    # 获取请求头信息
    #打印所有头的信息
    request_hea

最低0.47元/天解锁文章

niuyoudao

关注

8
点赞
踩
48

收藏

觉得还不错? 一键收藏
0
评论
python爬虫(廖雪峰商业爬虫)

文章目录入门入门抓取图片并保存import requestsurl = 'https://ss0.baidu.com/73x1bjeh1BF3odCf/it/u=1855917097,3670624805&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;a
复制链接

扫一扫