Python实训记录 Three day

最新推荐文章于 2023-07-04 20:15:47 发布

llwvip

最新推荐文章于 2023-07-04 20:15:47 发布

阅读量355

点赞数

分类专栏： python实训

本文链接：https://blog.csdn.net/llwvip/article/details/107213309

版权

python实训专栏收录该内容

11 篇文章 0 订阅

订阅专栏

Three Day（仅用于学习阶段）

爬虫框架
- requests
- bs4
案例

数据挖掘（爬虫）框架

作用：自动化脚本、辅助工作，提高工作效率吧
分类：网络爬虫（公开）、蠕虫爬虫（病毒）
工作流程：

访问工具（浏览器：客户端）- 模拟浏览器（脚本：恶意访问） - 获取网页数据
解析数据 - 过滤数据 - 本地存储（txt、word、excel、rdis、mangodb、mysql）

进阶：

urllib3、re
requests、bs4（本次学习重点）
自动化selenuim
分布式scrapy

第三方框架安装：

cmd：pip install 库名
cmd：pip uninstall 库名
cmd：pip show 库名
cmd：pip list
（Anaconda中有）

requests请求框架

用于模拟浏览器，发送请求，提交数据，获取返回数据

在这里插入图片描述
有此.py路径输出，表示是正确的

一、请求函数

http中的请求函数：

GET 地址栏请求，路径+提交数据，数据可见，内存限制（常用）
POST 隐式请求，数据不可见，内存可自定义大小（常用）
DELETE
PUT
HEAD
OPTIONS
requests中文文档（https://requests.readthedocs.io/zh_CN/latest/）中，有对于发送请求的讲解。

response_get = requests.get('http://httpbin.org/get')#官方网站，可用于测试
response_post = requests.post('http://httpbin.org/post')

运行没有报错，则证明请求成功。

二、提交参数

服务器接口（url），两种请求方式的提交参数
参数名要和实际的相符合
response_get = requests.get(‘http://httpbin.org/get’,params={‘user’:‘Mr.L’})
response_post = requests.post(‘http://httpbin.org/post’,data={‘user’:‘Mr.L’})

三、阅读提交路径

请求对象（提交参数） - 服务器 - 返回对象（返回信息）

print(response_get.url)
print(response_post.url)

结果：
http://httpbin.org/get?user=Mr.L
http://httpbin.org/post

响应状态和编码格式

服务器返回状态码：

200 成功访问呢
403 无法加载
500、505 服务器异常

网页编码格式：

utf-8
gbk
ISO
…

#向百度发起请求
response_get = requests.get('https://www.baidu.com')
print('code-',response_get.status_code)
print('encoding-',response_get.encoding)

结果：
code- 200 #表示请求成功
encoding- ISO-8859-1 #表示使用的网页编码格式

五、响应内容

返回数据类型:

html 使用解析框架：bs4
json 使用json模块
xml 使用解析框架：bs4
text
file文件使用下载函数

拓展：JSON数据结构

结构：

obj:[] 数组
obj:{} 对象（字典）
{name:value} 属性

六、定制头部（伪装）

服务器会客户端验证，验证一些浏览器隐藏提交的参数

在开发者模式下，network网络，可以抓取当前访问的url，显示访问信息，F5刷新
Name中第一个-General-Status Code显示访问呢状态

response_get =
requests.get('http://www.qianlima.com/zb/area_305',headers=head)
print('code-',response_get.status_code)
该代码的结果是403，也就说无法正常访问

学习伪装，寻找request headers内部的参数，复制之后用于定制头部
常用的：User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36（用户代理）

解释：

User-Agent:用户代理
Mozilla/5.0 火狐浏览器版本
(Macintosh; Intel Mac OS X 10_15_4) 系统信息
AppleWebKit/537.36 渲染引擎
Chrome/83.0.4103.116 Safari/537.36 谷歌和苹果版本

head = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'}
response_get = requests.get('http://www.qianlima.com/zb/area_305',headers=head)
print('code-',response_get.status_code)
#这样之后，运行结果就是200，那么代表伪装成功

七、Cookie

浏览器的本地缓存

获取cookie：

response_get = requests.get(url='http://httpbin.org/cookies')
print('cookie-',response_get.cookies)

提交cookie：

cookie = {'name':'jack'}
response_get = requests.get(url='http://httpbin.org/cookies',cookies=cookie)
print('cookie-',response_get.text)

八、用户代理IP

一台主机与另一台主机链接，代理完成任务

高可用IP：https://ip.jiangxianli.com/

pro = {
‘http://’:‘37.238.209.227:80’,
‘https://’:‘110.232.252.234:8080’
}

http、https找各自对应的IP端口

九、链接超时

访问服务器时链接时间默认60s，超出时间抛出异常

timeout 重置链接时间

try:
    response_get = requests.get(url='http://httpbin.org/cookies',timeout=0.05)#出现异常
except Exception as e:#解决异常
    print(e)

bs4美丽的汤

方便解析html、xml等格式的源码，方便快速查询、修改等操作，节省开发时间
官网文档：
https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

导入包：from bs4 import BeautifulSoup

html用例源代码

html_doc = '<html><head><title>是谁送你来到我身边</title></head><body><a id="baidu">百度</a><a class="ali">阿里</a><p>是风</p></body></html>'

BS类解析html源码，使用解析器

html.parser
html5lib
要确保有以上的环境
soup = BeautifulSoup(html_doc,‘html5lib’)
print(soup.title)-----------------------读取标题

一、对象的种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: Tag , NavigableString , BeautifulSoup , Comment .

soup.TagName

如果查询的标签不存在
存在Tag，不存在None
soup.TagName 查询该页面中第一次出现的标签,只获取一个标签

print(soup.head)
print(soup.p)

soup.TagName.string | soup.TagName.text

可以通过.string|.text获取标签内部的文本
string 如果是一个子节点，输出该内容，如果多个子节点，无法确定，所以输出None
text 可以获取该标签下所有子节点的文本

#只显示文本内容
print(soup.title.string)
print(soup.p.text)

soup.TagName.attrs属性获取

可以通过.attrs查看标签的属性集合{}结构，可以使用字典的方式获取属性的值

atters[key]
get[key]

print(soup.a.attrs)
print(soup.a.attrs['id'])
print(soup.a.get('id'))

高级查询方式

可以通过查询函数获取单个及多个满足条件的标签

find_all() 查询所有满足的条件，返回列表[Tag,Tag…]
find() 查询第一次满足的条件，返回Tag
select() 查询所有满足的条件，返回列表[Tag,Tag…]
select_one() 查询第一次满足的条件，返回Tag

find参数

name 查询标签名称
attrs 查询标签属性
limit 查询返回长度

a_list = soup.find_all(name='a')
print(a_list)
a_list = soup.find_all(name=['a','p'])
print(a_list)
a_list = soup.find_all(name='a',attrs={'class':'ali'})
print(a_list)
a_list = soup.find_all(name='a',limit=1)
print(a_list)

a = soup.find(name='a',attrs={'class':'ali'})
print(a)

css语法选择器

tag标签选择 tagname{}
class类型选择 .class{}
id选择 #id{}
派生语法 tag、class、id

a_list = soup.select("a")
print(a_list)
a_list = soup.select("a#baidu")
print(a_list)
a_list = soup.select("a.ali")
print(a_list)
a_list = soup.select("body #baidu")
print(a_list)

案例：飞猪代理IP池爬虫

https://www.feizhuip.com/?source=baidu&keyword=feizhudailiip

思路：
1.访问Page1，获取代理IP的标签模块，解析出代理IPPage2页面的url
2.访问Page2，获取代理IP的信息表格模块，解析数据，本地存储

import requests
from bs4 import BeautifulSoup
class SpiderFeiZhuApp:
    '''初始化函数'''
    def __init__(self):
        self.url = 'https://www.feizhuip.com/?source=baidu&keyword=feizhudailiip'
        self.head = {'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'}
        self.resp = ''
        self.href_list = []
        pass
    '''请求页面一'''
    def sendPage1(self):
        # 1 发起请求
        self.resp = requests.get(url=self.url,headers=self.head)
        # 2 验证请求
        print('code-',self.resp.status_code)
        # 3 解析网页源码
        soup =  BeautifulSoup(self.resp.text,'html5lib')
        # 4 获取网页标题
        title = soup.title.string
        print('title-',title)
        # 5 查询标签：find_all(div class:item)
        div_list = soup.find_all(name='div',attrs={'class':'info'})
        span_list = div_list[2].select('p.list span.date')
        a_list = div_list[2].select('p.list a.content')
        print('len-',len(div_list),len(span_list),len(a_list))
        # 6 获取page1中的文本和page2的url
        for i in range(0,len(span_list)):
            href = a_list[i].attrs['href']
            text = a_list[i].string
            date = span_list[i].string
            # 7 判断内容不能为None
            if href != None and text != None and date != None:
                # 8 剔除空格
                href = href.strip()
                text = text.strip()
                date = date.strip()  
                # 9 路径缺失域名
                if 'http' not in href:
                    href = 'https://www.feizhuip.com/' + href
                # 10 保存到列表：[{date:date},{},{},{}]
                self.href_list.append({'date':date,'title':text,'href':href})
                print('page1-info ',self.href_list)
    '''请求页面二'''
    def sendPage2(self):
        pass
    '''本地存储'''
    def saveTxt(self):
        pass 
    def run(self):
        self.sendPage1()
app = SpiderFeiZhuApp()
app.run()