179-爬虫02-bs4模块介绍

最新推荐文章于 2023-11-22 20:05:12 发布

Pinkman2k

最新推荐文章于 2023-11-22 20:05:12 发布

阅读量203

点赞数

分类专栏：爬虫 python

本文链接：https://blog.csdn.net/qq_40808228/article/details/112958583

版权

python 同时被 2 个专栏收录

9 篇文章 0 订阅

订阅专栏

爬虫

3 篇文章 0 订阅

订阅专栏

昨日回顾

1 爬虫基本原理
	-爬虫协议：规定了哪些允许爬，哪些不允许（咱们一般没有关注）
	-模拟发送请求（http请求）----（反扒）----》获得数据----》解析清洗数据---->入库
2 爬取视频网站（re），request---get请求
	-请求头中：
    	-浏览器类型
        -refer
        -cookie
    -视频地址替换
3 自动登录某网站
	-requests发送post请求，data，json
    -拿到cookie，以后再发请求，携带cookie，就是登录状态
    
4 requests模块的使用（requests-html）
	-get 请求：加头，get请求参数，转码
    -post：请求头，cookie，post携带数据
    -响应对象的属性
    -其它用法：使用代理
    	-收费，免费
        -高匿，透明
    
    
    
5 测试代理
    import requests
    proxies={
        'http':'139.224.19.30:3128',
    }
    res=requests.get('http://101.133.225.166:8000/test/',proxies=proxies)
    print(res.text)


    # 后端代码
    def test(request):
        ip=request.META.get('REMOTE_ADDR')
        print(ip)
        return HttpResponse(ip)

简历项目

1 应用市场--》app---》从注册开始看---》找接口---》大致表设计（表中有哪些字段）---》celery，redis，docker
2 微信小程序（送礼物）
4 家教相关
3 公司内部用的自动化运维平台
4 工单系统，oa系统
	-https://zhuanlan.zhihu.com/p/38340557

今日内容

1 爬取汽车之家新闻



# request模块（发送请求）+bs4（解析html的模块）
# 汽车之家为例


# pip3 install beautifulsoup4
# pip3 install lxml

import pymysql

import requests
from bs4 import BeautifulSoup
res=requests.get('https://www.autohome.com.cn/news/1/#liststart')
# print(res.text)

# 类实例化(第一个参数，要解析的html内容，第二个参数是使用的解析器)
# html.parser :bs4的内置解析器
# lxml        ：额外装lxml（快）
# soup=BeautifulSoup(res.text,'html.parser')
soup=BeautifulSoup(res.text,'lxml')
conn=pymysql.Connect(host='127.0.0.1', user='root', password="123",database='qc', port=3306)
cursour=conn.cursor()
# find找一个
# find_all 找所有
# 因为class是关键字，所以使用class_
ul_list=soup.find_all(name='ul',class_='article')
for ul in ul_list:
    li_list=ul.find_all('li')
    for li in li_list:
        h3=li.find('h3')
        if h3:
            # 取出h3标签的文本内容
            title=h3.text
            desc=li.find(name='p').text
            url='https:'+li.find(name='a')['href']
            photo_url='https:'+li.find(name='img')['src']
            print('''
            新闻标题：%s
            新闻链接：%s
            新闻图片：%s
            新闻摘要：%s
            '''%(title,url,photo_url,desc))

            # 把图片保存到本地
            res=requests.get(photo_url)
            name=photo_url.split('_')[-1]
            with open('imgs/%s'%name,'wb') as f:
                for line in res.iter_content():
                    f.write(line)
            # 入库mysql
            sql='insert into article (title,url,photo_url,`desc`) values(%s,%s,%s,%s);'
            cursour.execute(sql,args=[title,url,photo_url,desc])


conn.commit()  # 提交
cursour.close()
conn.close()

2 bs4 之遍历文档树

'''
#遍历文档树：即直接通过标签名字选择，特点是选择速度快，但如果存在多个相同的标签则只返回第一个
#1、用法
#2、获取标签的名称
#3、获取标签的属性
#4、获取标签的内容
#5、嵌套选择
#6、子节点、子孙节点
#7、父节点、祖先节点
#8、兄弟节点
'''

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" id='id_pp' name='lqz'>asdfasdf<b>asdfas</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup=BeautifulSoup(html_doc,'lxml')

# 遍历文档树(速度快)

#1、用法
# head=soup.head
# print(head)
# print(type(head))

# p=soup.body.p
# p=soup.p
# print(p)


#2、获取标签的名称
# p=soup.p.name  # 对象.name 取到标签的名字
# print(p)
#3、获取标签的属性
# p=soup.p['class']   # class 是列表，可以有多个
# name=soup.p['name']

# attr=soup.p.attrs  # 所有属性放到字典中
# print(attr)
#4、获取标签的内容

# t=soup.p.text  # 把p标签文本+子标签文本都拿出来
# print(soup.p.string) # p下的文本只有一个时，取到，否则为None
# print(soup.p.strings) #拿到一个生成器对象, 取到p下所有的文本内容
# print(list(soup.p.strings)) #拿到一个生成器对象, 取到p下所有的文本内容

#5、嵌套选择
# b=soup.body.p.b
# print(b)

作业

1 云服务器
    地址，用户名，密码（ssl，明文密码（对称加密加密密码））
    录入功能（手动录入，excel导入）---存到数据库中
     django 搭建要给web-----》列出你所有的云服务器
     -执行命令
    	ls  立即执行/5m执行/几点几分执行
        celery中
        不管异步任务是否执行成功，都发送短信和邮件提醒
        
        
        
2 代码自动上线
	git地址---》python操作git模块---》
    
    
 # excel模块
# gitpython模块

Pinkman2k

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
179-爬虫02-bs4模块介绍

昨日回顾1 爬虫基本原理 -爬虫协议：规定了哪些允许爬，哪些不允许（咱们一般没有关注） -模拟发送请求（http请求）----（反扒）----》获得数据----》解析清洗数据---->入库2 爬取视频网站（re），request---get请求 -请求头中： -浏览器类型 -refer -cookie -视频地址替换3 自动登录某网站 -requests发送post请求，data，json -拿到cookie，以后再发请求，携带c
复制链接

扫一扫