网络爬虫之数据解析模块

最新推荐文章于 2023-03-24 17:59:41 发布

不食人间烟火的阿琨

最新推荐文章于 2023-03-24 17:59:41 发布

阅读量403

点赞数 1

分类专栏： pycharm 爬虫文章标签：爬虫之数据解析

本文链接：https://blog.csdn.net/qq_43775661/article/details/102488850

版权

pycharm 同时被 2 个专栏收录

7 篇文章

订阅专栏

爬虫

5 篇文章

订阅专栏

知识点

		## note
		聚焦爬虫：爬取页面指定页面内容
  -编码流程：
    —指定url
    -发起请求
    -获取数据
    -数据解析
    -持久化存储

数据解析分类
-正则
-bs4
-xpath（）
数据解析原理概述
-解析局部内容在指定标签内或则标签内部属性存储（大部分）
-1.指定标签定位
-2.标签对应属性中存储的数据值进行提取（解析）
#bs4进行数据解析
-数据解析原理：
-1标签定位
-2。提取标签。标签属性存储数据
bs4数据解析原理
-实例化一个beautsoup对象，并且将页面源码数据加载到对象中
-通过调用beausoup对象相关属性方法进行标签定位
-如何实例化beausoup对象
-1.将本地HTML文档中数据加载到该对象中
fp=open(’./text.html’,‘r’,encoding=‘utf_8’)
soup=Beautifulsoup=(fp,‘lxml’)
-2.将互联网获取的页面源码加载到对象中
-提供的用于数据解析方法和属性：
-soup.a#soup.tagname 返回html第一次出现的tagname标签
-soup.find #find(“tagname”)=soup.div
-属性定位
-soup.find_all(): #返回所有符合要求的该标签
-select：
—select(“某种选择器（id，calss，标签）”)，返回的是一个列表
-层级选择器：
-soup.select（.tang>ul>li>a） >表示层级
-soup.select（.tang>ul>li a） >表示多个层级
-获取标签之间的文本数据：
-soup.a.text/string/get_text()
-text/get_text():可以获取某一个标签中所有文本
-string:只可以获取该标签下面直系文本内容
-获取标签的属性：
-soup.a[‘href’]
xpath解析：追常用的并且最简洁高效的一种解析方式。通用。
-xpath解析原理：
-实例化一个etree对象，将需要的被解析页面源码数据加载到对象中
-调用etree对象中的xpath方法结合xpath的表达时实现标签定位和内容获取。
-环境安装
-pip install lxml
-如何实例化etree对象：
-1将本地html文档数据加载到etree对象中：
etree.parse（filename）
-2.可以将互联网获取源码数据加载到该对象中
etree.html（page_text）
-xpath（‘xpath表达式’）
-:/表示从根节点开始定位，表示一个层级
-：//表示多个层级。可以表示从任意位置开始定位
-:属性定位，//div[@calss=“song”] tag[@attrname=“attrvalue”]
-索引定位： tree.xpath(’//div[@class=“song”]/p[3]’)索引从1开始的。
-取文本：
-/text() 获取标签中直系文本类容。
-//text()获取标签中非直系文本内容。
-取属性：
-/@attrname ==>img/src

练习题

//解析图片下载.py

from lxml import etree
import json
import requests
import os
if __name__=="__main__":
    headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'}
    //这里HEADERS为你浏览器头部信息。并且是键值对形式
    url="http://pic.netbian.com/4kmeinv/"
    response=requests.get(url=url,headers=headers,)
    #手动设定编码乱码问题
    # response.encode='utf-8'
    page_text=response.text
    tree=etree.HTML(page_text)
    print(tree)
    #解析SRC属性值 alt属性
    list_p=tree.xpath('//div[@class="slist"]/ul/li')
    print(list_p)
    if not os.path.exists('./piclibs'):
        os.mkdir('./piclibs')
    for li in list_p:
        list_jpg="http://pic.netbian.com"+li.xpath('./a/img/@src')[0]
        list_name=li.xpath('./a/img/@alt')[0]+'.jpg'
        #通用处理中文乱码解决方案
        list_name=list_name.encode('iso-8859-1').decode('gbk')
        print(list_jpg,list_name)
        JBG=requests.get(url=list_jpg,headers=headers).content
        jpg_path='piclibs/'+list_name
        with open(jpg_path,'wb') as fp:
            fp.write(JBG)
            print(list_name,'下载成功！！！！1')

//全国城市.py

import requests
import os
from lxml import etree
import json
if __name__=="__main__":
    headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'}
    url='https://www.aqistudy.cn/historydata/index.php'
    page_text=requests.get(url=url,headers=headers).text
    tree=etree.HTML(page_text)
    all_list_name=[]
    # tree.xpath("")
    # /div/ul/li/a          热门城市a标签的层级关系
    # /div[@class='bottom'/ul/div[2]/li/a   全部城市a标签的层级关系
    all_list=tree.xpath('//div[@class="bottom"]/ul/li/a | //div[@class="bottom"]/ul/div[2]/li/a ')
    for li in all_list:
        all_name=li.xpath('./text()')[0]
        all_list_name.append(all_name)
    print(all_list_name,len(all_list_name))




    # 分部解析数据
    # top_city=tree.xpath("//div[@class='bottom']/ul/li")
    # all_city_name=[]
    # for li in top_city:
    #     hoot_city=li.xpath("./a/text()")[0]
    #     all_city_name.append(hoot_city)
    #     # print(hoot_city)
    # all_city=tree.xpath("//div[@class='bottom']/ul/div[2]/li")
    # for li in all_city:
    #     name=li.xpath("./a/text()")[0]
    #     all_city_name.append(name)
    # print(all_city_name,len(all_city_name))