python 爬虫之数据提取

最新推荐文章于 2024-05-03 14:23:24 发布

长安白猫

最新推荐文章于 2024-05-03 14:23:24 发布

阅读量812

点赞数 1

分类专栏：爬虫基础文章标签：爬虫

本文链接：https://blog.csdn.net/weixin_44074810/article/details/98882760

版权

爬虫基础专栏收录该内容

11 篇文章 0 订阅

订阅专栏

所有代码均是在虚拟机的环境下写的，如果如果直接粘贴代码在win的环境下运行有可能会出bug（虚拟机是 linux 系统）

1. 数据类型
结构化数据

json，xml，处理方式：直接转化为python类型

非结构化数据

HTML， 处理方式：正则表达式，xpath

2. json模块

  json.loads()  json字符串类型转换成python
  json.dumps()  python转换成json字符串类型
  json.load()  读文件
  json.dump()  写文件

import json
mydict = {
    "store": {
        "book": [
            {"category": "reference",
             "author": "Nigel Rees",
             "title": "Sayings of the Century",
             "price": 8.95
             },
            {"category": "fiction",
             "author": "Evelyn Waugh",
             "title": "Sword of Honour",
             "price": 12.99
             },
        ],
    }
}


# python_obj --> json_str
ret = json.dumps(mydict, ensure_ascii=False, indent=4)
print(ret)

# json_str --> python_obj
ret = json.loads(ret)
print(ret)

# python_obj --> 写入文件
with open('mydict.txt', 'w', encoding='utf-8') as f:
    # f.write(json.dumps(mydict))
    json.dump(mydict, f, ensure_ascii=False, indent=4)

print('=')
# 读取文件 --> python_obj
with open('mydict.txt', 'r') as f:
    # json_str = f.read()
    ret = json.load(f)
print(ret)
print(type(ret))

3. jsonpath 模块

用来解析多层嵌套的json数据，批量提取指定key的值
  $ 根节点
  .or[] 取子节点
  .. 不管位置，选择所有符合条件的条件
  * 匹配所有元素节点

book_dict = {
  "store": {
    "book": [
      { "category": "reference",
        "author": "Nigel Rees",
        "title": "Sayings of the Century",
        "price": 8.95
      },
      { "category": "fiction",
        "author": "Evelyn Waugh",
        "title": "Sword of Honour",
        "price": 12.99
      },
      { "category": "fiction",
        "author": "Herman Melville",
        "title": "Moby Dick",
        "isbn": "0-553-21311-3",
        "price": 8.99
      },
      { "category": "fiction",
        "author": "J. R. R. Tolkien",
        "title": "The Lord of the Rings",
        "isbn": "0-395-19395-8",
        "price": 22.99
      }
    ],
    "bicycle": {
      "color": "red",
      "price": 19.95
    }
  }
}

from jsonpath import jsonpath

# 提取所有price的值
ret = jsonpath(book_dict, '$..price') # [xxx,...] or False
print(ret)

# 提取所有书的价钱
print(jsonpath(book_dict, '$..book..price'))

print(jsonpath(book_dict, '$.store.bicycle'))

4. re模块

  re.match # 从首字符严格匹配一个
  re.search # 找一个
  re.findall # 找所有
  re.sub # 替换
  原始字符串r
  a = '\n' # 换行符(输出是一个换行符)
  b = r'\n' # 仅表示\n字符串，不再是换行符了（输出是一个\n）
  中文的 unicode 编码范围 主要在 [u4e00-u9fa5]

5. xpath

  HTML和XML的区别
    xml 可扩展标记语言，用来传输存储数据
    html 超文本标记语言 更好的显示数据

6. xpath语法

选取节点

/ 从根节点选区，或者是元素和元素间的过度
// 匹配选择的当前节点选择文档中的节点，而不考虑它们的位置
. 选取当前节点
  .. 选取当前节点的父节点
  @ 选取属性
  text() 选取文本

例：
1. /bookstore  选取根元素 bookstore。注释：假如路径起始于正斜杠( / )，则此路径始终      代表到某元素的绝对路径
2. bookstore/book   选取属于 bookstore 的子元素的所有 book 元素
3. //book  选取所有 book 子元素，而不管它们在文档中的位置
4. bookstore//book  选择属于 bookstore 元素的后代的所有 book 元素，而不管它们位于 bookstore 之下的什么位置
5. //book/title/@lang  选择所有的book下面的title中的lang属性的值
6. //book/title/text() 选择所有的book下面的title的文本

7. //h1/text()  选择所有的h1下的文本
8. //a/@href  获取所有的a标签的href
9. /html/head/title/text()  获取html下的head下的title的文本
10. /html/head/link/@href  获取html下的head下的link标签的href

查找特定的节点

1. //title[@lang="eng"]  选择lang属性值为eng的所有title元素
2. /bookstore/book[1]  选取属于 bookstore 子元素的第一个 book 元素
3. /bookstore/book[last()]  选取属于 bookstore 子元素的最后一个 book 元素
4. /bookstore/book[last()-1]  选取属于 bookstore 子元素的倒数第二个 book 元素
5. /bookstore/book[position()>1]  选择bookstore下面的book元素，从第二个开始选择
6. //book/title[text()='Harry Potter']  选择所有book下的title元素，仅仅选择文本为Harry Potter的title元素
7. /bookstore/book[price>35.00]/title  选取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值须大于 35.00

注意点: 在xpath中，第一个元素的位置是1，最后一个元素的位置是last(),倒数第二个是last()-1

选取未知节点

匹配任何元素节点

   @*  匹配任何属性节点
   node()  匹配任何类型的节点

例：
/bookstore/*  选取 bookstore 元素的所有子元素
//*  选取文档中的所有元素
//title[@*]  选取所有带有属性的 title 元素

选取若干路径

  在路径表达式中使用“|”运算符，您可以选取若干个路径
  1. /book/title | //book/price   选取 book 元素的所有 title 和 price 元素
  2. //title | //price   选取文档中的所有 title 和 price 元素
  3. /bookstore/book/title | //price   选取属于 bookstore 元素的 book 元素的所有 title 元素，以及文档中所有的 price 元素

7. lxml模块

  在python语法中使用xpath语法
  先分组在提取
  lxml.etree.HTML(response.content)能够自动对传入的参数字符串进行修改
  可以通过lxml.etree.tostring(html)来查看转换修改之后的内容
  爬虫提取数据不光要以url对应响应内容为准
  在使用lxmlxpath过程中还要以etree.tostring()的结果为准

html_str = '''<div> <ul> 
<li class="item-1"><a href="link1.html"></a></li> 
<li class="item-1"><a href="link2.html">second item</a></li> 
<li class="item-inactive"><a href="link3.html">third item</a></li> 
<li class="item-1"><a href="link4.html">fourth item</a></li> 
<li class="item-0"><a href="link5.html">fifth item</a> 
</ul> </div>'''

from lxml import etree

# 实例化一个具有xpath方法的html对象
html = etree.HTML(html_str)

# xpath规则定位的是标签，最终返回标签对象构成的list or []
# xpath规则定位的标签的属性值或标签的文本内容，最终返回字符串构成的list or []
# 先分组再提取
# li_list = html.xpath('//li')
# for li in li_list:
#     item = {}
#     item['href'] = li.xpath('./a/@href')[0] # xpath返回的标签对象可以继续xpath
#     item['text'] = li.xpath('./a/text()')[0] if li.xpath('./a/text()') != [] else ''
#     print(item)

# 提取所有的href，提取所有的text，再分别按下标组装成每一条数据
# href_list = html.xpath('//li/a/@href')
# print(href_list)
# text_list = html.xpath('//li/a/text()')
# print(text_list)
#
# item_list = html.xpath('//li/a/@href | //li/a/text()')
# for item in item_list:
#     print(item)


"""etree.tostring(html)"""
ret = etree.tostring(html) # bytes
print(ret.decode())

8. lxml的使用

  1. 导入lxml 的 etree 库
  from lxml import etree
  2. 利用etree.HTML，将html字符串（bytes类型或str类型）转化为Element对象，Element对象具有xpath的方法，返回结果的列表
  html = etree.HTML(text)
  ret_list = html.xpath("xpath语法规则字符串")
  3. 把转换后的element对象转换为字符串，返回bytes类型结果
  etree.tostring(element)

xpath方法返回列表的三种情况

1. 返回空列表：根据xpath语法规则字符串，没有定位到任何元素
2. 返回由字符串构成的列表：xpath字符串规则匹配的一定是文本内容或某属性的值
3. 返回由Element对象构成的列表：xpath规则字符串匹配的是标签，列表中的Element对象可以继续进行xpath

# 爬取百度贴吧网页版
import requests
from lxml import etree
import os

class Tieba(object):
	
    def __init__(self,name):
        self.url = 'https://tieba.baidu.com/f?kw={}'.format(name)
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
            # 'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0; DigExt) '
        }

    def get_data(self, url):
        response = requests.get(url, headers=self.headers)
        return response.content

    def parse_list_page(self, data):
        data = data.decode().replace('<!--','').replace('-->','')
        # 创建对象
        html = etree.HTML(data)

        # 定位所有帖子节点

        el_list = html.xpath('//li[@class=" j_thread_list clearfix"]/div/div[2]/div[1]/div[1]/a')
        # print(len(el_list))

        data_list = []
        for el in el_list:
            temp = {}

            temp['title'] = el.xpath('./text()')[0]
            temp['link'] = 'https://tieba.baidu.com' + el.xpath('./@href')[0]
            data_list.append(temp)
        # 构建翻页的url
        try:
            next_url = 'http:' + html.xpath('//*[@id="frs_list_pager"]/a[last()-1]/@href')[0]
        except:
            next_url = None

        return data_list,next_url
	
	# 获取帖子详情的图片
    def parse_detial_data(self, data):


        # 创建对象
        html = etree.HTML(data)

        img_list = html.xpath('//*[contains(@id,"post_content_")]/img/@src')
        return img_list
        
	# 下载图片保存到文件夹中
    def download(self, img_list):
        if not os.path.exists('python34'):
            os.makedirs('python34')

        for link in img_list:
            print(link)
            data = self.get_data(link)

            filename = 'python34' + os.sep + link.split(os.sep)[-1]
            with open(filename, "wb")as f:
                f.write(data)

    def run(self):
        # url
        # headers
        next_url = self.url
        while True:
            # 发送请求获取响应
            data = self.get_data(next_url)

            # 从响应中提取 帖子标题&链接列表 和 下一页链接
            data_list, next_url = self.parse_list_page(data)

            # 遍历 帖子标题&链接列表
            for data in data_list:
                print(data)
                detail_page = self.get_data(data['link'])
                img_list = self.parse_detial_data(detail_page)
                self.download(img_list)


            # 判断并翻页处理
            if next_url == None:
                break


if __name__ == '__main__':
    tieba = Tieba("李毅")
    tieba.run()

爬取百度贴吧手机端我还在写写完会上传，手机端的翻页和xpath和网页版有些许不一样

9. bs4模块

安装 pip install beautifulsoup4
使用 from bs4 import BeautifulSoup
     html = BeautifulSoup(html_str, 'lxml')
# 定位标签的方法
	elment = html.find()
	elment = html.find_all()
	elment = html.select()
# 提取属性值
	elment.get('href')
# 提取文本内容
	elment.get_text()

长安白猫

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python 爬虫之数据提取

所有代码均是在虚拟机的环境下写的，如果如果直接粘贴代码在win的环境下运行有可能会出bug（虚拟机是 linux 系统）1. 数据类型结构化数据json，xml，处理方式：直接转化为python类型非结构化数据HTML，处理方式：正则表达式，xpath2. json模块 json.loads() json字符串类型转换成python json.dumps() py...
复制链接

扫一扫