python 总结：正则 xpath beautifulsoup 的用法以及优胜劣汰

最新推荐文章于 2024-04-26 15:14:42 发布

crq_zcbk

最新推荐文章于 2024-04-26 15:14:42 发布

阅读量1.5k

点赞数

分类专栏： Python

本文链接：https://blog.csdn.net/crq_zcbk/article/details/81412331

版权

Python 专栏收录该内容

24 篇文章 0 订阅

订阅专栏

1.正则：improt re 本人博客正则地址

正则表达式可以判断目标字符是否符合特定要求，比如手机，身份证号等等

正则分为三种查找方法：re.math(),re.search(),re.findall()

其中，re.math()是要从所匹配的字符串的起始位置开始匹配且只输出一个符合正则表达式的值；

re.search()是从要匹配的字符串的任意位置匹配，但是只输出第一个找到的匹配值；

re.findall()从所有要匹配的字符串的任意位置开始匹配，并且输出所有的值，因此平时查找是多用次方法

2.xpath

xpath是一种针对结构化数据进行数据匹配的描述语言

xpath分析的目标数据：结构化数据~标记语言定义的数据[xml/HTML]

基本语法：针对加载的网页/xml文档，转化成文档书结构，在文档中格局基本语法在局部进行数据匹配的操作

基于index.html的查询操作

代码作用

html 查询所有html的子节点

/html 查询根节点下的html节点

python中怎么操作xpath

python中默认没有操作模块，但是有一个第三方模块lxml可以对结构化数据xpath操作进行非常友好的支持

3.beautifulsoup

bs4是一种对性能的要求，时间的限制相对较弱的一种爬取方式

三种爬取方式的对比：、

re xpath bs4

安装内置第三方第三方

语法正则路径匹配面向对象

使用困难较困难简单

性能最高适中最低

下面用美食杰来举个实例：

import re
from urllib.request import Request,urlopen
from lxml import etree
import requests
from bs4 import BeautifulSoup
import sqlite3
class DBmanager(object):
    def __init__(self):
        self.cursor=None
        self.connect=None
    def create_data(self):
        self.connect=sqlite3.connect('MYMSJDB')
        self.cursor=self.connect.cursor()
        self.cursor.execute('create table if not exists meishijie(name text)')
        self.connect.commit()
    def insert_data(self,result):
        self.cursor.execute('insert into meishijie (name)VALUES("{}")'.format(result))
        self.connect.commit()
    def close_data(self):
        self.cursor.close()
        self.connect.close()
class meishijie(object):
    def __init__(self):
        self.url='https://www.meishij.net/chufang/diy/?&page='
        self.headers={
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0'
        }
    def guanli(self):
        for x in range(1,5):
            code=self.down_data(x)
            print('*******************************************')
            print('正在获取第{}页'.format(x))
            self.get_tupian_info(code)
    def down_data(self,index):
        url=self.url+str(index)
        request=Request(url,headers=self.headers)
        try:
            response = urlopen(request)
            code = response.read().decode()
        except Exception as e:
            print('捕获异常',e)
        else:
            return code
    def get_tupian_info(self,code):
        # pattern=re.compile(r'<div class="c1">.*?<.*?>(.*?)</.*?>.*?<.*?>(.*?)</.*?>.*?<.*?>(.*?)</.*?>',re.S)
        # result=pattern.findall(code)
        # print(result)
        # response=requests.get(self.url).content
        # result=etree.HTML(code)
        # cont=result.xpath('//div[@class="c1"]')
        # for x in cont:
        #     y=x.xpath('.//strong/text()')
        #     m = x.xpath('.//span/text()')
        #     z=x.xpath('.//em/text()')
        #     print(y)
        #     print(m)
        #     print(z)
        bs=BeautifulSoup(code,'html.parser')
        info=bs.select('div.c1 strong')
        print(info)
        for x in info:
            result='{}'.format(x.string)
            print(result)
            DB.insert_data(result)
DB=DBmanager()
DB.create_data()
m=meishijie()
m.guanli()
DB.close_data()