Python正则表达式与简易爬虫

最新推荐文章于 2024-11-01 21:11:53 发布

uixjhn

最新推荐文章于 2024-11-01 21:11:53 发布

阅读量202

点赞数

分类专栏： python 文章标签：正则表达式 python 爬虫

本文链接：https://blog.csdn.net/qq_51536995/article/details/120898394

版权

python 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

python正则

简易爬虫

定义：

正则就是处理字符串的强大工具，它有自己独特的语法以及一个独立的处理引擎

作用：

主要无非就两种功能：验证，查询
验证：就是比如身份证号，手机号等可用正则来验证固定的某种格式
查找：在文章中查询有没有符合特定格式的数据

（）：分组，匹配找到的数据里面只取有用的数据分组，例如 < p > (.*) < /p >
贪婪模式：尽可能的获取多的符合条件的字符串
非贪婪模式：只要找到一组符合条件的就停止

贪婪模式，在_ 数量词_后面加上问号？，例如：\d+?,+为数量词，而\d?表示后面0或1个字符

.match 找第一个符合匹配项，匹配的必须式是第一个字符匹配，否则返回空对象
.search 找第一个匹配项，没有match的限制

import re
# compile(pattern, flags=0) 
# 给定一个正则表达式 pattern，指定使用的模式 flags 默认为0
# 即不使用任何模式,然后会返回一个 SRE_Pattern对象
str='<p>sax213ddx000</p>'
de=re.compile('\d+')
me=re.compile('\d+?')
ke=re.compile('<p>(.*)</p>')
# res=de.match('/d+?')
res=de.search(str)

print(res.group())
print(me.search(str).group())

# 分组
print(ke.match(str).group(1))
print(ke.match(str).group())

结果
在这里插入图片描述

splite函数

# split(pattern, string, maxsplit=0, flags=0)   
# 	参数 maxsplit 指定切分次数， 函数使用给定正则表达式寻找切分字符串位置，
# 返回包含切分后子串的列表，如果匹配不到，则返回包含原字符串的一个列表
str1='one1two2three3four4five5'
pe=re.compile('\d+')
res=pe.split(str1,maxsplit=5)
print(res)

结果：
在这里插入图片描述

sub函数&&subn函数

sub(pattern, repl, string, count=0, flags=0)

替换函数，将正则表达式 pattern 匹配到的字符串替换为 repl 指定的字符串, 参数 count 用于指定最大替换次数

subn(pattern, repl, string, count=0, flags=0)

作用与函数 sub 一样，唯一不同之处在于返回值为一个元组，第一个值为替换后的字符串，第二个值为发生替换的次数

在这里插入图片描述

简易爬虫

爬虫：就是抓取网页数据的程序（而搜索引擎底层就是爬虫）
爬虫从简单到入门
这里面写的还是很全的，下面就写自己练习的一个例子

from urllib import request,parse
import re,uuid
import pymysql as sql

class SpiderBaiDu:
    def spiderController(self,selname,pagefirst,pagelast):
        '''
        获取百度贴吧的调度器
        第一步：封装url，通过gethtml（）获取整体的html
        第二步：获取Html，调用getPage（）。使用正则表达式解析其中有用的数据
        第三步：写入文件，讲有用的数据通过writefile（）/writejdbc()写入
        :param selname:贴吧名字
        :param pagefirst:起始页码
        :param pagelast:终止页码
        :return:
        '''
        # 封装url，通过gethtml（）获取整体的html
        for p in range(pagefirst,pagelast+1):
            url='https://tieba.baidu.com/f?'+parse.urlencode({"kw":selname,"pn":(p-1)*50})
            html=self.gethtml(url)
            titleList=self.getpage(html)
            self.wirteFile(titleList)
            self.writeJDBC(titleList)
            print(titleList)
    def gethtml(self,url):
        head = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36 Edg/95.0.1020.30'}
        req=request.Request(url=url,headers=head)
        repose=request.urlopen(req)
        html=repose.read().decode('utf-8')
        return html

    def getpage(self,html):
        # 通过正则，对html的数据进行解析，并返回有用数据的结果
        # <a rel="noreferrer" href="/p/7513223339" title="我该怎么提醒他" target="_blank" class="j_th_tit ">我该怎么提醒他</a>
        # <a rel="noreferrer" href="/p/7585721409" title="有没有老哥跟我一样玩女V就为了上river" target="_blank" class="j_th_tit " clicked="true">有没有老哥跟我一样玩女V就为了上river</a>

        me=re.compile('<a rel="noreferrer"\s*href=".*?"\s*title="(.*?)"\s*target="_blank"\s*class="j_th_tit\s*"\s*>.*?</a>')
        titleList=me.findall(html)
        return titleList

    def wirteFile(self,res):
        with open('D://work/liunx/python/test.txt','a') as file:
            for list in res:
                file.write(list+"\n")

    def writeJDBC(self,titleList):
        # 获得连接对象
        conn=sql.Connect(
            host='localhost',
            port=3306,
            passwd='root',
            user='root',
            db='python',
        )
        conn.autocommit(True)
        cour=conn.cursor()
        for t in titleList:
            try:
                cour.execute("insert into tieba values({},'{}')".format(uuid.uuid4(),t))
            except Exception as e:
                continue
        if cour:
            cour.close()
        if conn:
            conn.close()

if __name__ == '__main__':

    global selname
    global pagefirst
    global pagelast

    selname = input("请输入要选取的贴吧名字：")
    while (True):
        try:

            pagefirst = int(input("请输入开始页码："))


        except Exception as e:
            print("请输入正确的页码！")
            continue
        else:
            break
    while (True):
                try:
                    pagelast = int(input("请输入结束页码："))
                except Exception as e:
                    print("请输入正确的页码")
                    continue
                else:
                    break
    sp=SpiderBaiDu()
    sp.spiderController(selname,pagefirst,pagelast)