实战--用正则re提取数据--爬取中国诗词

最新推荐文章于 2024-02-22 20:00:00 发布

Java川

最新推荐文章于 2024-02-22 20:00:00 发布

阅读量738

点赞数

分类专栏： python 文章标签：正则

本文链接：https://blog.csdn.net/weixin_43919632/article/details/89578263

版权

python 专栏收录该内容

39 篇文章 0 订阅

订阅专栏

知识：

正则大概需要掌握的函数有：

match函数必须以字符串开头开始匹配，否则会错
search函数可以任意从哪个字符串开始匹配（常用）
findall函数返回所有符合正则表达式的内容，返回的是列表（常用）
compile函数当重复使用某一个正则表达式的时候，可以先把这个表达式compile一遍，提升运行效率

爬虫中正则的基本使用：https://blog.csdn.net/weixin_43919632/article/details/89287145

数据提取使用的库：正则 re

安装：pip install re

导入模块：import re

实战操练：

 #--*encoding:utf-8*--
    import  re
    import requests
    
    def parse_html(url):
        header={"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36"}
        res=requests.get(url,headers=header)
        text=res.content.decode("utf-8","ignore")
        return text
    #正则提取数据  （麻烦的一批）
    def re_parse(text):
        #re.DOTOALL或 re.S都可以使 .可以匹配到换行符     需要的内容用（）括起来   
        poe_namelst=re.findall(r'<div\sclass="cont">.*?<b>(.*?)</b>',text,re.DOTOALL)
        poe_contentlst=re.findall(r'<div class="contson".*>(.*?)</div>',text,re.S)
        dynasties=re.findall(r'<p class="source">.*?>(.*?)</a>',text)
        authorlst=re.findall(r'<p class="source">.*?<a.*>.*?<a.*?>(.*?)</a>',text)
        #获得div标签内所有的内容
        content_tags=re.findall(r'<div class="contson".*?>(.*?)</div>',text,re.S)
        contents=[]
        poems=[]
        #处理得到文本
        for content in  content_tags:
            #将<p> </p> </br>等这类替换成空格，然后删去
            x=re.sub("<.*>","",content)
            contents.append(x.strip())
        for value in zip(poe_namelst,dynasties,authorlst,contents):
            name,dynasty,author,content=value
            poem={"诗歌":name,
              "作者":author,
              "朝代":dynasty,
              "内容":content}
            poems.append(poem)
        for poem in poems:
            print(poem)
    
    def main():
        #翻页 爬取10页内容
        for i in range(1,10):
            url="https://www.gushiwen.org/default.aspx?page={}".format(i)
            text=parse_html(url)
            re_parse(text)
    if __name__=="__main__":
        main()