初学爬虫实战

最新推荐文章于 2024-11-13 17:26:18 发布

寒林日斜

最新推荐文章于 2024-11-13 17:26:18 发布

阅读量322

点赞数 1

文章标签： python Powered by 金山文档

本文链接：https://blog.csdn.net/weixin_74700584/article/details/129739518

版权

import re
import os
import requests

url = 'http://www.shiren.org/xlib/lingshidao/gushi/tangdai.htm#001'
requests.get(url).encoding='UTF-8'
html = requests.get(url).text

def get_content(html):

    content_big = re.findall('<p><p>(.*?)<p align="center">',html, re.S)[0]
    content_little = re.findall('</a>(.*?)<p>(.*?)<p>', content_big,re.S)
    article = ""
    for i in content_little:
        for j in i:
            article = article + j.replace("<br>"," ") + '\n'

    return article


def save(article):
    os.makedirs('古诗',exist_ok=True)
    with open(os.path.join('古诗.txt'), 'w', encoding="utf-8") as f:
        f.write(article)
        return print('下载成功')

save(get_content(html))

嘿嘿嘿，又爬了一篇古诗，看来我已经会熟练使用requests库和re正则模块来进行网页数据获取了！！！( •̀ ω •́ )y，NICE！