将爬取的特定内容保存为html文件

最新推荐文章于 2023-07-25 20:48:45 发布

超市dn

最新推荐文章于 2023-07-25 20:48:45 发布

阅读量1k

点赞数 1

文章标签： html 前端

本文链接：https://blog.csdn.net/m0_68157946/article/details/127769421

版权

1.特定内容的爬取需要用到bs4 中的BeautifulSoup

soup = BeautifulSoup(frist_r, 'html.parser')
item_a = soup.find_all('div', class_="ewb-main")
file_name = soup.find('h3', class_="article-title")
file_name = file_name.text

item_a获取特定内容，没错是全获取了，如果在保存的html文件中只看见标题呀，那就等一小会。

file_name获取标题文本，在翻页时需要拼接，+是字符串与字符串拼接，注意转换。

html = \
    '''
        <!DOCTYPE html>
            <html lang="en">
            <head>
                <meta charset="UTF-8">
                <title>Title</title>
            </head>
            <body>
                {}
            </body>
        </html>
    '''.format(item_a)

 if not os.path.exists('{}'.format(url)):
     os.mkdir(r'{}'.format(url))

try:
    with open('{}\{}.html'.format(dir_name, file_name), 'w', encoding='utf-8') as f:
        f.write(html)
except Exception as e:
    print('文件名错误')

这就是写入html文件的标准格式，到这一步基本就可以满足所有要求，不过需要注意一点是不能将页面的url保存为文件名，因为其中包含：。