python-爬取网站数据

最新推荐文章于 2024-04-23 10:40:16 发布

King～Kang

最新推荐文章于 2024-04-23 10:40:16 发布

阅读量2.7k

点赞数 3

文章标签： python 爬虫开发语言

本文链接：https://blog.csdn.net/weixin_45409533/article/details/128067597

版权

一、安装jar

找到python解释解释器的安装根目录，执行以下命令，

pip install requests
#如果这个BeautifulSoup这个安装不上，可以换成pip install BeautifulSoup4试试
pip install BeautifulSoup

二、肯定就是敲代码了

import urllib.request
from bs4 import BeautifulSoup
def handle_request(url):
    #反爬
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0',
    }
    #请求
    request=urllib.request.Request(url,headers=headers)
    return request
def parse_content(content,fp):
    # 生成soup对象 lxml类型 soup已经拿到网页所有数据
    soup=BeautifulSoup(content,'lxml')
    #分析网页，获取自己想要的数据 通过select 来获取指定的数据
    name_list=soup.select("h3")
    content_list=soup.select("div .des")
    datelist=[]
    #把数据进行循环并且格式化数据
    for x,y in zip(name_list,content_list):
        #去空格
        name=x.get_text().strip('\n /')
        content=y.get_text().strip('\n /')
        #格式数据
        dict={"书名":name,"内容":content}
        datelist.append(dict)
        #print(name+":"+conten)

    if datelist=="":
        print("没有打印内容")
        return
    #写入到磁盘，把数据进行持久化
    fp.write(str(datelist))
    #关闭文件流
    fp.close()

def main():
    # 打开文件
    fp = open('作者合集.txt','w',encoding='utf8')
    url = 'https://www.shicimingju.com/hecheng/index.html'
    # 构建请求对象
    request = handle_request(url)
    # 发送请求，得到响应
    content=urllib.request.urlopen(request).read().decode('utf8')
    # 解析内容即可
    parse_content(content,fp)

if __name__ == '__main__':
    main()