python3爬虫之后的数据保存问题(基于Pandas库)

最新推荐文章于 2024-07-18 19:21:27 发布

Harold_96_lxw

最新推荐文章于 2024-07-18 19:21:27 发布

阅读量4.9w

点赞数 1

分类专栏： Python 爬虫

本文链接：https://blog.csdn.net/weixin_38168694/article/details/81276227

版权

Python 爬虫专栏收录该内容

13 篇文章 4 订阅

订阅专栏

python3爬虫之后的数据保存问题(基于Pandas库)

1.老生常谈，环境配置：

pip install pandas

又因为pandas库中依赖openpyxl所以

pip install openpyxl

2.爬虫过程简介：
这一次我爬取的是太原理工大学主页网站第一页共7则新闻
网址：http://www2017.tyut.edu.cn/xyxw/lgyw.htm
在对界面进行简单分析后，代码如下：

import requests
from bs4 import BeautifulSoup
import pandas

def getMaininfo(url):#抓取一条新闻的 标题 发布时间 来源 作者 返回一个列表
    result=[]
    res=requests.get(url)
    res.encoding='utf-8'
    soup=BeautifulSoup(res.text,'html.parser')
    result.append(soup.select('h2')[0].text)
    fb=soup.select('.xxxx')[0].contents[:-1]
    result.append(str(fb).lstrip('[').rstrip(']').replace('<span>','').replace('</span>','').replace('    , ','').replace(', ','\n'))
    result.append(soup.select('.v_news_content')[0].text)
    return result

def geturls(url):#抓取一页多个新闻url，调用getMaininfo(url)并将结果返回到字典
    a=[]
    b=[]
    c=[]
    res = requests.get(url)
    res.encoding = 'utf-8'
    soup = BeautifulSoup(res.text, 'html.parser')
    for url in soup.select('.tit a'):
        url='http://www2017.tyut.edu.cn'+(url['href']).strip('..')
        a.append(getMaininfo(url)[0])
        b.append(getMaininfo(url)[1])
        c.append(getMaininfo(url)[2])
    result = {
        'title': a,
        'fubiao': b,
        'context': c
    }
    return result



if __name__ == '__main__':
    url='http://www2017.tyut.edu.cn/xyxw/lgyw.htm'
    print(geturls(url))
    df=pandas.DataFrame(geturls(url))#调用Pandas函数
    df.to_excel('123.xlsx')#在本地项目下生成一个表格存储爬取新闻

3.结果如下：
这里写图片描述