人民网多页新闻推荐进行信息查找及xlsx打包

最新推荐文章于 2022-06-19 16:34:09 发布

三猪

最新推荐文章于 2022-06-19 16:34:09 发布

阅读量273

点赞数

分类专栏：爬虫文章标签：爬虫 python

本文链接：https://blog.csdn.net/weixin_39739342/article/details/79847794

版权

爬虫专栏收录该内容

11 篇文章 0 订阅

订阅专栏

人民网财经频道或其他频道均可(更改网址模板可改变)

经济频道网址：http://finance.people.com.cn/index1.html#fy01

主要查找的信息为新闻的日期，来源及标题

并储存为.xlsx

效果如下：

重点：需要将每个新闻的信息字典转化为json格式，并利用.append添加到列表中,最后利用.extend将序列添加到最终的列表

list.append(object) 向列表中添加一个对象object

list.extend(sequence) 把一个序列seq的内容添加到列表中

import requests
from bs4 import BeautifulSoup
import json
import pandas
import sqlite3
import openpyxl

def url(newurl):#提取新闻的信息并打包成字典返回
	news={}
	res=requests.get(newurl)
	res.encoding='GB2312'
	soup=BeautifulSoup(res.text,'lxml')
	for link in soup.select('head'):
		if len(link.select('title'))>0:
			news['title']=link.select('title')[0].text
		for lin in link.select('meta'):
			stra=str(lin).split(' ')
			stra[-1]=stra[-1].lstrip('name="').rstrip('"/>')
			if(stra[-1]=='source'):
				stra[-2]=stra[-2].lstrip('content="').rstrip('"')
				news['source']=stra[-2]
			if(stra[-1]=='publishdate'):
				stra[-2]=stra[-2].lstrip('content="').rstrip('"')
				news['date']=stra[-2]
	return news
url_mu='http://finance.people.com.cn/index{}.html#fy01'
url_mu2='http://finance.people.com.cn{}'
copy=[]
for i in range(1,5):#修改参数可以更改要搜索的页数
	new1={}
	cop=[]
	a=url_mu.format(i)
	res=requests.get(a)
	res.encoding='GB2312'
	soup=BeautifulSoup(res.text,'lxml')
	for link in soup.select('.left.w655'):
		for lin in link.select('h5'):
			if len(lin)>0:
				if lin.select('a')[0]['href'][0]=='h':
					b=lin.select('a')[0]['href']
				else:
					b=url_mu2.format(lin.select('a')[0]['href'])
				new1=url(b)
				jd=json.loads(json.dumps(new1, ensure_ascii=False, encoding='UTF-8'))
				cop.append(jd)
				copy.extend(cop)
				cop=[]
df=pandas.DataFrame(copy)
df.to_excel('news3.xlsx')#输出为.xlsx格式