python关键词提取_TextRank:基于Tr算法提取新闻关键词的Python爬虫

最新推荐文章于 2024-06-27 10:27:35 发布

weixin_39890629

最新推荐文章于 2024-06-27 10:27:35 发布

阅读量188

点赞数

文章标签： python关键词提取

该博客使用Python进行网页爬虫，从指定网址抓取新闻标题，并存储到文件中。之后利用jieba的TextRank算法对文件内容进行关键词提取，得到当日关键词。整个过程涉及网络请求、HTML解析和文本处理技术。

摘要由CSDN通过智能技术生成

[Python] 纯文本查看复制代码# -*- coding:utf-8 -*-

#Author:MercuryYe

import urllib.request

import numpy as np

import pandas as pd

import jieba.analyse

from bs4 import BeautifulSoup

###爬虫部分###

url = "http://www.bishijie.com/kuaixun"

print("请稍等，正在爬行中......")

#模拟浏览器请求

headers = ('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36')

opener = urllib.request.build_opener()

opener.addheaders = [headers]

urllib.request.install_opener(opener)

data = urllib.request.urlopen(url).read()

data = data.decode('utf-8')

#将获取到内容扔进BeautifulSoup中转换格式

soup = BeautifulSoup(data, 'html.parser')

result = soup.find_all('a', target='_blank')

result = list(set(result))

filter(None, result)

for link in result:

title = str(link.get('title'))

filewrite = open('vaule.txt','a+')

filewrite.write(title)

filewrite.close()

###提取关键词部分###

#定义函数

def read_from_file(directions):

decode_set=['utf-8','gb18030','ISO-8859-2','gb2312','gbk','Error']

for k in decode_set:

file = open(directions,"r",encoding=k)

readfile = file.read()

file.close()

break

return readfile

#读取文件

file_data = str(read_from_file('vaule.txt'))

print("请稍等，正在提权关键词中......\n")

#使用TextRank算法提取关键词

textrank=jieba.analyse.textrank

keywords_TR=textrank(file_data)

print('今日关键词：',set(keywords_TR))

weixin_39890629

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。