Python Newspaper爬虫库

最新推荐文章于 2023-09-08 01:28:14 发布

m0_38074612

最新推荐文章于 2023-09-08 01:28:14 发布

阅读量507

点赞数

分类专栏： Python 文章标签： python 爬虫

本文链接：https://blog.csdn.net/m0_38074612/article/details/124803263

版权

Python 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

pip3 install newspaper3k

1.提取新闻列表（标题，URL等）

import newspaper
url = 'https://sthj.sh.gov.cn/hbzhywpt1013/hbzhywpt1041/index.html'
paper = newspaper.build(url, language="zh", memoize_articles=False)
for article in paper.articles:
    print(article.title,article.url)

结果：

中华人民共和国噪声污染防治法 https://sthj.sh.gov.cn/hbzhywpt1013/hbzhywpt1044/20211229/57ad41586f2e4b3d95cc6fcabfb5fc54.html
中华人民共和国湿地保护法 https://sthj.sh.gov.cn/hbzhywpt1013/hbzhywpt1044/20211229/89a89da3c9ba4e6da3a56468e1dc50b5.html
企业环境信息依法披露管理办法 https://sthj.sh.gov.cn/hbzhywpt1013/hbzhywpt1041/20211222/c30ba2d93f084e8d8c2b4e4073fe9c2c.html
危险废物转移管理办法 https://sthj.sh.gov.cn/hbzhywpt1013/hbzhywpt1041/20211222/7bc56782b62149ae9408ef02500faa4d.html
关于修改部分部门规章的决定（2021年） https://sthj.sh.gov.cn/hbzhywpt1013/hbzhywpt1041/20211222/93ffcda185a7403ebb98e1f1f36048b1.html
关于废止固体废物进口相关规章和规范性文件的决定 https://sthj.sh.gov.cn/hbzhywpt1013/hbzhywpt1041/20210122/811a0d6a365c4a37b3d5dbef1f2f7361.html
放射性物品运输安全许可管理办法 https://sthj.sh.gov.cn/hbzhywpt1013/hbzhywpt1041/20201019/9a4b18a6c3434f118b86d8d7b1332c65.html
放射性同位素与射线装置安全许可管理办法 https://sthj.sh.gov.cn/hbzhywpt1013/hbzhywpt1041/20201019/ccf956fb2522442296faa9c57322ea35.html
关于废止、修改部分生态环境规章和规范性文件的决定 https://sthj.sh.gov.cn/hbzhywpt1013/hbzhywpt1041/20210111/a011899956414f948da73d057f6850a3.html
碳排放权交易管理办法（试行） https://sthj.sh.gov.cn/hbzhywpt1013/hbzhywpt1041/20210106/1d7cd8449ac94a20841bbb4a57d70ce4.html
生态环境标准管理办法 https://sthj.sh.gov.cn/hbzhywpt1013/hbzhywpt1041/20201231/c6bd784ac55e4b998fe781ecc69ccd7d.html
建设项目环境影响评价分类管理名录（2021年版） https://sthj.sh.gov.cn/hbzhywpt1013/hbzhywpt1041/20201230/87e10258568d4f3281f84f8572104232.html

2.提取新闻分类

or category in paper.category_urls():
    print(category)

3.提取新闻内容：Article

import newspaper
from  newspaper import Article

url = 'https://sthj.sh.gov.cn/hbzhywpt1013/hbzhywpt1041/index.html'
news = Article(url, language='zh')
news.download()
news.parse()

print(news.url)
# news.url为获取网址的url
print(news.text)
# news.text为获取页面的所有text文字
print(news.title)
# news.title为获取页面的所有标题
print(news.html)
# news.html为获取页面的所有源码
print(news.authors)
print(news.top_image)
print(news.movies)
print(news.keywords)
print(news.summary)
print(news.images)
print(news.imgs)

以上为简单部分使用，更多参考：新闻类爬虫库：Newspaper

m0_38074612

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Python Newspaper爬虫库

pip3 install newspaper3k1.提取新闻标题，URLimport newspaperurl = 'https://sthj.sh.gov.cn/hbzhywpt1013/hbzhywpt1041/index.html'paper = newspaper.build(url, language="zh", memoize_articles=False)for article in paper.articles: print(article.title,article.ur
复制链接

扫一扫