爬虫几大框架解读

最新推荐文章于 2024-04-05 09:45:00 发布

weixin_30800987

最新推荐文章于 2024-04-05 09:45:00 发布

阅读量117

点赞数

文章标签：爬虫人工智能 python

原文链接：http://www.cnblogs.com/jiabotao/p/10432316.html

版权

1.pysider的demo（常规操作）

from pyspider.libs.base_handler import *


class Handler(BaseHandler): crawl_config = { } @every(minutes=24 * 60) def on_start(self): self.crawl('http://scrapy.org/', callback=self.index_page) @config(age=10 * 24 * 60 * 60) def index_page(self, response): for each in response.doc('a[href^="http"]').items(): self.crawl(each.attr.href, callback=self.detail_page) def detail_page(self, response): return { "url": response.url, "title": response.doc('title').text(), }

2.newspaper

基本是用于文本，文献分析，常用于文本类型提取

>>> from newspaper import Article

>>> url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/' >>> article = Article(url)

>>> article.download()

>>> article.html
'<!DOCTYPE HTML><html itemscope itemtype="http://...'

>>> article.parse()

>>> article.authors
['Leigh Ann Caldwell', 'John Honway']

>>> article.publish_date
datetime.datetime(2013, 12, 30, 0, 0)

>>> article.text
'Washington (CNN) -- Not everyone subscribes to a New Year's resolution...'

>>> article.top_image
'http://someCDN.com/blah/blah/blah/file.png'

>>> article.movies
['http://youtube.com/path/to/link.com', ...]

>>> article.nlp()

>>> article.keywords
['New Years', 'resolution', ...]

>>> article.summary
'The study shows that 93% of people ...'

>>> import newspaper

>>> cnn_paper = newspaper.build('http://cnn.com')

>>> for article in cnn_paper.articles: >>> print(article.url) http://www.cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/ http://www.cnn.com/2013/12/11/us/texas-teen-dwi-wreck/index.html ... >>> for category in cnn_paper.category_urls(): >>> print(category) http://lifestyle.cnn.com http://cnn.com/world http://tech.cnn.com ... >>> cnn_article = cnn_paper.articles[0] >>> cnn_article.download() >>> cnn_article.parse() >>> cnn_article.nlp() ...

>>> from newspaper import fulltext

>>> html = requests.get(...).text
>>> text = fulltext(html)

Newspaper can extract and detect languages seamlessly. If no language is specified, Newspaper will attempt to auto detect a language.

>>> from newspaper import Article
>>> url = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml' >>> a = Article(url, language='zh') # Chinese >>> a.download() >>> a.parse() >>> print(a.text[:150]) 香港行政长官梁振英在各方压力下就其大宅的违章建 筑（僭建）问题到立法会接受质询，并向香港民众道歉。 梁振英在星期二（12月10日）的答问大会开始之际 在其演说中道歉，但强调他在违章建筑问题上没有隐瞒的 意图和动机。 一些亲北京阵营议员欢迎梁振英道歉， 且认为应能获得香港民众接受，但这些议员也质问梁振英有 >>> print(a.title) 港特首梁振英就住宅违建事件道歉

If you are certain that an entire news source is in one language, go ahead and use the same api :)

>>> import newspaper
>>> sina_paper = newspaper.build('http://www.sina.com.cn/', language='zh') >>> for category in sina_paper.category_urls(): >>> print(category) http://health.sina.com.cn http://eladies.sina.com.cn http://english.sina.com ... >>> article = sina_paper.articles[0] >>> article.download() >>> article.parse() >>> print(article.text) 新浪武汉汽车综合 随着汽车市场的日趋成熟， 传统的“集全家之力抱得爱车归”的全额购车模式已然过时， 另一种轻松的新兴 车模式――金融购车正逐步成为时下消费者购 买爱车最为时尚的消费理念，他们认为，这种新颖的购车 模式既能在短期内 ... >>> print(article.title) 两年双免0手续0利率 科鲁兹掀背金融轻松购_武汉车市_武汉汽 车网_新浪汽车_新浪网

转载于:https://www.cnblogs.com/jiabotao/p/10432316.html

weixin_30800987

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫几大框架解读

1.pysider的demo（常规操作）from pyspider.libs.base_handler import *class Handler(BaseHandler): crawl_config = { } @every(minutes=24 * 60) def on_start(self): self.cra...
复制链接

扫一扫