基于关键字在主流搜索引擎中抓取信息

最新推荐文章于 2024-03-07 20:59:02 发布

Chris的算法之旅

最新推荐文章于 2024-03-07 20:59:02 发布

阅读量1.2k

点赞数

分类专栏：爬虫文章标签： Scrapy 搜索引擎

本文链接：https://blog.csdn.net/u012052168/article/details/79762586

版权

爬虫专栏收录该内容

6 篇文章 1 订阅

订阅专栏

本文首发于我的博客：http://gongyanli.com
代码传送门：https://github.com/Gladysgong/seCrawler
简书: https://www.jianshu.com/p/4e244563849a
CSDN: https://blog.csdn.net/u012052168/article/details/79762586

seCrawler(Search Engine Crawler)

A scrapy project can crawl search result of Google/Bing/Baidu

refer

copying by https://github.com/xtt129/seCrawler and rewrite,adding title and abstract.

prerequisite

python 3.5 and scrapy is needed.

commands

run one command to get 50 pages result from search engine with keyword, the result would be kept in the “urls.txt” under the current directory.

Bing

scrapy crawl keywordSpider -a keyword=Spider-Man -a se=bing -a pages=50

Baidu

scrapy crawl keywordSpider -a keyword=Spider-Man -a se=baidu -a pages=50

Google

scrapy crawl keywordSpider -a keyword=Spider-Man -a se=google -a pages=50

results

url,title and abstract will be stored in the urls.txt

limitation

The project doesn’t provide any workaround to the anti-spider measure like CAPTCHA, IP ban list, etc.

But to reduce these measures, we recommand to set DOWNLOAD_DELAY=10 in settings.py file to add a temporisation (in second) between the crawl of two pages, see details in Scrapy Setting.