Python3 大型网络爬虫实战 002 --- scrapy 爬虫项目的创建及爬虫的创建 --- 实例：爬取百度标题和CSDN博客

最新推荐文章于 2024-07-08 21:12:19 发布

AoboSir

最新推荐文章于 2024-07-08 21:12:19 发布

阅读量1.7w

点赞数 1

分类专栏： Scrapy 大型爬虫项目 Python3 爬虫 window 文章标签： Scrapy python3 csdn baidu 网络爬虫

本文链接：https://blog.csdn.net/github_35160620/article/details/53353747

版权

本篇博客介绍了如何使用Scrapy框架创建爬虫项目，分别爬取百度首页标题和CSDN博客文章。详细讲解了创建Scrapy项目、设置爬虫、解析网页源代码、提取信息的步骤，并提供了运行爬虫的命令。

摘要由CSDN通过智能技术生成

开发环境

Python第三方库：lxml、Twisted、pywin32、scrapy
Python 版本：python-3.5.0-amd64
PyCharm软件版本：pycharm-professional-2016.1.4
电脑系统：Windows 10 64位

如果你还没有搭建好开发环境，请到这篇博客。

1 知识点：scrapy 爬虫项目的创建及爬虫的创建

1.1 scrapy 爬虫项目的创建

接下来我们为大家创建一个Scrapy爬虫项目，并在爬虫项目下创建一个Scrapy爬虫文件。

scrapy startproject <projectname>

1.2 scrapy 爬虫文件的创建

cd demo
scrapy genspider -t basic <filename> <domain>

更多 Scrapy 命令的介绍请到这篇博客查看。

2 实例：爬取百度标题和CSDN博客

我们创建一个爬虫项目，在里面创建一个爬虫文件来爬取百度，并再创建一个爬虫文件爬取CSDN博客文章。

先创建一个Scrapy爬虫项目：

scrapy startproject firstDemo

输出：

D:\WorkSpace\python_ws\python-large-web-crawler>scrapy startproject firstdemo
New Scrapy project 'firstdemo', using template directory 'c:\\users\\aobo\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\scrapy\\templates\\project', created in:
    D:\WorkSpace\python_ws\python-large-web-crawler\firstdemo

You can start your first spider with:
    cd firstdemo
    scrapy genspider example example.com

D:\WorkSpace\python_ws\python-large-web-crawler>

2-1.1 使用Scrapy爬虫爬取百度标题

创建一个爬虫文件来爬取百度

cd firstDemo
scrapy genspider -t basic baidu baidu.com

输出：

D:\WorkSpace\python_ws\python-large-web-crawler>cd firstdemo

D:\WorkSpace\python_ws\python-large-web-crawler\firstdemo>scrapy genspider -t basic baidu baidu.com
Created spider 'baidu' using template 'basic' in module:
  firstdemo.spiders.baidu

D:\WorkSpace\python_ws\python-large-web