scrapy基础
一、安装
- windows
pip install scrapy
- linux
zjz@debian10:~/spider$ sudo apt-get update && sudo apt-get install python3-scrapy
安装成功后终端输入scrapy
会显示如下内容:
scrapy
Scrapy 2.2.0 - no active project
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test commands
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
[ more ] More commands available when run from project directory
Use "scrapy <command> -h" to see more info about a command
二、基本操作
终端输入scarpy startproject tutorial
,自动在当前目录下创建目录:
New Scrapy project 'tutorial', using template directory 'c:\users\zhujz\envs\spider\lib\site-packages\scrapy\templates\project', created in:
D:\VScodeProject\scrapytest\tutorial
You can start your first spider with:
cd tutorial
scrapy genspider example example.com
包含以下文件:
- scrapy.cfg: 项目的配置文件
- tutorial/: 该项目的python模块。之后您将在此加入代码。
- tutorial/items.py: 项目中的item文件.
- tutorial/pipelines.py: 项目中的pipelines文件.
- tutorial/settings.py: 项目的设置文件.
- tutorial/spiders/: 放置spider代码的目录.
1、确认爬取目标
重写(mySpider/items.py)。
比如目标为爬取itcast网站的姓名、职称、信息,新建一个scrapy.Item类,构建 item 模型(model)
import scrapy
class TutorialItem(scrapy.Item):