适用条件:
1.爬取目标网址很多
2.包含数据采集、解析、清晰、存储全流程
3、方便后期运维管理(模块化、集成度高)
pycharm安装scrapy
pip install lxml
pip install Twisted
pip install Scrapy
流程:
1.创建项目
cmd程序中进行
#scrapy startproject+文件名+文件存放文件夹位置
scrapy startproject tipdmScrapy C:\Users\wangx\Desktop\tipdmSpider1
2.指定字段——找到items.py
原始模样
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class TipdmscrapyItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
更改后
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class TipdmscrapyItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
3.编写爬虫脚本——解析网页
打开Terminal(终端),进入py文件所处的上上级,创建命令,生成新的脚本文件,生成的文件在spiders文件中
#scrapy genspider +文件名 +网站(域名)
scrapy genspider spider_title www.tipdm.com
自动产生代码结果
import scrapy
class SpiderTitleSpider(scrapy.Spider):
name = "spider_title"
allowed_domains = ["www.tipdm.com"]
start_urls = ["https://www.tipdm.com"]
def parse(self, response):
pass
可手动更改网址
完整代码
import scrapy
#导入items.py文件中的类
from scrapy爬虫.tipdmScrapy.items import TipdmscrapyItem
class SpiderTitleSpider(scrapy.Spider):
name = "spider_title"
#允许爬取范围
allowed_domains = ["www.tipdm.com"]
#爬取网址
start_urls = ["https://http://www.tipdm.com/xwzx/index.jhtml"]
def parse(self, response):
#初始化对象
item=TipdmscrapyItem()
#指定爬取内容路径
titles= [each.ectract() for each in response.xpath('//*[@id="t505"]/div[1]/div[3]/h1/a/text()')
item['title']=titles
return item
4.编写数据保存代码
打开Terminal(终端)
#scrapy crawl 爬虫文件名称 -o 数据存储文件名称
scrapy crawl spider_title -o title.csv