本文学习自:https://www.urlteam.org/2016/06/scrapy-%E5%85%A5%E9%97%A8%E9%A1%B9%E7%9B%AE-%E7%88%AC%E8%99%AB%E6%8A%93%E5%8F%96w3c%E7%BD%91%E7%AB%99/
由于原作者用的是python2. 所以本人在用python3.6尝试时遇到不少坑。
1.创建项目
$ scrapy startproject w3school
创建完之后会有这几个文件:
2.定义Item容器
爬到的信息会按照这几项存起来from scrapy.item import Item,Field class W3schoolItem(Item): title = Field() link = Field() desc = Field()
3.pipeline.py
pipeline是对爬到的数据进行处理(查重、丢弃,储存)的地方。
# -*- coding:utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html import json import codecs import string class W3SchoolPipeline(object): def __init__(self): self.file = codecs.open('w3school_data_utf8.json','wb', encoding='utf-8') #‘wb’表示写入bytes,参数需要加上encoding='utf-8',因为写入需要二进制码,encoding可以生成二进制码 def process_item(self, item, spider):
line = json.dumps(dict(item),ensure_ascii=False)+ '\n' #print (line)#加上参数ensure_ascii=False,生成的json里面就可以显示中文啦!
#将数据写入json里面 self.file.write(line) return item
为了启动pipeline,需在setting里面加入:4.spider爬虫代码ITEM_PIPELINES = { 'w3school.pipelines.W3SchoolPipeline': 300, }
爬虫代码相对容易理解,就是用xpath找出你要爬的数据的位置,然后抽取出来,放到item[]容器里面# !/usr/bin/python # -*- coding:utf-8 -*- from scrapy.spiders import Spider #python2是scrapy.spider from scrapy.selector import Selector from w3school.items import W3schoolItem class W3schoolSpider(Spider):
#这个名字就是待会运行爬虫的名字 name = "w3school" allowed_domains = ["w3school.com.cn"] start_urls = [ "http://www.w3school.com.cn/xml/xml_syntax.asp" ] def parse(self, response): sel = Selector(response) sites = sel.xpath('//div[@id="navsecond"]/div[@id="course"]/ul[1]/li') items = [] for site in sites: item = W3schoolItem() title = site.xpath('a/text()').extract() link = site.xpath('a/@href').extract() desc = site.xpath('a/@title').extract() item['title'] = title #直接放到容器里就好了,不用像python2一样编码 item['link'] = link item['desc'] = desc items.append(item) return items
5.运行爬虫在你项目的目录下:
用记事本打开你在项目文件夹里所建立的$ scrapy crawl w3school
就可以看到w3school_data_utf8.json
搞腚!d=====( ̄▽ ̄*)b