点击--> 前提知识scrapy
一、创建项目
cmd >> scrapy startproject douban## scrapy startproject project_name
cmd >> cd douban/douban/spiders
cmd >> scrapy genspider douban_spider movie.douban.com## scrapy genspider spider_text_name start_url
二、items文件
import scrapy class GItem(scrapy.Item): # define the fields for your item here like: name = scrapy.Field()
三、spider文件
# 文件名 gis.py # -*- coding: utf-8 -*- import scrapy def e(i): return str(i.extract().strip()) from G.items import GItem class GisSpider(scrapy.Spider): name = 'gis' allowed_domains = ['sz.5i5j.com'] start_urls = ['https://sz.5i5j.com/ershoufang/'] def parse(self, response): fr = response.xpath("//ul[@class='pList']//li//p[2]/text()|//p[2]/a[1]//text()") strs = "" for i in range(len(fr)): E = e(fr[i]) strs = strs + E if len(E) > 2: if E[-2]=="m": gitem = GItem() strs = strs.encode("utf-8") gitem['name'] = strs strs = "" yield gitem next_link = response.xpath("//a[@class='cPage']/@href").extract() if next_link: next_link=next_link[0] yield scrapy.Request("https://sz.5i5j.com"+next_link,callback=self.parse)
四、爬取中遇到的问题
# 解决UnicodeEncodeError: 'gbk' codec can't encode character '\xbb' in position strs = strs.encode("utf-8")
五、运行与输出
scrapy crawl gis _o _gis.csv
六、上海等城市二手房价爬取
爬取苏州\上海等地二手房房价
最新推荐文章于 2024-04-01 17:38:35 发布