1.前期准备
1.1 建立scrapy项目
通过执行scrapy startproject tengxun(该名是项目的名称),建立成功后出现以下图片:
1.2 spiders文件下建立Tencent.py
操作方式:scrapy genspider tencent tencent.com ,该建立成功后出现以下形式:
2.做好前面的工作后,现在在item.py,piplines.py,tencent.py等文件进行编写程序。
2.1 item.py文件中的代码:
class TengxunItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
position_name=scrapy.Field()
position_list=scrapy.Field()
positionType=scrapy.Field()
positionNumber=scrapy.Field()
positionAddress=scrapy.Field()
pass
2.2 piplines.py 文件中的代码:
import json
class TengxunPipeline(object):
def __init__(self):
self.filename=open("tencents.csv","wb")
def process_item(self, item, spider):
text=json.dumps(dict(item),ensure_ascii=False)+'\n'
self.filename.write(text.encode("utf-8"))
return item
def close_spider(self,spider):
self.filename.close()
2.3 tencent.py文件中的代码:
import scrapy
from tengxun.items import TengxunItem
class TencentSpider(scrapy.Spider):
name = 'tencent'
allowed_domains = ['tencent.com']
url='https://hr.tencent.com/position.php?&start='
offset=0
start_urls = [url+str(offset)]
def parse(self, response):
job_list=response.xpath('//tr[@class="even"]|//tr[@class="odd"]')
for each in job_list:
item = TengxunItem()
item['position_name']=each.xpath('./td[1]/a/text()').extract_first()
item['position_list']=each.xpath('./td[1]/a/@href').extract_first()
item['positionType']=each.xpath('./td[2]/text()').extract_first()
item['positionNumber']=each.xpath('./td[3]/text()').extract_first()
item['positionAddress']=each.xpath('./td[4]/text()').extract_first()
yield item
if self.offset<1680:
self.offset+=10
else:
raise"结束工作"
yield scrapy.Request(self.url+str(self.offset),callback=self.parse)
pass
setting.py 中根据自己需求修改。
运行结果出现: