创建项目
scrapy startproject tencent
cd tencent 打开项目目录
scrapy genspider hr tence.com# hr为spder文件的名字,tencent.com是允许爬的域名范围
- 设置初始的url地址
- 打开网页源代码根据xpath找需要的信息
- 取标签的文本值使用text()函数,去标签的属性值用@,比如取a标签的地址
response.xpath("//a[@id='next']/@href").extract_first()
- 另外每个后面要加上extract_first()
import scrapy
class HrSpider(scrapy.Spider):
name = 'hr'
allowed_domains = ['tencent.com']
start_urls = ['https://hr.tencent.com/position.php']
def parse(self, response):
tr_list = response.xpath("//table[@class='tablelist']/tr")[1:-1]
for tr in tr_list:
item = {
}
item["title"] = tr