一、案例:腾讯招聘网职位及对应职责
1.1页面分析:
- 是否为静态—>动态(ajax)
- 翻页 在scrapy中:1>列出几个URL,然后.format 2>直接找下一页的Url地址 最后 yield scrapy.Request(url,callback=None)
- 实现 (爬虫程序、items、piplines.py)
注:若需要点击跳转到另一个URL 这时看看检查里是否有现成的、找规律
检查、response、preview、网页源代码、json
1.2具体:
1.2.1翻页:
职位的URL:
https://careers.tencent.com/search.html?index=1 第一页
https://careers.tencent.com/search.html?index=2 第二页
https://careers.tencent.com/search.html?index=3 第三页
每一个岗位的职责的URL(可能在职位的URL中也有,但可能不全)
https://careers.tencent.com/jobdesc.html?postId=1357242769973714944
https://careers.tencent.com/jobdesc.html?postId=1357242775422115840
[postId不一样]
1.2.2 :代码:
import scrapy
import json
from recruit.items import RecruitItem
class PositionSpider(scrapy.Spider):
name = 'position'
allowed_domains = ['tencent.com']
first_url='https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1622635924738&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn'
second_url='https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1622637678428&postId={}&language=zh-cn'
start_urls = [first_url.format(1)]
def parse(self, response):
# 解析数据 re xpath bs4..
# response.encoding='utf-8'
data = json.loads(response.text)
for job in data['Data']['Posts']:
item = RecruitItem()
item['job_name'] = job['RecruitPostName']
post_id = job['PostId']
# 获取详情页的url
detail_url = self.second_url.format(post_id)
#对每一个职位,再次请求url获取职位
yield scrapy.Request(
url=detail_url,
callback=self.detail_content,
meta={'item':item}
)
# print(item)
# 翻页
for page in range(2,3):
url = self.first_url.format(page)
yield scrapy.Request(url=url)
def detail_content(self,response):
# 数据有没有过来
# item = response.meta['item']
item = response.meta.get('item')
data = json.loads(response.text)
item['job_duty'] = data['Data']['Responsibility']
print(item)
总结:
今后用什么方式来爬取数据? - 先实现功能 - 优化程序
是根据自己掌握技术的优先级
补充:
html中:
上节课古诗文代码的扩展(进一步爬取每一首古诗的译文与注释)
代码:(poem.py)
import scrapy
from poems.items import PoemsItem
import re
from lxml import etree
import requests
import time
class PoemSpider(scrapy.Spider):
name = 'poem'
allowed_domains = ['gushiwen.org','gushiwen.cn']
start_urls = ['https://www.gushiwen.org/default_1.aspx']
def parse(self, response):
gsw_divs=response.xpath('//div[@class="sons"]/div[@class="cont"]')
# 过滤非正常诗文,并得取诗文详细内容链接
if gsw_divs:
for gsw_div in gsw_divs:
# time.sleep(0.1)
if gsw_div.xpath('./div[@class="yizhu"]'):
href = gsw_div.xpath('./p/a/@href').get()
href_url = response.urljoin(href)
yield scrapy.Request(url=href_url, callback=self.parse_1)
#翻页处理:
# next_href = response.xpath('//a[@id="amore"]/@href').get()
# # 翻页
# if next_href:
# next_url = response.urljoin(next_href) # urljoin()可以进行url地址的补全
# # request = scrapy.Request(next_url)
# # yield request
# yield scrapy.Request(
# url=next_url,
# callback=self.parse # 如果这个逻辑是这个parse 就可以省略
# )
#这里的response进入到详细页了
def parse_1(self,response):
#获取四项:名称...
html_text = response.xpath('//div[@id="sonsyuanwen"][1]/div[@class="cont"]')
title = html_text.xpath('./h1/text()').get()
author = html_text.xpath('./p[@class="source"]/a[1]/text()').get()
dynasty = html_text.xpath('./p[@class="source"]/a[2]/text()').get()
content_list = html_text.xpath('./div[@class="contson"]//text()').getall()
content = ''.join(content_list).strip()
#取得“展开阅读全文”链接,如果有则得取ajax页的译文和注释,没有则提取本页上的译文和注释
pd=response.xpath('//a[@style="text-decoration:none;"]/@href').get()
#'//div[@style="text-align:center; margin-top:-5px;"]/a/@href'
if pd:
# try:
ID=re.match(r"javascript:.*?,'(.*?)'",str(pd)).group(1)
# print(type(ID))
base_url='https://so.gushiwen.cn/nocdn/ajaxfanyi.aspx?id={}'
parse_html = etree.HTML(requests.get(base_url.format(ID)).text)
# print(parse_html)
html_fanyi= parse_html.xpath('//div[@class="contyishang"]//text()')
# print(html_fanyi) 测试用
# print('*'*10)
# except:
# pass
else:
html_fanyi = response.xpath(
'//div[@class="left"]/div[@class="sons"][2]/div[@class="contyishang"]//text()').getall()
# print(html_fanyi)
html_fanyi = ''.join(html_fanyi).replace('\n', '').replace('译文及注释', '').replace('译文', '').replace('注释', '|')
if html_fanyi:
fanyi = html_fanyi.split('|')
translation = fanyi[0]
notes = fanyi[1]
else:
translation = ''
notes = ''
item = PoemsItem(title=title,
author=author,
dynasty=dynasty,
content=content,
translation=translation,
notes=notes)
yield item
总结:
- URL一定要正确,一般scrapy不会出现乱码 如果出现了settings.py文件中加个配置项 FEED_EXPORT_ENCODING = 'utf-8
- 爬到好多空列表—很可能 xpath(就是爬取数据是否正确) 、解析是否正确、反爬、要登录
二、scrapy shell
三、settings.py文件的补充
settings.py文件可以作为配置项 变量大写表示公共、配置
用于连接外部与scrapy 编程时不知把一些逻辑(简单)放哪里,可以放settings.py文件中
例子:
在爬虫文件与管道中引用settings.py文件中的MYSQL_HOST = ‘127.0.0.1’
1.爬虫文件:
第一种方式:
from mySpider.settings import MYSQL_HOST
item['db_host'] = MYSQL_HOST
第二种方式:
item['db_host'] = self.settings.get('MYSQL_HOST')
2.piplines.py:
第一种方式同上
第二种方式:
item['pip_host'] = spider.settings.get('MYSQL_HOST')