python爬虫之Ajax动态加载数据抓取--豆瓣电影/腾讯招聘

最新推荐文章于 2022-07-11 07:35:00 发布

Ryan_yan1

最新推荐文章于 2022-07-11 07:35:00 发布

阅读量744

点赞数 1

分类专栏：爬虫文章标签： python ajax

本文链接：https://blog.csdn.net/weixin_44706011/article/details/103549857

版权

爬虫专栏收录该内容

11 篇文章 2 订阅

订阅专栏

动态加载数据抓取-Ajax

特点

1、右键 -> 查看网页源码中没有具体数据
2、滚动鼠标滑轮或其他动作时加载

抓取

1、F12打开控制台，页面动作抓取网络数据包
2、抓取json文件URL地址
# 控制台中 XHR ：异步加载的数据包
# XHR -> QueryStringParameters(查询参数)

豆瓣电影数据抓取案例

目标

1、地址: 豆瓣电影 - 排行榜 - 剧情
2、目标: 电影名称、电影评分

F12抓包（XHR）

1、Request URL(基准URL地址) ：https://movie.douban.com/j/chart/top_list?
2、Query String(查询参数)
# 抓取的查询参数如下：
type: 13
interval_id: 100:90
action: ''
start: 0
limit: 用户输入的电影数量

json模块的使用

1、json.loads(json格式的字符串)：把json格式的字符串转为python数据类型
# 示例
html = json.loads(res.text)
print(type(html))

代码实现

import requests
import json

class DoubanSpider(object):
    def __init__(self):
        self.url = 'https://movie.douban.com/j/chart/top_list?'
        self.headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'}

    # 获取页面
    def get_page(self,params):
        res = requests.get(
            url=self.url,
            params=params,
            headers=self.headers,
            verify=True
        )
        res.encoding = 'utf-8'
        # json.loads() josn格式->Python格式
        html = res.json()
        self.parse_page(html)

    # 解析并保存数据
    def parse_page(self,html):
        # html为大列表 [{电影1信息},{},{}]
        for h in html:
            # 名称
            name = h['title'].strip()
            # 评分
            score = float(h['score'].strip())
            # 打印测试
            print([name,score])

    # 主函数
    def main(self):
        limit = input('请输入电影数量:')
        params = {
            'type' : '24',
            'interval_id' : '100:90',
            'action' : '',
            'start' : '0',
            'limit' : limit
        }
        # 调用函数,传递params参数
        self.get_page(params)

if __name__ == '__main__':
    spider = DoubanSpider()
    spider.main()

腾讯招聘案例

URL地址及目标

确定URL地址及目标

1、URL: 百度搜索腾讯招聘 - 查看工作岗位
2、目标: 职位名称、工作职责、岗位要求

F12抓包
一级页面json地址(index变,timestamp未检查)

https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1563912271089&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn

二级页面地址(postId在变,在一级页面中可拿到)

https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1563912374645&postId={}&language=zh-cn

具体代码实现

import requests
import json
import time
import random

class TencentSpider(object):
  def __init__(self):
    self.headers = {'User-Agent':'Mozilla/5.0'}
    self.one_url = 'https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1563912271089&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn'

  def get_page(self,url):
    res = requests.get(url,headers=self.headers)
    res.encoding = 'utf-8'
    # json.loads()把json格式的字符串转为python数据类型
    html = json.loads(res.text)
    return html

  def parse_one_page(self,html):
    job_info = {}
    for job in html['Data']['Posts']:
      job_info['job_name'] = job['RecruitPostName']
      job_info['job_address'] = job['LocationName']
      # 拿postid为了拼接二级页面地址
      post_id = job['PostId']
      # 职责和要求(二级页面)
      # 得到二级页面链接
      two_url = 'https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1563912374645&postId={}&language=zh-cn'.format(post_id)
      # 发请求解析
      job_info['job_duty'],job_info['job_requirement'] = self.parse_two_page(two_url)
      print(job_info)

  def parse_two_page(self,two_url):
    html = self.get_page(two_url)
    # 职责
    job_duty = html['Data']['Responsibility']
    # 要求
    job_requirement = html['Data']['Requirement']

    return job_duty,job_requirement

  def main(self):
    for index in range(1,11):
      url = self.one_url.format(index)
      one_html = self.get_page(url)
      self.parse_one_page(one_html)
      time.sleep(random.uniform(0.5,1.5))

if __name__ == '__main__':
  spider = TencentSpider()
  spider.main()

附git地址：https://github.com/RyanLove1/spider_code

补充：

控制台抓包

打开方式及常用选项

1、打开浏览器，F12打开控制台，找到Network选项卡
2、控制台常用选项
   1、Network: 抓取网络数据包
        1、ALL: 抓取所有的网络数据包
        2、XHR：抓取异步加载的网络数据包
        3、JS : 抓取所有的JS文件
   2、Sources: 格式化输出并打断点调试JavaScript代码，助于分析爬虫中一些参数
   3、Console: 交互模式，可对JavaScript中的代码进行测试
3、抓取具体网络数据包后
   1、单击左侧网络数据包地址，进入数据包详情，查看右侧
   2、右侧:
       1、Headers: 整个请求信息
            General、Response Headers、Request Headers、Query String、Form Data
       2、Preview: 对响应内容进行预览
       3、Response：响应内容

python中正则处理headers和formdata

1、pycharm进入方法 ：Ctrl + r ，选中 Regex
2、处理headers和formdata
  (.*): (.*)
  "$1": "$2",
3、点击 Replace All

Ryan_yan1

关注

1
点赞
踩
7

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录