Scrapy爬取1——接口数据爬取准备

最新推荐文章于 2024-07-06 03:00:24 发布

BunnyDuudu

最新推荐文章于 2024-07-06 03:00:24 发布

阅读量3k

点赞数 2

分类专栏： python Scrapy 文章标签： python 爬虫

本文链接：https://blog.csdn.net/weixin_44826986/article/details/124138028

版权

python 同时被 2 个专栏收录

16 篇文章 1 订阅

订阅专栏

Scrapy

8 篇文章 0 订阅

订阅专栏

本文爬取网页：https://spa1.scrape.center/

爬取流程：

1.检查页面：

检查网页源代码，查看数据是在网页HTML源代码中还是调用了接口
在这里插入图片描述
右键检查页面源代码，未在页面中发现任何页面内容数据

由此得出该网页调用接口：查看网页接口过程如下

F12调出检查界面，点击Network标签，再点击Fetch/XHR
页面加载完成后未出现显示，不过没关系，重新加载页面数据就出来了
依次检查即可发现数接口数据，发现第一个数据中有一个跳转
Status Code: 301 Moved Permanently，跳转到第二个数据中，状态码301是跳转
在页面点击下一页之后，发现offset的数据变为了10
爬取思路已然清晰

2.数据爬取

Python库：request
查看网页数据调取方式：get
在这里插入图片描述
调用接口需要headers头：
通常有用的有
Cookies(可有可无)，Host，Referer， User-Agent，
所以我们可以将Cookies以下的所有参数复制到Headers中

Headers格式化网站：http://www.spidertools.cn/#/formatHeader
可以将我们复制的headers格式化
在这里插入图片描述
操作方法如图：粘贴自动生成

代码如下：

import requests

headers = {
    "Host": "spa1.scrape.center",
    "Pragma": "no-cache",
    "Referer": "https://spa1.scrape.center/",
    "sec-ch-ua": "\" Not A;Brand\";v=\"99\", \"Chromium\";v=\"100\", \"Google Chrome\";v=\"100\"",
    "sec-ch-ua-mobile": "?0",
    "sec-ch-ua-platform": "\"Windows\"",
    "Sec-Fetch-Dest": "empty",
    "Sec-Fetch-Mode": "cors",
    "Sec-Fetch-Site": "same-origin",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36"
}

source = requests.get('https://spa1.scrape.center/api/movie?limit=10&offset=0',headers=headers).text
print(source)

结果：

{"count":100,"results":[{"id":1,"name":"霸王别姬","alias":"Farewell My Concubine","cover":"https://p0.meituan.net/movie/ce4da3e03e655b5b88ed31b5cd7896cf62472.jpg@464w_644h_1e_1c","categories":["剧情","爱情"],"published_at":"1993-07-26","minute":171,"score":9.5,"regions":["中国内地","中国香港"]},{"id":2,"name":"这个杀手不太冷","alias":"Léon","cover":"https://p1.meituan.net/movie/6bea9af4524dfbd0b668eaa7e187c3df767253.jpg@464w_644h_1e_1c","categories":["剧情","动作","犯罪"],"published_at":"1994-09-14","minute":110,"score":9.5,"regions":["法国"]},{"id":3,"name":"肖申克的救赎","alias":"The Shawshank Redemption","cover":"https://p0.meituan.net/movie/283292171619cdfd5b240c8fd093f1eb255670.jpg@464w_644h_1e_1c","categories":["剧情","犯罪"],"published_at":"1994-09-10","minute":142,"score":9.5,"regions":["美国"]},{"id":4,"name":"泰坦尼克号","alias":"Titanic","cover":"https://p1.meituan.net/movie/b607fba7513e7f15eab170aac1e1400d878112.jpg@464w_644h_1e_1c","categories":["剧情","爱情","灾难"],"published_at":"1998-04-03","minute":194,"score":9.5,"regions":["美国"]},{"id":5,"name":"罗马假日","alias":"Roman Holiday","cover":"https://p0.meituan.net/movie/289f98ceaa8a0ae737d3dc01cd05ab052213631.jpg@464w_644h_1e_1c","categories":["剧情","喜剧","爱情"],"published_at":"1953-08-20","minute":118,"score":9.5,"regions":["美国"]},{"id":6,"name":"唐伯虎点秋香","alias":"Flirting Scholar","cover":"https://p0.meituan.net/movie/da64660f82b98cdc1b8a3804e69609e041108.jpg@464w_644h_1e_1c","categories":["喜剧","爱情","古装"],"published_at":"1993-07-01","minute":102,"score":9.5,"regions":["中国香港"]},{"id":7,"name":"乱世佳人","alias":"Gone with the Wind","cover":"https://p0.meituan.net/movie/223c3e186db3ab4ea3bb14508c709400427933.jpg@464w_644h_1e_1c","categories":["剧情","爱情","历史","战争"],"published_at":"1939-12-15","minute":238,"score":9.5,"regions":["美国"]},{"id":8,"name":"喜剧之王","alias":"The King of Comedy","cover":"https://p0.meituan.net/movie/1f0d671f6a37f9d7b015e4682b8b113e174332.jpg@464w_644h_1e_1c","categories":["剧情","喜剧","爱情"],"published_at":"1999-02-13","minute":85,"score":9.5,"regions":["中国香港"]},{"id":9,"name":"楚门的世界","alias":"The Truman Show","cover":"https://p0.meituan.net/movie/8959888ee0c399b0fe53a714bc8a5a17460048.jpg@464w_644h_1e_1c","categories":["剧情","科幻"],"published_at":null,"minute":103,"score":9.0,"regions":["美国"]},{"id":10,"name":"狮子王","alias":"The Lion King","cover":"https://p0.meituan.net/movie/27b76fe6cf3903f3d74963f70786001e1438406.jpg@464w_644h_1e_1c","categories":["动画","歌舞","冒险"],"published_at":"1995-07-15","minute":89,"score":9.0,"regions":["美国"]}]}

发现结果为Json数据,则按照Json数据来解析

辨别Json数据方法：
1.JSON 对象使用在大括号{ }中书写
2.对象可以包含多个 key:value（键:值）对
3.JSON 数组在中括号[ ]中书写
4.JSON数组和JSON数据可以相互嵌套
——————————————————
案例如下：
——————
myObj = {
“name”:“网站”,
“num”:3,
“sites”: [
{ “name”:“Google”, “info”:[ “Android”, “Google 搜索”, “Google 翻译” ] },
{ “name”:“Runoob”, “info”:[ “菜鸟教程”, “菜鸟工具”, “菜鸟微信” ] },
{ “name”:“Taobao”, “info”:[ “淘宝”, “网购” ] }
]
}

此时我们将代码中的text改为Json()

source = requests.get('https://spa1.scrape.center/api/movie?limit=10&offset=0',headers=headers).Json()

会自动生成一个字典，数据和网页中显示一致，只要我们遍历键results就可以获取到网页关键数据
在这里插入图片描述

demo.py代码如下：

import requests

headers = {
    "Host": "spa1.scrape.center",
    "Pragma": "no-cache",
    "Referer": "https://spa1.scrape.center/",
    "sec-ch-ua": "\" Not A;Brand\";v=\"99\", \"Chromium\";v=\"100\", \"Google Chrome\";v=\"100\"",
    "sec-ch-ua-mobile": "?0",
    "sec-ch-ua-platform": "\"Windows\"",
    "Sec-Fetch-Dest": "empty",
    "Sec-Fetch-Mode": "cors",
    "Sec-Fetch-Site": "same-origin",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36"
}

source = requests.get('https://spa1.scrape.center/api/movie?limit=10&offset=0',headers=headers).json()
for i in source['results']:
    name = i['name']
    categories = i['categories']
    score = i['score']
    print(name, categories, score)