本文爬取网页:https://spa1.scrape.center/
爬取流程:
1.检查页面:
检查网页源代码,查看数据是在网页HTML源代码中还是调用了接口
右键检查页面源代码,未在页面中发现任何页面内容数据
由此得出该网页调用接口:查看网页接口过程如下
- F12调出检查界面,点击Network标签,再点击Fetch/XHR
- 页面加载完成后未出现显示,不过没关系,重新加载页面数据就出来了
- 依次检查即可发现数接口数据,发现第一个数据中有一个跳转
Status Code: 301 Moved Permanently
,跳转到第二个数据中,状态码301是跳转 - 在页面点击 下一页 之后,发现offset的数据变为了10
爬取思路已然清晰
2.数据爬取
Python库:request
查看网页数据调取方式:get
调用接口需要headers头:
通常有用的有
Cookies(可有可无),Host,Referer, User-Agent,
所以我们可以将Cookies以下的所有参数复制到Headers中
Headers格式化网站:http://www.spidertools.cn/#/formatHeader
可以将我们复制的headers格式化
操作方法如图:粘贴自动生成
代码如下:
import requests
headers = {
"Host": "spa1.scrape.center",
"Pragma": "no-cache",
"Referer": "https://spa1.scrape.center/",
"sec-ch-ua": "\" Not A;Brand\";v=\"99\", \"Chromium\";v=\"100\", \"Google Chrome\";v=\"100\"",
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "\"Windows\"",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36"
}
source = requests.get('https://spa1.scrape.center/api/movie?limit=10&offset=0',headers=headers).text
print(source)
结果:
{"count":100,"results":[{"id":1,"name":"霸王别姬","alias":"Farewell My Concubine","cover":"https://p0.meituan.net/movie/ce4da3e03e655b5b88ed31b5cd7896cf62472.jpg@464w_644h_1e_1c","categories":["剧情","爱情"],"published_at":"1993-07-26","minute":171,"score":9.5,"regions":["中国内地","中国香港"]},{"id":2,"name":"这个杀手不太冷","alias":"Léon","cover":"https://p1.meituan.net/movie/6bea9af4524dfbd0b668eaa7e187c3df767253.jpg@464w_644h_1e_1c","categories":["剧情","动作","犯罪"],"published_at":"1994-09-14","minute":110,"score":9.5,"regions":["法国"]},{"id":3,"name":"肖申克的救赎","alias":"The Shawshank Redemption","cover":"https://p0.meituan.net/movie/283292171619cdfd5b240c8fd093f1eb255670.jpg@464w_644h_1e_1c","categories":["剧情","犯罪"],"published_at":"1994-09-10","minute":142,"score":9.5,"regions":["美国"]},{"id":4,"name":"泰坦尼克号","alias":"Titanic","cover":"https://p1.meituan.net/movie/b607fba7513e7f15eab170aac1e1400d878112.jpg@464w_644h_1e_1c","categories":["剧情","爱情","灾难"],"published_at":"1998-04-03","minute":194,"score":9.5,"regions":["美国"]},{"id":5,"name":"罗马假日","alias":"Roman Holiday","cover":"https://p0.meituan.net/movie/289f98ceaa8a0ae737d3dc01cd05ab052213631.jpg@464w_644h_1e_1c","categories":["剧情","喜剧","爱情"],"published_at":"1953-08-20","minute":118,"score":9.5,"regions":["美国"]},{"id":6,"name":"唐伯虎点秋香","alias":"Flirting Scholar","cover":"https://p0.meituan.net/movie/da64660f82b98cdc1b8a3804e69609e041108.jpg@464w_644h_1e_1c","categories":["喜剧","爱情","古装"],"published_at":"1993-07-01","minute":102,"score":9.5,"regions":["中国香港"]},{"id":7,"name":"乱世佳人","alias":"Gone with the Wind","cover":"https://p0.meituan.net/movie/223c3e186db3ab4ea3bb14508c709400427933.jpg@464w_644h_1e_1c","categories":["剧情","爱情","历史","战争"],"published_at":"1939-12-15","minute":238,"score":9.5,"regions":["美国"]},{"id":8,"name":"喜剧之王","alias":"The King of Comedy","cover":"https://p0.meituan.net/movie/1f0d671f6a37f9d7b015e4682b8b113e174332.jpg@464w_644h_1e_1c","categories":["剧情","喜剧","爱情"],"published_at":"1999-02-13","minute":85,"score":9.5,"regions":["中国香港"]},{"id":9,"name":"楚门的世界","alias":"The Truman Show","cover":"https://p0.meituan.net/movie/8959888ee0c399b0fe53a714bc8a5a17460048.jpg@464w_644h_1e_1c","categories":["剧情","科幻"],"published_at":null,"minute":103,"score":9.0,"regions":["美国"]},{"id":10,"name":"狮子王","alias":"The Lion King","cover":"https://p0.meituan.net/movie/27b76fe6cf3903f3d74963f70786001e1438406.jpg@464w_644h_1e_1c","categories":["动画","歌舞","冒险"],"published_at":"1995-07-15","minute":89,"score":9.0,"regions":["美国"]}]}
发现结果为Json数据,则按照Json数据来解析
辨别Json数据方法:
1.JSON 对象使用在大括号{ }中书写
2.对象可以包含多个 key:value(键:值)对
3.JSON 数组在中括号[ ]中书写
4.JSON数组和JSON数据可以相互嵌套
——————————————————
案例如下:
——————
myObj = {
“name”:“网站”,
“num”:3,
“sites”: [
{ “name”:“Google”, “info”:[ “Android”, “Google 搜索”, “Google 翻译” ] },
{ “name”:“Runoob”, “info”:[ “菜鸟教程”, “菜鸟工具”, “菜鸟微信” ] },
{ “name”:“Taobao”, “info”:[ “淘宝”, “网购” ] }
]
}
此时我们将代码中的text
改为Json()
source = requests.get('https://spa1.scrape.center/api/movie?limit=10&offset=0',headers=headers).Json()
会自动生成一个字典,数据和网页中显示一致,只要我们遍历键results就可以获取到网页关键数据
demo.py代码如下:
import requests
headers = {
"Host": "spa1.scrape.center",
"Pragma": "no-cache",
"Referer": "https://spa1.scrape.center/",
"sec-ch-ua": "\" Not A;Brand\";v=\"99\", \"Chromium\";v=\"100\", \"Google Chrome\";v=\"100\"",
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "\"Windows\"",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36"
}
source = requests.get('https://spa1.scrape.center/api/movie?limit=10&offset=0',headers=headers).json()
for i in source['results']:
name = i['name']
categories = i['categories']
score = i['score']
print(name, categories, score)
爬取思路完善,接下来便可转战Scrapy进行爬取