Scrapy爬取1——接口数据爬取准备

本文爬取网页:https://spa1.scrape.center/

爬取流程:

1.检查页面:

检查网页源代码,查看数据是在网页HTML源代码中还是调用了接口
在这里插入图片描述
右键检查页面源代码,未在页面中发现任何页面内容数据
在这里插入图片描述
由此得出该网页调用接口:查看网页接口过程如下

  • F12调出检查界面,点击Network标签,再点击Fetch/XHR
    在这里插入图片描述
  • 页面加载完成后未出现显示,不过没关系,重新加载页面数据就出来了在这里插入图片描述
  • 依次检查即可发现数接口数据,发现第一个数据中有一个跳转
    Status Code: 301 Moved Permanently,跳转到第二个数据中,状态码301是跳转在这里插入图片描述在这里插入图片描述
  • 在页面点击 下一页 之后,发现offset的数据变为了10在这里插入图片描述
    爬取思路已然清晰

2.数据爬取

Python库:request
查看网页数据调取方式:get
在这里插入图片描述
调用接口需要headers头:
通常有用的有
Cookies(可有可无),Host,Referer, User-Agent,
所以我们可以将Cookies以下的所有参数复制到Headers中

Headers格式化网站:http://www.spidertools.cn/#/formatHeader
可以将我们复制的headers格式化
在这里插入图片描述
操作方法如图:粘贴自动生成
在这里插入图片描述
代码如下:

import requests

headers = {
    "Host": "spa1.scrape.center",
    "Pragma": "no-cache",
    "Referer": "https://spa1.scrape.center/",
    "sec-ch-ua": "\" Not A;Brand\";v=\"99\", \"Chromium\";v=\"100\", \"Google Chrome\";v=\"100\"",
    "sec-ch-ua-mobile": "?0",
    "sec-ch-ua-platform": "\"Windows\"",
    "Sec-Fetch-Dest": "empty",
    "Sec-Fetch-Mode": "cors",
    "Sec-Fetch-Site": "same-origin",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36"
}

source = requests.get('https://spa1.scrape.center/api/movie?limit=10&offset=0',headers=headers).text
print(source)

结果:

{"count":100,"results":[{"id":1,"name":"霸王别姬","alias":"Farewell My Concubine","cover":"https://p0.meituan.net/movie/ce4da3e03e655b5b88ed31b5cd7896cf62472.jpg@464w_644h_1e_1c","categories":["剧情","爱情"],"published_at":"1993-07-26","minute":171,"score":9.5,"regions":["中国内地","中国香港"]},{"id":2,"name":"这个杀手不太冷","alias":"Léon","cover":"https://p1.meituan.net/movie/6bea9af4524dfbd0b668eaa7e187c3df767253.jpg@464w_644h_1e_1c","categories":["剧情","动作","犯罪"],"published_at":"1994-09-14","minute":110,"score":9.5,"regions":["法国"]},{"id":3,"name":"肖申克的救赎","alias":"The Shawshank Redemption","cover":"https://p0.meituan.net/movie/283292171619cdfd5b240c8fd093f1eb255670.jpg@464w_644h_1e_1c","categories":["剧情","犯罪"],"published_at":"1994-09-10","minute":142,"score":9.5,"regions":["美国"]},{"id":4,"name":"泰坦尼克号","alias":"Titanic","cover":"https://p1.meituan.net/movie/b607fba7513e7f15eab170aac1e1400d878112.jpg@464w_644h_1e_1c","categories":["剧情","爱情","灾难"],"published_at":"1998-04-03","minute":194,"score":9.5,"regions":["美国"]},{"id":5,"name":"罗马假日","alias":"Roman Holiday","cover":"https://p0.meituan.net/movie/289f98ceaa8a0ae737d3dc01cd05ab052213631.jpg@464w_644h_1e_1c","categories":["剧情","喜剧","爱情"],"published_at":"1953-08-20","minute":118,"score":9.5,"regions":["美国"]},{"id":6,"name":"唐伯虎点秋香","alias":"Flirting Scholar","cover":"https://p0.meituan.net/movie/da64660f82b98cdc1b8a3804e69609e041108.jpg@464w_644h_1e_1c","categories":["喜剧","爱情","古装"],"published_at":"1993-07-01","minute":102,"score":9.5,"regions":["中国香港"]},{"id":7,"name":"乱世佳人","alias":"Gone with the Wind","cover":"https://p0.meituan.net/movie/223c3e186db3ab4ea3bb14508c709400427933.jpg@464w_644h_1e_1c","categories":["剧情","爱情","历史","战争"],"published_at":"1939-12-15","minute":238,"score":9.5,"regions":["美国"]},{"id":8,"name":"喜剧之王","alias":"The King of Comedy","cover":"https://p0.meituan.net/movie/1f0d671f6a37f9d7b015e4682b8b113e174332.jpg@464w_644h_1e_1c","categories":["剧情","喜剧","爱情"],"published_at":"1999-02-13","minute":85,"score":9.5,"regions":["中国香港"]},{"id":9,"name":"楚门的世界","alias":"The Truman Show","cover":"https://p0.meituan.net/movie/8959888ee0c399b0fe53a714bc8a5a17460048.jpg@464w_644h_1e_1c","categories":["剧情","科幻"],"published_at":null,"minute":103,"score":9.0,"regions":["美国"]},{"id":10,"name":"狮子王","alias":"The Lion King","cover":"https://p0.meituan.net/movie/27b76fe6cf3903f3d74963f70786001e1438406.jpg@464w_644h_1e_1c","categories":["动画","歌舞","冒险"],"published_at":"1995-07-15","minute":89,"score":9.0,"regions":["美国"]}]}

发现结果为Json数据,则按照Json数据来解析

辨别Json数据方法:
1.JSON 对象使用在大括号{ }中书写
2.对象可以包含多个 key:value(键:值)对
3.JSON 数组在中括号[ ]中书写
4.JSON数组和JSON数据可以相互嵌套
——————————————————
案例如下:
——————
myObj = {
“name”:“网站”,
“num”:3,
“sites”: [
{ “name”:“Google”, “info”:[ “Android”, “Google 搜索”, “Google 翻译” ] },
{ “name”:“Runoob”, “info”:[ “菜鸟教程”, “菜鸟工具”, “菜鸟微信” ] },
{ “name”:“Taobao”, “info”:[ “淘宝”, “网购” ] }
]
}

此时我们将代码中的text改为Json()

source = requests.get('https://spa1.scrape.center/api/movie?limit=10&offset=0',headers=headers).Json()

会自动生成一个字典,数据和网页中显示一致,只要我们遍历键results就可以获取到网页关键数据
在这里插入图片描述

demo.py代码如下:

import requests

headers = {
    "Host": "spa1.scrape.center",
    "Pragma": "no-cache",
    "Referer": "https://spa1.scrape.center/",
    "sec-ch-ua": "\" Not A;Brand\";v=\"99\", \"Chromium\";v=\"100\", \"Google Chrome\";v=\"100\"",
    "sec-ch-ua-mobile": "?0",
    "sec-ch-ua-platform": "\"Windows\"",
    "Sec-Fetch-Dest": "empty",
    "Sec-Fetch-Mode": "cors",
    "Sec-Fetch-Site": "same-origin",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36"
}

source = requests.get('https://spa1.scrape.center/api/movie?limit=10&offset=0',headers=headers).json()
for i in source['results']:
    name = i['name']
    categories = i['categories']
    score = i['score']
    print(name, categories, score)

爬取思路完善,接下来便可转战Scrapy进行爬取

  • 2
    点赞
  • 18
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
好的,以下是用Scrapy爬取豆瓣TOP250的教程: 1. 创建Scrapy项目 打开命令行窗口,进入你想要创建项目的目录,输入以下命令来创建一个Scrapy项目: ``` scrapy startproject douban_top250 ``` 其中,`douban_top250`是项目的名称。 2. 创建Spider 进入刚才创建的项目目录,输入以下命令来创建一个Spider: ``` scrapy genspider douban_movie https://movie.douban.com/top250 ``` 其中,`douban_movie`是Spider的名称,`https://movie.douban.com/top250`是要爬取的网址。 3. 编写Spider代码 打开`douban_top250/spiders/douban_movie.py`文件,将以下代码复制进去: ```python import scrapy from douban_top250.items import DoubanTop250Item class DoubanMovieSpider(scrapy.Spider): name = 'douban_movie' allowed_domains = ['movie.douban.com'] start_urls = ['https://movie.douban.com/top250'] def parse(self, response): movie_list = response.xpath('//div[@class="article"]//ol[@class="grid_view"]/li') for movie in movie_list: item = DoubanTop250Item() item['rank'] = movie.xpath('.//em/text()').get() item['title'] = movie.xpath('.//span[@class="title"][1]/text()').get() item['year'] = movie.xpath('.//span[@class="title"][2]/text()').get()[1:-1] item['score'] = movie.xpath('.//span[@class="rating_num"]/text()').get() yield item ``` 代码中用到的`DoubanTop250Item`是一个自定义的Item,需要在`douban_top250/items.py`中定义: ```python import scrapy class DoubanTop250Item(scrapy.Item): rank = scrapy.Field() title = scrapy.Field() year = scrapy.Field() score = scrapy.Field() ``` 4. 运行Spider 在命令行窗口中进入项目目录,输入以下命令来运行Spider: ``` scrapy crawl douban_movie -o douban_top250.csv ``` 其中,`douban_movie`是Spider的名称,`douban_top250.csv`是要保存数据的文件名。 5. 查看结果 在项目目录下会生成一个`douban_top250.csv`文件,打开即可查看爬取结果。 以上就是用Scrapy爬取豆瓣TOP250的教程,希望能帮助到你!

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值