解析Json格式的饿了么数据(二)

推荐使用在线格式化工具:http://tool.oschina.net/codeformat/json

将获取的json数据格式化


找出我们需要的字段信息,在java中写出实体类,注意属性的类型。

写出实体类

商家类:

public class Shop {

	private Integer id;	//店铺ID
	private String name;	//店铺名称
	private Float rating;	//店铺评分
	private Integer recent_order_num;	//店铺总售量
	private Integer order_lead_time;	//最迟送到时间
	
}

信息类:

public class Info {

	private String name;	//分类名称
	private Food[] foods;	//菜品
	private long id;	//分类ID
	
}

菜品类:

public class Food {
	
	private Float rating;	//菜品等级
	private long restaurant_id;	//所属餐厅ID
	private String description;	//描述
	private Integer month_sales;	//月售量
	private Integer rating_count;	
	private String image_path;	//图片路径
	private String name;	//菜品名称
	private Integer satisfy_count;	//满意的顾客数
	private Integer satisfy_rate;	//满意度
	private Specfood[] specfoods;	//菜品详情
	
}

菜品详情类:(这个是我要的数据类)

public class Specfood {
	private String name;	//菜品名
	private Integer restaurant_id;	//所属餐厅ID
	private Integer food_id;	//菜品ID
	private Float recent_rating;	//最近评价
	private Float price;	//价格
	private Float recent_popularity;	//最近销售量

}

上面的类省去了set get 方法、构造方法以及toString( )。

利用gson进行解析

gson的maven坐标

<dependency>
	<groupId>com.google.code.gson</groupId>
	<artifactId>gson</artifactId>
	<version>2.8.1</version>
</dependency>

下面以爬取北京市海淀区北京大学医学部附近的商铺为例:

因为商铺很多,饿了么是通过滚动条滚到底部,加载一次数据的方式加载店铺


Chrome 监控数据


每次请求只有offset变化,并且是从0~624,每次以24的偏移量递增





package com.yc.elm.utils;

import java.io.BufferedWriter;
import java.io.FileWriter;
import java.util.List;

import org.jsoup.Connection;
import org.jsoup.Connection.Response;
import org.jsoup.Jsoup;

import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
import com.google.gson.reflect.TypeToken;
import com.yc.elm.entity.Food;
import com.yc.elm.entity.Info;
import com.yc.elm.entity.Shop;
import com.yc.elm.entity.Specfood;

public class DataUtil {

	public static void main(String[] args) throws Exception {

		for (int i = 0; i < 27; i++) {
			String url = "https://www.ele.me/restapi/shopping/restaurants?extras%5B%5D=activities&geohash=wx4erybdnx4&latitude=39.98375&limit=24&longitude=116.35717"
					+ "&offset=" + i * 24 + "&terminal=web";
			Connection con = Jsoup.connect(url).ignoreContentType(true);
			Response response = con.execute();
			String str = response.body();
			System.out.println(i);
			Gson gson = new Gson();
			List<Shop> shops = gson.fromJson(str, new TypeToken<List<Shop>>() {
			}.getType());
			parseFoods(shops);
		}

	}

	public static void parseFoods(List<Shop> shops) throws Exception {

		FileWriter fw = new FileWriter("data.txt", true);
		BufferedWriter bw = new BufferedWriter(fw);

		for (Shop shop : shops) {

			String url = "https://www.ele.me/restapi/shopping/v2/menu?restaurant_id=" + shop.getId();
			Connection con = Jsoup.connect(url).ignoreContentType(true);
			Response response = con.execute();

			String str = response.body();
			Gson gson = new GsonBuilder().create();

			if (shop.getId() == 271908 || shop.getId() == 156329101 || shop.getId() == 156277299
					|| shop.getId() == 682323 || shop.getId() == 157034079 || shop.getId() == 154898695
					|| shop.getId() == 1001894 || shop.getId() == 2142009 || shop.getId() == 476592
					|| shop.getId() == 305155 || shop.getId() == 156447071) {
				break;
			}

			List<Info> infos = null;
			try {
				infos = gson.fromJson(str, new TypeToken<List<Info>>() {
				}.getType());
			} catch (Exception e) {
				throw new RuntimeException(shop.getId() + "网页有问题!!!", e);
			}

			for (Info info : infos) {
				Food[] foods = info.getFoods();
				for (Food food : foods) {
					Specfood[] specfoods = food.getSpecfoods();
					for (Specfood specfood : specfoods) {
						Integer food_id = specfood.getFood_id();
						String name = specfood.getName();
						Integer restaurant_id = specfood.getRestaurant_id();
						Float price = specfood.getPrice();
						Float recent_rating = specfood.getRecent_rating();
						Float recent_popularity = specfood.getRecent_popularity();
						if (food_id > 0 && name != null && restaurant_id != 0 && price != 0.0) {
							String data = food_id + "\t" + name + "\t" + restaurant_id + "\t" + shop.getName() + "\t"
									+ price + "\t" + recent_rating + "\t" + recent_popularity + "\r\n";
							bw.append(data);
							bw.flush();
						}

					}
				}
			}
		}
		bw.close();
	}
}

其中有一些店铺的数据结构比较特殊,遇到他们我们都会 break 掉。

还有一个问题就是,当你多次请求该页面,该页面会把你当成网络攻击器,拒绝你访问。并报出429 430错误。

Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=429, URL=https://www.ele.me/restapi/shopping/v2/menu?restaurant_id=157108780
	at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:679)
	at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:628)
	at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:260)
	at com.yc.elm.utils.DataUtil.parseFoods(DataUtil.java:51)
	at com.yc.elm.utils.DataUtil.main(DataUtil.java:33)

		
Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=430, URL=https://www.ele.me/restapi/shopping/restaurants?extras%255B%255D=activities&geohash=wx4erybdnx4&latitude=39.98375&limit=24&longitude=116.35717&offset=0&terminal=web
	at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:679)
	at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:628)
	at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:260)
	at com.yc.elm.utils.DataUtil.main(DataUtil.java:27)

429 Too Many Requests (太多请求)

当你需要限制客户端请求某个服务数量时,该状态码就很有用,也就是请求速度限制。

在此之前,有一些类似的状态码,例如 '509 Bandwidth Limit Exceeded'. Twitter 使用 420 (这不是HTTP定义的状态码)

如果你希望限制客户端对服务的请求数,可使用 429 状态码,同时包含一个 Retry-After 响应头用于告诉客户端多长时间后可以再次请求服务。

报429后 隔段时间再次访问还是可以,一旦报430,就必须更换IP地址,这里没有对伪装IP和配置代理进行下一步研究,这个问题暂时没有解决。

亲测,这里可以用代码模拟下拉三个页面。

结果:


数据量还是太小,但这是一个好的开始。


评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值