系列文章目录
对英雄联盟比赛预测(一)- 特征分析
对英雄联盟比赛预测(二)- 数据获取api分析
对英雄联盟比赛预测(三)- 数据获取Java实现
文章目录
声明
本文所提到的爬虫方法仅供学习研究使用,不得用于商业用途
前言
前面几篇文章,主要讲述lol比赛的影响因素和riotgames 提供对外的api,本文主要讲述如何利用前面两篇文章的内容,去真正实现一个自动化的爬虫。
一、基本框架
本文以SpringBoot为主框架,引用apache httpclient,apache commons-io和google guava为辅,另外解析json由alibaba fastjson实现,idea插件引入lombok。
二、基本配置
1. maven依赖配置
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.4</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.13</version>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>1.18.16</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>30.1-jre</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.75</version>
</dependency>
2. properties配置
# the apikey applied by riotgames: https://developer.riotgames.com/
spider.riotgames.apikey=RGAPI-74967532-db62-4129-a292-a3f0d5713f88
spider.riotgames.lol.base-url=https://developer.riotgames.com
# these urls below are start with br1, because the data regions are br1, but there are some other region like kr, na1 etc.
spider.riotgames.lol.summoner.byname.url=https://br1.api.riotgames.com/tft/summoner/v1/summoners/by-name/%s
# the urls below is to get lol match details
spider.riotgames.lol.matchlist.byaccount.url=https://br1.api.riotgames.com/lol/match/v4/matchlists/by-account/%s?beginIndex=%d
spider.riotgames.lol.match.detail.url=https://br1.api.riotgames.com/lol/match/v4/matches/%s
spider.riotgames.lol.match.timelines.url=https://br1.api.riotgames.com/lol/match/v4/timelines/by-match/%s
三、模块详解
本项目中共分为
- client
改造httpClient,实现一些自定义的属性 - config
riotgames的一些配置 - entity
取/存数据的基本实体类 - pipeline
定义了两个中间件,实现类似消息队列的功能 - scheduler
定时任务 - service
爬虫的主要逻辑
六个部分实现
本次最终获得数据并未入数据库,而是以json文件的格式,保存在本地
1. client
client模块共包含两个类
- RiotGamesClient
/**
* @date 2021/1/16 20:34
* @auth jixiang.ma
* @copyright copyright © 2021 jixiang.ma all right reserved.
**/
@Component
public class RiotGamesClient {
private static RateLimiter rateLimiterMin = RateLimiter.create(100, 2, TimeUnit.MINUTES);
private static RateLimiter rateLimiterSec = RateLimiter.create(20, 1, TimeUnit.SECONDS);
private static HttpClient httpClient;
private static final String UA = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36";
private static final String AP = "*/*";
private static final String AL = "zh-CN,zh;q=0.9";
@Autowired
private RiotGamesConfig riotGamesConfig;
private HttpClient getClient() {
RequestConfig requestConfig = RequestConfig.custom()
.setConnectionRequestTimeout(5000)
.setConnectTimeout(5000).build();
List<Header> headerList = new ArrayList<>(5);
headerList.add(new BasicHeader(HttpHeaders.USER_AGENT, UA));
headerList.add(new BasicHeader(HttpHeaders.ACCEPT, AP));
headerList.add(new BasicHeader(HttpHeaders.ACCEPT_LANGUAGE, AL));
headerList.add(new BasicHeader("Origin", riotGamesConfig.getLolBaseUrl()));
headerList.add(new BasicHeader("X-Riot-Token", riotGamesConfig.getApiKey()));
httpClient = HttpClientBuilder.create().
setDefaultRequestConfig(requestConfig)
.setDefaultHeaders(headerList)
.setRetryHandler(retryHandler())
.build();
return this.httpClient;
}
public HttpResponse execute(HttpUriRequest httpRequest) throws IOException {
rateLimiterMin.acquire(1);
rateLimiterSec.acquire(1);
return this.getClient().execute(httpRequest);
}
private static HttpRequestRetryHandler retryHandler(){
return (exception, executionCount, context) -> {
System.out.println("riotGames api retry request: " + executionCount);
if (executionCount >= 5) {
// Do not retry if over max retry count
return false;
}
if (exception instanceof InterruptedIOException) {
// Timeout
return false;
}
if (exception instanceof UnknownHostException) {
// Unknown host
return false;
}
if (exception instanceof SSLException) {
// SSL handshake exception
return false;
}
if(exception instanceof SocketTimeoutException) {
return true;
}
HttpClientContext clientContext = HttpClientContext.adapt(context);
HttpRequest request = clientContext.getRequest();
boolean idempotent = !(request instanceof HttpEntityEnclosingRequest);
if (idempotent) {
// Retry if the request is considered idempotent
return true;
}
return false;
};
}
}
- RiotGamesRetryHandler
/**
* @date 2021/1/17 17:48
* @auth jixiang.ma
* @copyright copyright © 2021 jixiang.ma all right reserved.
**/
public class RiotGamesRetryHandler implements HttpRequestRetryHandler {
@Override
public boolean retryRequest(IOException e, int i, HttpContext httpContext) {
return false;
}
}
在RiotGamesClient中定义了两个RateLimiter,是因为拳头的api限制了api访问速度:每秒最多20个,每两分钟最多100个,所以,我们必须在发送http请求时对一定时间内的请求数量做限制,我们这里采用的是guava的RateLimiter。
插一句:google的guava是一个非常强大的jdk补充库,值得一学。
2. config
config这个模块只有一个配置类,为其他模块提供配置参数。
/**
* @date 2021/1/17 1:30
* @auth jixiang.ma@transwarp.io
* @copyright copyright © 2021 www.jixiang.ma all right reserved.
**/
@Configuration
@Getter
public class RiotGamesConfig {
@Value("${spider.riotgames.apikey}")
private String apiKey;
@Value("${spider.riotgames.lol.base-url}")
private String lolBaseUrl;
@Value("${spider.riotgames.lol.summoner.byname.url}")
private String lolSummonerUrl;
@Value("${spider.riotgames.lol.matchlist.byaccount.url}")
private String lolMatchListByAccountUrl;
@Value("${spider.riotgames.lol.match.detail.url}")
private String lolMatchDetailUrl;
@Value("${spider.riotgames.lol.match.timelines.url}")
private String lolMatchTimeLinesUrl;
}
3. entity
这个模块是整个爬虫中的基础,一切数据都依靠entity中的实体类交互。
- Ban
/**
* @date 2021/1/17 13:38
* @auth jixiang.ma
* @copyright copyright © 2021 jixiang.ma all right reserved.
**/
@Getter
@Setter
public class Ban {
private Integer championId;
private Integer pickTurn;
}
- Frame
/**
* @date 2021/1/17 14:59
* @auth jixiang.ma
* @copyright copyright © 2021 jixiang.ma all right reserved.
**/
@Getter
@Setter
@ToString
public class Frame {
private List<ParticipantFrame> participantFrames;
private Long timestamp;
private List<Event> events;
}
- Match
/**
* @date 2021/1/17 13:11
* @auth jixiang.ma
* @copyright copyright © 2021 jixiang.ma all right reserved.
**/
@Getter
@Setter
public class Match {
private String platformId;
private Long gameId;
private Integer champion;
private Integer queue;
private Integer season;
private Long timestamp;
private String role;
private String lane;
}
- MatchDetail
/**
* @date 2021/1/17 13:26
* @auth jixiang.ma
* @copyright copyright © 2021 jixiang.ma all right reserved.
**/
@Getter
@Setter
@ToString
public class MatchDetail {
private Long gameId;
private String platformId;
private Long gameCreation;
private Long gameDuration;
private Integer queueId;
private Integer mapId;
private Integer seasonId;
private String gameVersion;
private String gameMode;
private String gameType;
private List<Team> teams;
private List<Participants> participants;
private List<ParticipantIdentities> participantIdentities;
private List<Ban> bans;
}
- MatchList
/**
* @date 2021/1/17 14:48
* @auth jixiang.ma
* @copyright copyright © 2021 jixiang.ma all right reserved.
**/
@Data
public class MatchList {
private List<Match> matches;
private Integer startIndex;
private Integer endIndex;
private Integer totalGames;
}
- ParticipantFrame
/**
* @date 2021/1/17 14:27
* @auth jixiang.ma
* @copyright copyright © 2021 jixiang.ma all right reserved.
**/
@Setter
@Getter
@ToString
public class ParticipantFrame {
private Integer participantId;
private Integer currentGold;
private Integer totalGold;
private Integer level;
private Integer xp;
private Integer minionsKilled;
private Integer jungleMinionsKilled;
private Integer dominionScore;
private Integer teamScore;
}
- ParticipantIdentities
/**
* @date 2021/1/17 13:31
* @auth jixiang.ma
* @copyright copyright © 2021 jixiang.ma all right reserved.
**/
@Getter
@Setter
@ToString
public class ParticipantIdentities {
private Integer participantId;
private Player player;
}
- Participants
/**
* @date 2021/1/17 13:30
* @auth jixiang.ma
* @copyright copyright © 2021 jixiang.ma all right reserved.
**/
@Getter
@Setter
public class Participants {
private Integer participantId;
private Integer teamId;
private Integer championId;
private Integer spell1Id;
private Integer spell2Id;
}
- Player
/**
* @date 2021/1/17 15:56
* @auth jixiang.ma
* @copyright copyright © 2021 jixiang.ma all right reserved.
**/
@Getter
@Setter
@ToString
public class Player {
private String platformId;
private String accountId;
private String summonerName;
private String summonerId;
private String currentAccountId;
private String currentPlatformId;
private String matchHistoryUri;
}
- Stats
/**
* @date 2021/1/17 13:42
* @auth jixiang.ma
* @copyright copyright © 2021 jixiang.ma all right reserved.
**/
@Getter
@Setter
public class Stats {
private Integer participantId;
private Boolean win;
private Integer item0;
private Integer item1;
private Integer item2;
private Integer item3;
private Integer item4;
private Integer item5;
private Integer item6;
private Integer kills;
private Integer deaths;
private Integer assists;
private Integer largestKillingSpree;
private Integer largestMultiKill;
private Integer killingSprees;
private Integer longestTimeSpentLiving;
private Integer doubleKills;
private Integer tripleKills;
private Integer quadraKills;
private Integer pentaKills;
private Integer unrealKills;
private Integer totalDamageDealt;
private Integer magicDamageDealt;
private Integer physicalDamageDealt;
private Integer trueDamageDealt;
// 最大伤害
private Integer largestCriticalStrike;
private Integer totalDamageDealtToChampions;
private Integer magicDamageDealtToChampions;
private Integer physicalDamageDealtToChampions;
private Integer trueDamageDealtToChampions;
private Integer totalHeal;
private Integer totalUnitsHealed;
private Integer damageSelfMitigated;
private Integer damageDealtToObjectives;
private Integer damageDealtToTurrets;
private Integer visionScore;
private Integer timeCCingOthers;
private Integer totalDamageTaken;
private Integer magicalDamageTaken;
private Integer physicalDamageTaken;
private Integer trueDamageTaken;
private Integer goldEarned;
private Integer goldSpent;
private Integer turretKills;
private Integer inhibitorKills;
private Integer totalMinionsKilled;
private Integer neutralMinionsKilled;
private Integer neutralMinionsKilledTeamJungle;
private Integer neutralMinionsKilledEnemyJungle;
private Integer totalTimeCrowdControlDealt;
private Integer champLevel;
private Integer visionWardsBoughtInGame;
private Integer sightWardsBoughtInGame;
private Integer wardsPlaced;
private Integer wardsKilled;
private Boolean firstBloodKill;
private Boolean firstBloodAssist;
private Boolean firstTowerKill;
private Boolean firstTowerAssist;
private Boolean firstInhibitorKill;
private Boolean firstInhibitorAssist;
}
- Summoner
/**
* @date 2021/1/16 20:22
* @auth jixiang.ma
* @copyright copyright © 2021 jixiang.ma all right reserved.
**/
@Getter
@Setter
@ToString
public class Summoner {
private String id;
private String accountId;
private String puuid;
private String name;
private Integer profileIconId;
private Long revisionDate;
private Integer summonerLevel;
}
- Team
/**
* @date 2021/1/17 13:30
* @auth jixiang.ma
* @copyright copyright © 2021 jixiang.ma all right reserved.
**/
@Getter
@Setter
public class Team {
private Integer teamId;
private Win win;
private Boolean firstBlood;
private Boolean firstTower;
private Boolean firstInhibitor;
private Boolean firstBaron;
private Boolean firstDragon;
private Boolean firstRiftHerald;
private Integer towerKills;
private Integer inhibitorKills;
private Integer baronKills;
private Integer dragonKills;
private Integer vilemawKills;
private Integer riftHeraldKills;
private Integer dominionVictoryScore;
}
- TimeLine
/**
* @date 2021/1/17 14:39
* @auth jixiang.ma
* @copyright copyright © 2021 jixiang.ma all right reserved.
**/
@Getter
@Setter
@ToString
public class TimeLine {
List<ParticipantFrame> frames;
private Long frameInterval;
}
- Win
/**
* @date 2021/1/17 13:33
* @auth jixiang.ma
* @copyright copyright © 2021 jixiang.ma all right reserved.
**/
public enum Win {
Fail, Win
}
4. pipeline
本模块实现了两个类似消息队列,所有的summoner accountId和gameId都从本模块中获取,同时为了避免又重复的accountId和gameId,用布隆过滤器做了个校验,这样可以保证所有没进入list的都进入list,已存在list中的大概率不会再进入list中。
/**
* @date 2021/1/17 16:02
* @auth jixiang.ma
* @copyright copyright © 2021 jixiang.ma all right reserved.
**/
@Component
public class RiotGamesPipeLine {
private List<String> summoners = new ArrayList<>(16);
private List<String> gameIds = new ArrayList<>(16);
private BloomFilter<String> summonerBloomFilter =
BloomFilter.create(Funnels.stringFunnel(Charset.forName("UTF-8")), 10000000, 0.00001);
private BloomFilter<String> gameIdBloomFilter =
BloomFilter.create(Funnels.stringFunnel(Charset.forName("UTF-8")), 10000000, 0.00001);
public void addSummoner(String accountId) {
if(!summonerBloomFilter.mightContain(accountId)) {
summonerBloomFilter.put(accountId);
summoners.add(accountId);
}
}
public String getSummoner() {
Iterator<String> sIterator = summoners.listIterator();
if(sIterator.hasNext()) {
String acccountId = sIterator.next();
if(acccountId != null) {
summoners.remove(acccountId);
}
return acccountId;
}
return null;
}
public void addGameId(String gameId) {
if(!gameIdBloomFilter.mightContain(gameId)) {
gameIdBloomFilter.put(gameId);
gameIds.add(gameId);
}
}
public String getGameId() {
Iterator<String> sIterator = gameIds.listIterator();
if(sIterator.hasNext()) {
String gameId = sIterator.next();;
gameIds.remove(gameId);
return gameId;
}
return null;
}
}
5. scheduler
本模块主要处理调度任务,cron表达式表示每分钟都会调度一次,但是如果上次的任务还没执行结束,那么本次任务就不会被执行。
其中定时任务使用的是springboot自带的,所以要在启动类上加上注解@EnableScheduling
/**
* @date 2021/1/16 20:13
* @auth jixiang.ma
* @copyright copyright © 2020 jixiang.ma all right reserved.
**/
@Slf4j
@Component
public class WebCrawlerJob {
@Autowired
private LolService lolService;
@Autowired
private RiotGamesPipeLine pipeLine;
private static Lock lock = new ReentrantLock();
@Scheduled(cron = "0 0/1 * * * ?")
public void crawlerJob() {
if (lock.tryLock()) {
try {
doExecute();
} finally {
lock.unlock();
}
} else {
log.warn("the previous job has not finished, so this job is canceled");
}
}
private void doExecute() {
lolService.getSummonerDetails("abc");
String summonerAccountId = pipeLine.getSummoner();
while (summonerAccountId != null) {
// it is the first to get the match list, we need to use the total games to get the all gameId.
MatchList matchList = lolService.getMatchList(summonerAccountId, 0);
for (int i = matchList.getEndIndex(); i < matchList.getTotalGames(); i += 100) {
lolService.getMatchList(summonerAccountId, i);
}
String gameId = pipeLine.getGameId();
while (gameId != null) {
lolService.getMatchDetail(gameId);
lolService.getGameTimeLine(gameId);
gameId = pipeLine.getGameId();
System.out.println("gameId = \t" + gameId);
}
summonerAccountId = pipeLine.getSummoner();
}
}
}
6. service
本模块是整体核心,业务逻辑皆于此。各个方法的作用见名知意,就不详解了。
/**
* @date 2021/1/16 20:21
* @auth jixiang.ma
* @copyright copyright © 2021 jixiang.ma all right reserved.
**/
@Slf4j
@Service
@AllArgsConstructor(onConstructor_ = {@Autowired})
public class LolService {
private RiotGamesClient riotGamesClient;
private RiotGamesConfig riotGamesConfig;
private RiotGamesPipeLine riotGamesPipeLine;
public Summoner getSummonerDetails(String summonerName) {
HttpGet httpGet = new HttpGet(String.format(riotGamesConfig.getLolSummonerUrl(), summonerName));
Summoner summoner = null;
try {
HttpResponse response = riotGamesClient.execute(httpGet);
String result = EntityUtils.toString(response.getEntity(), Charset.forName("UTF-8"));
summoner = JSONObject.parseObject(result, Summoner.class);
} catch (IOException e) {
log.error("when get summoner details, it has error taken place!", e);
}
if(summoner != null) {
riotGamesPipeLine.addSummoner(summoner.getAccountId());
}
return summoner;
}
public MatchList getMatchList(String puuid, Integer beginIndex) {
HttpGet httpGet = new HttpGet(String.format(riotGamesConfig.getLolMatchListByAccountUrl(), puuid, beginIndex));
MatchList matches = null;
try {
HttpResponse response = riotGamesClient.execute(httpGet);
String result = EntityUtils.toString(response.getEntity(), Charset.forName("UTF-8"));
matches = JSONObject.parseObject(result).toJavaObject(MatchList.class);
} catch (IOException e) {
log.error("when get match list, it has error taken place, puuid is {}", puuid, e);
}
matches.getMatches().forEach(match -> riotGamesPipeLine.addGameId(String.valueOf(match.getGameId())));
return matches;
}
public MatchDetail getMatchDetail(String matchId) {
HttpGet httpGet = new HttpGet(String.format(riotGamesConfig.getLolMatchDetailUrl(), matchId));
MatchDetail matchDetail = null;
String result = null;
try {
HttpResponse response = riotGamesClient.execute(httpGet);
result = EntityUtils.toString(response.getEntity(), Charset.forName("UTF-8"));
matchDetail = JSONObject.parseObject(result, MatchDetail.class);
} catch (IOException e) {
log.error("when get match detail, something error has taken place, matchId is {}", matchId, e);
}
matchDetail.getParticipantIdentities().forEach(participant -> participant.getPlayer().getAccountId());
saveJsonFile(result, "E:\\riotgames\\data\\"+ matchId+"_detail.json");
return matchDetail;
}
public TimeLine getGameTimeLine(String matchId) {
HttpGet httpGet = new HttpGet(String.format(riotGamesConfig.getLolMatchTimeLinesUrl(), matchId));
TimeLine timeLine = null;
String result = null;
try {
HttpResponse response = riotGamesClient.execute(httpGet);
result = EntityUtils.toString(response.getEntity(), Charset.forName("UTF-8"));
timeLine = JSONObject.parseObject(result, TimeLine.class);
} catch (IOException e) {
log.error("when get match details, something error taken place, matchId is {}", matchId, e);
}
saveJsonFile(result, "E:\\riotgames\\data\\"+ matchId+"_timeline.json");
return timeLine;
}
private void saveJsonFile(String data, String path) {
if(data != null ) {
File file = new File(path);
if(!file.exists()) {
try {
if(file.createNewFile()) {
FileUtils.write(file, data, false);
}
} catch (IOException e) {
log.error("write match info to file failed.", e);
}
}
}
}
}
以下是最后运行的效果,每个game都会有两个json文件。
总结
本文主要讲述了一个完整的拳头数据获取的自动化爬虫,其中使用了RateLimiter做限流,用BloomFilter做防重复数据,另外还有四个获取数据的方法。
遗留问题
重启会造成资源浪费
当程序重启时,又会从abc开始抓数据,造成性能的浪费,本文为了使得此项目尽量轻量化,故而设定程序一直跑下去。
当然,有两个方法可以避免:
- 手动修改这个name
当进程每次重启后都修改这个值
- 把召唤师的数据落数据库
每次启动定时任务时,都从数据库里读最新的,且没有被抓过数据的召唤师名字或者accountId。
速度慢
因为拳头限定每两分钟最多有100条请求,每天大约可以获取三万六千条游戏数据(game详情和timeline各一次)。
所以可以采用分布式爬虫或者多apikey的方式来解决问题。
apikey有效期
注意:拳头给的apikey只有24小时有效期,所以我建议把这个apikey存到数据库里,每23小时更新一次
数据不全
此数据接口只有某一个地区的数据,最后大概率会把这个地区的所有数据都拿到,但是换个地区,比如LCK,NA的数据就拿不到了,如果要拿到其他赛区的数据,就需要改配置文件,其他赛区的url,除了subdomain不同外,其他都一致。
解决办法:增加一个枚举类,列出所有地区id,在service里生成http request url时,去生成不同地区的url
上述所有缺陷和解决方案,会在下个版本修正,敬请期待。
码字不易,转载请注明出处。笔者水平有限,难免会有一定的错误,希望大家能不吝指出,请大家多多点赞,评论,谢谢!