忘记了是几个月前的哪一天,我偶然发现CBS的直播数据是可以直接从html文件中获得出手点数据的,当时应该是一阵狂喜呢,那时候我还不知道该怎么搞定ESPN的xml数据……
现在回头看以前处理过的CBS出手数据,不得不说很鸡肋。
处理后的文件包括CBSplayerID和球员名对应表,03-11年8个赛季的shotdata,shotType解释表。
CBS出手数据总数上和赛季整体统计有不小的差距,总数上经常有几百上千的多少,总数的比例都有98%以上,应该算不错了,但具体到单场比赛,会发现有shotdata的时间轴数据不准和出手球员错误的问题(主要是和NBA官网和ESPN的PBP数据时间轴做比较),这和之后获得的ESPNxml出手数据相比就有明显的不足了。
但另外值得一提的一点是,CBS和NBA官网的出手类型描述还是很丰富的,而ESPN的分类相对粗一点。
有次偶然发现一个特别的补扣
本来好奇的是这个球算助攻空接还是投篮不中前板补扣,结果却意外发现只有CBS描述这球是扣篮,而ESPN和NBA官网记的是上篮。
这么看来应该存在其它不一致的投篮描述,但也应该只是少数。考虑到时间轴不一致,统一起来应该还是比较麻烦的,暂未处理这个问题。
简单记录一下基本的抓取和处理过程:
1,03-11,8个赛季,分别保存一个某一天的scoreboard文件,抽取出8个赛季的全部比赛日。
例如:http://www.cbssports.com/nba/scoreboard/20110101。
主要是匹配页面中的“<a href=\"/nba/scoreboard/”,抽取其后的8位数字串加入比赛日集合。
2,用全部的比赛日链接做种子,配置Heritrix任务抓回所有比赛场次的shotchart页面。
主要是匹配“NBA_[0-9]+_[A-Z]*@[A-Z]*”,添加到等待抓取队列中。
原理上可以不用自己写个简单继承的Extractor,那需要另外在任务中设置链接过滤规则,而默认的链接抽取模块会抽出很多无用的链接来作判断,花费的抓取时间要多一些。
另外还可以先用下载工具抓取比赛日列表,然后用正则表达式提取所有比赛的特征字符串(需要编程),再用抽出的链接抓取shotchart页面。抓取部分用迅雷就可以轻松搞定,文件命名就是比赛特征字符串。
例如:http://www.cbssports.com/nba/gametracker/shotchart/NBA_20110101_CLE@CHI,抓取下来的文件名就是“NBA_20110101_CLE@CHI”。
不过我还是选择了编程的方法……
import java.io.IOException;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.commons.httpclient.URIException;
import org.archive.crawler.datamodel.CrawlURI;
import org.archive.crawler.extractor.Extractor;
import org.archive.crawler.extractor.Link;
import org.archive.io.ReplayCharSequence;
import org.archive.util.HttpRecorder;
public class CBSScoreboardExtractor extends Extractor {
private static final long serialVersionUID = 5855731422080471017L;
private static Logger logger =
Logger.getLogger(CBSScoreboardExtractor.class.getName());
public CBSScoreboardExtractor(String name) {
this(name, "CBSSport Scoreboard Extractor");
}
public CBSScoreboardExtractor(String name, String description) {
super(name, description);
}
//从scoreboard页面抽取CBS每场的比赛特征字符串
private static final String CBS_FEATURE = "NBA_[0-9]+_[A-Z]*@[A-Z]*";
private static final String SHOTCHART = "http://www.cbssports.com/nba/gametracker/shotchart/";
protected void extract(CrawlURI curi){
//下面一段代码主要用于取得当前链接的返回 字符串,以便对内容进行分析
ReplayCharSequence cs = null;
try {
HttpRecorder hr = curi.getHttpRecorder();
if (hr == null) {
throw new IOException("Why is recorder null here?");
}
cs = hr.getReplayCharSequence();
} catch (IOException e) {
curi.addLocalizedError(this.getName(), e,
"Failed get of replay char sequence " + curi.toString()
+ " " + e.getMessage());
logger.log(Level.SEVERE, "Failed get of replay char sequence in "
+ Thread.currentThread().getName(), e);
}
if (cs == null) {
return;
}
// 将链接返回的内容转成字符串
String content = cs.toString();
try {
// 将字符串内容进行正则匹配
// 取出其中的链接信息
Pattern pattern = Pattern.compile(CBS_FEATURE);
Matcher matcher = pattern.matcher(content);
// 若找到了一个链接
while (matcher.find()) {
int start = matcher.start();
int end = matcher.end();
String aShotchartLink = SHOTCHART + content.substring(start, end);
addLinkFromString(curi, aShotchartLink, "", Link.NAVLINK_HOP);
}
curi.linkExtractorFinished();
} catch (Exception e) {
e.printStackTrace();
}
}
// 将链接保存记录下来,以备后续处理
private void addLinkFromString(CrawlURI curi, String uri,
CharSequence context, char hopType) {
try {
curi.createAndAddLinkRelativeToBase(uri, context.toString(),
hopType);
} catch (URIException e) {
if (getController() != null) {
getController().logUriError(e, curi.getUURI(), uri);
} else {
logger.info("Failed createAndAddLinkRelativeToBase "
+ curi + ", " + uri + ", " + context + ", "
+ hopType + ": " + e);
}
}
}
}
这样下来共抓取了10000+场比赛的shotchart数据。
3,手工为每个赛季的比赛集中一个文件夹,剔除全明星赛和延期的比赛,还有10来比赛因为某一个页面链接错误没有抓取,手动保存了一些页面。
4,在单一的shotchart页面里抽取球员信息(CBSplayerID和球员名)和出手信息,分赛季写入文本。
package CBS;
import java.io.*;
import java.util.Comparator;
import java.util.Iterator;
import java.util.TreeSet;
/*
* 2003-11每个赛季的总出手数据分别保存为一个文本
* 20031028-20040615 1189 + 82
* 20041102-20050623 1230 + 84
* 20051101-20060620 1230 + 89
* 20061031-20070614 1230 + 79
* 20071030-20080617 1230 + 86
* 20081028-20090618 1230 + 85
* 20091027-20100617 1230 + 82
* 20101026-20110612 1230 + 81
* Damon Jones & Dwayne Jones 2007-08 Cavaliers
* James Jones & Jumaine Jones 2006-07 Suns
*
* rescheduled game
*
* 源数据中存在错误的球员信息
* 同球员不同ID,Awvee Scorey ;同ID不同姓名,如Yao Ming、Ming Yao
*/
public class CBSShotchartParser {
public static void main(String[] args) throws Exception{
File directory = new File("E:\\NBA\\data\\2003-2011CBSshotchart\\06-07\\");
String[] shotcharts = directory.list();
//FileWriter fr0304 = new FileWriter("E:\\2003-04shotdata.txt");
//FileWriter fr0405 = new FileWriter("E:\\2004-05shotdata.txt");
//FileWriter fr0506 = new FileWriter("E:\\2005-06shotdata.txt");
FileWriter fr0607 = new FileWriter("E:\\2006-07shotdata.txt");
//FileWriter fr0708 = new FileWriter("E:\\2007-08shotdata.txt");
//FileWriter fr0809 = new FileWriter("E:\\2008-09shotdata.txt");
//FileWriter fr0910 = new FileWriter("E:\\2009-10shotdata.txt");
//FileWriter fr1011 = new FileWriter("E:\\2010-11shotdata.txt");
//延期安排的比赛,或出手数据为空
FileWriter frReschGames = new FileWriter("E:\\rescheduledGames.txt");
//球员姓名中出现特殊空格字符
FileWriter frSpecialName = new FileWriter("E:\\SpecialName.txt");
TreeSet<CBSplayerInfo> playerInfoSet = new TreeSet<CBSplayerInfo>();
//FileWriter frID = new FileWriter("E:\\CBSplayerInfo.txt");
for(int i=0; i < shotcharts.length; i++){
String pageFile = "E:\\NBA\\data\\2003-2011CBSshotchart\\06-07\\" + shotcharts[i];
String gameKey = shotcharts[i].substring(4).replaceAll("_|@", "");
String pageContent = "";
BufferedReader br = new BufferedReader(new FileReader(pageFile));
String aLine = br.readLine();
while(aLine != null){
pageContent = pageContent + aLine;
aLine = br.readLine();
}
br.close();
int cur = pageContent.indexOf("currentShotData = new String");
int lcur = pageContent.indexOf("\"", cur);
int rcur = pageContent.indexOf("\"", lcur+1);
String rawShotdata = pageContent.substring(lcur+1, rcur);
if(rawShotdata.equals("")){//处理可能出现的重排比赛(出手数据为空)
frReschGames.append(shotcharts[i] + "\r\n");
continue;
}
String shotData = gameKey + "," + pageContent.substring(lcur+1, rcur).replaceAll("~", "\r\n" + gameKey + ",");
//player信息索引集(只保留CBSplayerId,first name,last name)
//例如(240304:Tony Parker,9,PG,8-20,1-3,0-0,17|)保留(240304,Tony,Parker)
cur = pageContent.indexOf("playerDataHomeString = new String",rcur);
lcur = pageContent.indexOf("\"", cur);
rcur = pageContent.indexOf("\"", lcur+1);
String homePlayers = pageContent.substring(lcur+1,rcur);
cur = pageContent.indexOf("playerDataAwayString = new String",rcur);
lcur = pageContent.indexOf("\"", cur);
rcur = pageContent.indexOf("\"", lcur+1);
String awayPlayers = pageContent.substring(lcur+1,rcur);
String players = homePlayers + "|" + awayPlayers;
for(int j = 0; j < players.length(); j++){
CBSplayerInfo aPlayer = new CBSplayerInfo();
int cur1 = players.indexOf(":",j);
aPlayer.id = players.substring(j,cur1);
int cur2 = players.indexOf(" ",cur1);
//出现特例:20071103DALSAC中空格是" ";
//20071211INDCLE中空格是字符集导致的乱码(先保存,暂不处理),cur2返回-1.
int SPACE_LEN = 6;
if(cur2 == -1){
frSpecialName.append(shotcharts[i] + "\r\n");
break;
//cur2 = players.indexOf(" ",cur1);
//SPACE_LEN = 1;
}
aPlayer.firstName = players.substring(cur1 + 1,cur2);
int cur3 = players.indexOf(",",cur2);
aPlayer.lastName = players.substring(cur2 + SPACE_LEN,cur3);
playerInfoSet.add(aPlayer); //添加球员ID信息
j = players.indexOf("|",cur3);
if(j == -1) break;
}
//保存shotchart数据
if(gameKey.compareTo("200407") < 0){
//fr0304.append(shotData + "\r\n");
}else if(gameKey.compareTo("200507") < 0){
//fr0304.close();
//fr0405.append(shotData + "\r\n");
}else if(gameKey.compareTo("200607") < 0){
//fr0405.close();
//fr0506.append(shotData + "\r\n");
}else if(gameKey.compareTo("200707") < 0){
//fr0506.close();
fr0607.append(shotData + "\r\n");
}else if(gameKey.compareTo("200807") < 0){
fr0607.close();
//fr0708.append(shotData + "\r\n");
}else if(gameKey.compareTo("200907") < 0){
//fr0708.close();
//fr0809.append(shotData + "\r\n");
}else if(gameKey.compareTo("201007") < 0){
//fr0809.close();
//fr0910.append(shotData + "\r\n");
}else if(gameKey.compareTo("201107") < 0){
//fr0910.close();
//fr1011.append(shotData + "\r\n");
}
System.out.println(shotcharts[i]);
}
//fr1011.close();
//保存球员ID数据
Iterator<CBSplayerInfo> it = playerInfoSet.iterator();
while(it.hasNext()){
CBSplayerInfo nextPlayer = it.next();
String playerInfo = nextPlayer.id + "\t" + nextPlayer.firstName + "\t" + nextPlayer.lastName;
//frID.append(playerInfo + "\r\n");
}
frReschGames.close();
frSpecialName.close();
//frID.close();
}
}
碰到一些页面空格不一致的编码问题,单独处理。
package CBS;
import java.io.*;
import java.util.Iterator;
import java.util.TreeSet;
public class CBSspecialName {
public static void main(String[] args) throws Exception{
TreeSet<CBSplayerInfo> playerInfoSet = new TreeSet<CBSplayerInfo>();
FileWriter frID = new FileWriter("E:\\CBSplayerInfo.txt");
//球员姓名中出现特殊空格字符的文件
FileWriter frSpecialName = new FileWriter("E:\\SpecialNameSpace.txt");
BufferedReader br = new BufferedReader(new FileReader("E:\\NBA\\data\\SpecialName.txt"));
String str = br.readLine();
int cnt = 1;
while(str != null){
String page = "E:\\NBA\\data\\2003-2011CBSshotchart\\" + str;
BufferedReader br2 = new BufferedReader(new FileReader(page));
String pageContent = "";
String aLine = br2.readLine();
while(aLine != null){
pageContent = pageContent + aLine;
aLine = br2.readLine();
}
br2.close();
int cur = pageContent.indexOf("playerDataHomeString = new String");
int lcur = pageContent.indexOf("\"", cur);
int rcur = pageContent.indexOf("\"", lcur+1);
String homePlayers = pageContent.substring(lcur+1,rcur);
cur = pageContent.indexOf("playerDataAwayString = new String",rcur);
lcur = pageContent.indexOf("\"", cur);
rcur = pageContent.indexOf("\"", lcur+1);
String awayPlayers = pageContent.substring(lcur+1,rcur);
String players = homePlayers + "|" + awayPlayers;
players = new String(players.getBytes("iso-8859-1"));
for(int j = 0; j < players.length(); j++){
CBSplayerInfo aPlayer = new CBSplayerInfo();
int cur1 = players.indexOf(":",j);
aPlayer.id = players.substring(j,cur1);
int cur2 = players.indexOf(" ",cur1);
int cur2p = players.indexOf("|",cur1);
if(cur2 == -1 || (cur2 > cur2p && cur2p != -1)){
cur2 = players.indexOf("?",cur1); //iso-8859-1下的空格
}
aPlayer.firstName = players.substring(cur1 + 1,cur2);
int cur3 = players.indexOf(",",cur2);
aPlayer.lastName = players.substring(cur2 + 1,cur3);
playerInfoSet.add(aPlayer); //添加球员ID信息
System.out.println(str + ":" + aPlayer.display());
j = players.indexOf("|",cur3);
if(j == -1) break;
}
str = br.readLine();
}
frSpecialName.close();
br.close();
//保存球员ID数据
Iterator<CBSplayerInfo> it = playerInfoSet.iterator();
while(it.hasNext()){
CBSplayerInfo nextPlayer = it.next();
String playerInfo = nextPlayer.id + ";" + nextPlayer.firstName + ";" + nextPlayer.lastName;
frID.append(playerInfo + "\r\n");
}
frID.close();
}
}
5,CBS默认shotchart数据里的第四节以及加时赛都是用3表示的period,编程修正。
package CBS;
/*
* 默认情况下,CBS的period数据中的第4节和加时赛都是3,本程序依次改为4,5,6……
* 20101026HOULAL,0,5.0,3,1622542,1,0,25,40,25
* 20101026HOULAL,0,11:41,3,1622542,5,1,0,42,0
* period >= 3,同一gameID,当前一条shot时间为秒“.”,下一条包含分“:”时,period++
*/
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.sql.Date;
import java.sql.Time;
import java.text.ParseException;
import java.text.SimpleDateFormat;
public class CBSTime {
public static void main(String args[]) throws Exception{
String directoryPath = "E:\\2006-07shotdata\\";
File directory = new File(directoryPath);
String[] shotdata = directory.list();
for(int i = 0; i < shotdata.length; i++){
BufferedReader br = new BufferedReader(new FileReader(directoryPath + shotdata[i]));
String aLine = br.readLine();
FileWriter fr = new FileWriter(directoryPath + "CBS" + shotdata[i]);
String[] lastShot = new String[]{"","","","","","","","","",""};
while(aLine != null){
String[] newShot = aLine.split(",");
if(lastShot[0].equals(newShot[0]) && lastShot[3].compareTo("3") >= 0 && lastShot[2].contains(".") && newShot[2].contains(":")){
Integer tmp = Integer.parseInt(lastShot[3])+1;
newShot[3] = tmp.toString();
}
if(lastShot[0].equals(newShot[0]) && newShot[3].compareTo(lastShot[3]) < 0)
newShot[3] = lastShot[3];
lastShot = newShot;
String aShot = lastShot[0]+","+lastShot[1]+","+lastShot[2]+","+lastShot[3]+","+lastShot[4]+","+lastShot[5]+","+lastShot[6]+","+lastShot[7]+","+lastShot[8]+","+lastShot[9];
fr.append(aShot+"\r\n");
System.out.println(aShot);
aLine = br.readLine();
}
br.close();
fr.close();
}
}
}
6,shotdata文本导入数据库就可以做一些简单的查询了~