CBSsport的NBA直播数据整理小结一下……

忘记了是几个月前的哪一天,我偶然发现CBS的直播数据是可以直接从html文件中获得出手点数据的,当时应该是一阵狂喜呢,那时候我还不知道该怎么搞定ESPN的xml数据……

现在回头看以前处理过的CBS出手数据,不得不说很鸡肋。


处理后的文件包括CBSplayerID和球员名对应表,03-11年8个赛季的shotdata,shotType解释表。

CBS出手数据总数上和赛季整体统计有不小的差距,总数上经常有几百上千的多少,总数的比例都有98%以上,应该算不错了,但具体到单场比赛,会发现有shotdata的时间轴数据不准和出手球员错误的问题(主要是和NBA官网和ESPN的PBP数据时间轴做比较),这和之后获得的ESPNxml出手数据相比就有明显的不足了。


但另外值得一提的一点是,CBS和NBA官网的出手类型描述还是很丰富的,而ESPN的分类相对粗一点。

有次偶然发现一个特别的补扣


本来好奇的是这个球算助攻空接还是投篮不中前板补扣,结果却意外发现只有CBS描述这球是扣篮,而ESPN和NBA官网记的是上篮。

这么看来应该存在其它不一致的投篮描述,但也应该只是少数。考虑到时间轴不一致,统一起来应该还是比较麻烦的,暂未处理这个问题。


简单记录一下基本的抓取和处理过程:

1,03-11,8个赛季,分别保存一个某一天的scoreboard文件,抽取出8个赛季的全部比赛日。

例如:http://www.cbssports.com/nba/scoreboard/20110101。

主要是匹配页面中的“<a href=\"/nba/scoreboard/”,抽取其后的8位数字串加入比赛日集合。


2,用全部的比赛日链接做种子,配置Heritrix任务抓回所有比赛场次的shotchart页面。

主要是匹配“NBA_[0-9]+_[A-Z]*@[A-Z]*”,添加到等待抓取队列中。

原理上可以不用自己写个简单继承的Extractor,那需要另外在任务中设置链接过滤规则,而默认的链接抽取模块会抽出很多无用的链接来作判断,花费的抓取时间要多一些。

另外还可以先用下载工具抓取比赛日列表,然后用正则表达式提取所有比赛的特征字符串(需要编程),再用抽出的链接抓取shotchart页面。抓取部分用迅雷就可以轻松搞定,文件命名就是比赛特征字符串。

例如:http://www.cbssports.com/nba/gametracker/shotchart/NBA_20110101_CLE@CHI,抓取下来的文件名就是“NBA_20110101_CLE@CHI”。

不过我还是选择了编程的方法……

import java.io.IOException;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.commons.httpclient.URIException;
import org.archive.crawler.datamodel.CrawlURI;
import org.archive.crawler.extractor.Extractor;
import org.archive.crawler.extractor.Link;
import org.archive.io.ReplayCharSequence;
import org.archive.util.HttpRecorder;

public class CBSScoreboardExtractor extends Extractor {
	
	private static final long serialVersionUID = 5855731422080471017L;	
	private static Logger logger =
        Logger.getLogger(CBSScoreboardExtractor.class.getName());	
	public CBSScoreboardExtractor(String name) {
        this(name, "CBSSport Scoreboard Extractor");
	}	
	public CBSScoreboardExtractor(String name, String description) {
        super(name, description);
	}
	//从scoreboard页面抽取CBS每场的比赛特征字符串
	private static final String CBS_FEATURE = "NBA_[0-9]+_[A-Z]*@[A-Z]*";
	private static final String SHOTCHART = "http://www.cbssports.com/nba/gametracker/shotchart/";
	
	protected void extract(CrawlURI curi){
        //下面一段代码主要用于取得当前链接的返回 字符串,以便对内容进行分析
        ReplayCharSequence cs = null;
        try {
            HttpRecorder hr = curi.getHttpRecorder();
            if (hr == null) {
                throw new IOException("Why is recorder null here?");
            }
            cs = hr.getReplayCharSequence();
        } catch (IOException e) {
            curi.addLocalizedError(this.getName(), e,
                    "Failed get of replay char sequence " + curi.toString()
                            + " " + e.getMessage());
            logger.log(Level.SEVERE, "Failed get of replay char sequence in "
                    + Thread.currentThread().getName(), e);
        }
        if (cs == null) {
            return;
        }
        // 将链接返回的内容转成字符串
        String content = cs.toString();	
        
        try {           
            // 将字符串内容进行正则匹配
            // 取出其中的链接信息
            Pattern pattern = Pattern.compile(CBS_FEATURE);
            Matcher matcher = pattern.matcher(content);
            // 若找到了一个链接
            while (matcher.find()) {
            	int start = matcher.start();
            	int end = matcher.end();
            	String aShotchartLink = SHOTCHART + content.substring(start, end);
                addLinkFromString(curi, aShotchartLink, "", Link.NAVLINK_HOP);
            }
            curi.linkExtractorFinished();
        } catch (Exception e) {
            e.printStackTrace();
        }
	}
	
    // 将链接保存记录下来,以备后续处理
    private void addLinkFromString(CrawlURI curi, String uri,
            CharSequence context, char hopType) {
        try {
            curi.createAndAddLinkRelativeToBase(uri, context.toString(),
                    hopType);
        } catch (URIException e) {
            if (getController() != null) {
                getController().logUriError(e, curi.getUURI(), uri);
            } else {
                logger.info("Failed createAndAddLinkRelativeToBase " 
                + curi + ", " + uri + ", " + context + ", " 
                + hopType + ": " + e);
            }
        }
    }
}
这样下来共抓取了10000+场比赛的shotchart数据。


3,手工为每个赛季的比赛集中一个文件夹,剔除全明星赛和延期的比赛,还有10来比赛因为某一个页面链接错误没有抓取,手动保存了一些页面。


4,在单一的shotchart页面里抽取球员信息(CBSplayerID和球员名)和出手信息,分赛季写入文本。

package CBS;

import java.io.*;
import java.util.Comparator;
import java.util.Iterator;
import java.util.TreeSet;

/*
 * 2003-11每个赛季的总出手数据分别保存为一个文本
 * 20031028-20040615 1189 + 82
 * 20041102-20050623 1230 + 84
 * 20051101-20060620 1230 + 89
 * 20061031-20070614 1230 + 79
 * 20071030-20080617 1230 + 86
 * 20081028-20090618 1230 + 85
 * 20091027-20100617 1230 + 82
 * 20101026-20110612 1230 + 81
 * Damon Jones & Dwayne Jones 2007-08 Cavaliers
 * James Jones & Jumaine Jones 2006-07 Suns
 * 
 * rescheduled game
 * 
 * 源数据中存在错误的球员信息
 * 同球员不同ID,Awvee Scorey ;同ID不同姓名,如Yao Ming、Ming Yao
*/
public class CBSShotchartParser {
	public static void main(String[] args) throws Exception{
		
		File directory = new File("E:\\NBA\\data\\2003-2011CBSshotchart\\06-07\\");
		
		String[] shotcharts = directory.list();
		//FileWriter fr0304 = new FileWriter("E:\\2003-04shotdata.txt");
		//FileWriter fr0405 = new FileWriter("E:\\2004-05shotdata.txt");
		//FileWriter fr0506 = new FileWriter("E:\\2005-06shotdata.txt");
		FileWriter fr0607 = new FileWriter("E:\\2006-07shotdata.txt");
		//FileWriter fr0708 = new FileWriter("E:\\2007-08shotdata.txt");
		//FileWriter fr0809 = new FileWriter("E:\\2008-09shotdata.txt");
		//FileWriter fr0910 = new FileWriter("E:\\2009-10shotdata.txt");
		//FileWriter fr1011 = new FileWriter("E:\\2010-11shotdata.txt");
		//延期安排的比赛,或出手数据为空
		FileWriter frReschGames = new FileWriter("E:\\rescheduledGames.txt");
		//球员姓名中出现特殊空格字符
		FileWriter frSpecialName = new FileWriter("E:\\SpecialName.txt");
		
		TreeSet<CBSplayerInfo> playerInfoSet = new TreeSet<CBSplayerInfo>();
		//FileWriter frID = new FileWriter("E:\\CBSplayerInfo.txt");		
		
		for(int i=0; i < shotcharts.length; i++){
			String pageFile = "E:\\NBA\\data\\2003-2011CBSshotchart\\06-07\\" + shotcharts[i];
			String gameKey = shotcharts[i].substring(4).replaceAll("_|@", "");
			String pageContent = "";
			BufferedReader br = new BufferedReader(new FileReader(pageFile));
			String aLine = br.readLine();
			while(aLine != null){
				pageContent = pageContent + aLine;
				aLine = br.readLine();
			}
			br.close();
			
			int cur = pageContent.indexOf("currentShotData = new String");
			int lcur = pageContent.indexOf("\"", cur);
			int rcur = pageContent.indexOf("\"", lcur+1);
			
			String rawShotdata = pageContent.substring(lcur+1, rcur);
			if(rawShotdata.equals("")){//处理可能出现的重排比赛(出手数据为空)
				frReschGames.append(shotcharts[i] + "\r\n");
				continue;
			}
			String shotData = gameKey + "," + pageContent.substring(lcur+1, rcur).replaceAll("~", "\r\n" + gameKey + ",");

			//player信息索引集(只保留CBSplayerId,first name,last name)
			//例如(240304:Tony Parker,9,PG,8-20,1-3,0-0,17|)保留(240304,Tony,Parker)
			cur = pageContent.indexOf("playerDataHomeString = new String",rcur);
			lcur = pageContent.indexOf("\"", cur);
			rcur = pageContent.indexOf("\"", lcur+1);
			String homePlayers = pageContent.substring(lcur+1,rcur);
			
			cur = pageContent.indexOf("playerDataAwayString = new String",rcur);
			lcur = pageContent.indexOf("\"", cur);
			rcur = pageContent.indexOf("\"", lcur+1);
			String awayPlayers = pageContent.substring(lcur+1,rcur);	
			
			String players = homePlayers + "|" + awayPlayers;			
			for(int j = 0; j < players.length(); j++){
				CBSplayerInfo aPlayer = new CBSplayerInfo();
				int cur1 = players.indexOf(":",j);
				aPlayer.id = players.substring(j,cur1);
				int cur2 = players.indexOf(" ",cur1);	
				//出现特例:20071103DALSAC中空格是" ";
				//20071211INDCLE中空格是字符集导致的乱码(先保存,暂不处理),cur2返回-1.
				int SPACE_LEN = 6;
				if(cur2 == -1){
					frSpecialName.append(shotcharts[i] + "\r\n");
					break;
					//cur2 = players.indexOf(" ",cur1);
					//SPACE_LEN = 1;
				}
				aPlayer.firstName = players.substring(cur1 + 1,cur2);
				int cur3 = players.indexOf(",",cur2);
				aPlayer.lastName = players.substring(cur2 + SPACE_LEN,cur3);
				
				playerInfoSet.add(aPlayer);		//添加球员ID信息
				j = players.indexOf("|",cur3);
				if(j == -1) break;
			}
			
			//保存shotchart数据
			if(gameKey.compareTo("200407") < 0){
				//fr0304.append(shotData + "\r\n");
			}else if(gameKey.compareTo("200507") < 0){
				//fr0304.close();
				//fr0405.append(shotData + "\r\n");
			}else if(gameKey.compareTo("200607") < 0){
				//fr0405.close();
				//fr0506.append(shotData + "\r\n");
			}else if(gameKey.compareTo("200707") < 0){
				//fr0506.close();
				fr0607.append(shotData + "\r\n");
			}else if(gameKey.compareTo("200807") < 0){
				fr0607.close();
				//fr0708.append(shotData + "\r\n");
			}else if(gameKey.compareTo("200907") < 0){
				//fr0708.close();
				//fr0809.append(shotData + "\r\n");
			}else if(gameKey.compareTo("201007") < 0){
				//fr0809.close();
				//fr0910.append(shotData + "\r\n");
			}else if(gameKey.compareTo("201107") < 0){
				//fr0910.close();
				//fr1011.append(shotData + "\r\n");
			}			
			System.out.println(shotcharts[i]);
		}
		//fr1011.close();

		//保存球员ID数据
		Iterator<CBSplayerInfo> it = playerInfoSet.iterator();
		while(it.hasNext()){
			CBSplayerInfo nextPlayer = it.next();
			String playerInfo = nextPlayer.id + "\t" + nextPlayer.firstName + "\t" + nextPlayer.lastName;
			//frID.append(playerInfo + "\r\n");
		}
		frReschGames.close();
		frSpecialName.close();		
		//frID.close();
	}
}
碰到一些页面空格不一致的编码问题,单独处理。

package CBS;

import java.io.*;
import java.util.Iterator;
import java.util.TreeSet;


public class CBSspecialName {
	public static void main(String[] args) throws Exception{
		
		TreeSet<CBSplayerInfo> playerInfoSet = new TreeSet<CBSplayerInfo>();
		FileWriter frID = new FileWriter("E:\\CBSplayerInfo.txt");	
		//球员姓名中出现特殊空格字符的文件
		FileWriter frSpecialName = new FileWriter("E:\\SpecialNameSpace.txt");
		
		BufferedReader br = new BufferedReader(new FileReader("E:\\NBA\\data\\SpecialName.txt"));
		String str = br.readLine();
		int cnt = 1;
		while(str != null){
			String page = "E:\\NBA\\data\\2003-2011CBSshotchart\\" + str;
			BufferedReader br2 = new BufferedReader(new FileReader(page));
			String pageContent = "";
			String aLine = br2.readLine();
			while(aLine != null){
				pageContent = pageContent + aLine;
				aLine = br2.readLine();
			}
			br2.close();
			
			int cur = pageContent.indexOf("playerDataHomeString = new String");
			int lcur = pageContent.indexOf("\"", cur);
			int rcur = pageContent.indexOf("\"", lcur+1);
			String homePlayers = pageContent.substring(lcur+1,rcur);
			
			cur = pageContent.indexOf("playerDataAwayString = new String",rcur);
			lcur = pageContent.indexOf("\"", cur);
			rcur = pageContent.indexOf("\"", lcur+1);
			String awayPlayers = pageContent.substring(lcur+1,rcur);
			
			String players = homePlayers + "|" + awayPlayers;
			players = new String(players.getBytes("iso-8859-1"));
			for(int j = 0; j < players.length(); j++){
				CBSplayerInfo aPlayer = new CBSplayerInfo();
				int cur1 = players.indexOf(":",j);
				aPlayer.id = players.substring(j,cur1);
				int cur2 = players.indexOf(" ",cur1);
				int cur2p = players.indexOf("|",cur1);
				if(cur2 == -1 || (cur2 > cur2p && cur2p != -1)){
					cur2 = players.indexOf("?",cur1);	//iso-8859-1下的空格
				}
				aPlayer.firstName = players.substring(cur1 + 1,cur2);
				int cur3 = players.indexOf(",",cur2);
				aPlayer.lastName = players.substring(cur2 + 1,cur3);
				
				playerInfoSet.add(aPlayer);		//添加球员ID信息
				System.out.println(str + ":" + aPlayer.display());
				j = players.indexOf("|",cur3);
				if(j == -1) break;
			}
			str = br.readLine();
		}
		frSpecialName.close();
		br.close();
		
		//保存球员ID数据
		Iterator<CBSplayerInfo> it = playerInfoSet.iterator();
		while(it.hasNext()){
			CBSplayerInfo nextPlayer = it.next();
			String playerInfo = nextPlayer.id + ";" + nextPlayer.firstName + ";" + nextPlayer.lastName;
			frID.append(playerInfo + "\r\n");
		}		
		frID.close();
	}
}

5,CBS默认shotchart数据里的第四节以及加时赛都是用3表示的period,编程修正。

package CBS;
/*
 * 默认情况下,CBS的period数据中的第4节和加时赛都是3,本程序依次改为4,5,6……
 * 20101026HOULAL,0,5.0,3,1622542,1,0,25,40,25
 * 20101026HOULAL,0,11:41,3,1622542,5,1,0,42,0
 * period >= 3,同一gameID,当前一条shot时间为秒“.”,下一条包含分“:”时,period++
 */
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.sql.Date;
import java.sql.Time;
import java.text.ParseException;
import java.text.SimpleDateFormat;

public class CBSTime {
	public static void main(String args[]) throws Exception{
		String directoryPath = "E:\\2006-07shotdata\\";
		File directory = new File(directoryPath);
		String[] shotdata = directory.list();
		for(int i = 0; i < shotdata.length; i++){
			BufferedReader br = new BufferedReader(new FileReader(directoryPath + shotdata[i]));
			String aLine = br.readLine();
			FileWriter fr = new FileWriter(directoryPath + "CBS" + shotdata[i]);
			String[] lastShot = new String[]{"","","","","","","","","",""};
			while(aLine != null){
				String[] newShot = aLine.split(",");
				if(lastShot[0].equals(newShot[0]) && lastShot[3].compareTo("3") >= 0 && lastShot[2].contains(".") && newShot[2].contains(":")){
					Integer tmp = Integer.parseInt(lastShot[3])+1;
					newShot[3] = tmp.toString();
				}
				if(lastShot[0].equals(newShot[0]) && newShot[3].compareTo(lastShot[3]) < 0)
					newShot[3] = lastShot[3];
				lastShot = newShot;
				String aShot = lastShot[0]+","+lastShot[1]+","+lastShot[2]+","+lastShot[3]+","+lastShot[4]+","+lastShot[5]+","+lastShot[6]+","+lastShot[7]+","+lastShot[8]+","+lastShot[9];
				fr.append(aShot+"\r\n");
				System.out.println(aShot);
				aLine = br.readLine();
			}
			br.close();
			fr.close();
		}
	}
}

6,shotdata文本导入数据库就可以做一些简单的查询了~

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值