上数据挖掘课,数据准备部分考虑这样做:根据配置文件打开相应的网址并保存。之后再对这些文件进行内容解析、文本提取、矩阵转换、聚类等。
public static void main(String[] args){
final int THREAD_COUNT=5;
String baseUrl=null;
String searchBlogs=null;
String blogs[]=null;
String fileDir=null;
//String category=null;
InputStream inputStream =CsdnBlogMining.class.getClassLoader().getResourceAsStream("config.properties");
Properties p = new Properties();
try {
p.load(inputStream);
baseUrl=p.getProperty("baseUrl");
fileDir=p.getProperty("fileDir");
searchBlogs=p.getProperty("searchBlogs");
if(searchBlogs!=""){
blogs=searchBlogs.split(";");
}
ExecutorService pool=Executors.newFixedThreadPool(THREAD_COUNT);
for(String s:blogs){
pool.submit(new SaveWeb(baseUrl+s,fileDir+"/"+s+".html"));
}
pool.shutdown();
//category=new String(p.getProperty("category").getBytes("ISO-8859-1"),"UTF-8");
} catch (IOException e) {
e.printStackTrace();
}
}
打开网页并保存模块:
public class SaveWeb implements Runnable{
private String url;
private String filename;
public SaveWeb(String url,String filename){
this.url=url;
this.filename=filename;
}
@Override
public void run() {
HttpClient httpclient = new DefaultHttpClient();
HttpGet httpGet = new HttpGet(url);
httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2");
try{
HttpResponse response = httpclient.execute(httpGet);
HttpEntity entity = response.getEntity();
BufferedOutputStream outputStream = new BufferedOutputStream(new FileOutputStream(filename));
if(response.getStatusLine().getStatusCode() == HttpStatus.SC_OK){
if (entity != null) {
String res=EntityUtils.toString(entity,"UTF-8");
outputStream.write(res.getBytes("UTF-8"));
outputStream.flush();
}
}
outputStream.close();
}catch(IOException e){
e.printStackTrace();
}
}
}
后续:
作业完成了,但几乎和上面的内容没啥关系,本来想全删了。再想也不算写错,只是没用上而已,还是留着吧。
最终,用java代码循环加并发去获得一个地址列表存到文件里。而采用R语言去做的挖掘工作。包括获取网页、解析正文、分词、聚类、结果输出等。R语言真是省事,几十行代码全搞定了。但最终分类的结果不理想。看来基于全文的计算特征不明显,划分出来的类也很不准确,还得考虑改进。