爬虫最简单的实现就是一个http连接request,然后解析resposne,最后根据样式或者什么规则,进行匹配,然后提取信息,判断是否链接其他页面爬取信息。
我在GIT上面在写了一个关于通过关键字查活跃度,暂时在优化中,暂时支持CMD查询。
GIT地址是 https://github.com/hzm1313/tz
基础实现
public SearchDto keyWordSearchTest(String url,String keyWord){
SearchDto seD=new SearchDto();
BufferedReader in =null;
OutputStream outputStream = null;
String reasponseStr=null;
StringBuffer resHtml=new StringBuffer();
String line;
try{
URL realUrl =new URL(url);
HttpURLConnection urlConnection = (HttpURLConnection) realUrl.openConnection();
urlConnection.setRequestProperty("Host", "s.tool.chinaz.com");
urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101