使用java通过Get请求进行爬虫操作

最新推荐文章于 2022-07-18 19:24:02 发布

物流小哥

最新推荐文章于 2022-07-18 19:24:02 发布

阅读量1k

点赞数 1

分类专栏： Java 文章标签： java 编程语言爬虫 get请求正则表达式实践

本文链接：https://blog.csdn.net/wwwwse/article/details/50974264

版权

Java 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

把之前给我们部门演示的爬虫代码复制过来吧，当时我注释的很详细~

简单的演示一下如何使用java进行爬虫操作，扩展空间很大，爬虫主要有两种方式第一种是用Get请求去抓取网页信息第二种是用Post请求去抓取网页信息现在我先给出一种Get请求的方法~

package pachong;

import java.io.FileOutputStream;
import java.io.IOException;
import java.util.Calendar;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.util.EntityUtils;

/**
 * @author soon
 * 
 */
@SuppressWarnings("deprecation")
public class pachong {

	/**
	 * @param args
	 * @throws IOException
	 * @throws ClientProtocolException
	 */
	public static void main(String[] args) throws ClientProtocolException,
			IOException {
		HttpClient client = new DefaultHttpClient();//new 一个容器（HttpClient模拟浏览器）
		HttpResponse httpResponse = null;
		String path ="http://tz.its.csu.edu.cn/Home/Release_TZTG_zd/7C97659E2C724ADCAD6DDBCFB3A3074C";//目标网址
		HttpGet httpget = new HttpGet(path);//new 一个Get请求
		httpResponse = client.execute(httpget);//执行请求
		HttpEntity entity = httpResponse.getEntity();//获取返回实体
		String html = EntityUtils.toString(entity, "gb2312");//转化成字符串
		// System.out.println(html);
		//此处正则表达式只给出了粗略的匹配获取的式子
		//每个网站都不同，可以以我的代码为基础去看其他网页源码自己尝试写一下正则
		String patter = "<tr style=\"height:650px;\" valign=\"top\">([\\w\\W]*?)<tr style=\"height:40px;\">";//正则表达式
		Pattern p = Pattern.compile(patter);//编译
		Matcher m = p.matcher(html);//匹配
		if (!m.find()) {
			System.out.println("error");
			return;
		}
		String string1 = "<!DOCTYPE html PUBLIC \" - // W3C//DTD XHTML 1.0// Transitional//EN\"\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\"><html xmlns=\"http://www.w3.org/1999/xhtml\"><head id=\"Head1\"><title>中南大学校内通告</title><meta content=\"IE=EmulateIE7\" http-equiv=\"X-UA-Compatible\" /><link href=\"http://tz.its.csu.edu.cn/Content/listmanagement.css\" rel=\"stylesheet\" type=\"text/css\" />"
				+ "</head><body style=\"background-color:White;\">"
				+ m.group(0) + "</body></html>";//html页面构造
		// System.out.println(string1);
		try {//文件输出（需要try catch）
			Calendar calendar = Calendar.getInstance();//获得java时间类对象
			String fileame = String.valueOf(calendar.getTimeInMillis())//获得当前系统时间
					+ ".html";
			fileame = "C:\\Users\\admin\\Desktop" + "/" + fileame;//admin为你电脑用户名，路径为桌面
			FileOutputStream fileoutputstream = new FileOutputStream(fileame);//创建输入输出流
			System.out.print("文件输出路径:");
			System.out.print(fileame);
			byte bytess[] = string1.getBytes();//转换成字节流
			fileoutputstream.write(bytess);//写入
			fileoutputstream.close();//关闭输入输出流
		} catch (Exception e) {
			System.out.print(e.toString());//将错误输出
		}
	}
}

需要的jar包：

注：不需要全部jar，我只是把与这方面相关的jar都加进去了

一直缺分，分多的求帮忙下载一下,,, http://download.csdn.net/detail/wwwwse/9471785

物流小哥

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
使用java通过Get请求进行爬虫操作

把之前给我们部门演示的爬虫代码复制过来吧，当时我注释的很详细~ 简单的演示一下如何使用java进行爬虫操作，扩展空间很大，爬虫主要有两种方式第一种是用Get请求去抓取网页信息第二种是用Post请求去抓取网页信息现在我先给出一种Get请求的方法~
复制链接

扫一扫