java 简单爬虫

最新推荐文章于 2020-10-18 16:58:10 发布

爱笑的冷眸

最新推荐文章于 2020-10-18 16:58:10 发布

阅读量66

点赞数

分类专栏： java 文章标签： java 爬虫

本文链接：https://blog.csdn.net/weixin_43749410/article/details/101052524

版权

java 专栏收录该内容

27 篇文章 1 订阅

订阅专栏

1，使用httpClient获取页面html

	public static String getPageContent(String url) {
		HttpClientBuilder builder = HttpClients.custom();
		CloseableHttpClient client = builder.build();
		HttpGet request = new HttpGet(url);
		String content = "";
		try {
			CloseableHttpResponse execute = client.execute(request);
			HttpEntity entity = execute.getEntity();
			content = EntityUtils.toString(entity);
		} catch (ClientProtocolException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}
		return content;
	}

使用URL获取HTML

public static String getContent(String url) {
		String result = "";
		BufferedReader in = null;
		URL realUrl;
		try {
			realUrl = new URL(url);
			URLConnection conn = realUrl.openConnection();
			in = new BufferedReader(new InputStreamReader(conn.getInputStream()));
			String line = null;
			while((line=in.readLine())!=null) {
				result +=line;
			}
		} catch (MalformedURLException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}
		return result;
	}

运行结果
在这里插入图片描述
使用正则获取关键内容

public static void regex(String regex,String content) {
		Pattern p = Pattern.compile(regex);
		Matcher matcher = p.matcher(content);
		System.out.println(matcher.find());
		if(matcher.find()) {
			System.out.println(matcher.group());
		}	
	}

使用htmlcleaner

public static void htmlCleaner() throws Exception {
		HtmlCleaner cleaner = new HtmlCleaner();
		TagNode node = cleaner.clean(new File("xxx.html"));
		//按tag取.  
		TagNode[] elementsName = node.getElementsByName("title", true);
		if(elementsName.length > 0) {
			System.out.println(elementsName[0].getText());
		}
		//按xpath取  
		Object[] obj = node.evaluateXPath("//div[@class='d_1']//li");
		for(Object on:obj) {
			System.out.println(on);
		}
	}

等待继续更新。。。。。。

爱笑的冷眸

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
java 简单爬虫

1，使用httpClient获取页面html public static String getPageContent(String url) { HttpClientBuilder builder = HttpClients.custom(); CloseableHttpClient client = builder.build(); HttpGet request = new Ht...
复制链接

扫一扫

专栏目录