技术:httpclient + Jsoup
添加依赖(三个):
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.2</version>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.6</version>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.10.3</version>
</dependency>
HttpClient
demo01.java ----获取整个网页内容
public static void main(String[] args) throws Exception{
//HTTPClient实例化
CloseableHttpClient httpclient = HttpClients.createDefault();
//HTTPget实例化
HttpGet httpget = new HttpGet("http://www.csdn.net");
CloseableHttpResponse response = null;
//执行http请求
response = httpclient.execute(httpget);
//获取返回实体
HttpEntity entity = response.getEntity();
System.out.println("网页内容:" + EntityUtils.toString(entity, "utf-8"));
//response关闭
response.close();
httpclient.close();
}
demo02.java
- 爬虫网站反爬虫时,即-----系统检测亲不是真人行为,因系统资源限制,我们只能拒绝你的请求 httpget.setHeader("User-Agent", "目标网站的agent");
- 获取响应头相关信息 status -->response.getStatusLine().getStatusCode(); content-type -->response.getEntity().getContentType(); 等等
demo03.java
爬一些图片
public static void main(String[] args) throws Exception{
CloseableHttpClient httpclient = HttpClients.createDefault();
HttpGet httpget = new HttpGet("图片地址");
httpget.setHeader("User-Agent", "agent");
CloseableHttpResponse response = null;
response = httpclient.execute(httpget);
HttpEntity entity = response.getEntity();
String content = EntityUtils.toString(entity, "utf-8");
if(entity != null){
System.out.println(entity.getContentType().getValue());
InputStream input = entity.getContent();
FileUtils.copyToFile(input, new File("目录")); //common IO
}
response.close();
httpclient.close();
}
demo04.java
1.由于有些网站设置了反爬虫,会禁掉IP,此时就需要代理IP来访问了。
代理IP有四种:正向代理、反向代理、透明代理以及高匿代理。
采用高匿代理方法(最优),搜索代理IP即可查到很多免费的高匿代理IP
在HttpGet实例后添加:
HttpHost proxy = new HttpHost("118.190.95.35",9001);
RequestConfig config = RequestConfig.custom().setProxy(proxy).build();
httpget.setConfig(config);
2.访问超时以及读取数据超时设置
RequestConfig config = RequestConfig.custom()
.setConnectTimeout(10000) //设置超时时间10秒
.setSocketTimeout(10000) //设置读取时间10秒
.build();
Jsoup
简介:jsoup 是一款Java 的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。
在上面代码基础上添加:
Document doc = Jsoup.parse(content);
Elements ele = doc.getElementsByTag("title");//通过标签名获取
Element e = ele.get(0);//用element接收
System.out.println(e.html());
doc.getElementsByClass(".className");//通过类名获取(括号里是JQuery形式)
doc.getElementById("#id");//通过ID获取
doc.getElementsByAttribute("key");//通过属性获取等等
通过选择器获取
Elements items = doc.select(".class .class tag名 a标签");
for(Element item : items){
System.out.println(item.text());
System.out.println("===============");
}
doc.select("a[href]");//带有href的a标签
doc.select("img[src$=.jpg]");//查找扩展名为jpg的节点