Java简单爬虫入门自我总结_java爬虫程序心得体会-CSDN博客

本文链接：https://blog.csdn.net/The_king_San/article/details/82971776

技术：httpclient + Jsoup

添加依赖(三个)：

<dependency>
   <groupId>org.apache.httpcomponents</groupId>
   <artifactId>httpclient</artifactId>
   <version>4.5.2</version>
   </dependency>

<dependency>
   <groupId>commons-io</groupId>
   <artifactId>commons-io</artifactId>
   <version>2.6</version>
   </dependency>

<dependency>
   <groupId>org.jsoup</groupId>
   <artifactId>jsoup</artifactId>
   <version>1.10.3</version>
   </dependency>

HttpClient

demo01.java ----获取整个网页内容

public static void main(String[] args) throws Exception{
               //HTTPClient实例化
               CloseableHttpClient httpclient = HttpClients.createDefault();
               //HTTPget实例化
               HttpGet httpget = new HttpGet("http://www.csdn.net");
               CloseableHttpResponse response = null;
               //执行http请求
               response = httpclient.execute(httpget);
               //获取返回实体
               HttpEntity entity = response.getEntity();
               System.out.println("网页内容：" + EntityUtils.toString(entity, "utf-8"));
               //response关闭
               response.close();
               httpclient.close();
   }

demo02.java

爬虫网站反爬虫时，即-----系统检测亲不是真人行为，因系统资源限制，我们只能拒绝你的请求 httpget.setHeader("User-Agent", "目标网站的agent");
获取响应头相关信息 status -->response.getStatusLine().getStatusCode(); content-type -->response.getEntity().getContentType(); 等等

demo03.java

爬一些图片

public static void main(String[] args) throws Exception{
       CloseableHttpClient httpclient = HttpClients.createDefault();
HttpGet httpget = new HttpGet("图片地址");
       httpget.setHeader("User-Agent", "agent");
       CloseableHttpResponse response = null;
       response = httpclient.execute(httpget);
       HttpEntity entity = response.getEntity();

String content = EntityUtils.toString(entity, "utf-8");
       if(entity != null){
           System.out.println(entity.getContentType().getValue());
           InputStream input = entity.getContent();
           FileUtils.copyToFile(input, new File("目录")); //common IO
       }
       response.close();
       httpclient.close();
}

demo04.java

1.由于有些网站设置了反爬虫，会禁掉IP，此时就需要代理IP来访问了。

代理IP有四种：正向代理、反向代理、透明代理以及高匿代理。

采用高匿代理方法(最优)，搜索代理IP即可查到很多免费的高匿代理IP

在HttpGet实例后添加：
       HttpHost proxy = new HttpHost("118.190.95.35",9001);
       RequestConfig config = RequestConfig.custom().setProxy(proxy).build();
       httpget.setConfig(config);

2.访问超时以及读取数据超时设置

  RequestConfig config = RequestConfig.custom()
               .setConnectTimeout(10000) //设置超时时间10秒
               .setSocketTimeout(10000) //设置读取时间10秒
               .build();

Jsoup

简介：jsoup 是一款Java 的HTML解析器，可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API，可通过DOM，CSS以及类似于jQuery的操作方法来取出和操作数据。

在上面代码基础上添加：

Document doc = Jsoup.parse(content);

Elements ele = doc.getElementsByTag("title");//通过标签名获取

Element e = ele.get(0);//用element接收

System.out.println(e.html());

doc.getElementsByClass(".className");//通过类名获取(括号里是JQuery形式)

doc.getElementById("#id")；//通过ID获取

doc.getElementsByAttribute("key")；//通过属性获取等等

通过选择器获取

  Elements items = doc.select(".class .class tag名 a标签");
       for(Element item : items){
           System.out.println(item.text());
           System.out.println("===============");
       }

doc.select("a[href]");//带有href的a标签
doc.select("img[src$=.jpg]");//查找扩展名为jpg的节点