JAVA HttpClient学习笔记（一）：GET方法模拟网页登录抓取网页数据

最新推荐文章于 2024-06-26 15:49:22 发布

冷朴承

最新推荐文章于 2024-06-26 15:49:22 发布

阅读量1.1k

点赞数 1

分类专栏： JAVA 文章标签： java 网络 http apache

本文链接：https://blog.csdn.net/XiaoYunKuaiFei/article/details/105409296

版权

JAVA 专栏收录该内容

20 篇文章 0 订阅

订阅专栏

目前正在学习安卓，并开发了一个类似于超级课程表和今日校园的APP，但是一直卡壳在抓取课程表这一步，遍历了很多资料任然无法解决，下定决心系统信息HttpClient，先写一个helloWord，一直持续记录学习！

一、GET方法模拟抓取网页

使用org.apache.HttpClient GET方法模拟登录网页，并抓取数据，需要用到HttpClient包

import org.apache.http.HttpEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

import java.io.IOException;

public class HellWord {
    //直接模拟
    public static void main(String[] a){
        //生成一个可关闭的HTTP浏览器（相当于）
        CloseableHttpClient httpClient=HttpClients.createDefault();
        CloseableHttpResponse response=null;
        //创建http Get请求
        HttpGet httpGet=new HttpGet("http://hll520.cn");
        try {
            response=httpClient.execute(httpGet);//执行
        } catch (IOException e) {
            e.printStackTrace();
        }

        //获取网页源码
        HttpEntity httpEntity=response.getEntity();//获取网页源码
        try {
            String h=EntityUtils.toString(httpEntity,"UTF-8");//指定编码避免乱码
            System.out.printf(h);
        } catch (IOException e) {
            //io异常（网络问题）
            e.printStackTrace();
        }

        //关闭HTTp
        try {
            response.close();
            httpClient.close();
        } catch (IOException e) {
            e.printStackTrace();
        }

    }
}

运行结果，模拟打开一个网页，并使用getEntity将网页HTML源码显示

二、模拟浏览器UA，并返回状态

部分网页会给不同的浏览器不同的页面，或者限制机器抓取，这时候需要设置UA模拟浏览器登录页面，同时可以用getStatusLine 返回状态。

1、设置请求头的UA，模拟火狐浏览器

        httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0");

2、返回状态

response.getStatusLine();//获取当前状态

如果只返回状态码（ 200）

response.getStatusLine().getStatusCode()

3、返回类型

判断链接的目标类型

entity.getContentType().getValue()

三、带参数的GET

使用URIBuilder来构造一个URI，并设置参数，多个参数就多个setParameter

URIBuilder  uriBuilder=new URIBuilder("http://baidu.com");
            //写入参数  (可以设置多参数）
            uriBuilder.setParameter("key","JAVA");
            uriBuilder.setParameter("keys","c#");

使用build()方法转换为URI

httpGet=new HttpGet(uriBuilder.build());//使用builder写入URI

完整带参代码

import org.apache.http.HttpEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.utils.URIBuilder;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

import java.io.IOException;
import java.net.URISyntaxException;


//带参数的get
public class HelloWordUA {
    public static void main(String[] ars){
        //生成一个可关闭的HTTP浏览器（相当于）
        CloseableHttpClient httpClient= HttpClients.createDefault();
        CloseableHttpResponse response=null;
        HttpGet httpGet=null;
        try {
            URIBuilder  uriBuilder=new URIBuilder("http://baidu.com");
            //写入参数  (可以设置多参数）
            uriBuilder.setParameter("key","JAVA");
            uriBuilder.setParameter("keys","c#");
            System.out.println(uriBuilder.build());
            //创建http Get请求
            httpGet=new HttpGet(uriBuilder.build());//使用builder写入URI
        } catch (URISyntaxException e) {
            e.printStackTrace();
        }


        //设置请求头，UA浏览器型号，模拟火狐浏览器
        httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0");
        try {
            response=httpClient.execute(httpGet);//执行

            //获取当响应状态
//            response.getStatusLine();//获取当前状态
            //response.getStatusLine().getStatusCode()  获取当前状态码
            System.out.println("Status:"+response.getStatusLine().getStatusCode());
            //获取网页源码
            HttpEntity entity=response.getEntity();//获取网页实体
            //获取目标类型
            System.out.println("ContentType："+entity.getContentType().getValue());

            System.out.println(EntityUtils.toString(entity,"UTF-8"));
        } catch (IOException e) {
            e.printStackTrace();
        }

        //关闭HTTp
        try {
            response.close();
            httpClient.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

冷朴承

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
JAVA HttpClient学习笔记（一）：GET方法模拟网页登录抓取网页数据

目前正在学习安卓，并开发了一个类似于超级课程表和今日校园的APP，但是一直卡壳在抓取课程表这一步，遍历了很多资料任然无法解决，下定决心系统信息HttpClient，先写一个helloWord，一直持续记录学习！一、使用org.apache.HttpClient模拟登录网页，并抓取数据，需要用到HttpClient包import org.apache.http.HttpEntity;...
复制链接

扫一扫

专栏目录