爬虫基础

最新推荐文章于 2024-07-17 23:50:36 发布

星辰轨

最新推荐文章于 2024-07-17 23:50:36 发布

阅读量142

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/u010761858/article/details/81416219

版权

爬虫专栏收录该内容

2 篇文章 0 订阅

订阅专栏

爬虫简介

爬虫系统一般由三个模块构成：获取数据，解析数据，存储数据。
根据爬取范围分为垂直爬虫和通用爬虫，垂直爬虫专门针对某一类网站进行爬取，通用爬虫爬取互联网网上的所有数据。

爬虫原理

获取数据

获取数据需要发起网络请求，网络请求使用HTTP协议。

通过JDK的API获取数据

public void JdkHttpGetData throws Exception {
    // 1.创建一个URL对象(指定地址)
    URL url = new URL("http://www.kanxue.ren");
    // 2.打开URL对应的连接
    HttpURLConnection conn = (HttpURLConnection) url.openConnection();
    // 3.请求方式，必须是大写，GET方式必须设置请求头
    conn.setRequestMethod("GET");
    /*4.发起请求
    *GET方式：调用getInputStream，其实发送请求并获得输入流
    *POST方式：需要设置请求体参数
    */
    /*Post方式
        conn.setRequestMethod("POST");
        conn.setDoOutput(true);
        OutputStream outputStream = conn.getOutputStream();
        outputStream.write("username=zhangsan&password=123123".getBytes());
        outputStream.close();
    */
    InputStream inputStream = conn.getInputStream();
    // 5.打印数据
    BufferedReader bufferedReader = new BufferedReader(
            new InputStreamReader(inputStream, Charset.forName("utf-8")));
    String line = null;
    while ((line = bufferedReader.readLine()) != null) {
        System.out.println(line);
    }
}

通过HttpClient工具包获取数据

 * 发起网络请求使用CloseableHttpClient
 * GET请求用HttpGet对象，POST请求用HttpPost对象
 * EntityUtils.toString()能够将输入流转换成字符串
 * UrlEncodingFormEntity用来封装提交的数据

maven中导如jar包

    <dependency>
        <groupId>org.apache.httpcomponents</groupId>
        <artifactId>httpclient</artifactId>
        <version>4.5.5</version>
    </dependency>

//GET方式
public static void main(String[] args) throws Exception {
    //创建一个默认的http请求
    CloseableHttpClient client = HttpClients.createDefault();
    //设置请求方式
    //HttpPost request = new HttpPost("http://www.kanxue.ren");
    HttpGet request = new HttpGet("http://www.kanxue.ren");
    /*
    //Post方式需要设置请求体
    List<BasicNameValuePair> params = new ArrayList<BasicNameValuePair>();
    BasicNameValuePair param1 = new BasicNameValuePair("name", "zhangsan");
    BasicNameValuePair param2 = new BasicNameValuePair("password", "123456");
    params.add(param1);
    params.add(param2);
    HttpEntity httpEntity = new UrlEncodedFormEntity(params, Charset.forName("UTF-8"));
    post.setEntity(httpEntity);
    */
    //通过httpClient来执行请求，并获取响应数据
    CloseableHttpResponse response = client.execute(request);
    // 3 获取响应码
    int statusCode = response.getStatusLine().getStatusCode();
    if (statusCode == 200) {
        HttpEntity entity = response.getEntity();
        String string = EntityUtils.toString(entity, Charset.forName("UTF-8"));
        System.out.println(string);
    }
}