爬虫知识基础

最新推荐文章于 2021-11-30 15:45:58 发布

满山的猴子我毛最多

最新推荐文章于 2021-11-30 15:45:58 发布

阅读量1k

点赞数

本文链接：https://blog.csdn.net/qq_43467548/article/details/103299154

版权

##网络爬虫
网络爬虫，是一种按照一定的规则，自动的抓取信息的程序或者脚本
public class ClawlerFirst {
public static void main(String[] args) throws IOException {
//打开浏览器,创建一个HttpClient对象
CloseableHttpClient httpClient= HttpClients.createDefault();
//输入网址，发起请求get
HttpGet httpGet=new HttpGet(“http://baihe.com”);
//回车，发起请求，返回响应，使用httpClient对象发起请求
CloseableHttpResponse response = httpClient.execute(httpGet);
//解析响应，获取数据
if(response.getStatusLine().getStatusCode()==200){
HttpEntity httpEntity = response.getEntity();
String content = EntityUtils.toString(httpEntity, “utf8”);
System.out.println(content);
}
}
}
###爬虫就是从互联网上进行信息采集
又称作网络机器人，可以自动的在互联网中进行数据的采集与整理，它是按照一定的规则，自动抓取互联网的
程序或者脚本，可以自动采集所有其访问到页面的所有内容，以获取相关的数据
* 功能上来讲，爬虫一般分为数据的采集，处理和存储三个部分

1. get请求带参
	1.       //设置请求地址请求地址是http://www.itcast.cn/search?keys=Java
    //创建URIBuilder,设置参数
   /* URIBuilder uriBuilder=new URIBuilder("http://yun.itcastheima.cn/search");
    //设置参数
    uriBuilder.setParameter("keys","java");
    //创建HttpGet对象，设置url访问地址*/
    HttpPost httpGet=new HttpPost(uriBuilder.build());

2. 连接池
public class HttpClientPoolTest {
public static void main(String[] args) {
    //创建连接池管理器
    PoolingHttpClientConnectionManager cm=new PoolingHttpClientConnectionManager();
    //设置连接数
    cm.setMaxTotal(100);//最大连接数
    //设置每个主机的连接数
    cm.setDefaultMaxPerRoute(10);
    //使用连接池管理器发起请求
    doGer(cm);
    //doPost(cm);
}
private static void doGer(PoolingHttpClientConnectionManager cm)  {
    //不是每次创建新的HttpClient,而是从HttpClient对象
    CloseableHttpClient httpClient =HttpClients.custom().setConnectionManager(cm).build();

    HttpGet httpGet=new HttpGet("http://www.itcast.cn");
    CloseableHttpResponse response=null;
    try {
        response = httpClient.execute(httpGet);
        if(response.getStatusLine().getStatusCode()==200){
            String content = EntityUtils.toString(response.getEntity(), "utf8");
            System.out.println(content.length());
        }

    } catch (IOException e) {
        e.printStackTrace();
    }finally {
        if(response!=null){
            try {
                response.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
            //不能关闭HttpClient
        }
    }
}
}

####自定义相关的时间
public class HttpConfigTest {
public static void main(String[] args) throws IOException {
//创建HttpClient对象
CloseableHttpClient httpClient= HttpClients.createDefault();
//创建HttpGet对象，设置url访问地址
HttpGet httpGet=new HttpGet(“http://www.baidu.com”);
//配置请求信息
RequestConfig config=RequestConfig.custom().setConnectTimeout(1000)
.setConnectionRequestTimeout(500).setSocketTimeout(10*1000).build();//创建连接的最长时间，单位是毫秒
//给请求设置请求信息
httpGet.setConfig(config);

    //使用HttpClient发起请求，获取response
    CloseableHttpResponse response=null;
    try {
        response = httpClient.execute(httpGet);
        if(response.getStatusLine().getStatusCode()==200){
            String content = EntityUtils.toString(response.getEntity(), "utf8");
            System.out.println(content.length());
        }
    } catch (IOException e) {
        e.printStackTrace();
    }finally {
        try {
            response.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
        httpClient.close();
    }
    //解析响应
}

###jsoup介绍：
jsoup是一款Java的HTML解析器，可以直接获取某个URL地址，HTML文本内容。

jsoup的主要功能如下：
1. 从一个URL，文件或字符串中解析HTML
2. 使用DOM或CSS选择器来查找，取数据；
3. 可操作HTML元素，属性，文本
例如：
	  public void testUrl() throws Exception{
    //解析url地址,第一个参数是访问的url,第二个参数是访问时候的超时时间
    Document doc = Jsoup.parse(new URL("http://www.baihe.cn"), 1000);
    //使用标签选择器,获取title标签中的内容
    String title = doc.getElementsByTag("title").first().text();
    //打印
    System.out.println(title);
}

###使用工具类解析字符串
public void testString()throws Exception{
//使用工具类读取文件，获取字符串
String content = FileUtils.readFileToString(new File(“C:\Users\liguoliang\Desktop\test.html”), “utf8”);
//解析字符串
Document doc = Jsoup.parse(content);
String title = doc.getElementsByTag(“title”).first().text();
System.out.println(title);
}
###使用jsoup解析文件
public void testFile() throws Exception{
//解析文件
Document doc = Jsoup.parse(new File(“C:\Users\liguoliang\Desktop\test.html”), “utf8”);
String title = doc.getElementsByTag(“title”).text();
System.out.println(title);
}
###使用dom的方式遍历文档
元素的获取，根据id元素查询getElementById
根据标签获取元素getElementsByTag
根据class获取元素getElementsByClass
根据属性获取元素getElementsByAttribute

###获取元素中的数据
1. 从元素中获取id
2. 从元素中获取className
3. 从元素中获取属性的值attr
4. 从元素中获取所有属性attributes
5. 从元素中获取文本内容text
###使用Selector选择器
1. tagname:通过标签查找元素
2. #id 通过id查找元素
3. .class：通过class名称查找元素
4. [attribute]利用属性查找元素
5. [attr=value]利用属性值来查元素
###选择器组和使用
1. el#id:元素+ID
2. el.class元素+class
3. el[attr]元素+属性名
任意组和：
1. ancestor child:查找某个元素下的子元素
2. parent>child查找某个父元素下的直接子元素
3. parent>*:查找某个父元素下的所有直接子元素

满山的猴子我毛最多

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
爬虫知识基础

##网络爬虫网络爬虫，是一种按照一定的规则，自动的抓取信息的程序或者脚本public class ClawlerFirst {public static void main(String[] args) throws IOException {//打开浏览器,创建一个HttpClient对象CloseableHttpClient httpClient= HttpClients.creat...
复制链接

扫一扫