java爬取百度首页logo

最新推荐文章于 2024-08-09 09:30:00 发布

weixin_34092455

最新推荐文章于 2024-08-09 09:30:00 发布

阅读量272

点赞数

文章标签： java 开发工具

原文链接：http://blog.51cto.com/6023891/1557516

版权

两个方法
- 一个获得Url的网页源代码getUrlContentString，另外一个从源代码中得到想要的地址片段，其中需要用到正则表达式去匹配
得到网页源代码的过程：
- 地址为string，将地址转换为java中的url对象
- url的openConnection方法返回urlConnection
- urlConnection的connect方法建立连接
- 新建一个InputStreamReader对象，其中InputStreamReader的构建需要InputStream输入流对象，而URLConnection的getInputStream方法则返回输入流对象，所以可以连接起来
- 然后利用建立好的InputStreamReader对象建立BuffereReader对象
- 从bufferedreader对象中按行读入网页源码，追加到result字符串中，result字符串即为网页源代码字符串
logo地址匹配
- Pattern pattern = Pattern.compile(patternString);
  - java.util.regex：java类库包，用正则表达式所定义的模式对字符串进行匹配

它包括两个类：Pattern和Matcher 。

Pattern：创建匹配模式字符串。

Matcher：将匹配模式字符串与输入字符串。

pattern的compile方法：将指定的字符编译到模式中

Matcher matcher = pattern.matcher(contentString);

package com.test;
 
import java.io.*;
import java.net.*;
import java.util.regex.*;
 
public class baidulogo {
    static String  getUrlContentString(String urlString) throws Exception {
        String result = "";
        URL url = new URL(urlString);
        URLConnection urlConnection = url.openConnection();
        urlConnection.connect();
        InputStreamReader inputStreamReader = new InputStreamReader(
                urlConnection.getInputStream(), "utf-8");
        BufferedReader in = new BufferedReader(inputStreamReader);
        String line;
        while ((line = in.readLine()) != null)  {
            result += line;
        }
        return result;
    }
 
    static String  getLogoUrl(String contentString, String patternString) {
        String LogoUrl = null;
        Pattern pattern = Pattern.compile(patternString);
        Matcher matcher = pattern.matcher(contentString);
        if (matcher.find()) {
            LogoUrl = matcher.group(1);
        }
        return LogoUrl;
 
    }
 
    public staticvoid main(String[] args) throws Exception {
        // 定义即将访问的链接
        String urlString = "http://www.baidu.com";
        String result = getUrlContentString(urlString);
        String patternString = "src=\"(.+?)\"";
        String contentString = result;
        String logoUrl = getLogoUrl(contentString, patternString);
        System.out.println(logoUrl);
    }
}

转载于:https://blog.51cto.com/6023891/1557516

weixin_34092455

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
java爬取百度首页logo

两个方法一个获得Url的网页源代码getUrlContentString，另外一个从源代码中得到想要的地址片段，其中需要用到正则表达式去匹配得到网页源代码的过程：地址为string，将地址转换为java中的url对象url的openConnection方法返回urlConnectionurlConnection的connect方法建立连接新建一个InputStreamRead...
复制链接

扫一扫