获取页面title和description是遇到的问题

最新推荐文章于 2021-08-05 13:08:24 发布

iteye_17208

最新推荐文章于 2021-08-05 13:08:24 发布

阅读量366

点赞数

分类专栏： Java 文章标签：正则表达式百度 Apache IBM XML

Java 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

前几天公司要求我将数据库里的pages表里的title和sumarry列填充一下，这个表已经通过crawler填充了页面的url。所以我只要获取每个页面的url然后取得这个页面的内容就可以很容易取得某个字段，但是当中发生了一些问题。
1.获取页面的代码：

要用到以下类必须在pom.xml里加入下面的依赖：


         <dependency>
	      <groupId>nekohtml</groupId>
	      <artifactId>nekohtml</artifactId>
	      <version>0.9.5</version>
	    </dependency> 
	    <dependency>
	  	<artifactId>commons-httpclient</artifactId>
	  	<groupId>commons-httpclient</groupId>
	  	<version>3.0.1</version>
	    </dependency>
	    <dependency>
		 <groupId>com.ibm.icu</groupId>
		 <artifactId>icu4j</artifactId>
		 <version>3.8</version>
            </dependency>


判断页面字符集的方法：
/**
	 * Determine the page encoding from the binary stream
	 * @param is The source on which the process is executed.
	 * @return 
	 */
	public String getCharset(InputStream is){
		CharsetDetector detector;
		CharsetMatch match;
		detector = new CharsetDetector();
		try {
			BufferedInputStream inputStream = new BufferedInputStream(is);
			detector.setText(inputStream);
		} catch (Exception e1) {
			e1.printStackTrace();
		}
		detector.enableInputFilter(true);

		match = detector.detect();

		String charset = match.getName();
		return charset;
	}
CharsetDetector 类是分析页面字符给出最可能的结果，比如百度百科的页面编码都是”gb2312"的，但是得出的结果为：GB18030，也即前者编码的超集。


获取页面内容代码：
import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpStatus;
import org.apache.commons.httpclient.methods.GetMethod;

public static byte[] downloadContent(String url) {
		byte[] buffer = new byte[1024 * 100];

		HttpClient httpClient = new HttpClient();
		GetMethod getMethod = new GetMethod(url);
		try {
			int rt = httpClient.executeMethod(getMethod);
			if (rt == HttpStatus.SC_OK) {
//				int count = -1;
				ByteArrayOutputStream baos = new ByteArrayOutputStream();
				InputStream responseBodyAsStream = getMethod.getResponseBodyAsStream();
				while((count = responseBodyAsStream.read(buffer, 0, buffer.length)) > -1) {
					baos.write(buffer, 0, count);
				}

				ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray());
				return baos.toByteArray();
			}else {
				return null;	
			}
		} catch (Exception e) {
			logger.error("error occur, while download page at the location of: " + url, e);
			return null;
		}finally {
			getMethod.releaseConnection();
		}
	}

再看看ByteArrayOutputStream的API：


ByteArrayOutputStream
This class implements an output stream in which the data is written into a byte array. The buffer automatically grows as data is written to it. The data can be retrieved using toByteArray() and toString(). 

Closing a ByteArrayOutputStream has no effect. The methods in this class can be called after the stream has been closed without generating an IOException.

[quote]
The buffer automatically grows as data is written to it.
[/quote]
由于该类的buffer会自动增长，所以如果网页的大小超过预设的buffer大小的话它也能过这个特性来存放数据（除了网页的内容大的超过所剩内存大小，这时候就会出现OOM异常了）

一开始是直接返回一个InputStream，即getMethod.getResponseBodyAsStream();但是在另外一个类读这个stream的时候会报“attempted to read a closed stream"，猜想应该是另外一个引用这个InputStream时被调用方法堆栈的对这个流的reference已经释放，所以才会抛出这个错误。因为这个原因，后来便直接在该方法中读出页面内容然后返回一个缓冲引用。
在返回缓冲的时候出了一个问题困扰了我好久。因为网页的大小会相差较大，所以100K的缓存有时候能一次性容纳一个网页，有时候必须要读好几次缓存才能获取所有页面的内容。一开始是直接return buffer，所以遇到小的页面是，所得的结果完全正确，而遇到大的页面超出缓存的大小时，这时候buffer的内容总是最后一次读取的大小，而前面读取的内容则被覆盖了，因此经常出现取不到title的情况。而用以上方法获取字符集的时候，也会因为数据不全而出现时对时错的情况。下次在获取诸如不确定大小数据的时候，一定要小心这个问题。
后来发现有一个类似的帖子：http://dengyin2000.iteye.com/blog/47417。
还想问一个问题，在获取description的时候，本人的代码是通过正则表达式来作的,如下：


public String getDescription(String page){
		String description = null;

		Pattern pattern = Pattern.compile("(<meta)(.*)name *= *[\"\'] *description *[\"\'](.*)(/>)", Pattern.CASE_INSENSITIVE );
		Matcher matcher = pattern.matcher(page);

//		boolean found = false;//flag indicating whether the page has a description.
		String resultStr = null;
		while(matcher.find()){
			resultStr = matcher.group();
			Pattern patternDesc = Pattern.compile("(content *= *[\"\'])(.*)([\"\'])", Pattern.CASE_INSENSITIVE);
			Matcher matcherDesc = patternDesc.matcher(resultStr);
			while(matcherDesc.find()){
				description = matcherDesc.group(2);
			}
//             found = true;
		}


		//If no description found, then set title as its description
		/*if(!found){
			description = title;
		}*/

		return description;
	}

在html中，一个tag里的attribute位置和个数不确定。在获取description时，通过查找是否有name=”description“的<meta/>字符串然。找到后再来查找content的attribute。这里用了两次正则表达式查找。不知道各位有没有更好的方法一个正则表达式就可以查出来的呢？

iteye_17208

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
获取页面title和description是遇到的问题

前几天公司要求我将数据库里的pages表里的title和sumarry列填充一下，这个表已经通过crawler填充了页面的url。所以我只要获取每个页面的url然后取得这个页面的内容就可以很容易取得某个字段，但是当中发生了一些问题。1.获取页面的代码：要用到以下类必须在pom.xml里加入下面的依赖：[code="java"] nekohtml...
复制链接

扫一扫

专栏目录