Notes on Chinese Web Data Extraction in Java(part 1)






Note. The code is developed with Eclipse and tested under JDK 1.6. To make the code running correctly, you need to set the encoding of the project to utf-8 and include some necessary libraries. All the code will be available at

1. Correctly Loading a Chinese Web Page

Correctly loading a Chinese Web page using Java is not a trivial task. Given a target url, you need to read the content from the url and then decode the content using the right encoding. Chinese Web pages can be encoded using utf-8, gbk, gb2312, gb18030, big5, etc. If you did not use the right encoding when resolving a page, you will only get meaningless characters with some html tags. This is usually true for Web pages not written in English. However, Java does not handle the encoding issue automatically. So your code is responsible for the encoding detection of Web pages.

The first step is to get a http connection to the target url. This can be done using the following code.

public  static  HttpURLConnection  getConnection( URL  url)
     throws  IOException 
     HttpURLConnection  httpurlconnection  =  null;
     try  {
         URLConnection  urlconnection  =  url . openConnection();
         urlconnection . setConnectTimeout( 60000);
         urlconnection . setReadTimeout( 60000);
         urlconnection . connect();


        if (!(urlconnection instanceof HttpURLConnection)) {
            return null;

        httpurlconnection = (HttpURLConnection) urlconnection;
        int responsecode = httpurlconnection.getResponseCode();
        switch (responsecode) {
        case HttpURLConnection.HTTP_OK:
        case HttpURLConnection.HTTP_MOVED_PERM:
        case HttpURLConnection.HTTP_MOVED_TEMP:
            System.err.println("Invalid response code: " + 
                responsecode + " " + url);
            return null;
    } catch (IOException ioexception) {
        System.err.println("unable to connect: " + ioexception);
        if (httpurlconnection != null) {
        throw ioexception;
    return httpurlconnection;

The code first gets a URLConnection instance and then sets the time out parameter. These parameters must be set before calling the connect() method. Calling the getResponseCode() method to get the response code. If the code is valid, it returns with the cast objectHttpURLConnection.

The next step is to get an InputStream from the HttpURLConnection. It retries 3 times before returns nothing.

public  static  InputStream  getInputStream( HttpURLConnection  connection) 
     InputStream  inputstream  =  null;
     for ( int  i  =  0;  i  <  3;  ++ i)  {
         try  {
             inputstream  =  connection . getInputStream();
         }  catch ( IOException  e)  {
             System . err . println( "error opening connection "  +  e);
     return  inputstream;

The third step is the most important part which reads the content attribute of the connection and detects the encoding at the same time. The code is as follows.

public  static  final  int  STREAM_BUFFER_SIZE  =  4096;
public  static  final  String  DEFAULT_ENCODING  =  "utf-8";
public  static  String []  getContent( HttpURLConnection  connection)
     throws  IOException 
     InputStream  inputstream  =  null;
     try  {
         LinkedList < byte []>  byteList  =  new  LinkedList < byte []>();
         LinkedList < Integer >  byteLength  =  new  LinkedList < Integer >();
         inputstream  =  getInputStream( connection);
         if ( inputstream  ==  null)  {
             return  null;
         UniversalDetector  detector  =  new  UniversalDetector( null);
         byte []  buf  =  new  byte [ STREAM_BUFFER_SIZE ];
         int  nread  =  0 ,  nTotal  =  0;
         while (( nread  =  inputstream . read( buf ,  0 ,  STREAM_BUFFER_SIZE))  >  0)  {
             byteList . add( buf . clone());
             byteLength . add( nread);
             nTotal  +=  nread;
             detector . handleData( buf ,  0 ,  nread);
             if ( detector . isDone())
         detector . dataEnd();
         String  encoding  =  detector . getDetectedCharset();
         detector . reset();
         if ( encoding  ==  null)  {
             encoding  =  DEFAULT_ENCODING;
         while (( nread  =  inputstream . read( buf ,  0 ,  STREAM_BUFFER_SIZE))  >  0)  {
             byteList . add( buf . clone());
             byteLength . add( nread);
             nTotal  +=  nread;
         byte []  contentByte  =  new  byte [ nTotal ];
         int  offSet  =  0 ,  l  =  byteList . size();
         for ( int  i  =  0;  i  <  l;  ++ i)  {
             byte []  bytes  =  byteList . get( i);
             int  length  =  byteLength . get( i);
             System . arraycopy( bytes ,  0 ,  contentByte ,  offSet ,  length);
             offSet  +=  length;
         return  new  String []  {  encoding ,  new  String( contentByte ,  encoding)  };
     }  catch ( IOException  ioe)  {
         throw  ioe;
     }  finally  {
         if ( inputstream  !=  null)  {
             inputstream . close();

The encoding detection is achieved using a library called ‘juniversalchardet’. It is a Java implementation of ‘universalchardet’ which is the encoding detector library of Mozilla. To use the library, you need to construct an instance oforg.mozilla.universalchardet.UniversalDetector and feed some data to the detector by calling UniversalDetector.handleData(). After notifying the detector of the end of data by calling UniversalDetector.dataEnd(), you can get the detected encoding name by callingUniversalDetector.getDetectedCharset(). Please refer to for more details.

The getContent function reads bytes from the input stream and feeds these bytes to the encoding detector. These bytes are also stored into a list. When the encoding detection is done, the function reads up the remaining bytes and concatenates all the bytes into one array. It then decodes the bytes using the detected encoding. Here is a very small trick. You shouldn’t read the remaining bytes using the detected encoding because the encoding detection may stop in the middle of a specific character(a character is two bytes). If the detection stops in the middle of a character, then the remaining bytes is a single byte plus consecutive characters. Decoding these bytes using the detected encoding will get unreadable characters.

Finally, put the above three functions together, we get the following function which reads the content from a specific url.

public  static  String  getContent( URL  url) 
     HttpURLConnection  connection  =  null;
     try  {
         connection  =  NetUtilities . getConnection( url);
         if ( connection  !=  null)  {
             String []  resource  =  NetUtilities . getContent( connection);
             if ( resource  !=  null)  {
                 return  resource [ 1 ];
     }  catch ( Exception  e)  {
     }  finally  {
         if ( connection  !=  null)  {
             connection . disconnect();
     return  null;

  • 0
  • 0
    觉得还不错? 一键收藏
  • 0


  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


