nutch2.2.1安装部署

最新推荐文章于 2020-04-02 14:37:46 发布

jcxch

最新推荐文章于 2020-04-02 14:37:46 发布

阅读量660

点赞数

http://www.promenade.me/archives/146

手工创建表webpage varchar（767）改为 varchar（255）或者 text

有一篇对应的博文博文，不过是2.1版本的，在最新的2.2.1版本中有很多问题，所以强烈建议大家一定要完全把这篇文章看完后再着手操作，不要跟着我一起走弯路。

流水账一样的配置过程。

MySQL配置：

 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
         12 
       
         13 
       
         14 
       
         15 
       
         16 
       
         17 
       
         18 
       
         19 
       
         20 
       
         21 
       
         22 
       
         23 
       
         24 
       
         25 
       
         26 
       
         27 
       
         28 
       
         29 
       
         30 
       
        CREATE 
          
        DATABASE 
         nutch 
          
        DEFAULT 
          
        CHARACTER 
          
        SET 
         utf8mb4 
          
        DEFAULT 
          
        COLLATE 
         utf8mb4_unicode_ci; 
       
        CREATE 
          
        TABLE 
         `webpage` 
         ( 
       
         `id` 
          
        varchar(767) 
          
        NOT NULL, 
       
         `headers` 
         blob, 
       
         ` 
        text` 
         mediumtext 
          
        DEFAULT 
          
        NULL, 
       
         `status` 
          
        int(11) 
          
        DEFAULT 
          
        NULL, 
       
         `markers` 
         blob, 
       
         `parseStatus` 
         blob, 
       
         `modifiedTime` 
          
        bigint(20) 
          
        DEFAULT 
          
        NULL, 
       
         `score` 
          
        float 
          
        DEFAULT 
          
        NULL, 
       
         `typ` 
          
        varchar(32) 
          
        CHARACTER 
          
        SET 
         latin1 
          
        DEFAULT 
          
        NULL, 
       
         `baseUrl` 
          
        varchar(767) 
          
        DEFAULT 
          
        NULL, 
       
         ` 
        content` 
         longblob, 
       
         `title` 
          
        varchar(2048) 
          
        DEFAULT 
          
        NULL, 
       
         `reprUrl` 
          
        varchar(767) 
          
        DEFAULT 
          
        NULL, 
       
         `fetchInterval` 
          
        int(11) 
          
        DEFAULT 
          
        NULL, 
       
         `prevFetchTime` 
          
        bigint(20) 
          
        DEFAULT 
          
        NULL, 
       
         `inlinks` 
         mediumblob, 
       
         `prevSignature` 
         blob, 
       
         `outlinks` 
         mediumblob, 
       
         `fetchTime` 
          
        bigint(20) 
          
        DEFAULT 
          
        NULL, 
       
         `retriesSinceFetch` 
          
        int(11) 
          
        DEFAULT 
          
        NULL, 
       
         `protocolStatus` 
         blob, 
       
         `signature` 
         blob, 
       
         `metadata` 
         blob, 
       
        PRIMARY 
          
        KEY 
         (`id`) 
       
         ) 
         ENGINE 
        =InnoDB 
       
         ROW_FORMAT 
        =COMPRESSED 
       
        DEFAULT 
         CHARSET 
        =utf8mb4;

ivy/ivy.xml中需要uncomment这两行，让gora支持mysql

 
   
 
 
  
         1 
       

         2 
       
 
        <dependency 
        org 
        = 
        "org.apache.gora" 
         
        name 
        = 
        "gora-sql" 
         
        rev 
        = 
        "0.1.1-incubating" 
         
        conf 
        = 
        "*->default" 
         /> 
       
 
        <dependency 
        org 
        = 
        "mysql" 
         
        name 
        = 
        "mysql-connector-java" 
         
        rev 
        = 
        "5.1.18" 
         
        conf 
        = 
        "*->default" 
        /> 
       
 
 

conf/gora.properties中需要写好数据库信息

 
   
 
 
  
         1 
       

         2 
       

         3 
       

         4 
       
 
        gora 
        . 
        sqlstore 
        . 
        jdbc 
        . 
        driver 
        = 
        com 
        . 
        mysql 
        . 
        jdbc 
        . 
        Driver 
       
 
        gora 
        . 
        sqlstore 
        . 
        jdbc 
        . 
        url 
        = 
        jdbc 
        : 
        mysql 
        : 
        //localhost:3306/nutch?createDatabaseIfNotExist=true 
       
 
        gora 
        . 
        sqlstore 
        . 
        jdbc 
        . 
        user 
        = 
        xxxxx 
       
 
        gora 
        . 
        sqlstore 
        . 
        jdbc 
        . 
        password 
        = 
        xxxxx 
       
 
 

conf/gora-sql-mapping.xml中替换两个primarykey对应的length，因为ID变为了utf8，所以数据变长了。

另外就是关于抓取的，配置conf/nutch-site.xml，加入爬虫信息：

 
   
 
 
  
         1 
       

         2 
       

         3 
       

         4 
       

         5 
       

         6 
       

         7 
       

         8 
       

         9 
       

         10 
       

         11 
       

         12 
       

         13 
       

         14 
       

         15 
       

         16 
       

         17 
       

         18 
       

         19 
       

         20 
       

         21 
       

         22 
       

         23 
       

         24 
       

         25 
       

         26 
       

         27 
       

         28 
       
 
        < 
        property 
        > 
       
 
        < 
        name 
        > 
        http 
        . 
        agent 
        . 
        name 
        < 
        / 
        name 
        > 
       
 
        < 
        value 
        > 
        Ade' 
        s 
          
        spider 
        < 
        / 
        value 
        > 
       
 
        < 
        / 
        property 
        > 
       

           
       
 
        < 
        property 
        > 
       
 
        < 
        name 
        > 
        http 
        . 
        accept 
        . 
        language 
        < 
        / 
        name 
        > 
       
 
        < 
        value 
        > 
        ja 
        - 
        jp 
        , 
          
        en 
        - 
        us 
        , 
        en 
        - 
        gb 
        , 
        en 
        ; 
        q 
        = 
        0.7 
        , 
        * 
        ; 
        q 
        = 
        0.3 
        < 
        / 
        value 
        > 
       
 
        < 
        description 
        > 
        Value  
        of  
        the 
         “ 
        Accept 
        - 
        Language” 
          
        request  
        header  
        field 
        . 
       
 
        This 
          
        allows  
        selecting  
        non 
        - 
        English  
        language  
        as 
          
        default 
          
        one  
        to 
          
        retrieve 
        . 
       
 
        It  
        is 
          
        a 
          
        useful  
        setting  
        for 
          
        search  
        engines  
        build  
        for 
          
        certain  
        national  
        group 
        . 
       
 
        < 
        / 
        description 
        > 
       
 
        < 
        / 
        property 
        > 
       

           
       
 
        < 
        property 
        > 
       
 
        < 
        name 
        > 
        parser 
        . 
        character 
        . 
        encoding 
        . 
        default 
        < 
        / 
        name 
        > 
       
 
        < 
        value 
        > 
        utf 
        - 
        8 
        < 
        / 
        value 
        > 
       
 
        < 
        description 
        > 
        The  
        character  
        encoding  
        to 
          
        fall  
        back  
        to 
          
        when  
        no  
        other  
        information 
       
 
        is 
          
        available 
        < 
        / 
        description 
        > 
       
 
        < 
        / 
        property 
        > 
       

           
       
 
        < 
        property 
        > 
       
 
        < 
        name 
        > 
        storage 
        . 
        data 
        . 
        store 
        . 
        class 
        < 
        / 
        name 
        > 
       
 
        < 
        value 
        > 
        org 
        . 
        apache 
        . 
        gora 
        . 
        sql 
        . 
        store 
        . 
        SqlStore 
        < 
        / 
        value 
        > 
       
 
        < 
        description 
        > 
        The  
        Gora  
        DataStore  
        class 
          
        for 
          
        storing  
        and 
          
        retrieving  
        data 
        . 
       
 
        Currently  
        the  
        following  
        stores  
        are  
        available 
        : 
         … 
        . 
       
 
        < 
        / 
        description 
        > 
       
 
        < 
        / 
        property 
        > 
       
 
 

由于还需要ivy下载一个sql connector与gora-sql，所以再ant编译一遍。

下面就可以开始抓取了：

 
         1 
       
         2 
       
         3 
       
         4 
       
        cd 
         
        . 
        / 
        runtime 
        / 
        local 
       
        mkdir 
         
        - 
        p 
         
        urls 
       
        echo 
         
        'http://www.promenade.me' 
         
        > 
         
        urls 
        / 
        seed 
        . 
        txt 
       
        bin 
        / 
        nutch 
        crawl 
        urls 
         
        - 
        depth 
         
        3 
         
        - 
        topN 
         
        5

有可能会遇到问题：

 
   
 
 
  
         1 
       

         2 
       

         3 
       

         4 
       

         5 
       

         6 
       

         7 
       

         8 
       

         9 
       

         10 
       

         11 
       

         12 
       
 
        [ 
        root 
        @ 
        AY131218101252507ad0Z  
        local 
        ] 
        # bin/nutch crawl urls -depth 3 -topN 5 
       
 
        InjectorJob 
        : 
          
        Using  
        class 
          
        org 
        .apache 
        .gora 
        .sql 
        .store 
        .SqlStore 
          
        as 
          
        the  
        Gora  
        storage  
        class 
        . 
       
 
        InjectorJob 
        : 
          
        total  
        number  
        of  
        urls  
        rejected  
        by  
        filters 
        : 
          
        0 
       
 
        InjectorJob 
        : 
          
        total  
        number  
        of  
        urls  
        injected  
        after  
        normalization  
        and 
          
        filtering 
        : 
          
        0 
       
 
        Exception  
        in 
          
        thread 
          
        "main" 
          
        java 
        .lang 
        .RuntimeException 
        : 
          
        job  
        failed 
        : 
          
        name 
        = 
        generate 
        : 
          
        null 
        , 
          
        jobid 
        = 
        job_local177967844_0002 
       
 
             
        at 
          
        org 
        .apache 
        .nutch 
        .util 
        .NutchJob 
        .waitForCompletion 
        ( 
        NutchJob 
        .java 
        : 
        54 
        ) 
       
 
             
        at 
          
        org 
        .apache 
        .nutch 
        .crawl 
        .GeneratorJob 
        .run 
        ( 
        GeneratorJob 
        .java 
        : 
        199 
        ) 
       
 
             
        at 
          
        org 
        .apache 
        .nutch 
        .crawl 
        .Crawler 
        .runTool 
        ( 
        Crawler 
        .java 
        : 
        68 
        ) 
       
 
             
        at 
          
        org 
        .apache 
        .nutch 
        .crawl 
        .Crawler 
        .run 
        ( 
        Crawler 
        .java 
        : 
        152 
        ) 
       
 
             
        at 
          
        org 
        .apache 
        .nutch 
        .crawl 
        .Crawler 
        .run 
        ( 
        Crawler 
        .java 
        : 
        250 
        ) 
       
 
             
        at 
          
        org 
        .apache 
        .hadoop 
        .util 
        .ToolRunner 
        .run 
        ( 
        ToolRunner 
        .java 
        : 
        65 
        ) 
       
 
             
        at 
          
        org 
        .apache 
        .nutch 
        .crawl 
        .Crawler 
        .main 
        ( 
        Crawler 
        .java 
        : 
        257 
        ) 
       
 
 

查看logs/Hadoop.log会说是一个Utf8类传入了空值。网上有一篇Nutch2.0配置安装异常集锦，里面有对应的解释。
找到
nutch/src/Java/org/apache/nutch/crawl/GeneratorReducer.java，然后看其100行左右：

 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
        batchId 
         
        = 
         
        new 
         
        Utf8 
        ( 
        conf 
        . 
        get 
        ( 
        GeneratorJob 
        . 
        BATCH_ID 
        ) 
        ) 
        ; 
       
        //改为 
       
        int 
         
        randomSeed 
         
        = 
         
        Math 
        . 
        abs 
        ( 
        new 
         
        Random 
        ( 
        ) 
        . 
        nextInt 
        ( 
        ) 
        ) 
        ; 
       
        String 
         
        batchIdStr 
         
        = 
         
        ( 
        System 
        . 
        currentTimeMillis 
        ( 
        ) 
         
        / 
         
        1000 
        ) 
         
        + 
         
        "-" 
         
        + 
         
        randomSeed 
        ; 
       
        batchId 
         
        = 
         
        new 
         
        Utf8 
        ( 
         
        batchIdStr 
         
        ) 
        ; 
       
        //别忘了在最上面加上 
       
        import 
        java 
        . 
        util 
        . 
        Random 
        ;

之后需要重新编译一遍，然后再去抓取，又出现异常，查看hadoop.log:

 
   
 
 
  
         1 
       

         2 
       

         3 
       

         4 
       

         5 
       

         6 
       

         7 
       

         8 
       

         9 
       

         10 
       

         11 
       

         12 
       

         13 
       

         14 
       

         15 
       

         16 
       

         17 
       

         18 
       

         19 
       
 
        java 
        . 
        lang 
        . 
        Exception 
        : 
          
        java 
        . 
        lang 
        . 
        NoSuchMethodError 
        : 
          
        org 
        . 
        apache 
        . 
        gora 
        . 
        persistency 
        . 
        Persistent 
        . 
        getSchema 
        ( 
        ) 
        Lorg 
        / 
        apache 
        / 
        avro 
        / 
        Schema 
        ; 
       
 
             
        at  
        org 
        . 
        apache 
        . 
        hadoop 
        . 
        mapred 
        . 
        LocalJobRunner 
        $ 
        Job 
        . 
        run 
        ( 
        LocalJobRunner 
        . 
        java 
        : 
        354 
        ) 
       
 
        Caused  
        by 
        : 
          
        java 
        . 
        lang 
        . 
        NoSuchMethodError 
        : 
          
        org 
        . 
        apache 
        . 
        gora 
        . 
        persistency 
        . 
        Persistent 
        . 
        getSchema 
        ( 
        ) 
        Lorg 
        / 
        apache 
        / 
        avro 
        / 
        Schema 
        ; 
       
 
             
        at  
        org 
        . 
        apache 
        . 
        gora 
        . 
        sql 
        . 
        store 
        . 
        SqlStore 
        . 
        put 
        ( 
        SqlStore 
        . 
        java 
        : 
        591 
        ) 
       
 
             
        at  
        org 
        . 
        apache 
        . 
        gora 
        . 
        mapreduce 
        . 
        GoraRecordWriter 
        . 
        write 
        ( 
        GoraRecordWriter 
        . 
        java 
        : 
        65 
        ) 
       
 
             
        at  
        org 
        . 
        apache 
        . 
        hadoop 
        . 
        mapred 
        . 
        MapTask 
        $ 
        NewDirectOutputCollector 
        . 
        write 
        ( 
        MapTask 
        . 
        java 
        : 
        638 
        ) 
       
 
             
        at  
        org 
        . 
        apache 
        . 
        hadoop 
        . 
        mapreduce 
        . 
        TaskInputOutputContext 
        . 
        write 
        ( 
        TaskInputOutputContext 
        . 
        java 
        : 
        80 
        ) 
       
 
             
        at  
        org 
        . 
        apache 
        . 
        nutch 
        . 
        crawl 
        . 
        InjectorJob 
        $ 
        UrlMapper 
        . 
        map 
        ( 
        InjectorJob 
        . 
        java 
        : 
        191 
        ) 
       
 
             
        at  
        org 
        . 
        apache 
        . 
        nutch 
        . 
        crawl 
        . 
        InjectorJob 
        $ 
        UrlMapper 
        . 
        map 
        ( 
        InjectorJob 
        . 
        java 
        : 
        88 
        ) 
       
 
             
        at  
        org 
        . 
        apache 
        . 
        hadoop 
        . 
        mapreduce 
        . 
        Mapper 
        . 
        run 
        ( 
        Mapper 
        . 
        java 
        : 
        145 
        ) 
       
 
             
        at  
        org 
        . 
        apache 
        . 
        hadoop 
        . 
        mapred 
        . 
        MapTask 
        . 
        runNewMapper 
        ( 
        MapTask 
        . 
        java 
        : 
        764 
        ) 
       
 
             
        at  
        org 
        . 
        apache 
        . 
        hadoop 
        . 
        mapred 
        . 
        MapTask 
        . 
        run 
        ( 
        MapTask 
        . 
        java 
        : 
        364 
        ) 
       
 
             
        at  
        org 
        . 
        apache 
        . 
        hadoop 
        . 
        mapred 
        . 
        LocalJobRunner 
        $ 
        Job 
        $ 
        MapTaskRunnable 
        . 
        run 
        ( 
        LocalJobRunner 
        . 
        java 
        : 
        223 
        ) 
       
 
             
        at  
        java 
        . 
        util 
        . 
        concurrent 
        . 
        Executors 
        $ 
        RunnableAdapter 
        . 
        call 
        ( 
        Executors 
        . 
        java 
        : 
        471 
        ) 
       
 
             
        at  
        java 
        . 
        util 
        . 
        concurrent 
        . 
        FutureTask 
        $ 
        Sync 
        . 
        innerRun 
        ( 
        FutureTask 
        . 
        java 
        : 
        334 
        ) 
       
 
             
        at  
        java 
        . 
        util 
        . 
        concurrent 
        . 
        FutureTask 
        . 
        run 
        ( 
        FutureTask 
        . 
        java 
        : 
        166 
        ) 
       
 
             
        at  
        java 
        . 
        util 
        . 
        concurrent 
        . 
        ThreadPoolExecutor 
        . 
        runWorker 
        ( 
        ThreadPoolExecutor 
        . 
        java 
        : 
        1146 
        ) 
       
 
             
        at  
        java 
        . 
        util 
        . 
        concurrent 
        . 
        ThreadPoolExecutor 
        $ 
        Worker 
        . 
        run 
        ( 
        ThreadPoolExecutor 
        . 
        java 
        : 
        615 
        ) 
       
 
             
        at  
        java 
        . 
        lang 
        . 
        Thread 
        . 
        run 
        ( 
        Thread 
        . 
        java 
        : 
        701 
        ) 
       
 
 

突然想到在ivy/ivy.xml中有这样写道：

好吧，就在这个提示上面一行，修改一下gora-core的版本为0.2.1。再编译，再重来… 不出所料，又有问题，这回的错误是：

 
         1 
       
        Unknown  
        column 
          
        'batchId' 
          
        in 
          
        'field list'

麻利儿的检查一下数据库哪里有问题，这个batchId就应该是刚才utf8错误的那个batchId,在mysql表中加一个字段呗。

 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
         12 
       
         13 
       
         14 
       
         15 
       
         16 
       
         17 
       
         18 
       
         19 
       
         20 
       
         21 
       
         22 
       
         23 
       
         24 
       
         25 
       
         26 
       
         27 
       
         28 
       
         29 
       
        CREATE 
        TABLE 
         
        ` 
        webpage 
        ` 
         
        ( 
       
        ` 
        id 
        ` 
         
        varchar 
        ( 
        767 
        ) 
         
        NOT 
         
        NULL 
        , 
       
        ` 
        headers 
        ` 
         
        blob 
        , 
       
        ` 
        text 
        ` 
         
        mediumtext 
        DEFAULT 
         
        NULL 
        , 
       
        ` 
        status 
        ` 
         
        int 
        ( 
        11 
        ) 
         
        DEFAULT 
         
        NULL 
        , 
       
        ` 
        markers 
        ` 
         
        blob 
        , 
       
        ` 
        parseStatus 
        ` 
         
        blob 
        , 
       
        ` 
        modifiedTime 
        ` 
         
        bigint 
        ( 
        20 
        ) 
         
        DEFAULT 
         
        NULL 
        , 
       
        ` 
        score 
        ` 
         
        float 
         
        DEFAULT 
         
        NULL 
        , 
       
        ` 
        typ 
        ` 
         
        varchar 
        ( 
        32 
        ) 
         
        CHARACTER 
        SET 
        latin1 
        DEFAULT 
         
        NULL 
        , 
       
        ` 
        baseUrl 
        ` 
         
        varchar 
        ( 
        767 
        ) 
         
        DEFAULT 
         
        NULL 
        , 
       
        ` 
        content 
        ` 
         
        longblob 
        , 
       
        ` 
        title 
        ` 
         
        varchar 
        ( 
        2048 
        ) 
         
        DEFAULT 
         
        NULL 
        , 
       
        ` 
        reprUrl 
        ` 
         
        varchar 
        ( 
        767 
        ) 
         
        DEFAULT 
         
        NULL 
        , 
       
        ` 
        fetchInterval 
        ` 
         
        int 
        ( 
        11 
        ) 
         
        DEFAULT 
         
        NULL 
        , 
       
        ` 
        prevFetchTime 
        ` 
         
        bigint 
        ( 
        20 
        ) 
         
        DEFAULT 
         
        NULL 
        , 
       
        ` 
        inlinks 
        ` 
         
        mediumblob 
        , 
       
        ` 
        prevSignature 
        ` 
         
        blob 
        , 
       
        ` 
        outlinks 
        ` 
         
        mediumblob 
        , 
       
        ` 
        fetchTime 
        ` 
         
        bigint 
        ( 
        20 
        ) 
         
        DEFAULT 
         
        NULL 
        , 
       
        ` 
        retriesSinceFetch 
        ` 
         
        int 
        ( 
        11 
        ) 
         
        DEFAULT 
         
        NULL 
        , 
       
        ` 
        protocolStatus 
        ` 
         
        blob 
        , 
       
        ` 
        signature 
        ` 
         
        blob 
        , 
       
        ` 
        metadata 
        ` 
         
        blob 
        , 
       
        ` 
        batchId 
        ` 
         
        varchar 
        ( 
        767 
        ) 
         
        DEFAULT 
         
        NULL 
        , 
       
        PRIMARY 
        KEY 
         
        ( 
        ` 
        id 
        ` 
        ) 
       
        ) 
         
        ENGINE 
        = 
        InnoDB 
       
        ROW_FORMAT 
        = 
        COMPRESSED 
       
        DEFAULT 
         
        CHARSET 
        = 
        utf8mb4 
        ;

好吧，再运行，居然..居然开始抓取了…

=========================

Setting up Nutch 2.1 with MySQL to handle UTF-8

These instructions assume Ubuntu 12.04 and Java 6 or 7 installed and JAVA_HOME configured.

Install MySQL Server and MySQL Client using the Ubuntu software center or sudo apt-get install mysql-server mysql-client at the command line.

As MySQL defaults to latin (are we still in the 1990s?) we need to edit sudo vi /etc/mysql/my.cnf and under [mysqld] add

innodb_file_format=barracuda
innodb_file_per_table=true
innodb_large_prefix=true
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci
max_allowed_packet=500M

The innodb options are to help deal with the small primary key size restriction of MySQL. Restart your machine for the changes to take effect. The max_allowed_packet option is so you don’t run into issues as your database and the pages you store in it get larger.

Check to make sure MySQL is running by typing sudo netstat -tap | grep mysql and you should see something like

tcp 0 0 localhost:mysql *:* LISTEN

We need to set up the nutch database manually as the current Nutch/Gora/MySQL generated db schema defaults to latin. Log into mysql at the command line using your previously set up MySQL id and password type

mysql -u xxxxx -p

then in the MySQL editor type the following:

CREATE DATABASE nutch DEFAULT CHARACTER SET utf8mb4 DEFAULT COLLATE utf8mb4_unicode_ci;

and enter followed by

use nutch;

and enter and then copy and paste the following altogether:

CREATE TABLE `webpage` ( `id` varchar(767) NOT NULL, `headers` blob, `text` mediumtext DEFAULT NULL, `status` int(11) DEFAULT NULL, `markers` blob, `parseStatus` blob, `modifiedTime` bigint(20) DEFAULT NULL, `score` float DEFAULT NULL, `typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL, `baseUrl` varchar(767) DEFAULT NULL, `content` longblob, `title` varchar(2048) DEFAULT NULL, `reprUrl` varchar(767) DEFAULT NULL, `fetchInterval` int(11) DEFAULT NULL, `prevFetchTime` bigint(20) DEFAULT NULL, `inlinks` mediumblob, `prevSignature` blob, `outlinks` mediumblob, `fetchTime` bigint(20) DEFAULT NULL, `retriesSinceFetch` int(11) DEFAULT NULL, `protocolStatus` blob, `signature` blob, `metadata` blob, PRIMARY KEY (`id`) ) ENGINE=InnoDB ROW_FORMAT=COMPRESSED DEFAULT CHARSET=utf8mb4;

Then type enter. You are done setting up the MySQL database for Nutch.

Set up Nutch 2.1 by downloading the latest version from http://www.apache.org/dyn/closer.cgi/nutch/. Untar the contents of the file you just downloaded and going forward we will refer to this folder as ${APACHE_NUTCH_HOME}.

From inside the nutch folder ensure the MySQL dependency for Nutch is available by editing the following in ${APACHE_NUTCH_HOME}/ivy/ivy.xml

<!– Uncomment this to use MySQL as database with SQL as Gora store. –>
<dependency org=”mysql” name=”mysql-connector-java” rev=”5.1.18″ conf=”*->default”/>

Edit the ${APACHE_NUTCH_HOME}/conf/gora.properties file either deleting or commenting out the Default SqlStore Properties using #. Then add the MySQL properties below replacing xxxxx with the user and password you set up when installing MySQL earlier.

###############################
# MySQL properties #
###############################
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true
gora.sqlstore.jdbc.user=xxxxx
gora.sqlstore.jdbc.password=xxxxx

Edit the ${APACHE_NUTCH_HOME}/conf/gora-sql-mapping.xml file changing the length of the primarykey from 512 to 767 in both places.
<primarykey column=”id” length=”767″/>

Configure ${APACHE_NUTCH_HOME}/conf/nutch-site.xml to put in a name in the value field under http.agent.name. It can be anything but cannot be left blank. Add additional languages if you want (I have added Japanese ja-jp below) and utf-8 as default as well. You must specify Sqlstore.

<property>
<name>http.agent.name</name>
<value>Your Nutch Spider</value>
</property>

<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the “Accept-Language” request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
</description>
</property>

<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>

<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available: ….
</description>
</property>

Install ant using the Ubuntu software center or sudo apt-get install ant at the command line.

From the command line cd to your nutch folder type ant runtime
This may take a few minutes to compile.

Start your first crawl by typing the lines below at the terminal (replace ‘http://nutch.apache.org/’ with whatever site you want to crawl):
cd ${APACHE_NUTCH_HOME}/runtime/local mkdir -p urls echo 'http://nutch.apache.org/' > urls/seed.txt bin/nutch crawl urls -depth 3 -topN 5

You can easily add more urls to search by hand in seed.txt if you want. For the crawl, depth is the number of rounds of generate/fetch/parse/update you want to do (not depth of links as you might think at first) and topNis the max number of links you want to actually parse each time. Note however Nutch keeps track of all links it encounters in the webpage table (it just limits the amount it actually parses to TopN so don’t be surprised by seeing many more rows in the webpage table than you expect by limiting with TopN).

Check your crawl results by looking at the webpage table in the nutch database.
mysql -u xxxxx -p use nutch; SELECT * FROM nutch.webpage;

You should see the results of your crawl (around 159 rows). It will be hard to read the columns so you may want to install MySQL Workbench via sudo apt-get install mysql-workbench and use that instead for viewing the data. You may also want to run the following SQL command select * from webpage where status = 2; to limit the rows in the webpage table to only urls that were actually parsed.

Set up and index with Solr If you are using Nutch 2.1 at this time you are into the bleeding edge and probably want the latest version of Solr 4.0 as well. Untar it to to $HOME/apache-solr-4.0.0-XXXX. This folder will be now referred to as ${APACHE_SOLR_HOME}.
Download http://nlp.solutions.asia/wp-content/uploads/2012/08/schema.xml and use it to replace ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml.

From the terminal start solr:
cd ${APACHE_SOLR_HOME}/example java -jar start.jar

You can check this is running by opening http://localhost:8983/solr in your web browser.

Leave that terminal running and from a different terminal type the following:
cd ${APACHE_NUTCH_HOME}/runtime/local/ bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex

You can now run queries using Solr versus your crawled content. Openhttp://localhost:8983/solr/#/collection1/query and assuming you have crawled nutch.apache.org in the input box titled “q” you can do a search by inputting text:nutch and you should see something like this:

There remains a lot to configure to get a good web search going but you are at least started.

===========================================

Installing Nutch 2.2 with MySQL to handle UTF-8

Enough has changed from Nutch 2.1 to Nutch 2.2 to warrant an update to the installation instructions. These instructions assume Ubuntu 12.04 and Java 7 installed and JAVA_HOME configured.

Install MySQL Server and MySQL Client using the Ubuntu software center or sudo apt-get install mysql-server mysql-client at the command line.

As MySQL defaults to latin we need to edit sudo vi /etc/mysql/my.cnf and under [mysqld] add

innodb_file_format=barracuda
innodb_file_per_table=true
innodb_large_prefix=true
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci
max_allowed_packet=500M

The innodb options are to help deal with the small primary key size restriction of MySQL. The character and collation settings are to handle Unicode correctly.The max_allowed_packet settings is optional and only necessary for very large sizes. Restart your machine for the changes to take effect.

Check to make sure MySQL is running by typing sudo netstat -tap | grep mysql and you should see something like

tcp 0 0 localhost:mysql *:* LISTEN

mysql -u xxxxx -p

then in the MySQL editor type the following:

CREATE DATABASE nutch DEFAULT CHARACTER SET utf8mb4 DEFAULT COLLATE utf8mb4_unicode_ci;

and enter followed by

use nutch;

and enter and then copy and paste the following altogether:

CREATE TABLE `webpage` ( `id` varchar(767) NOT NULL, `headers` blob, `text` longtext DEFAULT NULL, `status` int(11) DEFAULT NULL, `markers` blob, `parseStatus` blob, `modifiedTime` bigint(20) DEFAULT NULL, `prevModifiedTime` bigint(20) DEFAULT NULL, `score` float DEFAULT NULL, `typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL, `batchId` varchar(32) CHARACTER SET latin1 DEFAULT NULL, `baseUrl` varchar(767) DEFAULT NULL, `content` longblob, `title` varchar(2048) DEFAULT NULL, `reprUrl` varchar(767) DEFAULT NULL, `fetchInterval` int(11) DEFAULT NULL, `prevFetchTime` bigint(20) DEFAULT NULL, `inlinks` mediumblob, `prevSignature` blob, `outlinks` mediumblob, `fetchTime` bigint(20) DEFAULT NULL, `retriesSinceFetch` int(11) DEFAULT NULL, `protocolStatus` blob, `signature` blob, `metadata` blob, PRIMARY KEY (`id`) ) ENGINE=InnoDB ROW_FORMAT=COMPRESSED DEFAULT CHARSET=utf8mb4;

Then type enter. You are done setting up the MySQL database for Nutch.

Set up Nutch 2.2 by downloading the apache-nutch-2.2-src.tar.gz version fromhttp://www.apache.org/dyn/closer.cgi/nutch/. Untar the contents of the file you just downloaded to a folder we will refer to going forward as ${APACHE_NUTCH_HOME}. In my particular case I prefer to use it with Eclipse so I untar it in the Eclipse workspace but this is not necessary.

From inside the nutch folder ensure the MySQL dependency for Nutch is available by editing the following in ${APACHE_NUTCH_HOME}/ivy/ivy.xml

change
<dependency org=”org.apache.gora” name=”gora-core” rev=”0.3″ conf=”*->default”/>
to
<dependency org=”org.apache.gora” name=”gora-core” rev=”0.2.1″ conf=”*->default”/>

and uncomment the gora-sql
<dependency org=”org.apache.gora” name=”gora-sql” rev=”0.1.1-incubating” conf=”*->default” />

and uncomment the mysql connector
<!– Uncomment this to use MySQL as database with SQL as Gora store. –>
<dependency org=”mysql” name=”mysql-connector-java” rev=”5.1.18″ conf=”*->default”/>

Edit the ${APACHE_NUTCH_HOME}/conf/gora-sql-mapping.xml file changing the length of the primarykey from 512 to 767 in both places.
<primarykey column=”id” length=”767″/>

<property>
<name>http.agent.name</name>
<value>YourNutchSpider</value>
</property>

<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>

Install ant using the Ubuntu software center or sudo apt-get install ant at the command line.

From the command line cd to your nutch folder

If you are using Eclipse type ant eclipse. When that is finished start up Eclipse and Go to File -> Import -> Existing Projects into Workspace -> Browse and add ${APACHE_NUTCH_HOME}. Go to the new project in the Eclipse project explorer and scroll down until you find ant.xml. Right click on ant.xml and select run as -> 1 ant build. This may take a little while to compile.

If you are not using Eclipse after you have cd to ${APACHE_NUTCH_HOME} simply type ant runtime
This may take a few minutes to compile.

Start your first crawl by typing the lines below at the terminal (replace ‘http://nutch.apache.org/’ with whatever site you want to crawl):
Inject a URL into the DB
cd ${APACHE_NUTCH_HOME}/runtime/local mkdir -p urls echo 'http://nutch.apache.org/' > urls/seed.txt

Start crawling (you will want to create your own script later but manually just to see what is happening type the following into the command line)
bin/nutch inject urls
bin/nutch generate -topN 20 bin/nutch fetch -all bin/nutch parse -all bin/nutch updatedb

Repeat the last four commands (generate, fetch, parse and updatedb) again.

For the generate command, topN is the max number of links you want to actually parse each time. The first time there is only one URL (the one we injected from seed.txt) but after that there are many more. Note, however, Nutch keeps track of all links it encounters in the webpage table. It just limits the amount it actually parses to TopN so don’t be surprised by seeing many more rows in the webpage table than you expect by limiting with TopN.

Check your crawl results by looking at the webpage table in the nutch database.
mysql -u xxxxx -p use nutch; SELECT * FROM nutch.webpage;

You should see the results of your crawl (around 320 rows). It will be hard to read the columns so you may want to install MySQL Workbench via sudo apt-get install mysql-workbench and use that instead for viewing the data. You may also want to run the following SQL command select * from webpage where status = 2; to limit the rows in the webpage table to only urls that were actually parsed.

You can easily add more urls to search by hand in seed.txt if you want and then use the command bin/nutch inject urls .

Set up and index with Solr If you are using Nutch 2.2 at this time you are into the bleeding edge and probably want the latest version of Solr 4 as well. Untar it to to $HOME/apache-solr-4.X.X-XXXX. This folder will be now referred to as ${APACHE_SOLR_HOME}.
Download this link and use it to replace ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml.

From the terminal start solr:
cd ${APACHE_SOLR_HOME}/example java -jar start.jar

You can check this is running by opening http://localhost:8983/solr in your web browser. Select collection1 from the core selector.

Leave that terminal running and from a different terminal type the following:
cd ${APACHE_NUTCH_HOME}/runtime/local/ bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex

There remains a lot to configure to get a good web search going but you are at least started.