[wordpress搬家]nutch的二三事 — 2.2对mysql的支持及各种问题

[2013.12.26]

之前有一篇关于2.1对mysql的UTF8支持,现在有了2.2的支持了。

不过先前的问题还是有,有个人给出了一个gora-sql-mapping.xml的patch

但是替换以后还是有问题,对于Index的767长度:

[root@AY131218101252507ad0Z local]#  bin/crawl urls dianping http://localhost:8983 2
InjectorJob: starting at 2013-12-26 11:33:20
InjectorJob: Injecting urlDir: urls
InjectorJob: org.apache.gora.util.GoraException: java.io.IOException: java.sql.SQLException: Index column size too large. The maximum column size is 767 bytes.
        at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)
        at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
        at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:75)
        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:221)
        at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251)
        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)
Caused by: java.io.IOException: java.sql.SQLException: Index column size too large. The maximum column size is 767 bytes.
        at org.apache.gora.sql.store.SqlStore.createSchema(SqlStore.java:226)
        at org.apache.gora.sql.store.SqlStore.initialize(SqlStore.java:172)
        at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
        at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
        ... 7 more
Caused by: java.sql.SQLException: Index column size too large. The maximum column size is 767 bytes.
        at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073)
        at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609)
        at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541)
        at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002)
        at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163)
        at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624)
        at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2127)
        at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2427)
        at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2345)
        at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2330)
        at org.apache.gora.sql.store.SqlStore.createSchema(SqlStore.java:224)
        ... 10 more

Mysql关于InnoDB的限制中有如下描述:

By default, an index key for a single-column index can be up to 767 bytes. The same length limit applies to any index key prefix. See Section 13.1.13, “CREATE INDEX Syntax”. For example, you might hit this limit with a column prefix index of more than 255 characters on a TEXT or VARCHAR column, assuming a UTF-8 character set and the maximum of 3 bytes for each character. When the innodb_large_prefix configuration option is enabled, this length limit is raised to 3072 bytes, for InnoDB tables that use the DYNAMIC and COMPRESSED row formats.

When you attempt to specify an index prefix length longer than allowed, the length is silently reduced to the maximum length. This configuration option changes the error handling for some combinations of row format and prefix length longer than the maximum allowed. See innodb_large_prefix for details.

Enable this option to allow index key prefixes longer than 767 bytes (up to 3072 bytes), for InnoDB tables that use the DYNAMIC and COMPRESSED row formats. (Creating such tables also requires the option values innodb_file_format=barracuda and innodb_file_per_table=true.) See Section 14.3.9.7, “Limits on InnoDB Tables” for the relevant maximums associated with index key prefixes under various settings.

For tables using the REDUNDANT and COMPACT row formats, this option does not affect the allowed key prefix length. It does introduce a new error possibility. When this setting is enabled, attempting to create an index prefix with a key length greater than 3072 for a REDUNDANT or COMPACT table causes an error ER_INDEX_COLUMN_TOO_LONG (1727).

但是悲催的是我Mysql的配置都符合要求,应该是可以使用大于767的Index,并且gora-sql-mapping中定义的primarykey大小也是767啊啊?为啥呢?因为我创建的表格已经成功,并没有出错,所以问题不应该是Mysql这里的,再找找Nutch的设置哪里有问题。

发现是在inject时这句抛出的异常:

 bin/nutch inject url -crawlId blah

查看hadoop.log发现应该是inject时想自动创建schema,但是失败了。
然后发现如果我创建了webpage,不设置batchId的化没问题,但如果设置了batchId,nutch会试图创建一个batchId_webpage的表,这时就会出错。

于是,我选择了自己帮他创建这个临时表,然后,就能正常抓取了。


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值