Nutch2.2.1+Eclipse+Mysql

最新推荐文章于 2017-08-22 15:57:15 发布

xzh199308

最新推荐文章于 2017-08-22 15:57:15 发布

阅读量736

点赞数

文章标签： mysql ant nutch svn eclipse

写在前面，环境配图是用的MyEclipse，使用Eclipse是一样的。用ant编译时，jdk版本应是1.6的。

一、环境准备

首先肯定是配置开发环境，这里暂时不作详细描述。

需要的环境有jdk1.7，MyEclipse，SVN，ant，以及MyEclipse下的两个插件subclipse 和IvyDe，下载地址http://subclipse.tigris.org/update_1.8.x和http://www.apache.org/dist/ant/ivyde/updatesite。

二、从svn检出项目，地址https://svn.apache.org/repos/asf/nutch/tags/release-2.2.1

Finish完成导入。

三、修改ivy目录下的ivysetting.xml地址 http://mirrors.ibiblio.org/maven2/ （只有这个地址访问是正常的，其余的地址我尝试访问不了）

四、修改ivy目录下的ivy.xml（增加mysql访问依赖java包）

修改gora-core版本为0.2.1，并解除注释gora-sql和mysql-connector-java

五、 Cd 到目录执行Ant eclipse(直接在Eclipse下ant build貌似有问题)，注意：jdk版本是1.6，太高版本不能编译

编译过程中，如果出现“ant-eclipse-1.0.bin.tar.bz2”这个jar包找不到，可以单独去下载下来，然后更改build.xml:876行的引用

将<get src="http://downloads.sourceforge.net/project/ant-eclipse/ant-eclipse/1.0/ant-eclipse-1.0.bin.tar.bz2" dest="${build.dir}/ant-eclipse-1.0.bin.tar.bz2" usetimestamp="false" /> 更改为 <copy file="D:\ant-eclipse-1.0.bin.tar.bz2" todir="${build.dir}"/>

六、回到Eclipse工程，刷新项目，会发现目录结构已经发生变化

七、看到还有一个错误，是编码的问题，工程右键Properties -> Resource ->utf-8

八、工程右键Build Path->Config Build Path->Order and Export下选中Conf 文件夹置顶

九、修改Conf文件夹下gora.properties配置mysql

#Default MySQL properties        #
###############################
gora.datastore.default=org.apache.gora.sql.store.SqlStore
gora.datastore.autocreateschema=true
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=utf8&autoReconnect=true&zeroDateTimeBehavior=convertToNull
gora.sqlstore.jdbc.user=root
gora.sqlstore.jdbc.password=123456

十、工程目录下新建文件夹urls，urls目录下新建一个文件url，里面输入要爬取的root_url，比如http://www.qq.com

十一、配置conf目录下nutch-site.xml

<property>
  <name>http.agent.name</name>
  <value>YourNutchSpider</value>
 </property>
 
 <property>
  <name>http.accept.language</name>
  <value>ja-jp, en-us,en-gb,en,zh-cn,zh-tw;q=0.7,*;q=0.3</value>
  <description>Value of the “Accept-Language” request header field.
  This allows selecting non-English language as default one to retrieve.
  It is a useful setting for search engines build for certain national group.</description>
 </property>
  
 <property>
  <name>parser.character.encoding.default</name>
  <value>utf-8</value>
  <description>The character encoding to fall back to when no other information
  is available</description>
 </property>
  
 <property>
  <name>plugin.folders</name>
  <value>src/plugin</value>
  <description>Directories where nutch plugins are located. Each
  element may be a relative or absolute path. If absolute, it is used
  as is. If relative, it is searched for on the classpath.</description>
 </property>

<property>
     <name>generate.batch.id</name>
     <value>*</value>
</property>
  
 <property>
  <name>storage.data.store.class</name>
  <value>org.apache.gora.sql.store.SqlStore</value>
  <description>The Gora DataStore class for storing and retrieving data.
  Currently the following stores are available: ….</description>
 </property>