nutch 1.x (nutch 1.11为例)
抓取网页存储到本地
bin/crawl urls crawl 2
建索引
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*
nutch 2.x (nutch 2.2.1为例)
mysql
my.ini或my.cnf中修改编码:
[mysqld] character-set-server=utf8
[client]、[mysql] default-character-set=utf8
数据表字段映射在gora-sql-mapping.xml中配置。
配置ivy对mysql的支持,在ivy/ivy.xml中配置
<dependency org=”mysql” name=”mysql-connector-java” rev=”5.1.18″ conf=”*->default”/>
<dependency org="org.apache.gora" name="gora-core" rev="0.2.1" conf="*->default"/>
<dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" />
配置nutch数据连接设置gora.properties
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true
gora.sqlstore.jdbc.user=xxxx(MySQL用户名)
gora.sqlstore.jdbc.password=xxxx(MySQL密码)
修改nutch-site.xml(将nutch-default.xml另存为nutch-site.xml然后修改),设置http.agent.name、storage.data.store.class等。
并添加
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available:.
</description>
</property>
<property>
<name>generate.batch.id</name>
<value>*</value>
</property
然后设置爬取网站。
执行爬取操作,爬取数据到数据库
bin/nutch crawl urls -depth 3 -topN 5