将apache-nutch-2.2.1-src.zip上传到liunx服务器
路径 /opt/nutch
修改配置
解压apache-nutch-2.2.1-src.tar.gz
修改gora.properties的数据库配置
$ wget http://archive.apache.org/dist/nutch/2.2.1/apache-nutch-2.2.1-src.tar.gz $ tar zxf apache-nutch-2.2.1-src.tar.gz
2.替换nutch-site.xml 位置:/opt/nutch/apache-nutch-2.2.1/conf
3.替换regex-urlfilter.txt 位置:/opt/nutch/apache-nutch-2.2.1/conf
4.替换ivy.xml 位置:/opt/nutch/apache-nutch-2.2.1/ivy
ant runtime
chmod a+x bin/nutch
执行 bin/nutch crawl urls -depth 1
https://www.instagram.com/p/CHfspR7MBoI/?igshid=1g5a895xvy50k
拉取的url会先从数据库的表中获取
1.拷贝nutch到指定目录
cp -r /opt/nutch/apache-nutch-2.2.1 /opt/nutch/apache-nutch-2.2.1-bigolive
2.切换路径
cd /opt/nutch/apache-nutch-2.2.1-bigolive/runtime/local
3.权限赋值
chmod a+x bin/nutch
4.替换文件 /opt/nutch/apache-nutch-2.2.1-bigolive/conf 下的 gora-sql-mapping.xml<?xml version="1.0" encoding="UTF-8"?> <!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> <gora-orm> <class name="org.apache.nutch.storage.WebPage" keyClass="java.lang.String" table="webpage_bigolive"> <primarykey column="id" length="512"/> <field name="baseUrl" column="baseUrl" length="512"/> <field name="status" column="status"/> <field name="prevFetchTime" column="prevFetchTime"/> <field name="fetchTime" column="fetchTime"/> <field name="fetchInterval" column="fetchInterval"/> <field name="retriesSinceFetch" column="retriesSinceFetch"/> <field name="reprUrl" column="reprUrl" length="512"/> <field name="content" column="content" length="65536"/> <field name="contentType" column="typ" length="32"/> <field name="protocolStatus" column="protocolStatus"/> <field name="modifiedTime" column="modifiedTime"/> <field name="prevModifiedTime" column="prevModifiedTime"/> <field name="batchId" column="batchId" length="32"/> <!-- parse fields --> <field name="title" column="title" length="512"/> <field name="text" column="text" length="32000"/> <field name="parseStatus" column="parseStatus"/> <field name="signature" column="signature"/> <field name="prevSignature" column="prevSignature"/> <!-- score fields --> <field name="score" column="score"/> <field name="headers" column="headers"/> <field name="inlinks" column="inlinks"/> <field name="outlinks" column="outlinks"/> <field name="metadata" column="metadata"/> <field name="markers" column="markers"/> </class> <class name="org.apache.nutch.storage.Host" keyClass="java.lang.String" table="host"> <primarykey column="id" length="512"/> <field name="metadata" column="metadata"/> <field name="inlinks" column="inlinks"/> <field name="outlinks" column="outlinks"/> </class> </gora-orm>
<class name="org.apache.nutch.storage.WebPage" keyClass="java.lang.String" table="webpage_instagram">
修改保存的表为 webpage_instagram,此表用于保持拉取下来的网页
5.替换nutch-site.xml 位置:/opt/nutch/apache-nutch-2.2.1-bigolive/conf
6.替换regex-urlfilter.txt 位置:/opt/nutch/apache-nutch-2.2.1-bigolive/conf7.编译
在 /opt/nutch/apache-nutch-2.2.1-instagram 下执行 ant runtime
7.新建urls/seed.txt文件