Nutch2 WebPage写入数据库的过程分析

版本: Nutch 2.2.1

本文通过InjectJob来追踪webpage的定义、创建、传递、序列化、写入数据库的整个过程。从源码中摘录了重要的代码行,并标明其所在文件名、行号。

1. 定义 schema
schema直接写在源代码里面:
//file: org/apache/nutch/storage/WebPage.java
//line: 42
public class WebPage extends PersistentBase {
  public static final Schema _SCHEMA = Schema.parse("{\"type\":\"record\",\"name\":\"WebPage\",\"namespace\":\"org.apache.nutch.storage\",\"fields\":[{\"name\":\"baseUrl\",\"type\":\"string\"},{\"name\":\"status\",\"type\":\"int\"},{\"name\":\"fetchTime\",\"type\":\"long\"},{\"name\":\"prevFetchTime\",\"type\":\"long\"},{\"name\":\"fetchInterval\",\"type\":\"int\"},{\"name\":\"retriesSinceFetch\",\"type\":\"int\"},{\"name\":\"modifiedTime\",\"type\":\"long\"},{\"name\":\"prevModifiedTime\",\"type\":\"long\"},{\"name\":\"protocolStatus\",\"type\":{\"type\":\"record\",\"name\":\"ProtocolStatus\",\"fields\":[{\"name\":\"code\",\"type\":\"int\"},{\"name\":\"args\",\"type\":{\"type\":\"array\",\"items\":\"string\"}},{\"name\":\"lastModified\",\"type\":\"long\"}]}},{\"name\":\"content\",\"type\":\"bytes\"},{\"name\":\"contentType\",\"type\":\"string\"},{\"name\":\"prevSignature\",\"type\":\"bytes\"},{\"name\":\"signature\",\"type\":\"bytes\"},{\"name\":\"title\",\"type\":\"string\"},{\"name\":\"text\",\"type\":\"string\"},{\"name\":\"parseStatus\",\"type\":{\"type\":\"record\",\"name\":\"ParseStatus\",\"fields\":[{\"name\":\"majorCode\",\"type\":\"int\"},{\"name\":\"minorCode\",\"type\":\"int\"},{\"name\":\"args\",\"type\":{\"type\":\"array\",\"items\":\"string\"}}]}},{\"name\":\"score\",\"type\":\"float\"},{\"name\":\"reprUrl\",\"type\":\"string\"},{\"name\":\"headers\",\"type\":{\"type\":\"map\",\"values\":\"string\"}},{\"name\":\"outlinks\",\"type\":{\"type\":\"map\",\"values\":\"string\"}},{\"name\":\"inlinks\",\"type\":{\"type\":\"map\",\"values\":\"string\"}},{\"name\":\"markers\",\"type\":{\"type\":\"map\",\"values\":\"string\"}},{\"name\":\"metadata\",\"type\":{\"type\":\"map\",\"values\":\"bytes\"}},{\"name\":\"batchId\",\"type\":\"string\"}]}");
//..
public Schema getSchema() { return _SCHEMA; }
//...
}

这是一个json格式的字符串,由avro负责解析

2.  传递Schema
这一过程在提交job之前的初始化阶段进行
//file: org/apache/nutch/crawl/InjectorJob.java
//InjectorJob.run(Map<String,Object>) line: 221   
{ 
    DataStore<String, WebPage> store = StorageUtils.createWebStore(currentJob.getConfiguration(),
      String.class, WebPage.class);
} 


一层层的传递persistentClass

//file: gora-core-0.2.1/org/apache/gora/store/DataStoreFactory.java
//DataStoreFactory.createDataStore(Class<D>, Class<K>, Class<T>, Configuration, String) line: 135  
{
    return createDataStore(dataStoreClass, keyClass, persistent, conf, createProps(), schemaName);
} 


gora调用WebPage.getSchema() ,获取了Schema
//file: gora-core-0.2.1/org/apache/gora/store/DataStoreBase.java
//SqlStore<K,T>(DataStoreBase<K,T>).initialize(Class<K>, Class<T>, Properties) line: 81 
{   
    schema = this.beanFactory.getCachedPersistent().getSchema();
    fieldMap = AvroUtils.getFieldMap(schema);
} 



3. 传递数据、序列化
这一过程在Map阶段进行

Map方法创建webpage(row),并在最后输出到context
//file: org/apache/nutch/crawl/InjectorJob.java
//InjectorJob$UrlMapper.map(LongWritable, Text, Mapper<LongWritable,Text,String,Contex>) line: 191 
{   
      context.write(reversedUrl, row);
} 


hadoop core 逐层传递webpage
//file: hadoop-src/org/apache/hadoop/mapred/MapTask.java
//MapTask$NewDirectOutputCollector<K,V>.write(K, V) line: 638    
{
      reporter.progress();
      long bytesOutPrev = getOutputBytes(fsStats);
      out.write(key, value);
} 

上面的out对象的类型是GoraRecoreWriter

//file: gora-core-0.2.1/org/apache/gora/mapreduce/GoraRecordWriter.java
//GoraRecordWriter<K,T>.write(K, T) line: 60   
{ 
    store.put(key, (Persistent) value);
}


对象store的实际类型为SqlStore,继承自Gora-core的DataStoreBase类,负责对Mysql的读写。K是主键,T是一个WebPage对象,先写到cache里面。

//file: gora-sql-0.1.1-incubating/org/apache/gora/sql/store/SqlStore.java
//SqlStore<K,T>.put(K, T) line: 616  
 
  public void put(K key, T persistent)
  {     
      List<Field> fields = schema.getFields();

      for (int i = 0; i < fields.size(); i++) {
        Field field = fields.get(i);
        Column column = mapping.getColumn(field.name());
        insertStatement.setObject(persistent.get(i), field.schema(), column);
      }

      //jdbc already should cache the ps
      PreparedStatement insert = insertStatement.toStatement(connection);
      synchronized (writeCache) {
        writeCache.add(insert);
      }

  }


toStatement()里面调用了setField(),序列化操作由avro实现,这里暂不深入
//file: gora-sql-0.1.1-incubating/org/apache/gora/sql/store/SqlStore.java
//SqlStore<K,T>.setField(PreparedStatement, Column, Schema, int, Object) line: 718
{
   IOUtils.serialize(os, datumWriter, schema, object);
} 



4. flush操作
//file: hadoop-src/org/apache/hadoop/mapred/MapTask.java
//MapTask.runNewMapper(JobConf, TaskSplitIndex, TaskUmbilicalProtocol, TaskReporter) line: 767
{ 
    output.close(mapperContext);
}

//file: gora-core-0.2.1/org/apache/gora/mapreduce/GoraRecordWriter.java
//GoraRecordWriter<K,T>.close(TaskAttemptContext) line: 55    
{
    store.close();
}

下面是SqlStore.close()内调用的flush()方法:
//file: gora-sql-0.1.1-incubating/org/apache/gora/sql/store/SqlStore.java
//SqlStore<K,T>.flush() line: 342
{
    connection.commit();
} 

至此,webpage被写入Mysql数据库 (底层是调用jdbc)


  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值