Nutch1.3集成Solr3.4网页快照功能实现(三)

修改Reduce方法,如下:

public void reduce(Text key, Iterator<NutchWritable> values,

            OutputCollector<Text, NutchDocument> output, Reporter reporter)

            throws IOException {

        Inlinks inlinks = null;

        CrawlDatum dbDatum = null;

        CrawlDatum fetchDatum = null;

        ParseData parseData = null;

        ParseText parseText = null;

 

        byte[] cache_content = null;

 

        while (values.hasNext()) {

            final Writable value = values.next().get(); // unwrap

            if (value instanceof Inlinks) {

                inlinks = (Inlinks) value;

            else if (value instanceof CrawlDatum) {

                final CrawlDatum datum = (CrawlDatum) value;

                if (CrawlDatum.hasDbStatus(datum))

                    dbDatum = datum;

                else if (CrawlDatum.hasFetchStatus(datum)) {

                    // don't index unmodified (empty) pages

                    if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED)

                        fetchDatum = datum;

                else if (CrawlDatum.STATUS_LINKED == datum.getStatus()

                        || CrawlDatum.STATUS_SIGNATURE == datum.getStatus()

                        || CrawlDatum.STATUS_PARSE_META == datum.getStatus()) {

                    continue;

                else {

                    throw new RuntimeException("Unexpected status: "

                            + datum.getStatus());

                }

            else if (value instanceof ParseData) {

                parseData = (ParseData) value;

            else if (value instanceof ParseText) {

                parseText = (ParseText) value;

            }

 

 else if (value instanceof Content) {

                cache_content = ((Content) value).getContent();

            }

 

else if (LOG.isWarnEnabled()) {

                LOG.warn("Unrecognized type: " + value.getClass());

            }

        }

 

        if (fetchDatum == null || dbDatum == null || parseText == null

                || parseData == null) {

            return// only have inlinks

        }

 

        if (!parseData.getStatus().isSuccess()

                || fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS) {

            return;

        }

 

        NutchDocument doc = new NutchDocument();

        final Metadata metadata = parseData.getContentMeta();

 

        // add segment, used to map from merged index back to segment files

        doc.add("segment", metadata.get(Nutch.SEGMENT_NAME_KEY));

 

        // add digest, used by dedup

        doc.add("digest", metadata.get(Nutch.SIGNATURE_KEY));

 

        doc.add("cache_content", cache_content);

 

        final Parse parse = new ParseImpl(parseText, parseData);

        try {

            // extract information from dbDatum and pass it to

            // fetchDatum so that indexing filters can use it

            final Text url = (Text) dbDatum.getMetaData().get(

                    Nutch.WRITABLE_REPR_URL_KEY);

            if (url != null) {

                fetchDatum.getMetaData().put(Nutch.WRITABLE_REPR_URL_KEY, url);

            }

            // run indexing filters

            doc = this.filters.filter(doc, parse, key, fetchDatum, inlinks);

        catch (final IndexingException e) {

            if (LOG.isWarnEnabled()) {

                LOG.warn("Error indexing " + key + ": " + e);

            }

            return;

        }

 

        // skip documents discarded by indexing filters

        if (doc == null)

            return;

 

        float boost = 1.0f;

        // run scoring filters

        try {

            boost = this.scfilters.indexerScore(key, doc, dbDatum, fetchDatum,

                    parse, inlinks, boost);

        catch (final ScoringFilterException e) {

            if (LOG.isWarnEnabled()) {

                LOG.warn("Error calculating score " + key + ": " + e);

            }

            return;

        }

        // apply boost to all indexed fields.

        doc.setWeight(boost);

        // store boost for use by explain and dedup

        doc.add("boost", Float.toString(boost));

 

        output.collect(key, doc);

    }

至此,代码部分修改完成,接下来需要修改配置文件


本文转自william_xu 51CTO博客,原文链接:http://blog.51cto.com/williamx/722719,如需转载请自行联系原作者

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值