heritrix学习

mirrorwriter类中的域:

/**
     * Key to use asking settings for character map.
     */
    public static final String ATTR_CHAR_MAP = "character-map";


addElementToDefinition(new StringList(ATTR_CHAR_MAP,
            "This list is grouped in pairs. "
            + "The first string in each pair must have a length of one. "
            + "If it occurs in a URI path, "
            + "it is replaced by the second string in the pair. "
            + "For UNIX, no character mapping is normally needed. "
            + "For Macintosh, the recommended value is [: %%3A]. "
            + "For Windows, the recommended value is "
            + "[' ' %%20  &quot; %%22  * %%2A  : %%3A  < %%3C "
            + "\\> %%3E ? %%3F  \\\\ %%5C  ^ %%5E  | %%7C]."));

按照上面的理解,这个域的意义在于:如果路径中含有特殊字符,如'',*,:,<等,就将其转换为对应的%%20,%%22等

addElementToDefinition(new StringList(ATTR_CONTENT_TYPE_MAP,
            "This list is grouped in pairs. "
            + "If the content type of a resource begins (case-insensitive) "
            + "with the first string in a pair, the suffix is set to "
            + "the second string in the pair, replacing any suffix that may "
            + "have been in the URI.  For example, to force all HTML files "
            + "to have the same suffix, use [text/html html]."));

这个域:如果内容类型等于该map的key,则将其文件后缀改为对应的后缀。

e = addElementToDefinition(new SimpleType(ATTR_DIRECTORY_FILE,
            "Implicitly append this to a URI ending with '/'.",
            "index.html"));

如果一个链接以/结尾,则生成一个index.html来表示这个文件

比如一个58团购的链接:http://t.58.com/xm/68526387987971009/?linkid=xm_liebiao_home_1

本地文件如下:

 e = addElementToDefinition(new SimpleType(ATTR_DOT_BEGIN,
            "If a segment starts with '.', the '.' is replaced by this.",
            DEFAULT_DOT_BEGIN));
       addElementToDefinition(new SimpleType(ATTR_DOT_END,
            "If a directory name ends with '.' it is replaced by this.  "
            + "For all file systems except Windows, '.' is recommended.  "
            + "For Windows, %%2E is recommended.",
            "."));

这边的两个域应该也就是路径转换时用到吧

addElementToDefinition(new StringList(ATTR_HOST_MAP,
            "This list is grouped in pairs. "
            + "If a host name matches (case-insensitive) the first string "
            + "in a pair, it is replaced by the second string in the pair.  "
            + "This can be used for consistency when several names are used "
            + "for one host, for example "
            + "[12.34.56.78 www42.foo.com]."));

类似dns啊

addElementToDefinition(new SimpleType(ATTR_PATH,
            "Top-level directory for mirror files.", "mirror"));

这个都知道啦,镜像存储的位置

addElementToDefinition(new SimpleType(ATTR_SUFFIX_AT_END,
            "If true, the suffix is placed at the end of the path, "
            + "after the query (if any).  If false, the suffix is placed "
            + "before the query.",
            Boolean.TRUE));

如果链接中包含查询的话,是把后缀放到查询前还是查询后。

e = addElementToDefinition(new SimpleType(ATTR_TOO_LONG_DIRECTORY,
            "If all the directories in the URI would exceed, "
            + "or come close to exceeding, the file system maximum "
            + "path length, then they are all replaced by this.",
            DEFAULT_TOO_LONG_DIRECTORY));

路径太长。。。自动转换

/** Default value for ATTR_TOO_LONG_DIRECTORY.*/
    private static final String DEFAULT_TOO_LONG_DIRECTORY = "LONG";

 addElementToDefinition(new StringList(ATTR_UNDERSCORE_SET,
            "If a directory name appears (case-insensitive) in this list "
            + "then an underscore is placed before it.  "
            + "For all file systems except Windows, this is not needed.  "
            + "For Windows, the following is recommended: "
            + "[com1 com2 com3 com4 com5 com6 com7 com8 com9 "
            + "lpt1 lpt2 lpt3 lpt4 lpt5 lpt6 lpt7 lpt8 lpt9 "
            + "con nul prn]."));

这个不太懂。。。

innerprocess方法:

String scheme = uuri.getScheme();
        if (!"http".equalsIgnoreCase(scheme)
                && !"https".equalsIgnoreCase(scheme)) {
            return;
        }

非http(s)直接返回;

RecordingInputStream recis = curi.getHttpRecorder().getRecordedInput();
        if (0L == recis.getResponseContentLength()) {
            return;
        }

如果未得到链接指向的网页的内容,直接返回;

String baseDir = null; // Base directory.
        String baseSeg = null; // ATTR_PATH value.
        try {
            baseSeg = (String) getAttribute(ATTR_PATH, curi);
        } catch (AttributeNotFoundException e) {
            logger.warning(e.getLocalizedMessage());
            return;
        }

默认baseSeg的值为mirror;

// Trim any trailing File.separatorChar characters from baseSeg.
        while ((baseSeg.length() > 1) && baseSeg.endsWith(File.separator)) {
            baseSeg = baseSeg.substring(0, baseSeg.length() - 1);
        }

去掉文件分隔符

if (0 == baseSeg.length()) {
            baseDir = getController().getDisk().getPath();
        } else if ((new File(baseSeg)).isAbsolute()) {
            baseDir = baseSeg;
        } else {
            baseDir = getController().getDisk().getPath() + File.separator
                + baseSeg;
        }

如果基本路径为空,则从controller中取,也就是从order。xml文件取,


默认就是项目路径了,


原因就在上面,配置文件里面是空的,所以只能是得到工作目录的路径了。。

 // Already have a path for this URI.
        boolean reCrawl = curi.containsKey(A_MIRROR_PATH);

if (reCrawl) {
                mps = curi.getString(A_MIRROR_PATH);
                destFile = new File(baseDir + File.separator + mps);
                File parent = destFile.getParentFile();
                if (null != parent) {
                    IoUtils.ensureWriteableDirectory(parent);
                }
            }

如果已经存在路径的话,则直接得到那个路径(不需要转换了);

else {
                URIToFileReturn r = null; // Return from uriToFile().
                try {
                     r = uriToFile(baseDir, curi);
                } catch (AttributeNotFoundException e) {
                    logger.warning(e.getLocalizedMessage());
                    return;
                }
                destFile = r.getFile();
                mps = r.getRelativePath();
            }

不然的话,是需要做一个从uri到文件路径的转换的,调用uriToFile。

在方法uriToFile中,完成全部的uri到文件路径的转换工作;



还记得之前有个表,里面类似map的entry,第一个数据为ip,第二个数据为对应的域名,这边应该是进行ip替换为域名;


这边是根据之前的域,设置是否显示端口号;设置后缀,比如contenttype为text/html,后缀为html;


非法字符的转换,如windows路径中不能含有问号之类的;

转换完之后,

logger.warning(uuri.toString() + " -> " + destFile.getPath());

这边会有提示,提示该uri变成本地路径后的字符串显示;

就这样吧







评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

低级知识传播

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值