Sally配置文件讲解

一般包含三个部分输入配置,特征配置以及输出配置。

Input configuration

  • input_format:这里的格式会告诉sally需要使用什么方式打开输入文件,不同格式文件对应不同方法;
  • chunck_size:To enable an efficient processing of large data sets, sally processes strings in chunks. This parameter defines the number of strings in one of these chunks. Depending on the lengths of the strings, this parameter can be adjusted to balance loading and processing of data;
  • decode_str:如果这个参数设置为“1”,sally会自动对URI编码元素进行解码。(That is, substrings of the form %XX are replaced with the byte represented by the hexadecimal number XX. This feature comes handy, if binary data is provided using the textual input format “lines”. For example, HTTP requests can be stored in a single line if line-breaks are represented by “%0a%0d”.);
  • reverse_str:这个参数被设定为“1”时,输入字符串将会被颠倒顺序(Such reversing might help in situations where the reading direction of the input strings is unspecified);
  • fasta_regex:在生物信息学中,FASTA格式(又称为Pearson格式),是一种基于文本用于表示核苷酸序列或氨基酸序列的格式。在这种格式中碱基对或氨基酸用单个字母来编码,且允许在序列前添加序列名及注释。如
>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA DIDGDGQVNYEEFVQMMTAK*

Feature configuration

  • ngram_len:指定n的大小,用来划分gram,
  • granularity:ngram_len指定了多少个symbols作为一个特征,而granularity则指定了什么粒度作为一个symbol(包括“bytes”、“tokens”)(这里特别注意一下最后一种token n-grams这种形式是将几个tokens整合一个group中,以这个group最为特征)
=item I<tokens>

The strings are partitioned into substrings (tokens) using a set of
delimiter characters.  Such partitioning is typical for natural language
processing, where the delimiters are usually defined as white-space and
punctuation symbols.  An embedding using tokens is selected by choosing
I<tokens> as granularity (B<granularity>), defining a set of delimiter
characters (B<token_delim>) and setting the n-gram length to 1
(B<ngram_len>).

=item I<byte n-grams>

The strings are characterized by all possible byte sequences of a fixed
length n (byte n-grams).  These features are frequently used if no
information about the structure of strings is available, such as in
bioinformatics or computer security.  An embedding using byte n-grams is
selected by choosing I<bytes> as granularity (B<granularity>) and defining
the n-gram length (B<ngram_len>).

=item I<token n-grams>

The strings are characterized by all possible token sequences of a fixed
length n (token n-grams).  These features require the definition of a set of
delimiters and a length n.  They are often used in natural language
processing as a coarse way for capturing structure of text.  An embedding
using token n-grams is selected by choosing I<tokens> as granularity
(B<granularity>), defining a set of delimiter characters (B<token_delim>) and
choosing an n-gram length (B<ngram_len>).
  • ngram_delim: 进行token划分的分隔符
  • ngram_pos:在计算value的时候是否需要考虑位置,也就是只有位置相同(或者同一位置的某个偏移范围内)的gram才会计数
  • pos_shift:指定偏移范围
The parameter B<ngram_pos> can be used to enable positional n-grams. In
contrast to regular n-grams, these substrings of length n are associated
with a position in the originating string.  Positional n-grams thus only
match if they appear at the same location in a string.  The additional
parameter B<pos_shift> can be used to add a shift to the n-grams.  If the
parameter is set to I<k>, multiple positional n-grams are extracted with a
shift from I<-k> to I<+k>.
  • vect_embed:向量的Embedding方式,包括三种“cnt”,“bin”,“tfidf”。
This parameter specifies how the features are embedded in the vector
space. Supported values are "bin" for associating each dimension with
a binary value, "cnt" for associating each dimension with a count
value and "tfidf" for using a TF-IDF weighting.

Output Configuration

  • output_format:输出格式,支持“text”、“libsvm”、“Json”以及matlab格式
=item I<"text">

The feature vectors of the embedded strings are stored as plain text.
Each feature vector is represented as a list of dimensions, which is written
to I<output> in the following form

    dimension:feature:value,... source

I<dimension> specifies the index of the dimension, I<feature> a textual
representation of the feature and I<value> the value at the dimension.  If
parameter B<explicit_hash> is not enabled in the configuration, the field
I<feature> is empty.

=item I<"stdout">

The feature vectors of the embedded strings are written to standard output
(stdout) as text.  Each feature vector is represented as a list of
dimensions in the following form:

    dimension:feature:value,... source

I<dimension> specifies the index of the dimension, I<feature> a textual
representation of the feature and I<value> the value at the dimension.  If
parameter B<explicit_hash> is not enabled in the configuration, the field
I<feature> is empty.

=item I<"matlab">

The feature vectors of the embedded strings are stored in Matlab
format (v5).  The vectors are stored as a 1 x n struct array with the
fields: data, src, label and feat. The name of the output file is
given as I<output> to B<sally>. Note that great care is required to
efficiently operate with sparse vectors in Matlab. If the sparse
representation is lost during computations, excessive run-time and
memory requirements are likely.

=item I<"cluto">

The feature vectors of the embedded strings are stored as a sparse
matrix suitable for the clustering tool Cluto. The first line of
the file is a header for Cluto, while the remaining lines correspond
to feature vectors. The name of the output file is given as I<output>
to B<sally>. Note that Cluto can not handle arbitrarily large vector
spaces and thus the B<"hash_bits"> should be set to values below 24.

=item I<"json">

The feature vectors of the embedded strings are stored as JSON objects.
Each object contains a list of dimension indices denoted I<dim> and
corresponding values denoted as I<val>.  Depending on the configuration the
source for each object as well as the actual string feature associated with
each dimension are also stored in the JSON object.

(PS:看到这里终于明白输出的.sally文件到底是什么意思了,今天看了一天的源码也算是没有浪费时间,下一步需要想想这么做为什么能够提取出特征来,以及后面可以怎么使用这种方式提取的特征。这里给出一个输出格式是“text”以及“json”的图。

输出格式为“text”:

这里写图片描述

输出格式为“json”:

这里写图片描述

下面是一个sally的配置文件的示例

# Input configuration
input = {
        # Input format. Supported types: "dir", "arc", "lines", "fasta"
        input_format    = "lines";

        # Number of strings to process in each chunk
        chunk_size      = 4096;

        # Regex for extracting labels from FASTA descriptions
        fasta_regex     = " (\\+|-)?[0-9]+";

        # Regex for extracting labels from text lines
        lines_regex     = "^[ ]+";
};

# Feature configuration
features = {
        # Length of n-grams.
        ngram_len       = 1;

    # d = '()<>@,:;\\\"/[]?={}&?\n\r \t'; from urllib import quote; quote(d)
        # Delimiters for n-grams. An empty string triggers byte n-grams.
        # ngram_delim     = "%0a%0d%20%09";

        # Embedding mode for vectors. Supported types "cnt", "bin", "tfidf"
        vect_embed      = "bin";

        # Normalization mode for vectors. Supported types "l1", "l2", "none".
        vect_norm       = "none";

        # Number of hash bits to use. 
        hash_bits       = 32;

        # Explicit hash table instead of hashing features only.
        explicit_hash   = 1;

        # File to store weighting vector for TFIDF embedding.
        tfidf_file      = "tfidf.fv";
};

# Configuration of output
output = {
        # Output format. Supported formats: "libsvm", "text", "matlab"
        output_format   = "text";
};
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值