Sally配置文件讲解

最新推荐文章于 2020-10-23 16:29:03 发布

Tusaka

最新推荐文章于 2020-10-23 16:29:03 发布

阅读量872

点赞数

分类专栏：预处理文章标签：预处理

本文链接：https://blog.csdn.net/Tusaka/article/details/52862913

版权

预处理专栏收录该内容

1 篇文章 0 订阅

订阅专栏

一般包含三个部分输入配置，特征配置以及输出配置。

Input configuration

input_format:这里的格式会告诉sally需要使用什么方式打开输入文件，不同格式文件对应不同方法；
chunck_size：To enable an efficient processing of large data sets, sally processes strings in chunks. This parameter defines the number of strings in one of these chunks. Depending on the lengths of the strings, this parameter can be adjusted to balance loading and processing of data；
decode_str：如果这个参数设置为“1”，sally会自动对URI编码元素进行解码。（That is, substrings of the form %XX are replaced with the byte represented by the hexadecimal number XX. This feature comes handy, if binary data is provided using the textual input format “lines”. For example, HTTP requests can be stored in a single line if line-breaks are represented by “%0a%0d”.）；
reverse_str：这个参数被设定为“1”时，输入字符串将会被颠倒顺序（Such reversing might help in situations where the reading direction of the input strings is unspecified）；
fasta_regex:在生物信息学中，FASTA格式（又称为Pearson格式），是一种基于文本用于表示核苷酸序列或氨基酸序列的格式。在这种格式中碱基对或氨基酸用单个字母来编码，且允许在序列前添加序列名及注释。如

>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA DIDGDGQVNYEEFVQMMTAK*

Feature configuration

ngram_len:指定n的大小，用来划分gram，
granularity：ngram_len指定了多少个symbols作为一个特征，而granularity则指定了什么粒度作为一个symbol（包括“bytes”、“tokens”）（这里特别注意一下最后一种token n-grams这种形式是将几个tokens整合一个group中，以这个group最为特征）

=item I<tokens>

The strings are partitioned into substrings (tokens) using a set of
delimiter characters.  Such partitioning is typical for natural language
processing, where the delimiters are usually defined as white-space and
punctuation symbols.  An embedding using tokens is selected by choosing
I<tokens> as granularity (B<granularity>), defining a set of delimiter
characters (B<token_delim>) and setting the n-gram length to 1
(B<ngram_len>).

=item I<byte n-grams>

The strings are characterized by all possible byte sequences of a fixed
length n (byte n-grams).  These features are frequently used if no
information about the structure of strings is available, such as in
bioinformatics or computer security.  An embedding using byte n-grams is
selected by choosing I<bytes> as granularity (B<granularity>) and defining
the n-gram length (B<ngram_len>).

=item I<token n-grams>

The strings are characterized by all possible token sequences of a fixed
length n (token n-grams).  These features require the definition of a set of
delimiters and a length n.  They are often used in natural language
processing as a coarse way for capturing structure of text.  An embedding
using token n-grams is selected by choosing I<tokens> as granularity
(B<granularity>), defining a set of delimiter characters (B<token_delim>) and
choosing an n-gram length (B<ngram_len>).

ngram_delim: 进行token划分的分隔符
ngram_pos:在计算value的时候是否需要考虑位置，也就是只有位置相同（或者同一位置的某个偏移范围内）的gram才会计数
pos_shift：指定偏移范围

The parameter B<ngram_pos> can be used to enable positional n-grams. In
contrast to regular n-grams, these substrings of length n are associated
with a position in the originating string.  Positional n-grams thus only
match if they appear at the same location in a string.  The additional
parameter B<pos_shift> can be used to add a shift to the n-grams.  If the
parameter is set to I<k>, multiple positional n-grams are extracted with a
shift from I<-k> to I<+k>.

vect_embed:向量的Embedding方式，包括三种“cnt”，“bin”，“tfidf”。

This parameter specifies how the features are embedded in the vector
space. Supported values are "bin" for associating each dimension with
a binary value, "cnt" for associating each dimension with a count
value and "tfidf" for using a TF-IDF weighting.

Output Configuration

output_format:输出格式，支持“text”、“libsvm”、“Json”以及matlab格式

=item I<"text">

The feature vectors of the embedded strings are stored as plain text.
Each feature vector is represented as a list of dimensions, which is written
to I<output> in the following form

    dimension:feature:value,... source

I<dimension> specifies the index of the dimension, I<feature> a textual
representation of the feature and I<value> the value at the dimension.  If
parameter B<explicit_hash> is not enabled in the configuration, the field
I<feature> is empty.

=item I<"stdout">

The feature vectors of the embedded strings are written to standard output
(stdout) as text.  Each feature vector is represented as a list of
dimensions in the following form:

    dimension:feature:value,... source

I<dimension> specifies the index of the dimension, I<feature> a textual
representation of the feature and I<value> the value at the dimension.  If
parameter B<explicit_hash> is not enabled in the configuration, the field
I<feature> is empty.

=item I<"matlab">

The feature vectors of the embedded strings are stored in Matlab
format (v5).  The vectors are stored as a 1 x n struct array with the
fields: data, src, label and feat. The name of the output file is
given as I<output> to B<sally>. Note that great care is required to
efficiently operate with sparse vectors in Matlab. If the sparse
representation is lost during computations, excessive run-time and
memory requirements are likely.

=item I<"cluto">

The feature vectors of the embedded strings are stored as a sparse
matrix suitable for the clustering tool Cluto. The first line of
the file is a header for Cluto, while the remaining lines correspond
to feature vectors. The name of the output file is given as I<output>
to B<sally>. Note that Cluto can not handle arbitrarily large vector
spaces and thus the B<"hash_bits"> should be set to values below 24.

=item I<"json">

The feature vectors of the embedded strings are stored as JSON objects.
Each object contains a list of dimension indices denoted I<dim> and
corresponding values denoted as I<val>.  Depending on the configuration the
source for each object as well as the actual string feature associated with
each dimension are also stored in the JSON object.

（PS：看到这里终于明白输出的.sally文件到底是什么意思了，今天看了一天的源码也算是没有浪费时间，下一步需要想想这么做为什么能够提取出特征来，以及后面可以怎么使用这种方式提取的特征。这里给出一个输出格式是“text”以及“json”的图。

输出格式为“text”：

这里写图片描述

输出格式为“json”：

这里写图片描述

下面是一个sally的配置文件的示例

# Input configuration
input = {
        # Input format. Supported types: "dir", "arc", "lines", "fasta"
        input_format    = "lines";

        # Number of strings to process in each chunk
        chunk_size      = 4096;

        # Regex for extracting labels from FASTA descriptions
        fasta_regex     = " (\\+|-)?[0-9]+";

        # Regex for extracting labels from text lines
        lines_regex     = "^[ ]+";
};

# Feature configuration
features = {
        # Length of n-grams.
        ngram_len       = 1;

    # d = '()<>@,:;\\\"/[]?={}&?\n\r \t'; from urllib import quote; quote(d)
        # Delimiters for n-grams. An empty string triggers byte n-grams.
        # ngram_delim     = "%0a%0d%20%09";

        # Embedding mode for vectors. Supported types "cnt", "bin", "tfidf"
        vect_embed      = "bin";

        # Normalization mode for vectors. Supported types "l1", "l2", "none".
        vect_norm       = "none";

        # Number of hash bits to use. 
        hash_bits       = 32;

        # Explicit hash table instead of hashing features only.
        explicit_hash   = 1;

        # File to store weighting vector for TFIDF embedding.
        tfidf_file      = "tfidf.fv";
};

# Configuration of output
output = {
        # Output format. Supported formats: "libsvm", "text", "matlab"
        output_format   = "text";
};