Linux使用ragel进行文本快速解析（下）

最新推荐文章于 2024-06-11 09:37:06 发布

staticnetwind

最新推荐文章于 2024-06-11 09:37:06 发布

阅读量1k

点赞数 1

分类专栏： linux 文章标签： ragel awkmenu awk parser

本文链接：https://blog.csdn.net/stayneckwind2/article/details/89290602

版权

linux 专栏收录该内容

53 篇文章 8 订阅

订阅专栏

1、前言

《Linux使用ragel进行文本快速解析（上）》文中对Ragel进行了初步介绍，并给出了一个atoi的例子，本文接着再给出一个文本行解析的例子

2、思路

awk的主要是对固定列数的文本进行内容解析，若使用 awk命令的话，是进行逐行解析。同样的，使用 Ragel 写的思路也是，编写正则以行为单位，进行读取解析。但是相比命令的方式，Ragel 相当于可编程处理，则能灵活地对不固定行的文本进行处理。

3、源码

代码的实现在ragel-6.10/examples/awkemu.rl

先看状态机这块的代码：

%%{
    machine awkemu;

    action start_word {
        ws[nwords] = fpc;
    }   

    action end_word {
        we[nwords++] = fpc;
    }   

    action start_line {
        nwords = 0;
        ls = fpc;
    }   

    action end_line {
        printf("endline(%i): ", nwords );
        fwrite( ls, 1, p - ls, stdout );
        printf("\n");

        for ( i = 0; i < nwords; i++ ) { 
            printf("  word: ");
            fwrite( ws[i], 1, we[i] - ws[i], stdout );
            printf("\n");
        }   
    }   

    # Words in a line.
    word = ^[ \t\n]+;

    # The whitespace separating words in a line.
    whitespace = [ \t];

    # The components in a line to break up. Either a word or a single char of
    # whitespace. On the word capture characters.
    blineElements = word >start_word %end_word | whitespace;

    # Star the break line elements. Just be careful to decrement the leaving
    # priority as we don't want multiple character identifiers to be treated as
    # multiple single char identifiers.
    line = ( blineElements** '\n' ) >start_line @end_line;

    # Any number of lines.
    main := line*;
}%%

%% write data noerror nofinal;

可以看出，行由多个元素组成： line = ( blineElements** ‘\n’ ) >start_line @end_line;

元素由单词、分隔符进行区分： blineElements = word >start_word %end_word | whitespace;

代码中获取字符串的动作并没有进行内存拷贝，而是通过暂存字符串指针到数组 ws we 保存。

再看一下主函数的入口：


#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#define MAXWORDS 256
#define BUFSIZE 4096
char buf[BUFSIZE];

int main()
{
    int i, nwords = 0;
    char *ls = 0;
    char *ws[MAXWORDS];
    char *we[MAXWORDS];

    int cs;
    int have = 0;

    %% write init;

    while ( 1 ) {
        char *p, *pe, *data = buf + have;
        int len, space = BUFSIZE - have;
        /* fprintf( stderr, "space: %i\n", space ); */

        if ( space == 0 ) {
            fprintf(stderr, "buffer out of space\n");
            exit(1);
        }

        len = fread( data, 1, space, stdin );
        /* fprintf( stderr, "len: %i\n", len ); */
        if ( len == 0 )
            break;

        /* Find the last newline by searching backwards. This is where
         * we will stop processing on this iteration. */
        p = buf;
        pe = buf + have + len - 1;
        while ( *pe != '\n' && pe >= buf )
            pe--;
        pe += 1;

        /* fprintf( stderr, "running on: %i\n", pe - p ); */
        %% write exec;
        
        /* How much is still in the buffer. */
        have = data + len - pe;
        if ( have > 0 )
            memmove( buf, pe, have );

        /* fprintf(stderr, "have: %i\n", have ); */

        if ( len < space )
            break;
    }

    if ( have > 0 )
        fprintf(stderr, "input not newline terminated\n");
        
    return 0;
}

主函数是读取文件内容，通过调用 %% write exec; 将内容传到状态机中进行解析，

开始一看还是挺复杂的，为啥不考虑通过 fgets读取一行进行处理？所以又进行修改试验。

int ragel_awkemu(const char *str)
{
    const char *p = str, *pe = str + strlen(str);
    int cs; 
    int i, nwords;

    const char *ls = NULL;
    const char *ws[MAXWORDS] = {NULL};
    const char *we[MAXWORDS] = {NULL};

    /* Initialize and execute. */
    %% write init;
    %% write exec;

    return nwords;
}

int main(int argc, char *argv[])
{
    char buf[SIZE_LINE_NORMAL];

    while (fgets(buf, sizeof(buf), stdin) != 0) {
        int value = ragel_awkemu(buf);
        printf("num: %d\n", value);
    }   
    return 0;
}

4、运行

首先简单地测试，两种方法都能进行文本解析。为了压测两种方法的真正能力，造了100w行的日志信息到文件里，一行15列数据，总文件大小为300MB。

原版，按块取：

centos-x64 [ ~ ]# time ./awkemu < /tmp/syslog.dat >/dev/null
real    0m2.581s
user    0m2.484s
sys     0m0.094s

修改版，按行取：

centos-x64 [ ~ ]# time ./awkemu2 < /tmp/syslog.dat >/dev/null
real    0m3.488s
user    0m3.372s
sys     0m0.111s

性能方面，速度达到30~40w/s的速度。同时，可以看出行解析稍微慢了一些，但是通过strace查看时，不管fgets还是fread底层都是调用read接口，所以开销的区别并不在接口调用次数上。

再仔细阅读对比一下main函数，最后分析开销的区别是再调用次数方面，还是fread以每次处理的数据多、调用次数少占优。每次读到（拷贝）一块大buffer，再通过状态机进行逐个字符解析。若碰巧最后几个字符不足整行的话，memmove 到buffer头进行下一次处理。

5、结论

Ragel提供了一个经典的awk例子，同时也告诉我们，通过尽可能的块数据获取减少函数调用，尽可能减少内存拷贝，这样才能把性能发货到极致。

staticnetwind

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
Linux使用ragel进行文本快速解析（下）

1、前言《Linux使用ragel进行文本快速解析（上）》文中对Ragel进行了初步介绍，并给出了一个atoi的例子，本文接着再给出一个文本行解析的例子2、思路 awk的主要是对固定列数的文本进行内容解析，若使用 awk命令的话，是进行逐行解析。同样的，使用 Ragel 写的思路也是，编写正则以行为单位，进行读取解析。但是相比命令的方式，Ragel 相当于可编程处理，则能灵活地对不固定...
复制链接

扫一扫