java微型计算器编写_fastCSV-微型,快速,符合标准的CSV阅读器编写器

java微型计算器编写

介绍 (Introduction)

With the increased prominence of machine learning and ingesting large datasets with the CSV format for this purpose, I decided to write a CSV parser which met my requirements of being small, fast and easy to use. Most libraries I looked at didn't really meet my requirements and fastCSV was born.

随着机器学习的兴起和为此目的而采用CSV格式提取大型数据集的日益增长,我决定编写一个CSV解析器,该解析器可以满足我对小型,快速和易于使用的要求。 我查看的大多数库都不完全符合我的要求, fastCSV诞生了。

Also CSV allows you to load tabular (2-dimensional) data into memory very quickly as opposed to other serializers like fastJSON.

相对于其他序列化器(如fastJSON ),CSV还允许您非常快速地将表格(二维)数据加载到内存中。

特征 (Features)

  • Fully CSV standard compliant

    完全符合CSV标准

    • Multi-line

      多线
    • Quoted columns

      引用栏
    • Keeps spaces between delimiters

      在定界符之间保持空格
  • Really fast reading and writing of CSV files (see performance)

    真正快速读取和写入CSV文件(请参见性能)

  • Tiny 8kb DLL compiled to net40 or netstandard20

    微小的8kb DLL编译为net40netstandard20

  • Ability to get a typed list of objects from a CSV file

    能够从CSV文件中获取对象的类型列表

  • Ability to filter a CSV file while loading

    加载时能够过滤CSV文件

  • Ability to specify a custom delimiter

    能够指定自定义定界符

CSV标准 (CSV Standard)

You can read the CSV RFC here : https://tools.ietf.org/html/rfc4180 as a summary a CSV file can :

您可以在这里阅读CSV RFC: https : //tools.ietf.org/html/rfc4180作为CSV文件的摘要可以:

  • be multi lined if the values in a column contains new lines

    如果一列中的值包含新行,则为多行
  • a column must be quoted if it contains a new line, delmiter or quote character

    如果列包含换行符,分隔符或引号字符,则必须用引号引起来

    • quotes must be quoted

      引号必须加引号
  • spaces between the delimiter are considered part of the column

    定界符之间的空格被视为列的一部分

Below is an example of a complex standard compliant CSV file from the wiki page https://en.wikipedia.org/wiki/Comma-separated_values :

以下是来自Wiki页面https://en.wikipedia.org/wiki/Comma-separated_values的复杂的,符合标准的CSV文件的示例:

Year,Make,Model,Description,Price
1997,Ford,"E350
F150","ac, abs,
moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!
air,
"",moon,""
roof, loaded",4799.00
1999,BMW,Z3,"used",14900.00
1999, Toyota,Corolla,,7000.00   

as you can see some rows are multi lined and contain quotes and commas, which will give the table below:

如您所见,有些行是多行的,并且包含引号和逗号,这将给出下表:

YearMakeModelDescriptionPrice
1997Ford

E350

F150

ac, abc,

moon

3000.00
1999ChevyVenture "Extended Edition" 4900.00
1999ChevyVenture "Extended Edition, Very Large" 5000.00
1996JeepGrand Cherokee

MUST SELL!

air,

",moon,"

roof, loaded

4799.00
1999BMWZ3used14900.00
1999 ToyotaCorolla 7000.00
使 模型 描述 价钱
1997年 福特汽车

E350

F150

交流,abc,

月亮

3000.00
1999年 雪佛兰 创业“扩展版” 4900.00
1999年 雪佛兰 风险投资“扩展版,非常大” 5000.00
1996年 吉普车 大切诺基

必须卖!

空气,

“,月亮,”

屋顶,满载

4799.00
1999年 宝马 Z3 用过的 14900.00
1999年 丰田汽车 花冠 7000.00

as you can see some columns are multi line, and " Toyota" column of the last row starts with a space.

如您所见,有些列是多行,最后一行的“ Toyota”列以空格开头。

绩效基准 (Performance Benchmarks)

Loading the https://www.ncdc.noaa.gov/orders/qclcd/QCLCD201503.zip (585Mb) file which has 4,496,263 rows on my machine as a relative comparison to other libraries:

加载https://www.ncdc.noaa.gov/orders/qclcd/QCLCD201503.zip(585Mb )文件,该文件在我的计算机上具有4,496,263行,作为与其他库的相对比较:

  • fastCSV : 11.20s 639Mb used

    fastCSV :使用11.20s 639Mb

  • NReco.CSV : 6.76s 800Mb used

    NReco.CSV :使用6.76s 800Mb

  • fastCSV string.Split() : 11.50s 638Mb used

    fastCSV string.Split() :使用11.50s 638Mb

  • TinyCSVparser : 34s 992Mb used

    TinyCSVparser :使用34s 992Mb

As a comparison of a baseline of what is possible on the same dataset:

作为对同一数据集上可能存在的基准的比较:

  1. File.ReadAllBytes() : 1.5s 573Mb used

    File.ReadAllBytes() :使用1.5s 573Mb

  2. File.ReadAllLines() with no processing : 3.7s 1633Mb used

    File.ReadAllLines()不进行处理:使用3.7s 1633Mb

  3. File.ReadLines() with no processing : 1.9s

    File.ReadLines()无需处理:1.9s

  4. File.ReadLines() + string.Split() no return list : 7.5s

    File.ReadLines() + string.Split()没有返回列表: string.Split()

The difference from 1 to 2 is the overhead of converting the bytes to Unicode strings : 2.2s

从1到2的区别是将字节转换为Unicode字符串的开销:2.2s

The difference between 2 and 3 is the memory overhead of creating string[] : 1.8s

2和3之间的区别是创建string[]的内存开销:1.8s

The difference from 4 to fastCSV is the overhead of creating T objects and adding to a list : 4s

从4到fastCSV是创建T对象并添加到列表的开销:4s

未上路 (Roads not taken)

  • Loading chunks in a buffer :

    在缓冲区中加载块:

    • while initially I tried this route, it proved too complex and I couldn't get it to work properly. Judging by other libraries which did this, it is slow in comparison to the current implementation anyway.

      最初我尝试此路线时,事实证明它太复杂了,无法正常工作。 从其他这样做的库来看,与当前的实现相比,它仍然很慢。
  • StringBuilder character by character :

    StringBuilder逐个字符:

    • using this option proved too slow for parsing columns out of a line.

      事实证明,使用此选项对于从行中解析列太慢。

使用代码 (Using the code)

Below are some examples of how to use fastCSV:

以下是一些使用fastCSV

public class car
{
    // you can use fields or properties
    public string Year;
    public string Make;
    public string Model;
    public string Description;
    public string Price;
}

// listcars = List<car>
var listcars = fastCSV.ReadFile<cars>(
    "csvstandard.csv", // filename
    true,              // has header
    ',',               // delimiter
    (o, c) =>          // to object function o : car object, c : columns array read
    {
        o.Year = c[0];
        o.Make = c[1];
        o.Model = c[2];
        o.Description = c[3];
        o.Price = c[4];
        // add to list
        return true;
    });

fastCSV.WriteFile<LocalWeatherData>(
    "filename2.csv",   // filename
    new string[] { "WBAN", "Date", "SkyCondition" }, // headers defined or null
    '|',               // delimiter
    list,              // list of LocalWeatherData to save
    (o, c) =>          // from object function 
	{
    	c.Add(o.WBAN);
    	c.Add(o.Date.ToString("yyyyMMdd"));
    	c.Add(o.SkyCondition);
	});

辅助功能以提高性能 (Helper functions for performance)

fastCSV has the following helper functions:

fastCSV具有以下帮助器功能:

  • int ToInt(string s) creates an int from a string

    int ToInt(string s)从一个字符串创建一个int

  • int ToInt(string s, int index, int count) creates an int from a substring

    int ToInt(string s, int index, int count)从子字符串创建一个int

  • DateTime ToDateTimeISO(string value, bool UseUTCDateTime) creates an ISO standard DateTime i.e. yyyy-MM-ddTHH:mm:ss ( optional part.nnnZ)

    DateTime ToDateTimeISO(string value, bool UseUTCDateTime)创建一个ISO标准DateTimeyyyy-MM-ddTHH:mm:ss (可选部分.nnnZ )

public class LocalWeatherData
{
    public string WBAN;
    public DateTime Date;
    public string SkyCondition;
}

var list = fastCSV.ReadFile<LocalWeatherData>("201503hourly.txt", true, ',', (o, c) =>
    {
        bool add = true;
        o.WBAN = c[0];
        // c[1] data is in "20150301" format
        o.Date = new DateTime(fastCSV.ToInt(c[1], 0, 4), 
                              fastCSV.ToInt(c[1], 4, 2), 
                              fastCSV.ToInt(c[1], 6, 2));
        o.SkyCondition = c[4];
        //if (o.Date.Day % 2 == 0)
        //    add = false;
        return add;
    });

使用场景 (Usage Scenarios)

  • Filtering CSV while loading

    加载时过滤CSV

    • In your map function while loading you can write conditions on your loaded line data and filter out lines you don't want by using return false;

      在加载时的地图函数中,您可以在加载的行数据上写条件,并通过使用return false;过滤掉不需要的行return false;

  • Reading CSV to import to other systems

    读取CSV导入到其他系统

    • in your map function you can send the line data to another system and return false;

      在地图功能中,您可以将线数据发送到另一个系统并return false;

    • or process the entire file and use the List<T> returned

      或处理整个文件并使用返回的List<T>

  • Processing/aggregating data while loading

    加载时处理/汇总数据

    • You can have a List<T> which has no bearing on the columns of the CSV file and sum/min/max/avg/etc. the lines read

      您可以有一个List<T> ,它与CSV文件的列以及sum / min / max / avg / etc无关。 线读

代码内 (Inside the code)

Essentially the reading is a loop through parsing a line, creating a generic element for a list, handing of the object created and the columns extracted from the line to the user defined map function and adding it to the list for return (if the map function says so) :

从本质上来说,读取是通过以下方式进行的循环:解析线,为列表创建通用元素,将创建的对象以及从该行提取的列移交给用户定义的map函数,并将其添加到列表中以供返回(如果map函数)说):

var c = ParseLine(line, delimiter, cols);
T o = new T();
var b = mapper(o, c);
if (b)
   list.Add(o);

Now the CSV standard complexity comes from handling multi lines correctly which is done by counting if there are odd number of quotes in a line, hence it's multi line and reading the lines until the quotes are even, which is done in the ReadFile() function.

现在,CSV标准复杂性来自正确处理多行,这是通过计算一行中引号是否为奇数来完成的,因此,它是多行并读取行直到引号是偶数,这在ReadFile()函数中完成。

The beauty of this approach is that it is simple, does no reflection and is really fast, with the control being in the users hands.

这种方法的优点在于,它的操作简单,无反射且速度非常快,并且控制权在用户手中。

All the reading code is below :

所有的阅读代码如下:

public static List<T> ReadFile<T>(string filename, bool hasheader, char delimiter, ToOBJ<T> mapper) where T : new()
{
    string[] cols = null;
    List<T> list = new List<T>();
    int linenum = -1;
    StringBuilder sb = new StringBuilder();
    bool insb = false;
    foreach (var line in File.ReadLines(filename))
    {
        try
        {
            linenum++;
            if (linenum == 0)
            {
                if (hasheader)
                {
                    // actual col count
                    int cc = CountOccurence(line, delimiter);
                    if (cc == 0)
                        throw new Exception("File does not have '" + delimiter + "' as a delimiter");
                    cols = new string[cc + 1];
                    continue;
                }
                else
                    cols = new string[_COLCOUNT];
            }
            var qc = CountOccurence(line, '\"');
            bool multiline = qc % 2 == 1 || insb;

            string cline = line;
            // if multiline add line to sb and continue
            if (multiline)
            {
                insb = true;
                sb.Append(line);
                var s = sb.ToString();
                qc = CountOccurence(s, '\"');
                if (qc % 2 == 1)
                {
                    sb.AppendLine();
                    continue;
                }
                cline = s;
                sb.Clear();
                insb = false;
            }

            var c = ParseLine(cline, delimiter, cols);

            T o = new T();
            var b = mapper(o, c);
            if (b)
                list.Add(o);
        }
        catch (Exception ex)
        {
            throw new Exception("error on line " + linenum, ex);
        }
    }

    return list;
}

private unsafe static int CountOccurence(string text, char c)
{
    int count = 0;
    int len = text.Length;
    int index = -1;
    fixed (char* s = text)
    {
        while (index++ < len)
        {
            char ch = *(s + index);
            if (ch == c)
                count++;
        }
    }
    return count;
}

private unsafe static string[] ParseLine(string line, char delimiter, string[] columns)
{
    //return line.Split(delimiter);
    int col = 0;
    int linelen = line.Length;
    int index = 0;

    fixed (char* l = line)
    {
        while (index < linelen)
        {
            if (*(l + index) != '\"')
            {
                // non quoted
                var next = line.IndexOf(delimiter, index);
                if (next < 0)
                {
                    columns[col++] = new string(l, index, linelen - index);
                    break;
                }
                columns[col++] = new string(l, index, next - index);
                index = next + 1;
            }
            else
            {
                // quoted string change "" -> "
                int qc = 1;
                int start = index;
                char c = *(l + ++index);
                // find matching quote until delim or EOL
                while (index++ < linelen)
                {
                    if (c == '\"')
                        qc++;
                    if (c == delimiter && qc % 2 == 0)
                        break;
                    c = *(l + index);
                }
                columns[col++] = new string(l, start + 1, index - start - 3).Replace("\"\"", "\"");
            }
        }
    }

    return columns;
}

ParseLine() is responsible for extracting the columns from a line in an optimized unsafe way.

ParseLine()负责以优化的unsafe方式从行中提取列。

And the writing code is just :

编写代码就是:

public static void WriteFile<T>(string filename, string[] headers, char delimiter, List<T> list, FromObj<T> mapper)
{
    using (FileStream f = new FileStream(filename, FileMode.Create, FileAccess.Write))
    {
        using (StreamWriter s = new StreamWriter(f))
        {
            if (headers != null)
                s.WriteLine(string.Join(delimiter.ToString(), headers));

            foreach (var o in list)
            {
                List<object> cols = new List<object>();
                mapper(o, cols);
                for (int i = 0; i < cols.Count; i++)
                {
                    // qoute string if needed -> \" \r \n delim 
                    var str = cols[i].ToString();
                    bool quote = false;

                    if (str.IndexOf('\"') >= 0)
                    {
                        quote = true;
                        str = str.Replace("\"", "\"\"");
                    }

                    if (quote == false && str.IndexOf('\n') >= 0)
                        quote = true;

                    if (quote == false && str.IndexOf('\r') >= 0)
                        quote = true;

                    if (quote == false && str.IndexOf(delimiter) >= 0)
                        quote = true;

                    if (quote)
                        s.Write("\"");
                    s.Write(str);
                    if (quote)
                        s.Write("\"");

                    if (i < cols.Count - 1)
                        s.Write(delimiter);
                }
                s.WriteLine();
            }
            s.Flush();
        }
        f.Close();
    }
}

样例用例 (Sample Use cases)

拆分数据集以进行测试和培训 (Splitting data sets for testing and training)

In data science you generally split your data into training and testing sets, and in the example below every 3rd row is for testing ( you could make the splitting more elaborate) :

在数据科学中,通常将数据划分为训练和测试集,在下面的示例中,每第3行用于测试(您可以使划分更加复杂):

var testing = new List<LocalWeatherData>();
int line = 0;
var training = fastCSV.ReadFile<LocalWeatherData>("201503hourly.txt", true, ',', (o, c) =>
    {
        bool add = true;
        line++;
        o.Date = new DateTime(fastCSV.ToInt(c[1], 0, 4),
                              fastCSV.ToInt(c[1], 4, 2),
                              fastCSV.ToInt(c[1], 6, 2));
        o.SkyCondition = c[4];
        if (line % 3 == 0)
        {
            add = false;
            testing.Add(o);
        }
        return add;
    });

附录v2.0 (Appendix v2.0)

With apologies to NReco.CSV, I messed up the timings for that library which was pointed out to me by KPixel on GitHub, this prompted me to go back to the drawing board and redo the internals of fastCSV for more speed.

NReco.CSV道歉,我弄乱了KPixel在GitHub上向我指出的该库的时间安排,这促使我回到制图板上fastCSVfastCSV的内部fastCSV以提高速度。

The new code is even faster than a naïve roll-your-own File.ReadLines() and string.Split() which can't handle multi-lines.

新代码甚至比无法处理多行的简单的File.ReadLines()string.Split()

性能 (Performance)

The new performance numbers are below which in comparison to the v1 code is nearly 2x faster at the expense of a little more memory usage on the same 4,496,263 rows dataset.

新的性能数字低于该数字,与v1代码相比,速度提高了近2倍,但代价是在相同的4,496,263行数据集上使用了更多的内存。

  • fastCSV .net4 : 6.27s  753Mb used

    fastCSV .net4 :使用6.27s 753Mb

  • fastCSV core : 6.51s  669Mb used

    fastCSV核心 :使用6.51s 669Mb

Interestingly on .net core, the library uses less memory.

有趣的是,在.net核心上,该库使用的内存更少。

所做的更改 (Changes Made)

  • First up was creating a buffered StreamReader instead of relying on File.ReadLines() this is handled by the BufReader class. This class also handles multi-lines without resorting to a StringBuilder and the possible case that a line is larger than the buffer being used to read it.

    首先是创建一个缓冲的StreamReader而不是依赖于File.ReadLines()这是由BufReader类处理的。 此类还处理多行,而不求助于StringBuilder ,并且可能发生的情况是,行大于用于读取它的缓冲区。

  • While testing I discovered that using IL to create the generic list objects is faster than using new T()

    在测试时,我发现使用IL创建通用列表对象比使用new T()更快。

  • Poor mans (.net 4) Span with the MGSpan class which just passes around the buffer, start and length of the data and delays the creation of strings until they are actually used in the object filling delegate.

    具有MGSpan类的可怜的(.net 4) Span仅传递缓冲区,数据的开始和长度,并延迟字符串的创建,直到它们在对象填充委托中实际使用为止。

  • ReadFile<T>() looks cleaner now and does not use StringBuilder for multi-lines.

    ReadFile<T>()现在看起来更干净,并且不将StringBuilder用于多行。

一些奇怪的东西 (Some Weird Stuff)

The first pass of FillBuffer() used _tr.BaseStream.Seek() to go back in the CSV data file when the end of the buffer being read was reached and a line was not completed, this was fine at first but failed non ASCII data and UTF8 encoded files, the reason being that some characters are 2 bytes not 1 and a char in .net while looking like a byte in fact can be 2 bytes. This messes the offset computation when seeking which results in reading incorrect lines.

当到达缓冲区的末尾且行未完成时, FillBuffer()FillBuffer()使用_tr.BaseStream.Seek()返回CSV数据文件,起初还可以,但非ASCII数据失败和UTF8编码的文件,原因是某些字符是2个字节而不是1个,而.net中的char实际上看起来像一个byte ,可能是2个字节。 在寻找导致读取错误行的结果时,这会使偏移计算混乱。

To remedy this I used Array.Copy() to copy the characters to the start of the buffer and read the rest from the file until the buffer is full, interestingly again you can't use Buffer.BlockCopy() for the same reason as above.

为了解决这个问题,我使用Array.Copy()将字符复制到缓冲区的开头,然后从文件中读取其余字符,直到缓冲区已满,有趣的是,出于与以下相同的原因,您不能使用Buffer.BlockCopy()以上。

A simple change of struct MGSpan instead of class MGSpan results in a very distinct speed up, probably because the objects are passed around on the stack and don't stay around long which faster than using the heap.

struct MGSpan而不是class MGSpan的简单更改class MGSpan导致非常明显的加速,这可能是因为对象在堆栈上传递,并且停留的时间并不比使用堆快。

之前的版本 (Previous Versions)

翻译自: https://www.codeproject.com/Articles/5255318/fastCSV-Tiny-Fast-Standard-Compliant-CSV-Reader-Wr

java微型计算器编写

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值