Java：推断固定宽度的文本文件的列宽

最新推荐文章于 2021-03-10 11:19:50 发布

cunxiedian8614

最新推荐文章于 2021-03-10 11:19:50 发布

阅读量289

点赞数

文章标签： java

Delimited Files and Fixed-Width Files

包含数据表的纯文本文件通常以以下两种方式之一进行组织：定界的文件或固定宽度 files. Delimited files use one or more characters in series to separate the columns of the tabular data along each row (and line breaks are almost always used to separate rows). A common 定界的 file format is the CSV (comma-separated values) format:

287540,Smith,Jones,Accountant,"$55,000"
204878,Ross,Betsy,Senior Accountant,"$66,000"
208417,Arthur,Wilbur,CEO,"$123,000"

...the delimiter can sometimes show up within a value in a row, and when that happens, the value is usually surrounded by double-quotes. Quotes can also show up in values, and when they appear, they are escaped by doubling (""). RFC-4180 defines the standard CSV format.

另一方面，固定宽度的文件会为每列强制使用固定的列宽（尽管并非所有列都必须具有相同宽度），并在左侧或右侧填充剩余空间，通常使用空格：

287540 Smith  Jones  Accountant         $55,000
204878 Ross   Betsy  Senior Accountant  $66,000
208417 Arthur Wilbur CEO               $123,000

There are advantages and distadvantages of each of these approaches. A delimited file can be easier to parse, unless there are escape characters and delimiters embedded in values. A delimited file also takes up less space than a fixed-width file, as it doesn't waste bytes padding the file full of spaces. Parsing CSV files can be simple enough if you have a good RegEx, but parsing a fixed-width file can be difficult. Either, the user has to know the column widths in advance and pass that to a parsing method, or the method has to infer the widths of the columns. The second one, being a bit more automated, is one that I would tend to prefer, so let's try to do that!

Read a Text File into a `List<String>`

The very first thing we want to do is get our fixed-width file into a List<String>. To do this, we simply get a java.io.Reader for the file as a BufferedReader and then use BufferedReader's readLine() method over and over until it returns null (it returns a String if it's successfully read a line):

jshell> String fileName = "src/main/resources/example_sql_windows.txt"
fileName ==> "src/main/resources/example_sql_windows.txt"

jshell> BufferedReader reader = new BufferedReader(new FileReader(fileName))
reader ==> java.io.BufferedReader@2353b3e6

jshell> List<String> lines = new ArrayList<>()
lines ==> []

jshell> String line = null // for use in the loop below
line ==> null

jshell> while ((line = reader.readLine()) != null) lines.add(line)

jshell> int nLines = lines.size() // save this for later
nLines ==> 22

That's it! Easy! Note that we had to instantiate an ArrayList because List is only an interface and can't be instantiated directly. We can also use the diamond operator <> to save some typing. Other than that, I hope the rest of the code above is more or less straightforward. Now we can access lines of our file by their indices in our lines list.

Count the Number of Non-Whitespace Characters Per Character Column

接下来，我们要计算每个字符列（与数据列相反）中非空白字符的数量。 “字符列”是文件的单个字符范围的列，而数据列由一个或多个相邻的字符列组成。具有很少非空白字符的字符列很可能是定界符列（分隔数据列）。为了清楚起见，我将在这里逐步解释代码。

First, we want to take each line of our file and determine whether a character is a whitespace character or not. Basically, we want to convert our List<String> to a List<List<Boolean>>, where each element of the inner List is true if the character at that position on that line is not a whitespace character. To do that, we first break the String into a char[] array using String.toCharArray(). (To start, I'll use the first line of lines (lines.get(0)) as a placeholder for later, when we'll use a loop.)

jshell> lines.get(0).toCharArray()
$86 ==> char[771] { 'e', 'x', 'e', 'c', ...

At this point, we could convert this char[] to a Stream<Character> by surrounding the above with a CharBuffer.wrap(), then calling chars() on the resulting CharBuffer, using mapToObj() and so on, but there's a much more performant way of achieveing the same thing -- a good, old for loop:

jshell> List<List<Boolean>> charsNonWS = new ArrayList<>() // String line => List<Boolean> line
charsNonWS ==> []

jshell> for (int ll = 0; ll < nLines; ++ll) { // loop over lines read from file
   ...>   charsNonWS.add(new ArrayList<Boolean>()); // add new empty array to List
   ...>   List<Boolean> temp = charsNonWS.get(ll); // save reference to use below
   ...>   for (char ch : lines.get(ll).toCharArray()) // loop over chars in this line
   ...>     temp.add(!Character.isWhitespace(ch)); // true if char is non-whitespace
   ...> }

jshell> charsNonWS
charsNonWS ==> [[true, true, true, true, true, ...

现在，我们要计算非空格字符的数量每列，而不是每行。因此，从某种意义上讲，我们需要“旋转”数据。为此，我们首先找到每行字符列的最大数量，然后创建一个具有该长度的数组。在这里，我用流节省打字的另一大对于环：

jshell> int nCharCols = charsNonWS.stream().mapToInt(e -> e.size()).max().orElse(0)
nCharCols ==> 771

CharsNonWS.stream() converts CharsNonWS from a List<List<Boolean>> to a Stream<List<Boolean>>. In other words, each element of the Stream is one line of the file, where characters have been converted to false/true values based on whether they're whitespace characters or not, respectively. Then, we map each List<Boolean> to a single Integer value with mapToInt(). That value is the length of the line, in number of characters, which we find by mapping each List<Boolean> to its size with mapToInt(e -> e.size()). Finally, we find the maximum value of the Stream (which is now a Stream<Integer>) with max(). max() returns an Optional, so we need to extract that value with a get() or something similar. I opted for an orElse(0), which will return 0 as the maximum line length (in characters) if something went wrong in the Stream.

因此，文件中任何行的最大字符数为771。现在，让我们创建一个int []并计算每个字符中非空格字符的数量771列：

jshell> int[] counts = new int[nCharCols]
counts ==> int[771] { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... , 0, 0, 0, 0, 0, 0, 0, 0 }

整型初始化为0，因此在开始使用该数组之前，我们不需要清除该数组。相反，让我们直接看一下每列的非空白字符数。

jshell> for (List<Boolean> row : charsNonWS) // loop over each "row" ("line" / inner List<Boolean>)
   ...>   for (int cc = 0; cc < row.size(); ++cc) // loop over each "column" (char) in that "row" (line)
   ...>     if (row.get(cc)) ++counts[cc]; // if the char is non-whitespace (true), increment column

jshell> counts
counts ==> int[771] { 4, 4, 4, 2, 4, 4, 4, 4, 2, ...

所以计数现在在文本文件的每个字符列中保存非空白字符的数量。

Infer "Empty" Columns

接下来，我们要概述每列非空白字符的数量。换句话说，任何列都没有非空白字符吗？还是有最低数量？本质上，我们想要做的是直方图的计数。最简单的方法可能是使用其他方法流：

jshell> Map<Integer, Long> map = Arrays.stream(counts). // convert int[] to Stream of primitive ints
   ...>   mapToObj(i -> (Integer)i). // convert primitive ints to Integers
   ...>   collect(Collectors.groupingBy( // group the Integers according to...
   ...>     Function.identity(), // their identity (value)
   ...>     Collectors.counting() // and then count the number in each group
   ...>   ))
map ==> {16=10, 0=9, 1=549, 17=113, 18=31, 2=39, 19=7, 3=2, 4=11}

所以有9与0非空白字符，549与1个非空白字符，依此类推。似乎那些9“空”字符列分隔数据列。让我们以编程方式从上方的地图中提取给定字符列中最少数量的非空白字符，并使用它来定义“空”列：

jshell> int emptyColDef = Collections.min(map.keySet())
emptyColDef ==> 0

对于此应用程序来说，这似乎有些过头，但是总的来说，将这样的事情自动化是个好主意。它使您的代码更健壮，可用于将来的应用程序。上面的代码只是比较了地图（每字符列中非空白字符的数量），并找到最小的字符。

Find Delimiting Columns

现在，我们可以找到定义（定界）数据列范围的字符列。这些通常是非空白字符最少的列（使用空白字符填充固定宽度数据列时）。我们想要这些字符列的索引，所以让我们获得一个流的计数并将这些价值与我们的空的ColDef：

jshell> List<Boolean> emptyCols = Arrays.stream(counts). // convert int[] to Stream of primitive ints
   ...>   mapToObj(n -> n == emptyCol). // convert primitive ints to Booleans
   ...>   collect(Collectors.toList()) // collect in a List
emptyCols ==> [false, false, false, ...

空（定界）列是那些真正值空库。要找到索引，我们只需遍历空库：

jshell> List<Integer> emptyIndices = new ArrayList<>()
emptyIndices ==> []

jshell> for (int cc = 0; cc < nCharCols; ++cc)
   ...>   if (emptyCols.get(cc)) emptyIndices.add(cc)

jshell> emptyIndices
emptyIndices ==> [38, 89, 120, 151, 352, 553, 592, 631, 670]

的对于上面的循环只是看索引处的值抄送在空库是真正。如果是的话，它会将索引添加到空索引, which now holds the character column indices which delimit the data columns在our fixed-width file! 的last thing to do是append a 0到开始清单, because we'll use adjacent values as the "start" and "end" character columns 对于 each data column, and the first data column begins at the 0第个字符：

jshell> int nDataCols = emptyIndices.size()
nDataCols ==> 9

jshell> emptyIndices.add(0, 0) // add a value 0 at the 0th position in the List

jshell> emptyIndices
emptyIndices ==> [0, 38, 89, 120, 151, 352, 553, 592, 631, 670]

Parsing

Finally, we can use emptyIndices to parse our file. We can split each line at the given character indices, then do a String.trim() to remove leading and/or trailing whitespace. Note that some lines may be shorter than the "standard" line length (holding metadata or something similar) so we need to do a bounds check before we split the String line into substrings:

jshell> List<List<String>> tokens = new ArrayList<>(nLines) // pre-allocate space
tokens ==> []

jshell> for (int ll = 0; ll < nLines; ++ll) { // loop over all lines in file
   ...>   tokens.add(new ArrayList<String>()); // add new List<String> parsed tokens for line
   ...>   List<String> tokensList = tokens.get(ll); // get reference to List to use below
   ...>   String line = lines.get(ll); // get line as String
   ...>   int len = line.length(); // get length of line in characters
   ...>   for (int ii = 1; ii <= nDataCols; ++ii) { // loop over data columns
   ...>     if (len < emptyIndices.get(ii)) break; // check if line is long enough to have next token
   ...>     tokensList.add(line.substring(emptyIndices.get(ii-1), emptyIndices.get(ii)).trim()); // get token
   ...>   }
   ...> }

jshell> tokens
tokens ==> [[execBegan, SampleID, ExperimentID, ...

jshell> tokens.get(7) // for example
$142 ==> [2018-11-04 11:07:16.8570000, 0016M978, test, test, SP -> Gilson, Execution Completed, 2018-11-04 11:07:15.0000000, 2018-11-04 11:09:37.5330000, 2018-11-04 11:07:11.7870000]

Beautiful! Now, we have a List<List<String>> containing (in the outer List) the lines of the file broken up into (in the inner Lists) String tokens, with leading and trailing whitespace trimmed. We inferred the column widths of a fixed-width text file and parsed its contents! As a next step, we could attempt to infer the type of data held in each token, maybe using something like my Typifier, which infers the type of data held within Java Strings.

I hope this walkthrough was helpful and/or interesting! If you have any comments or questions, please let me know in the comments below. I've compiled the code above into a class and posted it to Gist, as well. Happy coding!

from: https://dev.to//awwsmm/java-infer-column-widths-of-a-fixed-width-text-file-2hh0