CIT 594 Module 7 Programming AssignmentCSV Slicer

CIT 594 Module 7 Programming Assignment

CSV Slicer

In this assignment you will read files in a format known as “comma separated values” (CSV), interpret the formatting and output the content in the structure represented by the file.

Q1703105484

Learning Objectives

In completing this assignment, you will:

  • Implement a method to extract content from an input CSV
  • Read and understand a formal specification for the CSV format
  • Use a “state machine”

Background and Getting Started

Applications have long used delimited text files to store and transmit tables. The simplicity of the format means these files are human readable and editable as well as relatively easy to support in code. The breadth of support from nearly any general purpose application or tabular data loading library continues to make these files particularly popular for publishing data despite their limitations and inefficiencies. You will see more of this in later assignments where you will be required to support reading publicly available datasets.

One of the more common choices for delimiters are commas to separate fields within a row of the table, and line breaks to separate rows of the table. The common name for these comma based file formats are “comma separated values” (CSV).

To support values in the fields that include delimiter characters (e.g., if a comma is part of the data, not just a field separator) requires some added complexity. There is no single, universally-accepted specification for CSV files, so we will focus on one specific format that is widely accepted for this assignment, RFC 4180.

To get started, carefully read sections 1 and 2 of RFC 4180: RFC 4180 - Common Format and MIME Type for Comma-Separated Values (CSV) Files

That document uses a precise syntax to express an exact formal specification. If you are unsure of how to interpret the ANBF grammar you can reference RFC 2234.

CSV Format For This Assignment

You will write a method to process CSV files character-by-character. The exact format is a relaxation of RFC 4180.

Adjustments and clarifications of the descriptive rules:

  1. Formatting characters describe structure; they are not part of the content. For example,  the comma that separates two fields should not be included in either field. For example (easy3.csv):

"example of using "" in a field",1

Should result in a single row that is the equivalent an array constructed with the Java expres- sion:

new String[]{ "example of using \" in a field", "1" }

  1. The same applies for the escape character in an escaped field. Two double quotes in the middle of an escaped field count as a single double quote character in the content.
  2. You may ignore rule number 4 (for this assignment). Instead, follow these rules:
    1. No special treatment should be applied to the first row. You should not treat a header any differently than a record.
    2. You will not need to check if all rows have the same number of fields.

    1. Commas at the end of a line signify empty fields. For example, "a,b,c," results in a row with four fields: ["a", "b", "c", ""]1
    2. An empty line terminated by a line break is a valid row if it’s outside an escaped field. Inside an escaped field, it is just part of the content of that field.
  1. Clarification of rule 2: There is no additional record if the file ends at the start of a line. That includes but is not limited to example in rule 1 where the file ends with CRLF EOF

Adjustments to the grammar:

CRLF = [CR] LF

TEXTDATA =     x00-09  /   x0B-0C /    x0E-21  /   x23-2B  /   x2D-7F

The first rule change makes the carriage return (CR) optional everywhere CRLF is used. This is a relaxation seen in many places to adjust for the fact that many systems and applications choose to omit the carriage return and only use a single line feed character for line breaks.

The second rule change expands TEXTDATA2 to all characters that are not comma, double quote, carriage return, and line feed.

Note: these are relaxations, therefore all documents that are valid for the strict interpretation of RFC 4180 are valid for this format.

For the purposes of the assignment there are five classes of characters for you to consider:

Common Name

RFC Name

RFC Code (hex)

Decimal

Java Character

Line Feed

LF

x0A

x0D x22 x2C

10

\n

\r "

,

[^\r\n,"]3

Carriage Return

CR

13

Double Quote

DQUOTE

34

Comma

COMMA

44

Anything Else

TEXTDATA

 

1The line in rule 4 in RFC about "The last field in the record must not be followed by  a comma" is referring       to their rule that each line should contain the same number of fields.  A comma at the end of a line would insert       an additional, empty field. The statement indicates that an extra comma that would increase the number of fields  beyond the expected limit should not be dropped or ignored.

2We will not test with characters above % x7 F (127), but you are welcome to include % xFF                                                                                                                   7 FFFFFFF

(Integer.MAX_VALUE) in TEXTDATA if you wish. That will cover many more special characters and allow you to process Unicode values.

3This is just a regular expression to write any other character aside from the four listed.

Activity

Implement the readRow method in CSVReader.java.

You may also add supporting fields and helper methods to the CSVReader object as needed. Each call to readRow must return one row of data from reader until the input is exhausted.

If there is a format error in the input, you should raise a CSVFormatException. You are welcome to use the optional fields available in CSVFormatException for your own informative error messages (potentially quite useful for debugging). Any extra values and messages you use will not be evaluated by the grader.

Performance matters. The overall runtime for reading through the entire input should be O(n) where n is the number of characters in the input. As a reminder, this does mean that certain seemingly convenient operations and data structures may not be appropriate for this assignment. Choose your data structures carefully.

You will process the input one character at a time (hence the provided CharacterReader which is more restricted than the standard java.io.Reader). Along those lines, you may wish to organize your code in a “state machine”. The term “state machine” just implies that a program may respond differently to a given input based on a current “state” or context. A more detailed explanation is included with the assignment as supplemental reading.

Before You Submit

Please be sure that:

  • your classes are in the default package, i.e. there is no “package” declaration at the top of the source code
  • your classes compile and you have not changed the signature of the methods we have provided
  • you have not created any additional .java files
  • you have not overloaded any existing method names
  • you have filled out the required academic integrity signature in the comment block at the top of your submission files

How to Submit

After you have finished implementing the CSVReader class, go to the “Module 7 Programming Assignment submission” item and click the “Open Tool” button to go to the Codio platform.

Once you are logged into Codio, read the submission instructions in the README file. Be sure you upload your code to the “submit” folder.

To test your code before submitting, click the “Run Test Cases” button in the Codio toolbar.

Unlike the other assignments, you will have different sets of tests you can run in Codio. Most have limits, use them carefully. Once you use up a test, you will not get to try it again, there will be no exceptions.

The test cases we provide here are “sanity check” tests to make sure that you have the basic functionality working correctly, but it is up to you to ensure that your code satisftes all of the requirements described in this document.

Assessment

Your code will be evaluated against test cases that stress the limits of the specification including tricky but valid inputs and invalid inputs. There is no ambiguity in the specification. If something does not fit the grammar it is not valid.

Grading will be roughly:

  • 0% trivial CSV inputs
  • 20% simple valid CSV inputs
  • 60% tricky valid inputs
  • 20% invalid inputs

After submitting your code for grading, you can go back to this assignment in Codio and view the “results.txt” file, which should be listed in the Filetree on the left. This file will describe any failing test cases.

FAQ and Hints

  • “What about. . . ” RFC 4180.
  • java.lang.StringBuilder
  • java.io.Reader
  • Characters in C and Java are written with single quotes (’a’). Double quotes are for strings ("this is a string").
  • The switch statement is your friend (at least for this assignment). See the provided StateMa- chineIntro.pdf for an example.

Processing trivial CSV files (e.g., a file with quotes) can be as simple as:

Files.lines(Path.of(filename)).map(line -> line.split(",")).toArray()

This might help you get started with some initial testing. This is also just the baseline functionality for your implementation. You should expect no credit if your implementation does not correctly handle more sophisticated tests.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值