1: Csvkit (Data Cleaning and Exploration Using Csvkit )

最新推荐文章于 2024-05-10 18:31:02 发布

weixin_33835690

最新推荐文章于 2024-05-10 18:31:02 发布

阅读量113

点赞数

文章标签：人工智能 python

原文链接：https://my.oschina.net/Bettyty/blog/747186

版权

2019独角兽企业重金招聘Python工程师标准>>>

So far, we've been using the default command line tools to clean, munge, and explore data. Tools likewc and head are useful tools, but weren't designed specifically for working with datasets and are limited in many ways. These tools lack features specific to working with tabular datasets, like parsing the header row or understanding the row and column layout. Because of this, in theData Munging Using the Command Line challenge, you had to specifically compute the number of lines in each CSV file using the wctool and use that number to select just the non-header rows using the tail tool. You then had to repeat this for each CSV file you were trying to merge into the resulting, single file!

In this mission, we'll learn about the Csvkit library, which supercharges your workflow by adding 13 new command line tools specifically for working with CSV files. We'll focus on these 5 tools from Csvkit:

csvstack: for stacking rows from multiple CSV files.
csvlook: renders CSV in pretty table format.
csvcut: for selecting specific columns from a CSV file.
csvstat: for calculating descriptive statistics for some or all columns.
csvgrep: for filtering tabular data using specific criteria.

We'll be using csvkit version 0.9.1 in this mission and you can read about the installation procedure in the documentation. We'll continue to work with the same 3 datasets on housing affordability:

Hud_2005.csv,
Hud_2007.csv,
Hud_2013.csv.

#####################################################

2: Csvstack

To start, let's circle back to the task of merging 3 CSV files into 1 file. We can use csvstack tool to consolidate the rows from multiple CSV files and redirect the stdout to a new file:

csvstack file1.csv file2.csv file3.csv >

final.csv

As long as the header row for each file in the stdin to csvstack is the same, the first row in the resulting file will match this header row. After the header row, final.csv will contain all of the non-header rows from file1.csv, then all of the non-header rows from file2.csv, then finally the non-header rows from file3.csv. If you don't redirect the stdout of csvstack to a file or a tool like head, the full output will be rendered in the terminal. This can cause your terminal to grind to a halt as it tries to process and display all of the output and you want to be extra careful to avoid doing so.

If you peeked at the documentation, you may have noticed that the behavior of csvstack can be modified using a few different flags. For example,

if you want to be able to trace the file where each row originated from in the merged file, you can use the -g flag to specify a grouping value for each filename. When stacking the rows from a file, csvstack will add the corresponding value in a new column. Lastly, you can use the -n flag to specify the name of this new column. The following code will create a new column namedorigin, containing the values 1, 2, or 3depending on which file that row originated from:

csvstack -n origin -g 1,2,3 file1.csv file2

.csv file3.csv > final.csv

The rows in final.csv that originated fromfile1.csv will contain the value 1 in theorigin column and those from file2.csv will contain the value 2 in the origin column. Let's now use csvstack to combine the 3 datasets on U.S. housing affordability from the last challenge.

Instructions

Merge Hud_2005.csv, Hud_2007.csv, and Hud_2013.csv in that order into one file:
- Name the resulting file Combined_hud.csv.
- Add an extra column named year which contains the year value from the file name for each row. E.g. the rows that originated from Hud_2005.csv should have 2005 as the value in the year column.
Use head to preview the first few rows of Combined_hud.csv.
Use the wc command with the l flag to confirm that the merged file contains 154118 rows

~$ ls -l

~$ csvstack -n year -g 2005,2007,2013 Hud_2005.csv Hud_2007.csv Hud_2013.csv > Combined_hud.csv

~$ head -5 Combined_hud.csv

####################################

3: Csvlook

While head allows you to quickly observe the first few rows in a file, it doesn't attempt to format the rendered output at all. CSV files are tabular and it's incredibly useful to observe this structure and other data tools like Pandas and Microsoft Excel factored that notion in when displaying tabular data. Thankfully, we can use the csvlook tool to display tabular data in the table format we're used to.

The csvlook tool parses CSV formatted data from it's stdin and outputs a pretty formatted table representation of that data to it's stdout:

head -10 final.csv | csvlook

Let's use csvlook to explore the first few rows from the CSV file we created in the last screen.

Instructions

Use csvlook to preview the first 10 rows fromCombined_hud.csv

~$ head -10 Combined_hud.csv | csvlook

/home/dq$ head -10 Combined_hud.csv | csvlook

|-------+------+--------+-----+-----------+-------------+---------|

|-------+------+--------+-----+-----------+-------------+---------|

| 2005 | 43 | 0.513 | 680 | '3 3BR' | '1980-1989' | 20000 |

| 2005 | 44 | 0.223 | 760 | '4 4BR+' | '1980-1989' | 71000 |

| 2005 | 58 | 0.218 | 680 | '3 3BR' | '1980-1989' | 63000 |

| 2005 | 22 | 0.217 | 519 | '1 1BR' | '1980-1989' | 27040 |

| 2005 | 48 | 0.283 | 600 | '1 1BR' | '1980-1989' | 14000 |

| 2005 | 42 | 0.292 | 788 | '3 3BR' | '1980-1989' | 42000 |

| 2005 | -9 | -9.000 | 702 | '2 2BR' | '1980-1989' | -9 |

| 2005 | 23 | 0.145 | 546 | '2 2BR' | '1980-1989' | 48000 |

| 2005 | 51 | 0.296 | 680 | '3 3BR' | '1980-1989' | 58000 |

##############################################################

4: Csvcut

Csvlook returned a table formatted output of the merged CSV file. Let's now explore individual columns using the csvcut tool. Using the csvcut command with just the -n flag parses and displays all the columns in a CSV file along with an unique integer identifier for each column:

csvcut -n Combined_hud.csv

will output:

1: year

2: AGE1

3: BURDEN

4: FMR

5: FMTBEDRMS

6: FMTBUILT

7: TOTSAL

You can use the integer identifier for each column and the -cc flag to select just a specific column:

csvcut -c 1 Combined_hud.csv

will output just the year column. You want to avoid displaying the entire column since it contains 154118 rows and your terminal window will severely come to a halt attempting to display all that information. Instead, you can pipe the column output to head to preview just the firstn rows.

Instructions

Use csvcut to return all of the column names fromCombined_hud.csv.
Use csvcut to display just the first 10 values in the AGE1 column

~$ csvcut -n Combined_hud.csv

/home/dq$ csvcut -n Combined_hud.csv

1: year

2: AGE1

3: BURDEN

4: FMR

5: FMTBEDRMS

6: FMTBUILT

7: TOTSAL

~$ csvcut -c 2 Combined_hud.csv | head -10

/home/dq$ csvcut -c 2 Combined_hud.csv | head -10

AGE1

-9

##############################################################

Now that we know how to select specific columns, we can select a column and pipe it to the csvstat tool to calculate summary statistics for that column:

csvcut -c 4 Combined_hud.csv | csvstat

This calculates a full suite of summary statistics, including:

max,
min,
sum,
mean,
median,
standard deviation.

Depending on the size of the data, the full summary statistics for a column can take a long time and you often just want a specific summary statistic. You can use -- flags to choose specific summary statistics, which will greatly improve the speed:

# Just the max value.

csvcut -c 2 Combined_hud.csv | csvstat --max

# Just the mean value.

csvcut -c 2 Combined_hud.csv | csvstat

--mean

# Just the number of null values.

csvcut -c 2 Combined_hud.csv | csvstat

--nulls

You can see a full list of flags in thedocumentation. If you want to calculate summary statistics over all the columns in a CSV file, you can pass the file to csvstat directly:

csvstat Combined_hud.csv