1: Csvkit (Data Cleaning and Exploration Using Csvkit )

So far, we've been using the default command line tools to clean, munge, and explore data. Tools likewc and head are useful tools, but weren't designed specifically for working with datasets and are limited in many ways. These tools lack features specific to working with tabular datasets, like parsing the header row or understanding the row and column layout. Because of this, in theData Munging Using the Command Line challenge, you had to specifically compute the number of lines in each CSV file using the wctool and use that number to select just the non-header rows using the tail tool. You then had to repeat this for each CSV file you were trying to merge into the resulting, single file!

In this mission, we'll learn about the Csvkit library, which supercharges your workflow by adding 13 new command line tools specifically for working with CSV files. We'll focus on these 5 tools from Csvkit:

  • csvstack: for stacking rows from multiple CSV files.
  • csvlook: renders CSV in pretty table format.
  • csvcut: for selecting specific columns from a CSV file.
  • csvstat: for calculating descriptive statistics for some or all columns.
  • csvgrep: for filtering tabular data using specific criteria.

We'll be using csvkit version 0.9.1 in this mission and you can read about the installation procedure in the documentation. We'll continue to work with the same 3 datasets on housing affordability:

  • Hud_2005.csv,
  • Hud_2007.csv,
  • Hud_2013.csv.

 

#####################################################

2: Csvstack

To start, let's circle back to the task of merging 3 CSV files into 1 file. We can use csvstack tool to consolidate the rows from multiple CSV files and redirect the stdout to a new file:

 

csvstack file1.csv file2.csv file3.csv >

    final.csv

As long as the header row for each file in the stdin to csvstack is the same, the first row in the resulting file will match this header row. After the header row, final.csv will contain all of the non-header rows from file1.csv, then all of the non-header rows from file2.csv, then finally the non-header rows from file3.csv. If you don't redirect the stdout of csvstack to a file or a tool like head, the full output will be rendered in the terminal. This can cause your terminal to grind to a halt as it tries to process and display all of the output and you want to be extra careful to avoid doing so.

If you peeked at the documentation, you may have noticed that the behavior of csvstack can be modified using a few different flags. For example,

if you want to be able to trace the file where each row originated from in the merged file, you can use the -g flag to specify a grouping value for each filename. When stacking the rows from a file, csvstack will add the corresponding value in a new column. Lastly, you can use the -n flag to specify the name of this new column. The following code will create a new column namedorigin, containing the values 12, or 3depending on which file that row originated from:

 

csvstack -n origin -g 1,2,3 file1.csv file2

    .csv file3.csv > final.csv

The rows in final.csv that originated fromfile1.csv will contain the value 1 in theorigin column and those from file2.csv will contain the value 2 in the origin column. Let's now use csvstack to combine the 3 datasets on U.S. housing affordability from the last challenge.

Instructions

  • Merge Hud_2005.csvHud_2007.csv, and Hud_2013.csv in that order into one file:

    • Name the resulting file Combined_hud.csv.
    • Add an extra column named year which contains the year value from the file name for each row. E.g. the rows that originated from Hud_2005.csv should have 2005 as the value in the year column.
  • Use head to preview the first few rows of Combined_hud.csv.

  • Use the wc command with the l flag to confirm that the merged file contains 154118 rows

~$ ls -l

~$ csvstack -n year -g 2005,2007,2013 Hud_2005.csv Hud_2007.csv Hud_2013.csv > Combined_hud.csv

~$ head -5 Combined_hud.csv

 

 

####################################

3: Csvlook

While head allows you to quickly observe the first few rows in a file, it doesn't attempt to format the rendered output at all. CSV files are tabular and it's incredibly useful to observe this structure and other data tools like Pandas and Microsoft Excel factored that notion in when displaying tabular data. Thankfully, we can use the csvlook tool to display tabular data in the table format we're used to.

The csvlook tool parses CSV formatted data from it's stdin and outputs a pretty formatted table representation of that data to it's stdout:

 

head -10 final.csv | csvlook

Let's use csvlook to explore the first few rows from the CSV file we created in the last screen.

Instructions

  • Use csvlook to preview the first 10 rows fromCombined_hud.csv

~$ head -10 Combined_hud.csv | csvlook

/home/dq$ head -10 Combined_hud.csv | csvlook                                   

|-------+------+--------+-----+-----------+-------------+---------|             

|  year | AGE1 | BURDEN | FMR | FMTBEDRMS | FMTBUILT    | TOTSAL  |             

|-------+------+--------+-----+-----------+-------------+---------|             

|  2005 | 43   | 0.513  | 680 | '3 3BR'   | '1980-1989' | 20000   |             

|  2005 | 44   | 0.223  | 760 | '4 4BR+'  | '1980-1989' | 71000   |             

|  2005 | 58   | 0.218  | 680 | '3 3BR'   | '1980-1989' | 63000   |             

|  2005 | 22   | 0.217  | 519 | '1 1BR'   | '1980-1989' | 27040   |             

|  2005 | 48   | 0.283  | 600 | '1 1BR'   | '1980-1989' | 14000   |             

|  2005 | 42   | 0.292  | 788 | '3 3BR'   | '1980-1989' | 42000   |             

|  2005 | -9   | -9.000 | 702 | '2 2BR'   | '1980-1989' | -9      |             

|  2005 | 23   | 0.145  | 546 | '2 2BR'   | '1980-1989' | 48000   |             

|  2005 | 51   | 0.296  | 680 | '3 3BR'   | '1980-1989' | 58000   |

##############################################################

 

4: Csvcut

Csvlook returned a table formatted output of the merged CSV file. Let's now explore individual columns using the csvcut tool. Using the csvcut command with just the -n flag parses and displays all the columns in a CSV file along with an unique integer identifier for each column:

 

csvcut -n Combined_hud.csv

will output:

 

1: year

2: AGE1

3: BURDEN

4: FMR

5: FMTBEDRMS

6: FMTBUILT

7: TOTSAL

You can use the integer identifier for each column and the -cc flag to select just a specific column:

 

csvcut -c 1 Combined_hud.csv

will output just the year column. You want to avoid displaying the entire column since it contains 154118 rows and your terminal window will severely come to a halt attempting to display all that information. Instead, you can pipe the column output to head to preview just the firstn rows.

Instructions

  • Use csvcut to return all of the column names fromCombined_hud.csv.
  • Use csvcut to display just the first 10 values in the AGE1 column

~$ csvcut -n Combined_hud.csv

/home/dq$ csvcut -n Combined_hud.csv                                            

  1: year                                                                       

  2: AGE1                                                                       

  3: BURDEN                                                                     

  4: FMR                                                                        

  5: FMTBEDRMS                                                                  

  6: FMTBUILT                                                                   

  7: TOTSAL       

~$ csvcut -c 2 Combined_hud.csv | head -10

/home/dq$ csvcut -c 2 Combined_hud.csv | head -10                               

AGE1                                                                            

43                                                                              

44                                                                              

58                                                                              

22                                                                              

48                                                                              

42                                                                              

-9                                                                              

23                                                                              

51                                                                              

 

##############################################################

Now that we know how to select specific columns, we can select a column and pipe it to the csvstat tool to calculate summary statistics for that column:

 

csvcut -c 4 Combined_hud.csv | csvstat

This calculates a full suite of summary statistics, including:

  • max,
  • min,
  • sum,
  • mean,
  • median,
  • standard deviation.

Depending on the size of the data, the full summary statistics for a column can take a long time and you often just want a specific summary statistic. You can use -- flags to choose specific summary statistics, which will greatly improve the speed:

 

# Just the max value.

csvcut -c 2 Combined_hud.csv | csvstat --max

# Just the mean value.

csvcut -c 2 Combined_hud.csv | csvstat

    --mean

# Just the number of null values.

csvcut -c 2 Combined_hud.csv | csvstat

    --nulls

You can see a full list of flags in thedocumentation. If you want to calculate summary statistics over all the columns in a CSV file, you can pass the file to csvstat directly:

 

csvstat Combined_hud.csv

Instructions

  • Use csvstat to calculate just the mean for each column inCombined_hud.csv

~$ csvstat --mean Combined_hud.csv

/home/dq$ csvstat --mean Combined_hud.csv                                       

  1. year: 2008.9044232628457                                                   

  2. AGE1: 46.511215505103266                                                   

  3. BURDEN: 5.303764743668771                                                  

  4. FMR: 1037.1186695822005                                                    

  5. FMTBEDRMS: None                                                            

  6. FMTBUILT: None                                                             

  7. TOTSAL: 44041.841931779105                                                 

转载于:https://my.oschina.net/Bettyty/blog/747186

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值