So far, we've been using the default command line tools to clean, munge, and explore data. Tools likewc
and head
are useful tools, but weren't designed specifically for working with datasets and are limited in many ways. These tools lack features specific to working with tabular datasets, like parsing the header row or understanding the row and column layout. Because of this, in theData Munging Using the Command Line challenge, you had to specifically compute the number of lines in each CSV file using the wc
tool and use that number to select just the non-header rows using the tail
tool. You then had to repeat this for each CSV file you were trying to merge into the resulting, single file!
In this mission, we'll learn about the Csvkit library, which supercharges your workflow by adding 13 new command line tools specifically for working with CSV files. We'll focus on these 5 tools from Csvkit:
- csvstack: for stacking rows from multiple CSV files.
- csvlook: renders CSV in pretty table format.
- csvcut: for selecting specific columns from a CSV file.
- csvstat: for calculating descriptive statistics for some or all columns.
- csvgrep: for filtering tabular data using specific criteria.
We'll be using csvkit version 0.9.1 in this mission and you can read about the installation procedure in the documentation. We'll continue to work with the same 3 datasets on housing affordability:
- Hud_2005.csv,
- Hud_2007.csv,
- Hud_2013.csv.
#####################################################
2: Csvstack
To start, let's circle back to the task of merging 3 CSV files into 1 file. We can use csvstack tool to consolidate the rows from multiple CSV files and redirect the stdout to a new file:
csvstack file1.csv file2.csv file3.csv >
final.csv
As long as the header row for each file in the stdin to csvstack is the same, the first row in the resulting file will match this header row. After the header row, final.csv
will contain all of the non-header rows from file1.csv
, then all of the non-header rows from file2.csv
, then finally the non-header rows from file3.csv
. If you don't redirect the stdout of csvstack to a file or a tool like head
, the full output will be rendered in the terminal. This can cause your terminal to grind to a halt as it tries to process and display all of the output and you want to be extra careful to avoid doing so.
If you peeked at the documentation, you may have noticed that the behavior of csvstack can be modified using a few different flags. For example,
if you want to be able to trace the file where each row originated from in the merged file, you can use the -g
flag to specify a grouping value for each filename. When stacking the rows from a file, csvstack will add the corresponding value in a new column. Lastly, you can use the -n
flag to specify the name of this new column. The following code will create a new column namedorigin
, containing the values 1
, 2
, or 3
depending on which file that row originated from:
csvstack -n origin -g 1,2,3 file1.csv file2
.csv file3.csv > final.csv
The rows in final.csv
that originated fromfile1.csv
will contain the value 1
in theorigin
column and those from file2.csv
will contain the value 2
in the origin
column. Let's now use csvstack to combine the 3 datasets on U.S. housing affordability from the last challenge.
Instructions
-
Merge
Hud_2005.csv
,Hud_2007.csv
, andHud_2013.csv
in that order into one file:- Name the resulting file
Combined_hud.csv
. - Add an extra column named
year
which contains the year value from the file name for each row. E.g. the rows that originated fromHud_2005.csv
should have2005
as the value in theyear
column.
- Name the resulting file
-
Use
head
to preview the first few rows ofCombined_hud.csv
. - Use the
wc
command with thel
flag to confirm that the merged file contains154118
rows
~$ ls -l
~$ csvstack -n year -g 2005,2007,2013 Hud_2005.csv Hud_2007.csv Hud_2013.csv > Combined_hud.csv
~$ head -5 Combined_hud.csv
####################################
3: Csvlook
While head
allows you to quickly observe the first few rows in a file, it doesn't attempt to format the rendered output at all. CSV files are tabular and it's incredibly useful to observe this structure and other data tools like Pandas and Microsoft Excel factored that notion in when displaying tabular data. Thankfully, we can use the csvlook tool to display tabular data in the table format we're used to.
The csvlook tool parses CSV formatted data from it's stdin and outputs a pretty formatted table representation of that data to it's stdout:
head -10 final.csv | csvlook
Let's use csvlook to explore the first few rows from the CSV file we created in the last screen.
Instructions
- Use csvlook to preview the first 10 rows from
Combined_hud.csv
~$ head -10 Combined_hud.csv | csvlook
/home/dq$ head -10 Combined_hud.csv | csvlook
|-------+------+--------+-----+-----------+-------------+---------|
| year | AGE1 | BURDEN | FMR | FMTBEDRMS | FMTBUILT | TOTSAL |
|-------+------+--------+-----+-----------+-------------+---------|
| 2005 | 43 | 0.513 | 680 | '3 3BR' | '1980-1989' | 20000 |
| 2005 | 44 | 0.223 | 760 | '4 4BR+' | '1980-1989' | 71000 |
| 2005 | 58 | 0.218 | 680 | '3 3BR' | '1980-1989' | 63000 |
| 2005 | 22 | 0.217 | 519 | '1 1BR' | '1980-1989' | 27040 |
| 2005 | 48 | 0.283 | 600 | '1 1BR' | '1980-1989' | 14000 |
| 2005 | 42 | 0.292 | 788 | '3 3BR' | '1980-1989' | 42000 |
| 2005 | -9 | -9.000 | 702 | '2 2BR' | '1980-1989' | -9 |
| 2005 | 23 | 0.145 | 546 | '2 2BR' | '1980-1989' | 48000 |
| 2005 | 51 | 0.296 | 680 | '3 3BR' | '1980-1989' | 58000 |
##############################################################
4: Csvcut
Csvlook returned a table formatted output of the merged CSV file. Let's now explore individual columns using the csvcut tool. Using the csvcut command with just the -n
flag parses and displays all the columns in a CSV file along with an unique integer identifier for each column:
csvcut -n Combined_hud.csv
will output:
1: year
2: AGE1
3: BURDEN
4: FMR
5: FMTBEDRMS
6: FMTBUILT
7: TOTSAL
You can use the integer identifier for each column and the -cc
flag to select just a specific column:
csvcut -c 1 Combined_hud.csv
will output just the year
column. You want to avoid displaying the entire column since it contains 154118 rows and your terminal window will severely come to a halt attempting to display all that information. Instead, you can pipe the column output to head
to preview just the firstn
rows.
Instructions
- Use csvcut to return all of the column names from
Combined_hud.csv
. - Use csvcut to display just the first 10 values in the
AGE1
column
~$ csvcut -n Combined_hud.csv
/home/dq$ csvcut -n Combined_hud.csv
1: year
2: AGE1
3: BURDEN
4: FMR
5: FMTBEDRMS
6: FMTBUILT
7: TOTSAL
~$ csvcut -c 2 Combined_hud.csv | head -10
/home/dq$ csvcut -c 2 Combined_hud.csv | head -10
AGE1
43
44
58
22
48
42
-9
23
51
##############################################################
Now that we know how to select specific columns, we can select a column and pipe it to the csvstat tool to calculate summary statistics for that column:
csvcut -c 4 Combined_hud.csv | csvstat
This calculates a full suite of summary statistics, including:
- max,
- min,
- sum,
- mean,
- median,
- standard deviation.
Depending on the size of the data, the full summary statistics for a column can take a long time and you often just want a specific summary statistic. You can use --
flags to choose specific summary statistics, which will greatly improve the speed:
# Just the max value.
csvcut -c 2 Combined_hud.csv | csvstat --max
# Just the mean value.
csvcut -c 2 Combined_hud.csv | csvstat
--mean
# Just the number of null values.
csvcut -c 2 Combined_hud.csv | csvstat
--nulls
You can see a full list of flags in thedocumentation. If you want to calculate summary statistics over all the columns in a CSV file, you can pass the file to csvstat directly:
csvstat Combined_hud.csv
Instructions
- Use csvstat to calculate just the mean for each column in
Combined_hud.csv
~$ csvstat --mean Combined_hud.csv
/home/dq$ csvstat --mean Combined_hud.csv
1. year: 2008.9044232628457
2. AGE1: 46.511215505103266
3. BURDEN: 5.303764743668771
4. FMR: 1037.1186695822005
5. FMTBEDRMS: None
6. FMTBUILT: None
7. TOTSAL: 44041.841931779105