Let's use csvcut and csvstat to search for any problematic values in the AGE1
column.
Instructions
- Use csvstat to calculate the full summary statistics for just the
AGE1
column.
~$ csvcut -n Combined_hud.csv
~$ csvcut -c 2 Combined_hud.csv | csvstat
/home/dq$ csvcut -c 2 Combined_hud.csv | csvstat
1. AGE1
<class 'int'>
Nulls: False
Min: -9
Max: 93
Sum: 7168169
Mean: 46.511215505103266
Median: 48
Standard Deviation: 23.04901451351246
Unique values: 80
5 most frequent values:
-9: 11553
50: 3208
45: 3056
40: 3040
48: 3006
Row count: 154117
/home/dq$
##################################################
7: Csvgrep
You'll notice that -9
is the most common value in the AGE1
column, which is problematic since age values have to be greater than 0
. We can use csvgrep to select all the rows that match a specific pattern to dive a bit deeper. By default, csvgrep will search all of the rows in the dataset but we can restrict the search to specific columns using the -c
flag (just like with csvcut). We then use the -m
flag to specify the pattern:
csvgrep -c 2 -m -9 Combined_hud.csv
This command will return all rows fromCombined_hud.csv
with -9
as the value for theAGE1
column. The behavior of csvgrep can be customized using the flags. For example, you can use the -r
flag to pass in a regular expression as the pattern instead. We're now going to combined several of the tools we've talked about so far so that you can see the real power of using the csvkit tools combined with other CLI tools.
Instructions
- Display the first 10 rows from
Combined_hud.csv
where the value for theAGE1
column is-9
in a pretty table format
~$ csvgrep -c 2 -m -9 Combined_hud.csv | head -10 | csvlook
| year | AGE1 | BURDEN | FMR | FMTBEDRMS | FMTBUILT | TOTSAL |
|-------+------+--------+------+-----------+-------------+---------|
| 2005 | -9 | -9.000 | 702 | '2 2BR' | '1980-1989' | -9 |
| 2005 | -9 | -9.000 | 531 | '1 1BR' | '1980-1989' | -9 |
| 2005 | -9 | -9.000 | 1034 | '3 3BR' | '2000-2009' | -9 |
| 2005 | -9 | -9.000 | 631 | '1 1BR' | '1980-1989' | -9 |
| 2005 | -9 | -9.000 | 712 | '4 4BR+' | '1990-1999' | -9 |
| 2005 | -9 | -9.000 | 1006 | '3 3BR' | '2000-2009' | -9 |
| 2005 | -9 | -9.000 | 631 | '1 1BR' | '1980-1989' | -9 |
| 2005 | -9 | -9.000 | 712 | '3 3BR' | '2000-2009' | -9 |
| 2005 | -9 | -9.000 | 1087 | '3 3BR' | '2000-2009' | -9 |
|-------+------+--------+------+-----------+-------------+---------|
(注:千万记得要加 head 或 tail 看数据,否则output会没完没了浪费时间)
#################################################################
8: Filtering Out Problematic Rows
Let's now filter out all of these problematic rows from the dataset since they have data quality issues. Csvkit wasn't developed with a sharp focus on editing existing files, and the easiest way to filter rows is to create a separate file with just the rows we're interested in. To accomplish this, we can redirect the output of csvgrep to a file. So far, we've only used csvgrep to select rows that match a specific pattern. We need to instead select the rows that don't match a pattern, which we can specify with the -i
flag. You can read more about this flag in the documentation.
Instructions
- Select all rows where the value for
AGE1
isn't-9
and write just those rows topositive_ages_only.csv
~$ csvgrep -c 2 -m -9 -i Combined_hud.csv > positive_ages_only.csv
##########################################################
9: Next Steps
In this challenge, you learned how to use the csvkit library to explore and clean CSV files. You should use csvkit whenever you need to quickly transform or explore data from the command line, but remember that it has a few limitations:
-
Csvkit is not optimized for speed and struggles to run some commands over larger files.
-
Csvkit has very limited capabilities for actually editing problematic values in a dataset, since the community behind the library aspired to keep the library small and lightweight