6: Csvcut | Csvstat( Data Cleaning and Exploration Using Csvkit)-CSDN博客

2019独角兽企业重金招聘Python工程师标准>>>

Let's use csvcut and csvstat to search for any problematic values in the AGE1 column.

Instructions

Use csvstat to calculate the full summary statistics for just theAGE1 column.

~$ csvcut -n Combined_hud.csv

~$ csvcut -c 2 Combined_hud.csv | csvstat

/home/dq$ csvcut -c 2 Combined_hud.csv | csvstat

1. AGE1

Nulls: False

Min: -9

Max: 93

Sum: 7168169

Mean: 46.511215505103266

Median: 48

Standard Deviation: 23.04901451351246

Unique values: 80

5 most frequent values:

-9: 11553

50: 3208

45: 3056

40: 3040

48: 3006

Row count: 154117

/home/dq$

##################################################

7: Csvgrep

You'll notice that -9 is the most common value in the AGE1 column, which is problematic since age values have to be greater than 0. We can use csvgrep to select all the rows that match a specific pattern to dive a bit deeper. By default, csvgrep will search all of the rows in the dataset but we can restrict the search to specific columns using the -c flag (just like with csvcut). We then use the -m flag to specify the pattern:

csvgrep -c 2 -m -9 Combined_hud.csv

This command will return all rows fromCombined_hud.csv with -9 as the value for theAGE1 column. The behavior of csvgrep can be customized using the flags. For example, you can use the -r flag to pass in a regular expression as the pattern instead. We're now going to combined several of the tools we've talked about so far so that you can see the real power of using the csvkit tools combined with other CLI tools.

Instructions

Display the first 10 rows from Combined_hud.csv where the value for the AGE1 column is -9 in a pretty table format

~$ csvgrep -c 2 -m -9 Combined_hud.csv | head -10 | csvlook

|-------+------+--------+------+-----------+-------------+---------|

| 2005 | -9 | -9.000 | 702 | '2 2BR' | '1980-1989' | -9 |

| 2005 | -9 | -9.000 | 531 | '1 1BR' | '1980-1989' | -9 |

| 2005 | -9 | -9.000 | 1034 | '3 3BR' | '2000-2009' | -9 |

| 2005 | -9 | -9.000 | 631 | '1 1BR' | '1980-1989' | -9 |

| 2005 | -9 | -9.000 | 712 | '4 4BR+' | '1990-1999' | -9 |

| 2005 | -9 | -9.000 | 1006 | '3 3BR' | '2000-2009' | -9 |

| 2005 | -9 | -9.000 | 631 | '1 1BR' | '1980-1989' | -9 |

| 2005 | -9 | -9.000 | 712 | '3 3BR' | '2000-2009' | -9 |

| 2005 | -9 | -9.000 | 1087 | '3 3BR' | '2000-2009' | -9 |

|-------+------+--------+------+-----------+-------------+---------|

(注：千万记得要加 head 或 tail 看数据，否则output会没完没了浪费时间)

#################################################################

8: Filtering Out Problematic Rows

Let's now filter out all of these problematic rows from the dataset since they have data quality issues. Csvkit wasn't developed with a sharp focus on editing existing files, and the easiest way to filter rows is to create a separate file with just the rows we're interested in. To accomplish this, we can redirect the output of csvgrep to a file. So far, we've only used csvgrep to select rows that match a specific pattern. We need to instead select the rows that don't match a pattern, which we can specify with the -i flag. You can read more about this flag in the documentation.

Instructions

Select all rows where the value for AGE1 isn't -9 and write just those rows to positive_ages_only.csv

~$ csvgrep -c 2 -m -9 -i Combined_hud.csv > positive_ages_only.csv

##########################################################

9: Next Steps

In this challenge, you learned how to use the csvkit library to explore and clean CSV files. You should use csvkit whenever you need to quickly transform or explore data from the command line, but remember that it has a few limitations:

Csvkit is not optimized for speed and struggles to run some commands over larger files.
Csvkit has very limited capabilities for actually editing problematic values in a dataset, since the community behind the library aspired to keep the library small and lightweight

转载于:https://my.oschina.net/Bettyty/blog/747188