6: Csvcut | Csvstat( Data Cleaning and Exploration Using Csvkit)

Let's use csvcut and csvstat to search for any problematic values in the AGE1 column.

Instructions

  • Use csvstat to calculate the full summary statistics for just theAGE1 column.

~$ csvcut -n Combined_hud.csv

~$ csvcut -c 2 Combined_hud.csv | csvstat

/home/dq$ csvcut -c 2 Combined_hud.csv | csvstat                                

                                                                                

  1. AGE1                                                                       

        <class 'int'>                                                           

        Nulls: False                                                            

        Min: -9                                                                 

        Max: 93                                                                 

        Sum: 7168169                                                            

        Mean: 46.511215505103266                                                

        Median: 48                                                              

        Standard Deviation: 23.04901451351246  

        Unique values: 80                                                       

        5 most frequent values:                                                 

                -9:     11553                                                   

                50:     3208                                                    

                45:     3056                                                    

                40:     3040                                                    

                48:     3006                                                    

                                                                                

Row count: 154117                                                               

/home/dq$                            

 

##################################################

7: Csvgrep

 

You'll notice that -9 is the most common value in the AGE1 column, which is problematic since age values have to be greater than 0. We can use csvgrep to select all the rows that match a specific pattern to dive a bit deeper. By default, csvgrep will search all of the rows in the dataset but we can restrict the search to specific columns using the -c flag (just like with csvcut). We then use the -m flag to specify the pattern:

 

csvgrep -c 2 -m -9 Combined_hud.csv

This command will return all rows fromCombined_hud.csv with -9 as the value for theAGE1 column. The behavior of csvgrep can be customized using the flags. For example, you can use the -r flag to pass in a regular expression as the pattern instead. We're now going to combined several of the tools we've talked about so far so that you can see the real power of using the csvkit tools combined with other CLI tools.

Instructions

  • Display the first 10 rows from Combined_hud.csv where the value for the AGE1 column is -9 in a pretty table format

 

~$ csvgrep -c 2 -m -9 Combined_hud.csv | head -10 | csvlook

|  year | AGE1 | BURDEN | FMR  | FMTBEDRMS | FMTBUILT    | TOTSAL  |            

|-------+------+--------+------+-----------+-------------+---------|            

|  2005 | -9   | -9.000 | 702  | '2 2BR'   | '1980-1989' | -9      |            

|  2005 | -9   | -9.000 | 531  | '1 1BR'   | '1980-1989' | -9      |            

|  2005 | -9   | -9.000 | 1034 | '3 3BR'   | '2000-2009' | -9      |            

|  2005 | -9   | -9.000 | 631  | '1 1BR'   | '1980-1989' | -9      |            

|  2005 | -9   | -9.000 | 712  | '4 4BR+'  | '1990-1999' | -9      |            

|  2005 | -9   | -9.000 | 1006 | '3 3BR'   | '2000-2009' | -9      |            

|  2005 | -9   | -9.000 | 631  | '1 1BR'   | '1980-1989' | -9      |            

|  2005 | -9   | -9.000 | 712  | '3 3BR'   | '2000-2009' | -9      |            

|  2005 | -9   | -9.000 | 1087 | '3 3BR'   | '2000-2009' | -9      |            

|-------+------+--------+------+-----------+-------------+---------|            

(注:千万记得要加 head 或 tail 看数据,否则output会没完没了浪费时间)

#################################################################

8: Filtering Out Problematic Rows

Let's now filter out all of these problematic rows from the dataset since they have data quality issues. Csvkit wasn't developed with a sharp focus on editing existing files, and the easiest way to filter rows is to create a separate file with just the rows we're interested in. To accomplish this, we can redirect the output of csvgrep to a file. So far, we've only used csvgrep to select rows that match a specific pattern. We need to instead select the rows that don't match a pattern, which we can specify with the -i flag. You can read more about this flag in the documentation.

Instructions

  • Select all rows where the value for AGE1 isn't -9 and write just those rows to positive_ages_only.csv

~$ csvgrep -c 2 -m -9 -i Combined_hud.csv > positive_ages_only.csv

 

##########################################################

9: Next Steps

 

In this challenge, you learned how to use the csvkit library to explore and clean CSV files. You should use csvkit whenever you need to quickly transform or explore data from the command line, but remember that it has a few limitations:

  • Csvkit is not optimized for speed and struggles to run some commands over larger files.

  • Csvkit has very limited capabilities for actually editing problematic values in a dataset, since the community behind the library aspired to keep the library small and lightweight

转载于:https://my.oschina.net/Bettyty/blog/747188

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值