1: Data Munging( Challenge: Data Munging Using The Command Line)

In this challenge, you'll practice the command line concepts you've learned so far by munging datasets using just the command line. Data munging involves transforming datasets to make them easier to work with. Some datasets are too large to load into Python, so looking at them or transforming them beforehand can be useful. Even for smaller datasets, simple exploration on the command line is faster than exploration in Python, and file-based tasks like unifying datasets can be faster on the command line.

You'll be interacting with datasets on U.S. housing affordability from the U.S. Department of Housing & Urban Development in this challenge. To start things off, let's explore the datasets in the first few steps.

Instructions

  • List all of the files in the current directory (the home directory), including the file names, permissions, formats, and sizes

/home/dq$ ls -l                                                                 

total 6672                                                                      

-rwxr-xr-x 1 dq dq 2051577 Sep 15 07:47 Hud_2005.csv                            

-rwxr-xr-x 1 dq dq 1874334 Sep 15 07:47 Hud_2007.csv                            

-rwxr-xr-x 1 dq dq 2902856 Sep 15 07:47 Hud_2013.csv                      

 

#################################################

     2: Data Exploration

 

It looks like there are 3 different CSV files, each corresponding to a separate year.

You learned about the tail command to display the last n rows in a file. To display the first n rows (10 by default), you can instead us the head command.

Instructions

  • Use the head command to display the first 10 rows of each of the 3 CSV files

~$ head Hud_2005.csv

~$ head Hud_2007.csv

~$ head Hud_2013.csv

 

##################################################

3: Filtering

The goal is to eventually get this into a Pandas Dataframe so let's combine the datasets into one file so it can be read in easily. Since each dataset contains the same columns, you need to combine the datasets into one file. You can't, however, just use append the full contents of each file to one final file since each dataset contains the header row. The consolidated file should only contain the header row once (in the first row). You need to instead append the header row to the consolidated file once, then append only the non-header rows from the 3 datasets to the consolidated file.

Here's a reminder of how the first 10 rows ofHud_2013.csv looks like:

Imgur

Since the header row is always the first row in each of the datasets, you can just select all rows after the header row. You can use the commandwc along with the l flag to return the number of lines for a specified file. You can use each file's line count combined with the tail command to return the last n lines of a file.

Instructions

  • Create the file combined_hud.csv and append the header row from one of the datasets.
  • Select all non-header rows from Hud_2005.csv and append tocombined_hud.csv.
  • Display the first 10 rows in combined_hud.csv to verify your work

 

~$ head -1 Hud_2005.csv > combined_hud.csv

~$ wc -l Hud_2005.csv

~$ tail -46853 Hud_2005.csv >> combined_hud.csv

~$ head combined_hud.csv

转载于:https://my.oschina.net/Bettyty/blog/747169

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值