1: Data Munging( Challenge: Data Munging Using The Command Line)

最新推荐文章于 2024-08-14 23:30:40 发布

weixin_33755649

最新推荐文章于 2024-08-14 23:30:40 发布

阅读量132

点赞数

文章标签： python

原文链接：https://my.oschina.net/Bettyty/blog/747169

版权

2019独角兽企业重金招聘Python工程师标准>>>

In this challenge, you'll practice the command line concepts you've learned so far by munging datasets using just the command line. Data munging involves transforming datasets to make them easier to work with. Some datasets are too large to load into Python, so looking at them or transforming them beforehand can be useful. Even for smaller datasets, simple exploration on the command line is faster than exploration in Python, and file-based tasks like unifying datasets can be faster on the command line.

You'll be interacting with datasets on U.S. housing affordability from the U.S. Department of Housing & Urban Development in this challenge. To start things off, let's explore the datasets in the first few steps.

Instructions

List all of the files in the current directory (the home directory), including the file names, permissions, formats, and sizes

/home/dq$ ls -l

total 6672

-rwxr-xr-x 1 dq dq 2051577 Sep 15 07:47 Hud_2005.csv

-rwxr-xr-x 1 dq dq 1874334 Sep 15 07:47 Hud_2007.csv

-rwxr-xr-x 1 dq dq 2902856 Sep 15 07:47 Hud_2013.csv

#################################################

2: Data Exploration

It looks like there are 3 different CSV files, each corresponding to a separate year.

You learned about the tail command to display the last n rows in a file. To display the first n rows (10 by default), you can instead us the head command.

Instructions

Use the head command to display the first 10 rows of each of the 3 CSV files

~$ head Hud_2005.csv

~$ head Hud_2007.csv

~$ head Hud_2013.csv

##################################################

3: Filtering

The goal is to eventually get this into a Pandas Dataframe so let's combine the datasets into one file so it can be read in easily. Since each dataset contains the same columns, you need to combine the datasets into one file. You can't, however, just use append the full contents of each file to one final file since each dataset contains the header row. The consolidated file should only contain the header row once (in the first row). You need to instead append the header row to the consolidated file once, then append only the non-header rows from the 3 datasets to the consolidated file.

Here's a reminder of how the first 10 rows ofHud_2013.csv looks like:

Imgur

Since the header row is always the first row in each of the datasets, you can just select all rows after the header row. You can use the commandwc along with the l flag to return the number of lines for a specified file. You can use each file's line count combined with the tail command to return the last n lines of a file.

Instructions

Create the file combined_hud.csv and append the header row from one of the datasets.
Select all non-header rows from Hud_2005.csv and append tocombined_hud.csv.
Display the first 10 rows in combined_hud.csv to verify your work

~$ head -1 Hud_2005.csv > combined_hud.csv

~$ wc -l Hud_2005.csv

~$ tail -46853 Hud_2005.csv >> combined_hud.csv

~$ head combined_hud.csv

转载于:https://my.oschina.net/Bettyty/blog/747169