数据分析项目-合集-day02
需求:
- 导入文件,查看原始数据
- 将人口数据和各州的简称数据进行合并
- 将合并的数据中重复的abbreviation进行删除
- 查看存在缺失数据的列
- 找到有哪些state/region使得state的值为NaN,进行取重操作
- 为找到的这些state/region的state项补上正确的值,从而去除掉state这一列的所有NaN
- 合并各州面积数据areas
- 我们会发现area(sq.m)这一列有缺失数据,找出是哪行
- 去除含有缺失数据的行
- 找出2010年的全民人口数据
- 计算各州的人口密度
- 排序,并找出人口密度最高的州
import numpy as np
import pandas as pd
from pandas import DataFrame
import shutil
abb=pd.read_csv("../data/state-abbrevs.csv")
abb.head()
| state | abbreviation |
---|
0 | Alabama | AL |
---|
1 | Alaska | AK |
---|
2 | Arizona | AZ |
---|
3 | Arkansas | AR |
---|
4 | California | CA |
---|
area=pd.read_csv("../data/state-areas.csv")
area.head()
| state | area (sq. mi) |
---|
0 | Alabama | 52423 |
---|
1 | Alaska | 656425 |
---|
2 | Arizona | 114006 |
---|
3 | Arkansas | 53182 |
---|
4 | California | 163707 |
---|
pop=pd.read_csv("../data/state-population.csv")
pop.head()
| state/region | ages | year | population |
---|
0 | AL | under18 | 2012 | 1117489.0 |
---|
1 | AL | total | 2012 | 4817528.0 |
---|
2 | AL | under18 | 2010 | 1130966.0 |
---|
3 | AL | total | 2010 | 4785570.0 |
---|
4 | AL | under18 | 2011 | 1125763.0 |
---|
abb_pop=pd.merge(abb,pop,left_on="abbreviation",right_on="state/region",how="outer")
abb_pop.head()
| state | abbreviation | state/region | ages | year | population |
---|
0 | Alabama | AL | AL | under18 | 2012 | 1117489.0 |
---|
1 | Alabama | AL | AL | total | 2012 | 4817528.0 |
---|
2 | Alabama | AL | AL | under18 | 2010 | 1130966.0 |
---|
3 | Alabama | AL | AL | total | 2010 | 4785570.0 |
---|
4 | Alabama | AL | AL | under18 | 2011 | 1125763.0 |
---|
abb_pop.drop(labels="abbreviation",axis=1,inplace=True)
abb_pop.head()
| state | state/region | ages | year | population |
---|
0 | Alabama | AL | under18 | 2012 | 1117489.0 |
---|
1 | Alabama | AL | total | 2012 | 4817528.0 |
---|
2 | Alabama | AL | under18 | 2010 | 1130966.0 |
---|
3 | Alabama | AL | total | 2010 | 4785570.0 |
---|
4 | Alabama | AL | under18 | 2011 | 1125763.0 |
---|
abb_pop.isnull().any(axis=0)
state True
state/region False
ages False
year False
population True
dtype: bool
abb_pop.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2544 entries, 0 to 2543
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 state 2448 non-null object
1 state/region 2544 non-null object
2 ages 2544 non-null object
3 year 2544 non-null int64
4 population 2524 non-null float64
dtypes: float64(1), int64(1), object(3)
memory usage: 119.2+ KB
abb_pop
| state | state/region | ages | year | population |
---|
0 | Alabama | AL | under18 | 2012 | 1117489.0 |
---|
1 | Alabama | AL | total | 2012 | 4817528.0 |
---|
2 | Alabama | AL | under18 | 2010 | 1130966.0 |
---|
3 | Alabama | AL | total | 2010 | 4785570.0 |
---|
4 | Alabama | AL | under18 | 2011 | 1125763.0 |
---|
... | ... | ... | ... | ... | ... |
---|
2539 | NaN | USA | total | 2010 | 309326295.0 |
---|
2540 | NaN | USA | under18 | 2011 | 73902222.0 |
---|
2541 | NaN | USA | total | 2011 | 311582564.0 |
---|
2542 | NaN | USA | under18 | 2012 | 73708179.0 |
---|
2543 | NaN | USA | total | 2012 | 313873685.0 |
---|
2544 rows × 5 columns
abb_pop["state"].isnull()
abb_pop.loc[abb_pop["state"].isnull()]
abb_pop.loc[abb_pop["state"].isnull()]["state/region"]
abb_pop.loc[abb_pop["state"].isnull()]["state/region"].unique()
array(['PR', 'USA'], dtype=object)
abb_pop["state/region"]=="USA"
abb_pop.loc[abb_pop["state/region"]=="USA"]
| state | state/region | ages | year | population |
---|
2496 | NaN | USA | under18 | 1990 | 64218512.0 |
---|
2497 | NaN | USA | total | 1990 | 249622814.0 |
---|
2498 | NaN | USA | total | 1991 | 252980942.0 |
---|
2499 | NaN | USA | under18 | 1991 | 65313018.0 |
---|
2500 | NaN | USA | under18 | 1992 | 66509177.0 |
---|
2501 | NaN | USA | total | 1992 | 256514231.0 |
---|
2502 | NaN | USA | total | 1993 | 259918595.0 |
---|
2503 | NaN | USA | under18 | 1993 | 67594938.0 |
---|
2504 | NaN | USA | under18 | 1994 | 68640936.0 |
---|
2505 | NaN | USA | total | 1994 | 263125826.0 |
---|
2506 | NaN | USA | under18 | 1995 | 69473140.0 |
---|
2507 | NaN | USA | under18 | 1996 | 70233512.0 |
---|
2508 | NaN | USA | total | 1995 | 266278403.0 |
---|
2509 | NaN | USA | total | 1996 | 269394291.0 |
---|
2510 | NaN | USA | total | 1997 | 272646932.0 |
---|
2511 | NaN | USA | under18 | 1997 | 70920738.0 |
---|
2512 | NaN | USA | under18 | 1998 | 71431406.0 |
---|
2513 | NaN | USA | total | 1998 | 275854116.0 |
---|
2514 | NaN | USA | under18 | 1999 | 71946051.0 |
---|
2515 | NaN | USA | total | 2000 | 282162411.0 |
---|
2516 | NaN | USA | under18 | 2000 | 72376189.0 |
---|
2517 | NaN | USA | total | 1999 | 279040181.0 |
---|
2518 | NaN | USA | total | 2001 | 284968955.0 |
---|
2519 | NaN | USA | under18 | 2001 | 72671175.0 |
---|
2520 | NaN | USA | total | 2002 | 287625193.0 |
---|
2521 | NaN | USA | under18 | 2002 | 72936457.0 |
---|
2522 | NaN | USA | total | 2003 | 290107933.0 |
---|
2523 | NaN | USA | under18 | 2003 | 73100758.0 |
---|
2524 | NaN | USA | total | 2004 | 292805298.0 |
---|
2525 | NaN | USA | under18 | 2004 | 73297735.0 |
---|
2526 | NaN | USA | total | 2005 | 295516599.0 |
---|
2527 | NaN | USA | under18 | 2005 | 73523669.0 |
---|
2528 | NaN | USA | total | 2006 | 298379912.0 |
---|
2529 | NaN | USA | under18 | 2006 | 73757714.0 |
---|
2530 | NaN | USA | total | 2007 | 301231207.0 |
---|
2531 | NaN | USA | under18 | 2007 | 74019405.0 |
---|
2532 | NaN | USA | total | 2008 | 304093966.0 |
---|
2533 | NaN | USA | under18 | 2008 | 74104602.0 |
---|
2534 | NaN | USA | under18 | 2013 | 73585872.0 |
---|
2535 | NaN | USA | total | 2013 | 316128839.0 |
---|
2536 | NaN | USA | total | 2009 | 306771529.0 |
---|
2537 | NaN | USA | under18 | 2009 | 74134167.0 |
---|
2538 | NaN | USA | under18 | 2010 | 74119556.0 |
---|
2539 | NaN | USA | total | 2010 | 309326295.0 |
---|
2540 | NaN | USA | under18 | 2011 | 73902222.0 |
---|
2541 | NaN | USA | total | 2011 | 311582564.0 |
---|
2542 | NaN | USA | under18 | 2012 | 73708179.0 |
---|
2543 | NaN | USA | total | 2012 | 313873685.0 |
---|
indexs=abb_pop.loc[abb_pop["state/region"]=="USA"].index
abb_pop.iloc[indexs]
abb_pop.loc[indexs,"state"]="United States"
abb_pop["state/region"]=="PR"
indexs2=abb_pop.loc[abb_pop["state/region"]=="PR"].index
abb_pop.loc[indexs2,"state"]="Puerto Rico"
abb_pop_area=pd.merge(abb_pop,area,how="outer")
abb_pop_area
| state | state/region | ages | year | population | area (sq. mi) |
---|
0 | Alabama | AL | under18 | 2012 | 1117489.0 | 52423.0 |
---|
1 | Alabama | AL | total | 2012 | 4817528.0 | 52423.0 |
---|
2 | Alabama | AL | under18 | 2010 | 1130966.0 | 52423.0 |
---|
3 | Alabama | AL | total | 2010 | 4785570.0 | 52423.0 |
---|
4 | Alabama | AL | under18 | 2011 | 1125763.0 | 52423.0 |
---|
... | ... | ... | ... | ... | ... | ... |
---|
2539 | United States | USA | total | 2010 | 309326295.0 | NaN |
---|
2540 | United States | USA | under18 | 2011 | 73902222.0 | NaN |
---|
2541 | United States | USA | total | 2011 | 311582564.0 | NaN |
---|
2542 | United States | USA | under18 | 2012 | 73708179.0 | NaN |
---|
2543 | United States | USA | total | 2012 | 313873685.0 | NaN |
---|
2544 rows × 6 columns
abb_pop_area["area (sq. mi)"].isnull()
indexs3=abb_pop_area.loc[abb_pop_area["area (sq. mi)"].isnull()].index
abb_pop_area.drop(labels=indexs3,axis=0,inplace=True)
abb_pop_area
| state | state/region | ages | year | population | area (sq. mi) |
---|
0 | Alabama | AL | under18 | 2012 | 1117489.0 | 52423.0 |
---|
1 | Alabama | AL | total | 2012 | 4817528.0 | 52423.0 |
---|
2 | Alabama | AL | under18 | 2010 | 1130966.0 | 52423.0 |
---|
3 | Alabama | AL | total | 2010 | 4785570.0 | 52423.0 |
---|
4 | Alabama | AL | under18 | 2011 | 1125763.0 | 52423.0 |
---|
... | ... | ... | ... | ... | ... | ... |
---|
2491 | Puerto Rico | PR | under18 | 2010 | 896945.0 | 3515.0 |
---|
2492 | Puerto Rico | PR | under18 | 2011 | 869327.0 | 3515.0 |
---|
2493 | Puerto Rico | PR | total | 2011 | 3686580.0 | 3515.0 |
---|
2494 | Puerto Rico | PR | under18 | 2012 | 841740.0 | 3515.0 |
---|
2495 | Puerto Rico | PR | total | 2012 | 3651545.0 | 3515.0 |
---|
2496 rows × 6 columns
abb_pop_area.query('ages=="total" & year==2010')
| state | state/region | ages | year | population | area (sq. mi) |
---|
3 | Alabama | AL | total | 2010 | 4785570.0 | 52423.0 |
---|
91 | Alaska | AK | total | 2010 | 713868.0 | 656425.0 |
---|
101 | Arizona | AZ | total | 2010 | 6408790.0 | 114006.0 |
---|
189 | Arkansas | AR | total | 2010 | 2922280.0 | 53182.0 |
---|
197 | California | CA | total | 2010 | 37333601.0 | 163707.0 |
---|
283 | Colorado | CO | total | 2010 | 5048196.0 | 104100.0 |
---|
293 | Connecticut | CT | total | 2010 | 3579210.0 | 5544.0 |
---|
379 | Delaware | DE | total | 2010 | 899711.0 | 1954.0 |
---|
389 | District of Columbia | DC | total | 2010 | 605125.0 | 68.0 |
---|
475 | Florida | FL | total | 2010 | 18846054.0 | 65758.0 |
---|
485 | Georgia | GA | total | 2010 | 9713248.0 | 59441.0 |
---|
570 | Hawaii | HI | total | 2010 | 1363731.0 | 10932.0 |
---|
581 | Idaho | ID | total | 2010 | 1570718.0 | 83574.0 |
---|
666 | Illinois | IL | total | 2010 | 12839695.0 | 57918.0 |
---|
677 | Indiana | IN | total | 2010 | 6489965.0 | 36420.0 |
---|
762 | Iowa | IA | total | 2010 | 3050314.0 | 56276.0 |
---|
773 | Kansas | KS | total | 2010 | 2858910.0 | 82282.0 |
---|
858 | Kentucky | KY | total | 2010 | 4347698.0 | 40411.0 |
---|
869 | Louisiana | LA | total | 2010 | 4545392.0 | 51843.0 |
---|
954 | Maine | ME | total | 2010 | 1327366.0 | 35387.0 |
---|
965 | Montana | MT | total | 2010 | 990527.0 | 147046.0 |
---|
1050 | Nebraska | NE | total | 2010 | 1829838.0 | 77358.0 |
---|
1061 | Nevada | NV | total | 2010 | 2703230.0 | 110567.0 |
---|
1146 | New Hampshire | NH | total | 2010 | 1316614.0 | 9351.0 |
---|
1157 | New Jersey | NJ | total | 2010 | 8802707.0 | 8722.0 |
---|
1242 | New Mexico | NM | total | 2010 | 2064982.0 | 121593.0 |
---|
1253 | New York | NY | total | 2010 | 19398228.0 | 54475.0 |
---|
1338 | North Carolina | NC | total | 2010 | 9559533.0 | 53821.0 |
---|
1349 | North Dakota | ND | total | 2010 | 674344.0 | 70704.0 |
---|
1434 | Ohio | OH | total | 2010 | 11545435.0 | 44828.0 |
---|
1445 | Oklahoma | OK | total | 2010 | 3759263.0 | 69903.0 |
---|
1530 | Oregon | OR | total | 2010 | 3837208.0 | 98386.0 |
---|
1541 | Maryland | MD | total | 2010 | 5787193.0 | 12407.0 |
---|
1626 | Massachusetts | MA | total | 2010 | 6563263.0 | 10555.0 |
---|
1637 | Michigan | MI | total | 2010 | 9876149.0 | 96810.0 |
---|
1722 | Minnesota | MN | total | 2010 | 5310337.0 | 86943.0 |
---|
1733 | Mississippi | MS | total | 2010 | 2970047.0 | 48434.0 |
---|
1818 | Missouri | MO | total | 2010 | 5996063.0 | 69709.0 |
---|
1829 | Pennsylvania | PA | total | 2010 | 12710472.0 | 46058.0 |
---|
1914 | Rhode Island | RI | total | 2010 | 1052669.0 | 1545.0 |
---|
1925 | South Carolina | SC | total | 2010 | 4636361.0 | 32007.0 |
---|
2010 | South Dakota | SD | total | 2010 | 816211.0 | 77121.0 |
---|
2021 | Tennessee | TN | total | 2010 | 6356683.0 | 42146.0 |
---|
2106 | Texas | TX | total | 2010 | 25245178.0 | 268601.0 |
---|
2117 | Utah | UT | total | 2010 | 2774424.0 | 84904.0 |
---|
2202 | Vermont | VT | total | 2010 | 625793.0 | 9615.0 |
---|
2213 | Virginia | VA | total | 2010 | 8024417.0 | 42769.0 |
---|
2298 | Washington | WA | total | 2010 | 6742256.0 | 71303.0 |
---|
2309 | West Virginia | WV | total | 2010 | 1854146.0 | 24231.0 |
---|
2394 | Wisconsin | WI | total | 2010 | 5689060.0 | 65503.0 |
---|
2405 | Wyoming | WY | total | 2010 | 564222.0 | 97818.0 |
---|
2490 | Puerto Rico | PR | total | 2010 | 3721208.0 | 3515.0 |
---|
abb_pop_area["midu"]=abb_pop_area["population"]/abb_pop_area["area (sq. mi)"]
abb_pop_area
| state | state/region | ages | year | population | area (sq. mi) | midu |
---|
0 | Alabama | AL | under18 | 2012 | 1117489.0 | 52423.0 | 21.316769 |
---|
1 | Alabama | AL | total | 2012 | 4817528.0 | 52423.0 | 91.897221 |
---|
2 | Alabama | AL | under18 | 2010 | 1130966.0 | 52423.0 | 21.573851 |
---|
3 | Alabama | AL | total | 2010 | 4785570.0 | 52423.0 | 91.287603 |
---|
4 | Alabama | AL | under18 | 2011 | 1125763.0 | 52423.0 | 21.474601 |
---|
... | ... | ... | ... | ... | ... | ... | ... |
---|
2491 | Puerto Rico | PR | under18 | 2010 | 896945.0 | 3515.0 | 255.176387 |
---|
2492 | Puerto Rico | PR | under18 | 2011 | 869327.0 | 3515.0 | 247.319203 |
---|
2493 | Puerto Rico | PR | total | 2011 | 3686580.0 | 3515.0 | 1048.813656 |
---|
2494 | Puerto Rico | PR | under18 | 2012 | 841740.0 | 3515.0 | 239.470839 |
---|
2495 | Puerto Rico | PR | total | 2012 | 3651545.0 | 3515.0 | 1038.846373 |
---|
2496 rows × 7 columns
abb_pop_area.sort_values(by="midu",axis=0,ascending=False).iloc[0]["state"]
'District of Columbia'