数据分析简单案例之人口分析

Magician_liu

已于 2024-07-24 22:51:15 修改

阅读量746

点赞数 22

分类专栏：数据分析文章标签：数据分析数据挖掘 python pycharm

于 2024-07-24 22:49:55 首次发布

本文链接：https://blog.csdn.net/Magician_liu/article/details/140675158

版权

数据分析专栏收录该内容

7 篇文章

订阅专栏

文章目录

- - 数据分析之人口分析
  - - 步骤:

数据分析之人口分析

基础知识内容：
- 参考我的博客笔记：数据分析三剑客之Pandas(持续更新中…)
目的：
- 锻炼pandas的使用能力
- 温习pandas的级联合并操作
- 新函数query()的使用
- 复习df的常用操作
- 提高实际应用能力
需求：
- 现有三个csv关于人口资料文件，导入这些文件，查看原始数据
- 将人口数据和各州简称数据进行合并
- 将合并的数据中重复的abbreviation列进行删除
- 查看存在缺失数据的列
- 找到有哪些state/region使得state的值为NaN，进行去重操作
- 为找到的这些state/region的state项补上正确的值，从而去除掉state这一列的所有NaN
- 合并各州面积数据areas
- 我们会发现area(sq.mi)这一列有缺失数据，找出是哪些行
- 去除含有缺失数据的行
- 找出2010年的全民人口数据
- 计算各州的人口密度
- 排序，并找出人口密度最高的州

步骤:

1.现有三个csv文件：关于人口资料文件，导入这些文件，查看原始数据

#导入第一个文件，查看原始数据
fullstate_state = pd.read_csv('../data/state-abbrevs.csv') #state(州的全称)abbreviation（州的简称）
print(fullstate_state.head())
	#州的全称    州的简称
	state	abbreviation
0	Alabama		AL
1	Alaska		AK
2	Arizona		AZ
3	Arkansas	AR
4	California	CA
#导入第二个文件，查看原始数据
fullstate_area = pd.read_csv('../data/state-areas.csv') #state州的全称，area (sq. mi)州的面积
print(fullstate_area.head())
#   州的全称  州的面积
	state	area (sq. mi)
0	Alabama	52423
1	Alaska	656425
2	Arizona	114006
3	Arkansas	53182
4	California	163707
#导入第二个文件，查看原始数据
state_age_year_population = pd.read_csv('../data/state-population.csv')#state/region简称，ages年龄，year时间，population人口数量
print(state_age_year_population.head())
#   州的简称         人口年龄  时间     人口数量
	state/region	ages	year	population
0	AL				under18	2012	1117489.0
1	AL				total	2012	4817528.0
2	AL				under18	2010	1130966.0
3	AL				total	2010	4785570.0
4	AL				under18	2011	1125763.0

代表现在有三张表：
- 1.fullstate_state(state(州的全称),abbreviation(州的简称))
- 2.fullstate_area(state(州的全称),area (sq. mi)(面积))
- 3.state_age_year_population(state/region(州简称),ages(年龄),year(时间),population(人口数量))

2.将人口数据和各州简称数据进行合并

即将第一张表和第三张表合并

#将人口数据和各州简称数据进行合并
population_stage=pd.merge(state_age_year_population,fullstate_state,left_on='state/region',right_on='abbreviation',how='outer')
print(abb_pop.head())
输出结果如下：
	state	abbreviation  state/region	ages    year  population
0	Alabama		AL			AL	     under18	2012	1117489.0
1	Alabama		AL			AL		 total		2012	4817528.0
2	Alabama		AL			AL		 under18	2010	1130966.0
3	Alabama		AL			AL		 total		2010	4785570.0
4	Alabama		AL			AL		 under18	2011	1125763.0

3.将合并的数据中重复的abbreviation列进行删除

#将合并的数据中重复的abbreviation列进行删除
population_stage.drop(labels='abbreviation',axis=1,inplace=True)
print(population_stage.head())
输出结果如下：
	state/region	ages		year	population	state
0		AL			under18		2012	1117489.0	Alabama
1		AL			total		2012	4817528.0	Alabama
2		AL			under18		2010	1130966.0	Alabama
3		AL			total		2010	4785570.0	Alabama
4		AL			under18		2011	1125763.0	Alabama

4.查看存在缺失数据的列

方式1:isnull，notnull，any，all

```
population_stage.isnull().any(axis=0)
```

输出

state/region    False
ages            False
year            False
population       True
state            True
dtype: bool

state,population这两列中是存在空值

方式2：info()

```
population_stage.info()
```

输出

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2544 entries, 0 to 2543
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   state/region  2544 non-null   object 
 1   ages          2544 non-null   object 
 2   year          2544 non-null   int64  
 3   population    2524 non-null   float64
 4   state         2448 non-null   object 
dtypes: float64(1), int64(1), object(3)
memory usage: 119.2+ KB

找到有哪些state/region使得state的值为NaN，进行去重操作
- 将state中的空值对应的简称找到，且对简称进行去重
- 思路：可以将state这一列中的空值对应的行数据取出，从该行数据中就可以取出简称的值
  - ```
  #1.将state中的空值定位到
  population_stage['state'].isnull()
  #2.将上述的布尔值作为源数据的行索引
  population_stage.loc[population_stage['state'].isnull()]#将state中空对应的行数据取出
  #3.将简称取出
  population_stage.loc[population_stage['state'].isnull()]['state/region']
  #4.对简称去重
  population_stage.loc[population_stage['state'].isnull()]['state/region'].unique()
```
- 输出
  - ```
  array(['PR', 'USA'], dtype=object)
```
  - 只有PR和USA对应的全称数据为空值(即第一张州的全称与州的简称表中无PR,与USA的信息，导致外连接后为空)
  - 为找到的这些state/region的state项补上正确的值，从而去除掉state这一列的所有NaN
    - 思考：填充该需求中的空值可不可以使用fillna？
    - 不可以。fillna可以使用空的紧邻值做填充。fillna(value='xxx')使用指定的值填充空值
  - 为找到的这些state/region的state项补上正确的值，从而去除掉state这一列的所有NaN,而应使用给元素赋值的方式进行填充！
    - 先给USA的全称对应的空值进行批量赋值
      - 1.将USA对应的行数据找出（行数据中就存在state的空值）
        
        population_stage['state/region'] == 'USA' population_stage.loc[population_stage['state/region'] == 'USA']#将usa对应的行数据取出
      - 2.将USA对应的全称空对应的行索引取出
        
        indexs = population_stage.loc[population_stage['state/region'] == 'USA'].index
      - 3.赋值
        
        population_stage.loc[indexs,'state'] = 'United States'
    - PR同理
      - population_stage['state/region'] == 'PR' population_stage.loc[population_stage['state/region'] == 'PR'] #PR对应的行数据 indexs = population_stage.loc[population_stage['state/region'] == 'PR'].index population_stage.loc[indexs,'state'] = 'PUERTO RICO'

5.合并各州面积数据areas

将处理后的人口数据表与第二张面积表进行连接

#合并各州面积数据areas
abb_pop_area = pd.merge(population_stage,fullstate_area,how='outer')

6.我们会发现area(sq.mi)这一列有缺失数据，找出是哪些行

abb_pop_area['area (sq. mi)'].isnull()
abb_pop_area.loc[abb_pop_area['area (sq. mi)'].isnull()] #空对应的行数据
indexs = abb_pop_area.loc[abb_pop_area['area (sq. mi)'].isnull()].index

7. 去除含有缺失数据的行

abb_pop_area.drop(labels=indexs,axis=0,inplace=True)

现在得到的表格样式如下

		state/region	ages	year	population	state	area (sq. mi)
0				AL		under18	2012.0	1117489.0	Alabama	52423.0
1				AL		total	2012.0	4817528.0	Alabama	52423.0
2				AL		under18	2010.0	1130966.0	Alabama	52423.0
3				AL		total	2010.0	4785570.0	Alabama	52423.0
4				AL		under18	2011.0	1125763.0	Alabama	52423.0
						..............

8.找出2010年的全民人口数据

query()函数的使用

abb_pop_area.query('ages == "total" & year == 2010')

输出

	state/region	ages	year	population	state	area (sq. mi)
3		AL	total	2010.0	4785570.0	Alabama	52423.0
91		AK	total	2010.0	713868.0	Alaska	656425.0
101		AZ	total	2010.0	6408790.0	Arizona	114006.0
189		AR	total	2010.0	2922280.0	Arkansas	53182.0
197		CA	total	2010.0	37333601.0	California	163707.0
283		CO	total	2010.0	5048196.0	Colorado	104100.0
293		CT	total	2010.0	3579210.0	Connecticut	5544.0
379		DE	total	2010.0	899711.0	Delaware	1954.0
389		DC	total	2010.0	605125.0	District of Columbia	68.0
475		FL	total	2010.0	18846054.0	Florida	65758.0
485		GA	total	2010.0	9713248.0	Georgia	59441.0
570		HI	total	2010.0	1363731.0	Hawaii	10932.0
581		ID	total	2010.0	1570718.0	Idaho	83574.0
666		IL	total	2010.0	12839695.0	Illinois	57918.0
677		IN	total	2010.0	6489965.0	Indiana	36420.0
762		IA	total	2010.0	3050314.0	Iowa	56276.0
773		KS	total	2010.0	2858910.0	Kansas	82282.0
858		KY	total	2010.0	4347698.0	Kentucky	40411.0
869		LA	total	2010.0	4545392.0	Louisiana	51843.0

				.......................

9.计算各州的人口密度(人口除以面积)

abb_pop_area['midu'] = abb_pop_area['population'] / abb_pop_area['area (sq. mi)']

10.排序，并找出人口密度最高的州

abb_pop_area.sort_values(by='midu',axis=0,ascending=False).iloc[0]['state']

输出：
- ```
'District of Columbia'
```