pandas练习2

最新推荐文章于 2024-04-20 17:01:47 发布

潘诺西亚的火山

最新推荐文章于 2024-04-20 17:01:47 发布

阅读量1.7k

点赞数 1

本文链接：https://blog.csdn.net/helldoger/article/details/109054359

版权

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

data = {"grammer":["Python","C","Java","GO",np.nan,"SQL","PHP","Python"],
       "score":[1,2,np.nan,4,5,6,7,10]}

data

{'grammer': ['Python', 'C', 'Java', 'GO', nan, 'SQL', 'PHP', 'Python'],
 'score': [1, 2, nan, 4, 5, 6, 7, 10]}

df = pd.DataFrame(data)
df

	grammer	score
0	Python	1.0
1	C	2.0
2	Java	NaN
3	GO	4.0
4	NaN	5.0
5	SQL	6.0
6	PHP	7.0
7	Python	10.0

3提取含有字符串"Python"的行

df['grammer'] == 'Python'

0     True
1    False
2    False
3    False
4    False
5    False
6    False
7     True
Name: grammer, dtype: bool

df[df['grammer'] == 'Python']

	grammer	score
0	Python	1.0
7	Python	10.0

df.columns

Index(['grammer', 'score'], dtype='object')

4.修改第二列列名为’popularity’

df.rename(columns={"score":"popularity"},inplace=True)

df

	grammer	popularity
0	Python	1.0
1	C	2.0
2	Java	NaN
3	GO	4.0
4	NaN	5.0
5	SQL	6.0
6	PHP	7.0
7	Python	10.0

5.统计grammer列中每种编程语言出现的次数

df["grammer"].value_counts()

Python    2
Java      1
PHP       1
GO        1
SQL       1
C         1
Name: grammer, dtype: int64

df1 = df.copy()
df2 = df.copy()

df1

	grammer	popularity
0	Python	1.0
1	C	2.0
2	Java	NaN
3	GO	4.0
4	NaN	5.0
5	SQL	6.0
6	PHP	7.0
7	Python	10.0

6,将空值用上下值的平均值填充

df1['popularity'] = df1['popularity'].fillna(df1['popularity'].interpolate())
df1

	grammer	popularity
0	Python	1.0
1	C	2.0
2	Java	3.0
3	GO	4.0
4	NaN	5.0
5	SQL	6.0
6	PHP	7.0
7	Python	10.0

### 把0值替换为na的方法：df.replace(0,np.nan)

df2.iloc[:,1] = df2.iloc[:,1].fillna(df2.iloc[:,1].interpolate())

df2

	grammer	popularity
0	Python	1.0
1	C	2.0
2	Java	3.0
3	GO	4.0
4	NaN	5.0
5	SQL	6.0
6	PHP	7.0
7	Python	10.0

df

	grammer	popularity
0	Python	1.0
1	C	2.0
2	Java	NaN
3	GO	4.0
4	NaN	5.0
5	SQL	6.0
6	PHP	7.0
7	Python	10.0

df["popularity"].fillna(df["popularity"].interpolate(),inplace=True)

df

	grammer	popularity
0	Python	1.0
1	C	2.0
2	Java	3.0
3	GO	4.0
4	NaN	5.0
5	SQL	6.0
6	PHP	7.0
7	Python	10.0

保留popularity列大于3的值

df[df.iloc[:,1]>3]

	grammer	popularity
3	GO	4.0
4	NaN	5.0
5	SQL	6.0
6	PHP	7.0
7	Python	10.0

去除grammar列重复值

df.drop_duplicates("grammer",inplace = True)
df

	grammer	popularity
0	Python	1.0
1	C	2.0
2	Java	3.0
3	GO	4.0
4	NaN	5.0
5	SQL	6.0
6	PHP	7.0

df["popularity"].mean()

4.0

10，将grammer列转换为list

df["grammer"].to_list()

['Python', 'C', 'Java', 'GO', nan, 'SQL', 'PHP']

11.将DataFrame保存为EXCEL

df.to_csv("./test.csv")

df.to_excel('test.xlsx',index=False)

df[(df["popularity"]>3) & (df["popularity"]<7)]

	grammer	popularity
3	GO	4.0
4	NaN	5.0
5	SQL	6.0

14.交换两列位置

cols = df.columns[[1,0]]

cols

Index(['popularity', 'grammer'], dtype='object')

cols_1 = df.columns[[0,1]]
cols_1

Index(['grammer', 'popularity'], dtype='object')

type(cols_1)

pandas.core.indexes.base.Index

df = df[cols]
df

	popularity	grammer
0	1.0	Python
1	2.0	C
2	3.0	Java
3	4.0	GO
4	5.0	NaN
5	6.0	SQL
6	7.0	PHP

### 方法2
temp = df['popularity']
df.drop(labels=['popularity'], axis=1,inplace = True)
df.insert(0, 'popularity', temp)
df

E:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py:4167: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,

	popularity	grammer
0	1.0	Python
1	2.0	C
2	3.0	Java
3	4.0	GO
4	5.0	NaN
5	6.0	SQL
6	7.0	PHP

15.提取popularity列最大值所在行

df["popularity"] == df["popularity"].max()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
Name: popularity, dtype: bool

df[df["popularity"] == df["popularity"].max() ]

	popularity	grammer
6	7.0	PHP

16查看最后五行

df.tail()

	popularity	grammer
2	3.0	Java
3	4.0	GO
4	5.0	NaN
5	6.0	SQL
6	7.0	PHP

df

	popularity	grammer
0	1.0	Python
1	2.0	C
2	3.0	Java
3	4.0	GO
4	5.0	NaN
5	6.0	SQL
6	7.0	PHP

17 删掉一行，一列

df.drop("popularity",axis=1)

	grammer
0	Python
1	C
2	Java
3	GO
4	NaN
5	SQL
6	PHP

df.drop(6,axis=0)

	popularity	grammer
0	1.0	Python
1	2.0	C
2	3.0	Java
3	4.0	GO
4	5.0	NaN
5	6.0	SQL

18.添加一行数据[‘Perl’,6.6]

df.columns[[1,0]]

Index(['grammer', 'popularity'], dtype='object')

df = df[df.columns[[1,0]]]

df

	grammer	popularity
0	Python	1.0
1	C	2.0
2	Java	3.0
3	GO	4.0
4	NaN	5.0
5	SQL	6.0
6	PHP	7.0

a = {"grammer":"perl","popularity":6.6}

df.append(a,ignore_index=True)

	grammer	popularity
0	Python	1.0
1	C	2.0
2	Java	3.0
3	GO	4.0
4	NaN	5.0
5	SQL	6.0
6	PHP	7.0
7	perl	6.6

添加一列数据

a = "grammar" ## 新的列名
a

'grammar'

b = df["grammer"]
b

0    Python
1         C
2      Java
3        GO
4       NaN
5       SQL
6       PHP
Name: grammer, dtype: object

df.insert(0,a,b) # 插入的位置， 列名 ，内容

df

	grammar	grammer	popularity
0	Python	Python	1.0
1	C	C	2.0
2	Java	Java	3.0
3	GO	GO	4.0
4	NaN	NaN	5.0
5	SQL	SQL	6.0
6	PHP	PHP	7.0

19.对数据按照"popularity"列值的大小进行排序

df["popularity"].sort_values()

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
6    7.0
Name: popularity, dtype: float64

df.sort_values("popularity")

	grammar	grammer	popularity
0	Python	Python	1.0
1	C	C	2.0
2	Java	Java	3.0
3	GO	GO	4.0
4	NaN	NaN	5.0
5	SQL	SQL	6.0
6	PHP	PHP	7.0

20.统计grammer列每个字符串的长度

df = pd.DataFrame(data)
df['grammer'] = df['grammer'].fillna('R')
df

	grammer	score
0	Python	1.0
1	C	2.0
2	Java	NaN
3	GO	4.0
4	R	5.0
5	SQL	6.0
6	PHP	7.0
7	Python	10.0

df['len_str'] = df['grammer'].map(lambda x: len(x))
df

	grammer	score	len_str
0	Python	1.0	6
1	C	2.0	1
2	Java	NaN	4
3	GO	4.0	2
4	R	5.0	1
5	SQL	6.0	3
6	PHP	7.0	3
7	Python	10.0	6

第二期

23 将salary列数据转换为最大值与最小值的平均值

df = pd.read_excel("./pandas1206855/pandas120.xlsx")

df.tail()

	createTime	education	salary
130	2020-03-16 11:36:07	本科	10k-18k
131	2020-03-16 09:54:47	硕士	25k-50k
132	2020-03-16 10:48:32	本科	20k-40k
133	2020-03-16 10:46:31	本科	15k-23k
134	2020-03-16 11:19:38	本科	20k-40k

import re

方法1

lst = df['salary'].values
lst[:5]

array(['20k-35k', '20k-40k', '20k-35k', '13k-20k', '10k-20k'],
      dtype=object)

b = lst[0]
b = str(b)
b

'20k-35k'

qq = b.split("-")
qq[0],qq[1]

('20k', '35k')

qqq = qq[0].strip("k")
int(qqq)

arr变为list

list_1 = [i for i in lst]

用“-”分割

qa = [i.split("-") for i in list_1]

1个列表中嵌套列表，用推导式分开

list_min = [i[0] for i in qa]
list_max = [i[1] for i in qa]

list_min[:5],list_max[:5]

(['20k', '20k', '20k', '13k', '10k'], ['35k', '40k', '35k', '20k', '20k'])

去掉列表中的k

list_min_1 = [i.strip("k") for i in list_min]
list_max_1 = [i.strip("k") for i in list_max]

list_min_1[:5],list_max_1[:5]

(['20', '20', '20', '13', '10'], ['35', '40', '35', '20', '20'])

字符串化为整型

arr_min_2 = np.array(list_min_1,dtype=np.int32)
arr_max_2 = np.array(list_max_1,dtype=np.int32)

arr_min_2[:5],arr_max_2[:5]

(array([20, 20, 20, 13, 10]), array([35, 40, 35, 20, 20]))

求取平均值

salary_1 = (arr_max_2+arr_min_2)/2*1000

df["salary"] = salary_1
df.tail()

	createTime	education	salary
130	2020-03-16 11:36:07	本科	14000.0
131	2020-03-16 09:54:47	硕士	37500.0
132	2020-03-16 10:48:32	本科	30000.0
133	2020-03-16 10:46:31	本科	19000.0
134	2020-03-16 11:19:38	本科	30000.0

方法3：

df = pd.read_excel("./pandas1206855/pandas120.xlsx")

for index,row in df.iterrows():
    nums = re.findall('\d+',row[2])
    df.iloc[index,2] = int(eval(f'({nums[0]} + {nums[1]}) / 2 * 1000'))

df.tail()

	createTime	education	salary
130	2020-03-16 11:36:07	本科	14000
131	2020-03-16 09:54:47	硕士	37500
132	2020-03-16 10:48:32	本科	30000
133	2020-03-16 10:46:31	本科	19000
134	2020-03-16 11:19:38	本科	30000

方法2

df = pd.read_excel("./pandas1206855/pandas120.xlsx")

def func(df):
    lst = df['salary'].split('-')
    smin = int(lst[0].strip('k'))
    smax = int(lst[1].strip('k'))
    df['salary'] = int((smin + smax) / 2 * 1000)
    return df

df = df.apply(func,axis=1)

df.tail()

	createTime	education	salary
130	2020-03-16 11:36:07	本科	14000
131	2020-03-16 09:54:47	硕士	37500
132	2020-03-16 10:48:32	本科	30000
133	2020-03-16 10:46:31	本科	19000
134	2020-03-16 11:19:38	本科	30000

24.将数据根据学历进行分组并计算平均薪资

df.groupby("education").mean()

	salary
education
不限	19600.000000
大专	10000.000000
本科	19361.344538
硕士	20642.857143

25.将createTime列时间转换为月-日

for i in range(len(df)):
    df.iloc[i,0] = df.iloc[i,0].to_pydatetime().strftime("%m-%d")  
df.head()

	createTime	education	salary
0	03-16	本科	27500
1	03-16	本科	30000
2	03-16	不限	27500
3	03-16	本科	16500
4	03-16	本科	15000

27.查看数值型列的汇总统计

df.describe()

	salary
count	135.000000
mean	19159.259259
std	8661.686922
min	3500.000000
25%	14000.000000
50%	17500.000000
75%	25000.000000
max	45000.000000

28 新增一列根据salary将数据分为三组

bins = [0,10000, 20000, 50000]
group_names = ['低', '中', '高']
df['categories'] = pd.cut(df['salary'], bins, labels=group_names)
df

	createTime	education	salary	categories
0	03-16	本科	27500	高
1	03-16	本科	30000	高
2	03-16	不限	27500	高
3	03-16	本科	16500	中
4	03-16	本科	15000	中
...	...	...	...	...
130	03-16	本科	14000	中
131	03-16	硕士	37500	高
132	03-16	本科	30000	高
133	03-16	本科	19000	中
134	03-16	本科	30000	高

135 rows × 4 columns

29.按照salary列对数据降序排列

df.sort_values('salary', ascending=False) # ascending：升  descend 下降

	createTime	education	salary	categories
53	03-16	本科	45000	高
37	03-16	本科	40000	高
101	03-16	本科	37500	高
16	03-16	本科	37500	高
131	03-16	硕士	37500	高
...	...	...	...	...
123	03-16	本科	4500	低
126	03-16	本科	4000	低
110	03-16	本科	4000	低
96	03-16	不限	3500	低
113	03-16	本科	3500	低

135 rows × 4 columns

30.取出第30行

df.iloc[32]

createTime    03-16
education        硕士
salary        22500
categories        高
Name: 32, dtype: object

31 计算salary列的中位数

np.median(df["salary"])

17500.0

32.绘制薪资水平频率分布直方图

df.salary.plot(kind='hist')
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rLYfQwx7-1602577120621)(output_117_0.png)]

33.绘制薪资水平密度曲线

df.salary.plot(kind='kde',xlim=(0,80000))
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jaNOE96M-1602577120624)(output_119_0.png)]

34.删除最后一列categories

axis：使用0值表示沿着每一列或行标签\索引值向下执行方法，使用1值表示沿着每一行或者列标签模向执行对应的方法

df.drop(columns=["categories"])

	createTime	education	salary
0	03-16	本科	27500
1	03-16	本科	30000
2	03-16	不限	27500
3	03-16	本科	16500
4	03-16	本科	15000
...	...	...	...
130	03-16	本科	14000
131	03-16	硕士	37500
132	03-16	本科	30000
133	03-16	本科	19000
134	03-16	本科	30000

135 rows × 3 columns

df.drop(labels=“categories”,axis=1)

35.将df的第一列与第二列合并为新的一列

df["test"] = df["education"]+df["createTime"]

df.tail()

	createTime	education	salary	categories	test
130	03-16	本科	14000	中	本科03-16
131	03-16	硕士	37500	高	硕士03-16
132	03-16	本科	30000	高	本科03-16
133	03-16	本科	19000	中	本科03-16
134	03-16	本科	30000	高	本科03-16

df.columns

Index(['createTime', 'education', 'salary', 'categories', 'test'], dtype='object')

df.columns[[]]

Index([], dtype='object')

df.columns[[4,0,1,2,3]]

Index(['test', 'createTime', 'education', 'salary', 'categories'], dtype='object')

df[df.columns[[4,0,1,2,3]]].tail()

	test	createTime	education	salary	categories
130	本科03-16	03-16	本科	14000	中
131	硕士03-16	03-16	硕士	37500	高
132	本科03-16	03-16	本科	30000	高
133	本科03-16	03-16	本科	19000	中
134	本科03-16	03-16	本科	30000	高

36.将education列与salary列合并为新的一列

df["test_1"] = str(df["salary"])+df["education"]

df.tail()

	createTime	education	salary	categories	test	test_1
130	03-16	本科	14000	中	本科03-16	0 27500\n1 30000\n2 27500\n3 ...
131	03-16	硕士	37500	高	硕士03-16	0 27500\n1 30000\n2 27500\n3 ...
132	03-16	本科	30000	高	本科03-16	0 27500\n1 30000\n2 27500\n3 ...
133	03-16	本科	19000	中	本科03-16	0 27500\n1 30000\n2 27500\n3 ...
134	03-16	本科	30000	高	本科03-16	0 27500\n1 30000\n2 27500\n3 ...

df["test_1"] = df["salary"].map(str)+df["education"]
df.tail()

	createTime	education	salary	categories	test	test_1
130	03-16	本科	14000	中	本科03-16	14000本科
131	03-16	硕士	37500	高	硕士03-16	37500硕士
132	03-16	本科	30000	高	本科03-16	30000本科
133	03-16	本科	19000	中	本科03-16	19000本科
134	03-16	本科	30000	高	本科03-16	30000本科

37.计算salary最大值与最小值之差

df[‘name’]#得到的是不包含列索引的Series结构
df[[‘name’]]#得到是包含列索引的DataFrame结构
df.name#得到是不包含列索引的Series结构

df[["salary"]].tail()

	salary
130	14000
131	37500
132	30000
133	19000
134	30000

df["salary"].tail()

130    14000
131    37500
132    30000
133    19000
134    30000
Name: salary, dtype: int64

df.salary.tail()

130    14000
131    37500
132    30000
133    19000
134    30000
Name: salary, dtype: int64

df[["salary"]].apply(lambda x : x.max()-x.min())

salary    41500
dtype: int64

38.将第一行与最后一行拼接

pd.concat([df[:1],df[-2:-1]])

	createTime	education	salary	categories	test	test_1
0	03-16	本科	27500	高	本科03-16	27500本科
133	03-16	本科	19000	中	本科03-16	19000本科

39.将第8行数据添加至末尾

df[8:9]

	createTime	education	salary	categories	test	test_1
8	03-16	不限	7000	低	不限03-16	7000不限

df.iloc[8,:]

createTime      03-16
education          不限
salary           7000
categories          低
test          不限03-16
test_1         7000不限
Name: 8, dtype: object

df.iloc[:,[2,4]]

	salary	test
0	27500	本科03-16
1	30000	本科03-16
2	27500	不限03-16
3	16500	本科03-16
4	15000	本科03-16
...	...	...
130	14000	本科03-16
131	37500	硕士03-16
132	30000	本科03-16
133	19000	本科03-16
134	30000	本科03-16

135 rows × 2 columns

df.iloc[[2,4],:]

	createTime	education	salary	categories	test	test_1
2	03-16	不限	27500	高	不限03-16	27500不限
4	03-16	本科	15000	中	本科03-16	15000本科

df.iloc[[8]]

	createTime	education	salary	categories	test	test_1
8	03-16	不限	7000	低	不限03-16	7000不限

df.append(df[8:9])

	createTime	education	salary	categories	test	test_1
0	03-16	本科	27500	高	本科03-16	27500本科
1	03-16	本科	30000	高	本科03-16	30000本科
2	03-16	不限	27500	高	不限03-16	27500不限
3	03-16	本科	16500	中	本科03-16	16500本科
4	03-16	本科	15000	中	本科03-16	15000本科
...	...	...	...	...	...	...
131	03-16	硕士	37500	高	硕士03-16	37500硕士
132	03-16	本科	30000	高	本科03-16	30000本科
133	03-16	本科	19000	中	本科03-16	19000本科
134	03-16	本科	30000	高	本科03-16	30000本科
8	03-16	不限	7000	低	不限03-16	7000不限

136 rows × 6 columns

41.将createTime列设置为索引

df.set_index("createTime")

	education	salary	categories	test	test_1
createTime
03-16	本科	27500	高	本科03-16	27500本科
03-16	本科	30000	高	本科03-16	30000本科
03-16	不限	27500	高	不限03-16	27500不限
03-16	本科	16500	中	本科03-16	16500本科
03-16	本科	15000	中	本科03-16	15000本科
...	...	...	...	...	...
03-16	本科	14000	中	本科03-16	14000本科
03-16	硕士	37500	高	硕士03-16	37500硕士
03-16	本科	30000	高	本科03-16	30000本科
03-16	本科	19000	中	本科03-16	19000本科
03-16	本科	30000	高	本科03-16	30000本科

135 rows × 5 columns

42.生成一个和df长度相同的随机数dataframe

df1 = pd.DataFrame(pd.Series(np.random.randint(1, 10, 135)))
df1

	0
0	8
1	7
2	9
3	6
4	4
...	...
130	3
131	7
132	2
133	9
134	4

135 rows × 1 columns

43.将上一题生成的dataframe与df合并

df= pd.concat([df,df1],axis=1)
df

	createTime	education	salary	categories	test	test_1	0
0	03-16	本科	27500	高	本科03-16	27500本科	8
1	03-16	本科	30000	高	本科03-16	30000本科	7
2	03-16	不限	27500	高	不限03-16	27500不限	9
3	03-16	本科	16500	中	本科03-16	16500本科	6
4	03-16	本科	15000	中	本科03-16	15000本科	4
...	...	...	...	...	...	...	...
130	03-16	本科	14000	中	本科03-16	14000本科	3
131	03-16	硕士	37500	高	硕士03-16	37500硕士	7
132	03-16	本科	30000	高	本科03-16	30000本科	2
133	03-16	本科	19000	中	本科03-16	19000本科	9
134	03-16	本科	30000	高	本科03-16	30000本科	4

135 rows × 7 columns

44.生成新的一列new为salary列减去之前生成随机数列

df["new"] = df["salary"] - df[0]
df

	createTime	education	salary	categories	test	test_1	0	new
0	03-16	本科	27500	高	本科03-16	27500本科	8	27492
1	03-16	本科	30000	高	本科03-16	30000本科	7	29993
2	03-16	不限	27500	高	不限03-16	27500不限	9	27491
3	03-16	本科	16500	中	本科03-16	16500本科	6	16494
4	03-16	本科	15000	中	本科03-16	15000本科	4	14996
...	...	...	...	...	...	...	...	...
130	03-16	本科	14000	中	本科03-16	14000本科	3	13997
131	03-16	硕士	37500	高	硕士03-16	37500硕士	7	37493
132	03-16	本科	30000	高	本科03-16	30000本科	2	29998
133	03-16	本科	19000	中	本科03-16	19000本科	9	18991
134	03-16	本科	30000	高	本科03-16	30000本科	4	29996

135 rows × 8 columns

45.检查数据中是否含有任何缺失值

df.isnull().values.any()

False

46.将salary列类型转换为浮点数

df['salary'].astype(np.float64)

0      27500.0
1      30000.0
2      27500.0
3      16500.0
4      15000.0
        ...   
130    14000.0
131    37500.0
132    30000.0
133    19000.0
134    30000.0
Name: salary, Length: 135, dtype: float64

47.计算salary大于10000的次数

len(df[df['salary']>8000])

48.查看每种学历出现的次数

df.education.value_counts()

本科    119
硕士      7
不限      5
大专      4
Name: education, dtype: int64

df["education"].value_counts()

本科    119
硕士      7
不限      5
大专      4
Name: education, dtype: int64

df[["education"]].value_counts()

education
本科           119
硕士             7
不限             5
大专             4
dtype: int64

49.查看education列共有几种学历

df["education"].unique()

array(['本科', '不限', '硕士', '大专'], dtype=object)

df["education"].nunique()

50.提取salary与new列的和大于60000的最后3行

df1 = df[['salary','new']]
rowsums = df1.apply(np.sum, axis=1)
res = df.iloc[np.where(rowsums > 60000)[0][-3:], :]
res

	createTime	education	salary	categories	test	test_1	0	new
92	03-16	本科	35000	高	本科03-16	35000本科	8	34992
101	03-16	本科	37500	高	本科03-16	37500本科	5	37495
131	03-16	硕士	37500	高	硕士03-16	37500硕士	7	37493

df1.tail()

	salary	new
130	14000	13997
131	37500	37493
132	30000	29998
133	19000	18991
134	30000	29996

rowsums

0      54992
1      59993
2      54991
3      32994
4      29996
       ...  
130    27997
131    74993
132    59998
133    37991
134    59996
Length: 135, dtype: int64

51.使用绝对路径读取本地Excel数据

url_one = r'D:\exercise\pandas1206855\600000.SH.xls'

df = pd.read_excel(url_one)

WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero

df.head(3)

	代码	简称	日期	前收盘价(元)	开盘价(元)	最高价(元)	最低价(元)	收盘价(元)	成交量(股)	成交金额(元)	涨跌(元)	涨跌幅(%)	均价(元)	换手率(%)	A股流通市值(元)	总市值(元)	A股流通股本(股)	市盈率
0	600000.SH	浦发银行	2016-01-04	16.1356	16.1444	16.1444	15.4997	15.7205	42240610	754425783	-0.4151	-2.5725	17.8602	0.2264	3.320318e+11	3.320318e+11	1.865347e+10	6.5614
1	600000.SH	浦发银行	2016-01-05	15.7205	15.4644	15.9501	15.3672	15.8618	58054793	1034181474	0.1413	0.8989	17.8139	0.3112	3.350163e+11	3.350163e+11	1.865347e+10	6.6204
2	600000.SH	浦发银行	2016-01-06	15.8618	15.8088	16.0208	15.6234	15.9855	46772653	838667398	0.1236	0.7795	17.9307	0.2507	3.376278e+11	3.376278e+11	1.865347e+10	6.6720

53.查看每列数据缺失值情况

df.isnull().sum()

代码           1
简称           2
日期           2
前收盘价(元)      2
开盘价(元)       2
最高价(元)       2
最低价(元)       2
收盘价(元)       2
成交量(股)       2
成交金额(元)      2
涨跌(元)        2
涨跌幅(%)       2
均价(元)        2
换手率(%)       2
A股流通市值(元)    2
总市值(元)       2
A股流通股本(股)    2
市盈率          2
dtype: int64

54.提取日期列含有空值的行

df[df["日期"].isnull()]

	代码	简称	日期	前收盘价(元)	开盘价(元)	最高价(元)	最低价(元)	收盘价(元)	成交量(股)	成交金额(元)	涨跌(元)	涨跌幅(%)	均价(元)	换手率(%)	A股流通市值(元)	总市值(元)	A股流通股本(股)	市盈率
327	NaN	NaN	NaT	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
328	数据来源：Wind资讯	NaN	NaT	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

55.输出每列缺失值具体行数

for columname in df.columns:
    if df[columname].count() != len(data):
        loc = df[columname][df[columname].isnull().values==True].index.tolist()
        print('列名："{}", 第{}行位置有缺失值'.format(columname,loc))

列名："代码", 第[327]行位置有缺失值
列名："简称", 第[327, 328]行位置有缺失值
列名："日期", 第[327, 328]行位置有缺失值
列名："前收盘价(元)", 第[327, 328]行位置有缺失值
列名："开盘价(元)", 第[327, 328]行位置有缺失值
列名："最高价(元)", 第[327, 328]行位置有缺失值
列名："最低价(元)", 第[327, 328]行位置有缺失值
列名："收盘价(元)", 第[327, 328]行位置有缺失值
列名："成交量(股)", 第[327, 328]行位置有缺失值
列名："成交金额(元)", 第[327, 328]行位置有缺失值
列名："涨跌(元)", 第[327, 328]行位置有缺失值
列名："涨跌幅(%)", 第[327, 328]行位置有缺失值
列名："均价(元)", 第[327, 328]行位置有缺失值
列名："换手率(%)", 第[327, 328]行位置有缺失值
列名："A股流通市值(元)", 第[327, 328]行位置有缺失值
列名："总市值(元)", 第[327, 328]行位置有缺失值
列名："A股流通股本(股)", 第[327, 328]行位置有缺失值
列名："市盈率", 第[327, 328]行位置有缺失值

56.删除所有存在缺失值的行

'''
备注
axis：0-行操作（默认），1-列操作
how：any-只要有空值就删除（默认），all-全部为空值才删除
inplace：False-返回新的数据集（默认），True-在原数据集上操作
'''
data = df
df.dropna(axis=0, how='any', inplace=True)
df.tail()

	代码	简称	日期	前收盘价(元)	开盘价(元)	最高价(元)	最低价(元)	收盘价(元)	成交量(股)	成交金额(元)	涨跌(元)	涨跌幅(%)	均价(元)	换手率(%)	A股流通市值(元)	总市值(元)	A股流通股本(股)	市盈率
322	600000.SH	浦发银行	2017-05-03	15.16	15.16	15.16	15.05	15.08	14247943	215130847	-0.08	-0.5277	15.0991	0.0659	3.260037e+11	3.260037e+11	2.161828e+10	6.1395
323	600000.SH	浦发银行	2017-05-04	15.08	15.07	15.07	14.90	14.98	19477788	291839737	-0.10	-0.6631	14.9832	0.0901	3.238418e+11	3.238418e+11	2.161828e+10	6.0988
324	600000.SH	浦发银行	2017-05-05	14.98	14.95	14.98	14.52	14.92	40194577	592160198	-0.06	-0.4005	14.7323	0.1859	3.225447e+11	3.225447e+11	2.161828e+10	6.0744
325	600000.SH	浦发银行	2017-05-08	14.92	14.78	14.90	14.51	14.86	43568576	638781010	-0.06	-0.4021	14.6615	0.2015	3.212476e+11	3.212476e+11	2.161828e+10	6.0500
326	600000.SH	浦发银行	2017-05-09	14.86	14.69	14.84	14.66	14.76	19225492	283864640	-0.10	-0.6729	14.765	0.0889	3.190858e+11	3.190858e+11	2.161828e+10	6.0093

57.绘制收盘价的折线图

import matplotlib.pyplot as plt 
plt.style.use('seaborn-darkgrid') # 设置画图的风格
plt.rc('font',  size=6) #设置图中字体和大小
plt.rc('figure', figsize=(4,3), dpi=150) # 设置图的大小
df["收盘价(元)"].plot()

<AxesSubplot:>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Q3kE8F4A-1602577120628)(output_189_1.png)]

# 等价于
import matplotlib.pyplot as plt
plt.plot(df['收盘价(元)'])
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fhVj3pFK-1602577120668)(output_190_0.png)]

58.同时绘制开盘价与收盘价

df[['收盘价(元)','开盘价(元)']].plot()

<AxesSubplot:>



E:\ProgramData\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:238: RuntimeWarning: Glyph 25910 missing from current font.
  font.set_text(s, 0.0, flags=flags)
E:\ProgramData\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:238: RuntimeWarning: Glyph 30424 missing from current font.
  font.set_text(s, 0.0, flags=flags)
E:\ProgramData\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:238: RuntimeWarning: Glyph 20215 missing from current font.
  font.set_text(s, 0.0, flags=flags)
E:\ProgramData\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:238: RuntimeWarning: Glyph 20803 missing from current font.
  font.set_text(s, 0.0, flags=flags)
E:\ProgramData\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:238: RuntimeWarning: Glyph 24320 missing from current font.
  font.set_text(s, 0.0, flags=flags)
E:\ProgramData\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:201: RuntimeWarning: Glyph 25910 missing from current font.
  font.set_text(s, 0, flags=flags)
E:\ProgramData\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:201: RuntimeWarning: Glyph 30424 missing from current font.
  font.set_text(s, 0, flags=flags)
E:\ProgramData\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:201: RuntimeWarning: Glyph 20215 missing from current font.
  font.set_text(s, 0, flags=flags)
E:\ProgramData\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:201: RuntimeWarning: Glyph 20803 missing from current font.
  font.set_text(s, 0, flags=flags)
E:\ProgramData\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:201: RuntimeWarning: Glyph 24320 missing from current font.
  font.set_text(s, 0, flags=flags)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-dNhKXybj-1602577120669)(output_192_2.png)]

59.绘制涨跌幅的直方图

plt.hist(df['涨跌幅(%)'])
# 等价于
df['涨跌幅(%)'].hist()
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-WIOCxuBJ-1602577120670)(output_194_0.png)]

data = df
data['涨跌幅(%)'].hist(bins = 30) ##更细致

<AxesSubplot:>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-0T7ntsaM-1602577120671)(output_195_1.png)]

潘诺西亚的火山

关注

1
点赞
踩
11

收藏

觉得还不错? 一键收藏
打赏
0
评论
pandas练习2

import pandas as pdimport numpy as npimport matplotlib.pyplot as pltdata = {"grammer":["Python","C","Java","GO",np.nan,"SQL","PHP","Python"], "score":[1,2,np.nan,4,5,6,7,10]}data{'grammer': ['Python', 'C', 'Java', 'GO', nan, 'SQL', 'PHP', 'P
复制链接

扫一扫