pandas练习2

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = {"grammer":["Python","C","Java","GO",np.nan,"SQL","PHP","Python"],
       "score":[1,2,np.nan,4,5,6,7,10]}
data
{'grammer': ['Python', 'C', 'Java', 'GO', nan, 'SQL', 'PHP', 'Python'],
 'score': [1, 2, nan, 4, 5, 6, 7, 10]}
df = pd.DataFrame(data)
df
grammerscore
0Python1.0
1C2.0
2JavaNaN
3GO4.0
4NaN5.0
5SQL6.0
6PHP7.0
7Python10.0

3提取含有字符串"Python"的行

df['grammer'] == 'Python'
0     True
1    False
2    False
3    False
4    False
5    False
6    False
7     True
Name: grammer, dtype: bool
df[df['grammer'] == 'Python']
grammerscore
0Python1.0
7Python10.0
df.columns
Index(['grammer', 'score'], dtype='object')

4.修改第二列列名为’popularity’

df.rename(columns={"score":"popularity"},inplace=True)
df
grammerpopularity
0Python1.0
1C2.0
2JavaNaN
3GO4.0
4NaN5.0
5SQL6.0
6PHP7.0
7Python10.0

5.统计grammer列中每种编程语言出现的次数

df["grammer"].value_counts()
Python    2
Java      1
PHP       1
GO        1
SQL       1
C         1
Name: grammer, dtype: int64
df1 = df.copy()
df2 = df.copy()
df1
grammerpopularity
0Python1.0
1C2.0
2JavaNaN
3GO4.0
4NaN5.0
5SQL6.0
6PHP7.0
7Python10.0

6,将空值用上下值的平均值填充

df1['popularity'] = df1['popularity'].fillna(df1['popularity'].interpolate())
df1
grammerpopularity
0Python1.0
1C2.0
2Java3.0
3GO4.0
4NaN5.0
5SQL6.0
6PHP7.0
7Python10.0
### 把0值替换为na的方法:df.replace(0,np.nan)
df2.iloc[:,1] = df2.iloc[:,1].fillna(df2.iloc[:,1].interpolate())
df2
grammerpopularity
0Python1.0
1C2.0
2Java3.0
3GO4.0
4NaN5.0
5SQL6.0
6PHP7.0
7Python10.0
df
grammerpopularity
0Python1.0
1C2.0
2JavaNaN
3GO4.0
4NaN5.0
5SQL6.0
6PHP7.0
7Python10.0
df["popularity"].fillna(df["popularity"].interpolate(),inplace=True)
df
grammerpopularity
0Python1.0
1C2.0
2Java3.0
3GO4.0
4NaN5.0
5SQL6.0
6PHP7.0
7Python10.0

保留popularity列大于3的值

df[df.iloc[:,1]>3]
grammerpopularity
3GO4.0
4NaN5.0
5SQL6.0
6PHP7.0
7Python10.0

去除grammar列 重复值

df.drop_duplicates("grammer",inplace = True)
df
grammerpopularity
0Python1.0
1C2.0
2Java3.0
3GO4.0
4NaN5.0
5SQL6.0
6PHP7.0
df["popularity"].mean()
4.0

10,将grammer列转换为list

df["grammer"].to_list()
['Python', 'C', 'Java', 'GO', nan, 'SQL', 'PHP']

11.将DataFrame保存为EXCEL

df.to_csv("./test.csv")
df.to_excel('test.xlsx',index=False)
df[(df["popularity"]>3) & (df["popularity"]<7)]
grammerpopularity
3GO4.0
4NaN5.0
5SQL6.0

14.交换两列位置

cols = df.columns[[1,0]]
cols
Index(['popularity', 'grammer'], dtype='object')
cols_1 = df.columns[[0,1]]
cols_1
Index(['grammer', 'popularity'], dtype='object')
type(cols_1)
pandas.core.indexes.base.Index
df = df[cols]
df
popularitygrammer
01.0Python
12.0C
23.0Java
34.0GO
45.0NaN
56.0SQL
67.0PHP
### 方法2
temp = df['popularity']
df.drop(labels=['popularity'], axis=1,inplace = True)
df.insert(0, 'popularity', temp)
df
E:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py:4167: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
popularitygrammer
01.0Python
12.0C
23.0Java
34.0GO
45.0NaN
56.0SQL
67.0PHP

15.提取popularity列最大值所在行

df["popularity"] == df["popularity"].max() 
0    False
1    False
2    False
3    False
4    False
5    False
6     True
Name: popularity, dtype: bool
df[df["popularity"] == df["popularity"].max() ]
popularitygrammer
67.0PHP

16查看最后五行

df.tail()
popularitygrammer
23.0Java
34.0GO
45.0NaN
56.0SQL
67.0PHP
df
popularitygrammer
01.0Python
12.0C
23.0Java
34.0GO
45.0NaN
56.0SQL
67.0PHP

17 删掉一行,一列

df.drop("popularity",axis=1)
grammer
0Python
1C
2Java
3GO
4NaN
5SQL
6PHP
df.drop(6,axis=0)
popularitygrammer
01.0Python
12.0C
23.0Java
34.0GO
45.0NaN
56.0SQL

18.添加一行数据[‘Perl’,6.6]

df.columns[[1,0]]
Index(['grammer', 'popularity'], dtype='object')
df = df[df.columns[[1,0]]]
df
grammerpopularity
0Python1.0
1C2.0
2Java3.0
3GO4.0
4NaN5.0
5SQL6.0
6PHP7.0
a = {"grammer":"perl","popularity":6.6}
df.append(a,ignore_index=True)
grammerpopularity
0Python1.0
1C2.0
2Java3.0
3GO4.0
4NaN5.0
5SQL6.0
6PHP7.0
7perl6.6

添加一列数据

a = "grammar" ## 新的列名
a
'grammar'
b = df["grammer"]
b
0    Python
1         C
2      Java
3        GO
4       NaN
5       SQL
6       PHP
Name: grammer, dtype: object
df.insert(0,a,b) # 插入的位置, 列名 ,内容
df
grammargrammerpopularity
0PythonPython1.0
1CC2.0
2JavaJava3.0
3GOGO4.0
4NaNNaN5.0
5SQLSQL6.0
6PHPPHP7.0

19.对数据按照"popularity"列值的大小进行排序

df["popularity"].sort_values()
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
6    7.0
Name: popularity, dtype: float64
df.sort_values("popularity")
grammargrammerpopularity
0PythonPython1.0
1CC2.0
2JavaJava3.0
3GOGO4.0
4NaNNaN5.0
5SQLSQL6.0
6PHPPHP7.0

20.统计grammer列每个字符串的长度

df = pd.DataFrame(data)
df['grammer'] = df['grammer'].fillna('R')
df
grammerscore
0Python1.0
1C2.0
2JavaNaN
3GO4.0
4R5.0
5SQL6.0
6PHP7.0
7Python10.0
df['len_str'] = df['grammer'].map(lambda x: len(x))
df
grammerscorelen_str
0Python1.06
1C2.01
2JavaNaN4
3GO4.02
4R5.01
5SQL6.03
6PHP7.03
7Python10.06

第二期

23 将salary列数据转换为最大值与最小值的平均值

df = pd.read_excel("./pandas1206855/pandas120.xlsx")
df.tail()
createTimeeducationsalary
1302020-03-16 11:36:07本科10k-18k
1312020-03-16 09:54:47硕士25k-50k
1322020-03-16 10:48:32本科20k-40k
1332020-03-16 10:46:31本科15k-23k
1342020-03-16 11:19:38本科20k-40k
import re

方法1

lst = df['salary'].values
lst[:5]
array(['20k-35k', '20k-40k', '20k-35k', '13k-20k', '10k-20k'],
      dtype=object)
b = lst[0]
b = str(b)
b
'20k-35k'
qq = b.split("-")
qq[0],qq[1]
('20k', '35k')
qqq = qq[0].strip("k")
int(qqq)
20
arr变为list
list_1 = [i for i in lst]
用“-”分割
qa = [i.split("-") for i in list_1]
1个列表中嵌套列表,用推导式分开
list_min = [i[0] for i in qa]
list_max = [i[1] for i in qa]
list_min[:5],list_max[:5]
(['20k', '20k', '20k', '13k', '10k'], ['35k', '40k', '35k', '20k', '20k'])
去掉列表中的k
list_min_1 = [i.strip("k") for i in list_min]
list_max_1 = [i.strip("k") for i in list_max]
list_min_1[:5],list_max_1[:5]
(['20', '20', '20', '13', '10'], ['35', '40', '35', '20', '20'])
字符串化为整型
arr_min_2 = np.array(list_min_1,dtype=np.int32)
arr_max_2 = np.array(list_max_1,dtype=np.int32)
arr_min_2[:5],arr_max_2[:5]
(array([20, 20, 20, 13, 10]), array([35, 40, 35, 20, 20]))
求取平均值
salary_1 = (arr_max_2+arr_min_2)/2*1000
df["salary"] = salary_1
df.tail()
createTimeeducationsalary
1302020-03-16 11:36:07本科14000.0
1312020-03-16 09:54:47硕士37500.0
1322020-03-16 10:48:32本科30000.0
1332020-03-16 10:46:31本科19000.0
1342020-03-16 11:19:38本科30000.0

方法3:

df = pd.read_excel("./pandas1206855/pandas120.xlsx")
for index,row in df.iterrows():
    nums = re.findall('\d+',row[2])
    df.iloc[index,2] = int(eval(f'({nums[0]} + {nums[1]}) / 2 * 1000'))
df.tail()
createTimeeducationsalary
1302020-03-16 11:36:07本科14000
1312020-03-16 09:54:47硕士37500
1322020-03-16 10:48:32本科30000
1332020-03-16 10:46:31本科19000
1342020-03-16 11:19:38本科30000

方法2

df = pd.read_excel("./pandas1206855/pandas120.xlsx")
def func(df):
    lst = df['salary'].split('-')
    smin = int(lst[0].strip('k'))
    smax = int(lst[1].strip('k'))
    df['salary'] = int((smin + smax) / 2 * 1000)
    return df

df = df.apply(func,axis=1)
df.tail()
createTimeeducationsalary
1302020-03-16 11:36:07本科14000
1312020-03-16 09:54:47硕士37500
1322020-03-16 10:48:32本科30000
1332020-03-16 10:46:31本科19000
1342020-03-16 11:19:38本科30000

24.将数据根据学历进行分组并计算平均薪资

df.groupby("education").mean()
salary
education
不限19600.000000
大专10000.000000
本科19361.344538
硕士20642.857143

25.将createTime列时间转换为月-日

for i in range(len(df)):
    df.iloc[i,0] = df.iloc[i,0].to_pydatetime().strftime("%m-%d")  
df.head()
createTimeeducationsalary
003-16本科27500
103-16本科30000
203-16不限27500
303-16本科16500
403-16本科15000

27.查看数值型列的汇总统计

df.describe()
salary
count135.000000
mean19159.259259
std8661.686922
min3500.000000
25%14000.000000
50%17500.000000
75%25000.000000
max45000.000000

28 新增一列根据salary将数据分为三组

bins = [0,10000, 20000, 50000]
group_names = ['低', '中', '高']
df['categories'] = pd.cut(df['salary'], bins, labels=group_names)
df
createTimeeducationsalarycategories
003-16本科27500
103-16本科30000
203-16不限27500
303-16本科16500
403-16本科15000
...............
13003-16本科14000
13103-16硕士37500
13203-16本科30000
13303-16本科19000
13403-16本科30000

135 rows × 4 columns

29.按照salary列对数据降序排列

df.sort_values('salary', ascending=False) # ascending:升  descend 下降
createTimeeducationsalarycategories
5303-16本科45000
3703-16本科40000
10103-16本科37500
1603-16本科37500
13103-16硕士37500
...............
12303-16本科4500
12603-16本科4000
11003-16本科4000
9603-16不限3500
11303-16本科3500

135 rows × 4 columns

30.取出第30行

df.iloc[32]
createTime    03-16
education        硕士
salary        22500
categories        高
Name: 32, dtype: object

31 计算salary列的中位数

np.median(df["salary"])
17500.0

32.绘制薪资水平频率分布直方图

df.salary.plot(kind='hist')
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rLYfQwx7-1602577120621)(output_117_0.png)]

33.绘制薪资水平密度曲线

df.salary.plot(kind='kde',xlim=(0,80000))
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jaNOE96M-1602577120624)(output_119_0.png)]

34.删除最后一列categories

axis:使用0值表示沿着每一列或行标签\索引值向下执行方法,使用1值表示沿着每一行或者列标签模向执行对应的方法
df.drop(columns=["categories"])
createTimeeducationsalary
003-16本科27500
103-16本科30000
203-16不限27500
303-16本科16500
403-16本科15000
............
13003-16本科14000
13103-16硕士37500
13203-16本科30000
13303-16本科19000
13403-16本科30000

135 rows × 3 columns

df.drop(labels=“categories”,axis=1)

35.将df的第一列与第二列合并为新的一列

df["test"] = df["education"]+df["createTime"]
df.tail()
createTimeeducationsalarycategoriestest
13003-16本科14000本科03-16
13103-16硕士37500硕士03-16
13203-16本科30000本科03-16
13303-16本科19000本科03-16
13403-16本科30000本科03-16
df.columns
Index(['createTime', 'education', 'salary', 'categories', 'test'], dtype='object')

df.columns[[]]
Index([], dtype='object')
df.columns[[4,0,1,2,3]]
Index(['test', 'createTime', 'education', 'salary', 'categories'], dtype='object')
df[df.columns[[4,0,1,2,3]]].tail()
testcreateTimeeducationsalarycategories
130本科03-1603-16本科14000
131硕士03-1603-16硕士37500
132本科03-1603-16本科30000
133本科03-1603-16本科19000
134本科03-1603-16本科30000

36.将education列与salary列合并为新的一列

df["test_1"] = str(df["salary"])+df["education"]
df.tail()
createTimeeducationsalarycategoriestesttest_1
13003-16本科14000本科03-160 27500\n1 30000\n2 27500\n3 ...
13103-16硕士37500硕士03-160 27500\n1 30000\n2 27500\n3 ...
13203-16本科30000本科03-160 27500\n1 30000\n2 27500\n3 ...
13303-16本科19000本科03-160 27500\n1 30000\n2 27500\n3 ...
13403-16本科30000本科03-160 27500\n1 30000\n2 27500\n3 ...
df["test_1"] = df["salary"].map(str)+df["education"]
df.tail()
createTimeeducationsalarycategoriestesttest_1
13003-16本科14000本科03-1614000本科
13103-16硕士37500硕士03-1637500硕士
13203-16本科30000本科03-1630000本科
13303-16本科19000本科03-1619000本科
13403-16本科30000本科03-1630000本科

37.计算salary最大值与最小值之差

df[‘name’]#得到的是不包含列索引的Series结构
df[[‘name’]]#得到是包含列索引的DataFrame结构
df.name#得到是不包含列索引的Series结构

df[["salary"]].tail()
salary
13014000
13137500
13230000
13319000
13430000
df["salary"].tail()
130    14000
131    37500
132    30000
133    19000
134    30000
Name: salary, dtype: int64
df.salary.tail()
130    14000
131    37500
132    30000
133    19000
134    30000
Name: salary, dtype: int64
df[["salary"]].apply(lambda x : x.max()-x.min())
salary    41500
dtype: int64

38.将第一行与最后一行拼接

pd.concat([df[:1],df[-2:-1]])
createTimeeducationsalarycategoriestesttest_1
003-16本科27500本科03-1627500本科
13303-16本科19000本科03-1619000本科

39.将第8行数据添加至末尾

df[8:9]
createTimeeducationsalarycategoriestesttest_1
803-16不限7000不限03-167000不限
df.iloc[8,:]
createTime      03-16
education          不限
salary           7000
categories          低
test          不限03-16
test_1         7000不限
Name: 8, dtype: object
df.iloc[:,[2,4]]
salarytest
027500本科03-16
130000本科03-16
227500不限03-16
316500本科03-16
415000本科03-16
.........
13014000本科03-16
13137500硕士03-16
13230000本科03-16
13319000本科03-16
13430000本科03-16

135 rows × 2 columns

df.iloc[[2,4],:]
createTimeeducationsalarycategoriestesttest_1
203-16不限27500不限03-1627500不限
403-16本科15000本科03-1615000本科
df.iloc[[8]]
createTimeeducationsalarycategoriestesttest_1
803-16不限7000不限03-167000不限
df.append(df[8:9])
createTimeeducationsalarycategoriestesttest_1
003-16本科27500本科03-1627500本科
103-16本科30000本科03-1630000本科
203-16不限27500不限03-1627500不限
303-16本科16500本科03-1616500本科
403-16本科15000本科03-1615000本科
.....................
13103-16硕士37500硕士03-1637500硕士
13203-16本科30000本科03-1630000本科
13303-16本科19000本科03-1619000本科
13403-16本科30000本科03-1630000本科
803-16不限7000不限03-167000不限

136 rows × 6 columns

41.将createTime列设置为索引

df.set_index("createTime")
educationsalarycategoriestesttest_1
createTime
03-16本科27500本科03-1627500本科
03-16本科30000本科03-1630000本科
03-16不限27500不限03-1627500不限
03-16本科16500本科03-1616500本科
03-16本科15000本科03-1615000本科
..................
03-16本科14000本科03-1614000本科
03-16硕士37500硕士03-1637500硕士
03-16本科30000本科03-1630000本科
03-16本科19000本科03-1619000本科
03-16本科30000本科03-1630000本科

135 rows × 5 columns

42.生成一个和df长度相同的随机数dataframe

df1 = pd.DataFrame(pd.Series(np.random.randint(1, 10, 135)))
df1
0
08
17
29
36
44
......
1303
1317
1322
1339
1344

135 rows × 1 columns

43.将上一题生成的dataframe与df合并

df= pd.concat([df,df1],axis=1)
df
createTimeeducationsalarycategoriestesttest_10
003-16本科27500本科03-1627500本科8
103-16本科30000本科03-1630000本科7
203-16不限27500不限03-1627500不限9
303-16本科16500本科03-1616500本科6
403-16本科15000本科03-1615000本科4
........................
13003-16本科14000本科03-1614000本科3
13103-16硕士37500硕士03-1637500硕士7
13203-16本科30000本科03-1630000本科2
13303-16本科19000本科03-1619000本科9
13403-16本科30000本科03-1630000本科4

135 rows × 7 columns

44.生成新的一列new为salary列减去之前生成随机数列

df["new"] = df["salary"] - df[0]
df
createTimeeducationsalarycategoriestesttest_10new
003-16本科27500本科03-1627500本科827492
103-16本科30000本科03-1630000本科729993
203-16不限27500不限03-1627500不限927491
303-16本科16500本科03-1616500本科616494
403-16本科15000本科03-1615000本科414996
...........................
13003-16本科14000本科03-1614000本科313997
13103-16硕士37500硕士03-1637500硕士737493
13203-16本科30000本科03-1630000本科229998
13303-16本科19000本科03-1619000本科918991
13403-16本科30000本科03-1630000本科429996

135 rows × 8 columns

45.检查数据中是否含有任何缺失值

df.isnull().values.any()
False

46.将salary列类型转换为浮点数

df['salary'].astype(np.float64)
0      27500.0
1      30000.0
2      27500.0
3      16500.0
4      15000.0
        ...   
130    14000.0
131    37500.0
132    30000.0
133    19000.0
134    30000.0
Name: salary, Length: 135, dtype: float64

47.计算salary大于10000的次数

len(df[df['salary']>8000])
123

48.查看每种学历出现的次数

df.education.value_counts()
本科    119
硕士      7
不限      5
大专      4
Name: education, dtype: int64
df["education"].value_counts()
本科    119
硕士      7
不限      5
大专      4
Name: education, dtype: int64
df[["education"]].value_counts()
education
本科           119
硕士             7
不限             5
大专             4
dtype: int64

49.查看education列共有几种学历

df["education"].unique()
array(['本科', '不限', '硕士', '大专'], dtype=object)
df["education"].nunique()
4

50.提取salary与new列的和大于60000的最后3行

df1 = df[['salary','new']]
rowsums = df1.apply(np.sum, axis=1)
res = df.iloc[np.where(rowsums > 60000)[0][-3:], :]
res
createTimeeducationsalarycategoriestesttest_10new
9203-16本科35000本科03-1635000本科834992
10103-16本科37500本科03-1637500本科537495
13103-16硕士37500硕士03-1637500硕士737493
df1.tail()
salarynew
1301400013997
1313750037493
1323000029998
1331900018991
1343000029996
rowsums
0      54992
1      59993
2      54991
3      32994
4      29996
       ...  
130    27997
131    74993
132    59998
133    37991
134    59996
Length: 135, dtype: int64

51.使用绝对路径读取本地Excel数据

url_one = r'D:\exercise\pandas1206855\600000.SH.xls'
df = pd.read_excel(url_one)
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
df.head(3)
代码简称日期前收盘价(元)开盘价(元)最高价(元)最低价(元)收盘价(元)成交量(股)成交金额(元)涨跌(元)涨跌幅(%)均价(元)换手率(%)A股流通市值(元)总市值(元)A股流通股本(股)市盈率
0600000.SH浦发银行2016-01-0416.135616.144416.144415.499715.720542240610754425783-0.4151-2.572517.86020.22643.320318e+113.320318e+111.865347e+106.5614
1600000.SH浦发银行2016-01-0515.720515.464415.950115.367215.86185805479310341814740.14130.898917.81390.31123.350163e+113.350163e+111.865347e+106.6204
2600000.SH浦发银行2016-01-0615.861815.808816.020815.623415.9855467726538386673980.12360.779517.93070.25073.376278e+113.376278e+111.865347e+106.6720

53.查看每列数据缺失值情况

df.isnull().sum()
代码           1
简称           2
日期           2
前收盘价(元)      2
开盘价(元)       2
最高价(元)       2
最低价(元)       2
收盘价(元)       2
成交量(股)       2
成交金额(元)      2
涨跌(元)        2
涨跌幅(%)       2
均价(元)        2
换手率(%)       2
A股流通市值(元)    2
总市值(元)       2
A股流通股本(股)    2
市盈率          2
dtype: int64

54.提取日期列含有空值的行

df[df["日期"].isnull()]
代码简称日期前收盘价(元)开盘价(元)最高价(元)最低价(元)收盘价(元)成交量(股)成交金额(元)涨跌(元)涨跌幅(%)均价(元)换手率(%)A股流通市值(元)总市值(元)A股流通股本(股)市盈率
327NaNNaNNaTNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
328数据来源:Wind资讯NaNNaTNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

55.输出每列缺失值具体行数

for columname in df.columns:
    if df[columname].count() != len(data):
        loc = df[columname][df[columname].isnull().values==True].index.tolist()
        print('列名:"{}", 第{}行位置有缺失值'.format(columname,loc))
列名:"代码", 第[327]行位置有缺失值
列名:"简称", 第[327, 328]行位置有缺失值
列名:"日期", 第[327, 328]行位置有缺失值
列名:"前收盘价(元)", 第[327, 328]行位置有缺失值
列名:"开盘价(元)", 第[327, 328]行位置有缺失值
列名:"最高价(元)", 第[327, 328]行位置有缺失值
列名:"最低价(元)", 第[327, 328]行位置有缺失值
列名:"收盘价(元)", 第[327, 328]行位置有缺失值
列名:"成交量(股)", 第[327, 328]行位置有缺失值
列名:"成交金额(元)", 第[327, 328]行位置有缺失值
列名:"涨跌(元)", 第[327, 328]行位置有缺失值
列名:"涨跌幅(%)", 第[327, 328]行位置有缺失值
列名:"均价(元)", 第[327, 328]行位置有缺失值
列名:"换手率(%)", 第[327, 328]行位置有缺失值
列名:"A股流通市值(元)", 第[327, 328]行位置有缺失值
列名:"总市值(元)", 第[327, 328]行位置有缺失值
列名:"A股流通股本(股)", 第[327, 328]行位置有缺失值
列名:"市盈率", 第[327, 328]行位置有缺失值

56.删除所有存在缺失值的行

'''
备注
axis:0-行操作(默认),1-列操作
how:any-只要有空值就删除(默认),all-全部为空值才删除
inplace:False-返回新的数据集(默认),True-在原数据集上操作
'''
data = df
df.dropna(axis=0, how='any', inplace=True)
df.tail()
代码简称日期前收盘价(元)开盘价(元)最高价(元)最低价(元)收盘价(元)成交量(股)成交金额(元)涨跌(元)涨跌幅(%)均价(元)换手率(%)A股流通市值(元)总市值(元)A股流通股本(股)市盈率
322600000.SH浦发银行2017-05-0315.1615.1615.1615.0515.0814247943215130847-0.08-0.527715.09910.06593.260037e+113.260037e+112.161828e+106.1395
323600000.SH浦发银行2017-05-0415.0815.0715.0714.9014.9819477788291839737-0.10-0.663114.98320.09013.238418e+113.238418e+112.161828e+106.0988
324600000.SH浦发银行2017-05-0514.9814.9514.9814.5214.9240194577592160198-0.06-0.400514.73230.18593.225447e+113.225447e+112.161828e+106.0744
325600000.SH浦发银行2017-05-0814.9214.7814.9014.5114.8643568576638781010-0.06-0.402114.66150.20153.212476e+113.212476e+112.161828e+106.0500
326600000.SH浦发银行2017-05-0914.8614.6914.8414.6614.7619225492283864640-0.10-0.672914.7650.08893.190858e+113.190858e+112.161828e+106.0093

57.绘制收盘价的折线图

import matplotlib.pyplot as plt 
plt.style.use('seaborn-darkgrid') # 设置画图的风格
plt.rc('font',  size=6) #设置图中字体和大小
plt.rc('figure', figsize=(4,3), dpi=150) # 设置图的大小
df["收盘价(元)"].plot()
<AxesSubplot:>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Q3kE8F4A-1602577120628)(output_189_1.png)]

# 等价于
import matplotlib.pyplot as plt
plt.plot(df['收盘价(元)'])
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fhVj3pFK-1602577120668)(output_190_0.png)]

58.同时绘制开盘价与收盘价

df[['收盘价(元)','开盘价(元)']].plot()
<AxesSubplot:>



E:\ProgramData\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:238: RuntimeWarning: Glyph 25910 missing from current font.
  font.set_text(s, 0.0, flags=flags)
E:\ProgramData\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:238: RuntimeWarning: Glyph 30424 missing from current font.
  font.set_text(s, 0.0, flags=flags)
E:\ProgramData\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:238: RuntimeWarning: Glyph 20215 missing from current font.
  font.set_text(s, 0.0, flags=flags)
E:\ProgramData\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:238: RuntimeWarning: Glyph 20803 missing from current font.
  font.set_text(s, 0.0, flags=flags)
E:\ProgramData\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:238: RuntimeWarning: Glyph 24320 missing from current font.
  font.set_text(s, 0.0, flags=flags)
E:\ProgramData\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:201: RuntimeWarning: Glyph 25910 missing from current font.
  font.set_text(s, 0, flags=flags)
E:\ProgramData\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:201: RuntimeWarning: Glyph 30424 missing from current font.
  font.set_text(s, 0, flags=flags)
E:\ProgramData\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:201: RuntimeWarning: Glyph 20215 missing from current font.
  font.set_text(s, 0, flags=flags)
E:\ProgramData\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:201: RuntimeWarning: Glyph 20803 missing from current font.
  font.set_text(s, 0, flags=flags)
E:\ProgramData\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:201: RuntimeWarning: Glyph 24320 missing from current font.
  font.set_text(s, 0, flags=flags)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-dNhKXybj-1602577120669)(output_192_2.png)]

59.绘制涨跌幅的直方图

plt.hist(df['涨跌幅(%)'])
# 等价于
df['涨跌幅(%)'].hist()
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-WIOCxuBJ-1602577120670)(output_194_0.png)]

data = df
data['涨跌幅(%)'].hist(bins = 30) ##更细致
<AxesSubplot:>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-0T7ntsaM-1602577120671)(output_195_1.png)]


  • 1
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

潘诺西亚的火山

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值