pandas-Series
Series一维,带标签数据
pandas创建Series
1.根据数组创建
2.指定索引创建
3.通过字典来创建
4.通过ndarray创建
import pandas as pd
import numpy as np
a = pd.Series(np.array(range(5)))
b = pd.Series([0,1,2,3,4],index=list("abcde"))
c = pd.Series({"name":"k","age":20,"sex":"man"})
print(a)
print(b)
print(c)
print(b.astype(float))
print(a.head(2)) # 看前2行数据
head() 默认打印前五条数据
tail() 默认打印后五条数据
pandas的Series切片和索引
t1 = pd.Series(temp_dict)
1.通过键值
2.通过索引
3.t1.index
4.t1.values
c = pd.Series({"name":"k","age":20,"sex":"man","sex2":"man2","sex3":"man3","sex4":"man4"})
print(c)
print("-"*100)
print(c["name"] , c.loc["name"])
print(c[1], c.iloc[1])
print(c.index)
print(c.values)
print(c[:2])
print(c[[0,2]])
print(c[["name","sex"]])
print(c[c.index == "name"])
pandas中Series的索引和值
索引不能单独改变,只能整体替换
c = pd.Series({"name":"k","age":20,"sex":"man","sex2":"man2","sex3":"man3","sex4":"man4"})
c2 = c.reset_index()
c3 = c.reset_index(drop=True)
print(c2)
print(c3)
```## pandas中Series运算
对应索引相加,没有为NaN
```csharp
c = pd.Series(range(10))
a = pd.Series(range(10,15))
print(c)
print(a)
print(c +a)
pandas读取外部数据
a = pd.read_csv()
b = pd.read_clipboard() # 读取剪切板
读出的数据为dataFrom
pandas中的DataFrame创建
DataFrame二维
a = pd.DataFrame(np.arange(12).reshape(3,4),index=list("abc"),columns=list("1234"))
print(a)
# inde行索引
# columnsx列索引
a = pd.DataFrame({"name":["kang","wu"],"age":["20","19"],"sex":["man","woman"],"guan":["love","you"]})
print(a)
dict_data = {
'A': 1.,
'B': date(year=2019, month=8, day=29),
'C': pd.Series(1, index=list(range(4)), dtype='float32'),
'D': np.array([3] * 4, dtype='int32'),
'E': ['Python','Java','C++','C#'],
'F': 'ChinaHadoop'
}
a = pd.DataFrame(dict_data)
print(a) # 数组必须和最长的保持一致
data = [
{"name":"kang","age":20,"tel":187231},
{"name":"wu","age":19,"tel":123456},
{"name":"lo","age":0}
]
b = pd.DataFrame(data)
print(b)
# 传入数组缺少的数据NaN替换
DataFrame的基础属性
df.shape # 行数 列数
df.dtypes # 列数据类型
df.ndim # 数据维度
df.index # 行索引
df.columns # 列索引
df.values # 对象值,二维ndarray数组
df.drop(columns=[‘name’
,
‘age’]) # 返回被删除之后的DataFrame,原数据不变
del df[‘name’]
DataFrame的整体情况查询
t2.head(3) # 显示头部几行,默认5行
t2.tail(3) # 显示末尾几行,默认5行
t2.info() # 相关信息概述
t2.describe() # 快速综合统计结果
pandas取行取列
方括号写数组,表示取行,对行进行操作
写字符串,表示的取列索引,对列进行操作
df.loc 通过标签索引行数据
df.iloc 通过位置获取行数据
a = pd.DataFrame(np.arange(12).reshape(3,4),columns=list("xyzh"),index=list("123"))
print(a)
print("*"*100)
print(a[:2]) # 取前2行
print(a["x"]) # 取前x列
print(a[1:2]["x"]) # 取第1行,x列
print(a["x"][:2]) # 取x列,前2行
print(a.loc["1"]) # 取索引为1的行
print(a.loc[:,"x"]) # 取索引为x的列
print(a.loc[["1","2"]]) # 取索引为1,2行
print("*"*100)
print(a.iloc[2]) # 取第2行
print(a.iloc[:3]) # 前3行
print(a.iloc[2,3]) # 取第2行3列
print(a.iloc[[2, 1]]) # 取2,1行
排序
data = [
{"name":"kang","age":20,"tel":187231},
{"name":"wu","age":19,"tel":123456},
{"name":"lo","age":0}
]
b = pd.DataFrame(data)
print(b.sort_values(by="tel")) # 根据tel排序
print(b.sort_values(by="tel",ascending=False)) # 以降序排序
DataFrame计算
a = pd.DataFrame(np.ones((2,2)),columns=list("ab"))
b = pd.DataFrame(np.ones((3,3)),columns=list("abc"))
print(a)
print(b)
print("*"*100)
print(a + b)
对位进行运算,没有的为NaN
a = pd.DataFrame(np.arange(12).reshape(3,4),columns=list("abcd"))
print(a)
print("*"*100)
print(a[a["c"]>2])
print(a[(a["c"]>2) & (a["c"]<10) ])
pandas字符串常用方法
a = pd.DataFrame(data = [
{"name":"kang","age":20,"tel":187231},
{"name":"wu","age":19,"tel":123456},
{"name":"lo","age":0}
])
print(a)
print("*"*100)
print(len(a["name"]))
print(a["name"].str.len())
pandas中缺失数据的处理
判断数据是否为NaN:pd.isnull(df),pd.notnull(df)
处理方式1:删除NaN所在的行列dropna (axis=0, how=‘any’, inplace=False)
处理方式2:填充数据,t.fillna(t.mean()),t.fiallna(t.median()),t.fillna(0)
a = pd.DataFrame(np.arange(12).reshape(3,4))
print(a)
print("*"*100)
a.iloc[1,2] = np.nan
print(a)
print(pd.isnull(a))
print(a.dropna(axis=0))
print(a.dropna(axis=1))
print(a.dropna(axis=1,how="all")) # 当这列全为NaN时删除
print(a.fillna(a.mean()))
pandas中处理重复数据
判断是否存在重复数据
data.duplicated()删除重复数据
data.drop_duplicated()
subset 指定某些列
keep 保留第一次出现的数据
a = pd.DataFrame({'age':[28,31,27,28],'gender':['M','M','M','M'],'name':['Liu','Li','Chen','Liu']})
print(a)
print("*"*100)
a.iloc[1,2] = np.nan
print(a.duplicated()) # 判断一整行是否重复
print(a.duplicated(subset=["age","name"]))
print(a.drop_duplicates()) # 删除重复
print(a.drop_duplicates(keep="last")) # 保留最后出现
pandas中数据替换
replace(to_replace)
to_replace为需要被替换的值,可以是
数值,字符串
列表
字典
数据合并
join:默认情况下他是把行索引相同的数据合并到一起
相同索引合在一起,没有的以Nan填充
merge:按照指定的列把数据按照一定的方式合并到一起
有相同列名,把公共列的相同数据合并在一起
数据的分组聚合
df.groupby(by=“columns_name”)