Python学习------pandas模块整理笔记

最新推荐文章于 2022-08-16 11:12:44 发布

TVfan

最新推荐文章于 2022-08-16 11:12:44 发布

阅读量403

点赞数

本文链接：https://blog.csdn.net/qq_38420451/article/details/81357158

版权

Introduce:pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.(details transfer to http://pandas.pydata.org/index.html)

The two primary data structures of pandas :Series and DataFrame

Download and Import
pip3 install pandas(用pip3导入，其他方法见官网)

import pandas as pd
 
import numpy as np

import matplotlib.pyplot as plt

#------查看pandas的版本

print(pd__version__)
instructions:pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries

组成：

一组带标签的数组数据结构，主要是Series和DataFrame。
索引对象启用简单轴索引和多级/分层轴索引。
引擎的集成组，用于聚合和转换数据集。
日期范围生成（date_range）和自定义日期偏移，可实现自定义频率。
输入/输出工具：从平面文件（CSV，分隔符，Excel 2003）加载表格数据，以及从快速有效的PyTables / HDF5格式保存和加载pandas对象。
内存高效的“稀疏”版本的标准数据结构，用于存储大部分缺失或大部分不变的数据（某些固定值）。
移动窗口统计（滚动平均值，滚动标准偏差等)

详细介绍----------Series

1.Series:which is a single column. A DataFrame contains one or more Series and a name for each Series.

series是一维标记的数组，能够保存任何数据类型（整数，字符串，浮点数，Python对象等）。轴标签统称为索引。

2.调用:
series(data, index=index，name)
在这里，data:可以有很多不同的东西：一个Python字典,一个ndarray,标量值（如5）,index:传递的索引是轴标签列表,name:系统将自动分配，用Series.name查看，用Series.rename重命名

注意：NaN（不是数字）是pandas中使用的标准缺失数据标记。

3.切片操作和字典操作:
#满足Python类型的切片操作和字典操作都可以在pandas里实现
s[0]
0.469112299907186

s[:3] 

a    0.4691
b   -0.2829
c   -1.5091

s['a']
0.46911229990718628


s['e'] = 12
print(s)

a     0.4691
b    -0.2829
c    -1.5091
d    -1.1356
e    12.0000

#其他操作
s + s
s * 2
np.exp(s)
s[1:] + s[:-1]

详细介绍---------DataFrame

1.DataFrame:which you can imagine as a relational data table, with rows and named columns

DataFrame是一个二维标记数据结构，具有可能不同类型的列。您可以将其视为电子表格或SQL表，或Series对象的字典。它通常是最常用的pandas对象。

2.调用:

DataFrame(sequence,index,column,name)：
index：行索引。
columns：列索引。
values：值的二维数组。
name：名字。
date=pd.date_range('20170101',periods=6)
date
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
               '2017-01-05', '2017-01-06'],
              dtype='datetime64[ns]', freq='D')





#---------利用numpy模块
pd.DataFrame(np.random.randn(6,4),index=date,columns=['a','b','c','d'])
df
                 a         b         c         d
2017-01-01  -1.993447  1.272175 -1.578337 -1.972526
2017-01-02   0.092701 -0.503654 -0.540655 -0.126386
2017-01-03   0.191769 -0.578872 -1.693449  0.457891
2017-01-04   2.121120  0.521884 -0.419368 -1.916585
2017-01-05   1.642063  0.222134  0.108531 -1.858906
2017-01-06   0.636639  0.487491  0.617841 -1.597920

#------利用字典

d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd']

df = pd.DataFrame(d)

   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
 
   two three
d  4.0   NaN
b  2.0   NaN
a  1.0   NaN
But most of the time, you load an entire file into a DataFrame. The following example loads a file with California housing data. Run the following cell to load the data and create feature definitions:
california_housing_dataframe = pd.read_csv("https://storage.googleapis.com/mledu-datasets/california_housing_train.csv", sep=",")
3 常用属性和方法(将DataFrame看做是索引为列名和其对应的seires构成的字典集)

DataFrame.dtypes:查看每列的数据类型

DataFrame.index:查看行名

DataFrame.column:查看列名

DataFrame.values:查看

DataFrame.iloc[loc]:Select row by integer location

DataFrame.loc[label]:Select row by label

切片操作和索引操作，类似Series

增删改查，类似python的dict对象（insert(),pop()......）

DataFrame.head(n):读取头n条数据，n默认5行

DataFrame.tail(n):读取末尾n条数据

california_housing_dataframe.hist('housing_median_age'):显示某一列的直方图

DataFrame.assign(new column=expr):增加新的列

DataFrame.T(transpose):数据的转置

DataFrame.describe():查看数据的统计结果

DataFrame.sort_index(axis=[0|1],ascending=[false|true]):0代表行，1代表列，对数据进行排序

DataFrame.idxmin([axis=0|1]):返回dateframe中最小的值，如果axis=0，返回每列中的最小值的索引，如果axis=1，返回每行中的最小值的索引

DataFrame.idxmax([axis=0|1])：返回dataframe中最小的值的索引

Series.value_counts():返回各个值的个数

DataFrame.mode():返回dataframe或series中最常见的值

DataFrame.apply(function,axis):function为应用在每行或每列的函数，根据应用的函数返回结果