下载
下载pandas在命令行中输入conda install pandas
,如果想要查看自己下载的版本,在jupyter notebook中输入!conda list pandas
,在命令行中输入conda list pandas
,就可以返回版本号。
pandas series
Pandas series是一个带标签的一维数组,或者说像是一个字典。与NumPy不同,pandas可以有不同类型的数据,而且还可以为每一个元素定义一个索引标签。
首先,我们需要信用pandas库。
import pandas as pd
pandas series
我们可以使用pd.Series(date,index)来创建,其中,index代表一系列索引标签。
import pandas as pd
groceries = pd.Series(data = [30,6,'Yes','No'],index = ['eggs','apples','milk','bread'])
groceries
eggs 30
apples 6
milk Yes
bread No
dtype: object
可以看出,第一列输出为索引,并且索引不是0到3,而是我们设置的标签。第二列是数据,且有各种数据形式。
Pandas和NumPy类似,有很多可以返回信息的函数。比如说.shape
,.ndim
,.size
。
print('Groceries has shape:', groceries.shape)
print('Groceries has dimension:', groceries.ndim)
print('Groceries has a total of', groceries.size, 'elements')
Groceries has shape: (4,)
Groceries has dimension: 1
Groceries has a total of 4 elements
pandas series还支持单独输出索引和数据。
print(groceries.values)
print(groceries.index)
[30 6 'Yes' 'No']
Index(['eggs', 'apples', 'milk', 'bread'], dtype='object')
Pandas Series还支持使用in查找是否含有这个标签。
x = 'bananas' in groceries
y = 'bread' in groceries
print(x)
print(y)
False
True
访问,修改,删除
首先,我们可以通过标签索引访问数据,也可以使用数字索引(支持正索引和负索引)。为了消除歧义,pandas有两个属性.loc和.iloc。.loc特别代表标签索引,.iloc特别代表数字索引。
print(groceries['eggs'])
print(groceries[['eggs','bread']])
print(groceries.loc[['eggs','apples']])
print()
print(groceries[[0,1]])
print(groceries[[-1]])
print(groceries.iloc[[0,1]])
30
eggs 30
bread No
dtype: object
eggs 30
apples 6
dtype: object
eggs 30
apples 6
dtype: object
bread No
dtype: object
eggs 30
apples 6
dtype: object
如果想要修改某个数据的值,可以直接访问然后等号右边修改。
groceries['eggs'] = 2
print(groceries)
eggs 2
apples 6
milk Yes
bread No
dtype: object
我们可以删除pandas series中的数据使用.drop()方法,但是这样修改的话只是返回了删除某个索引的值的数据,在原数据中并没有改变。如果想要改变原数组,需要添加关键字inplace = True
groceries = pd.Series(data = [30,6,'Yes','No'],index = ['eggs','apples','milk','bread'])
print(groceries.drop('apples'))
print(groceries)
eggs 30
milk Yes
bread No
dtype: object
eggs 30
apples 6
milk Yes
bread No
dtype: object
groceries.drop('apples',inplace = True)
print(groceries)
eggs 30
milk Yes
bread No
dtype: object
算术运算
和numpy一样,Pandas Series支持元素层面上的运算。
fruits = pd.Series(data=[10,6,3], index = ['apples','oranges','bananas'])
print(fruits)
print()
print(fruits + 2)
print(fruits - 2)
print(fruits * 2)
print(fruits / 2)
apples 10
oranges 6
bananas 3
dtype: int64
apples 12
oranges 8
bananas 5
dtype: int64
apples 8
oranges 4
bananas 1
dtype: int64
apples 20
oranges 12
bananas 6
dtype: int64
apples 5.0
oranges 3.0
bananas 1.5
dtype: float64
pandas还支持一些NumPy的数学函数。比如sqrt()。
import numpy as np
print(fruits)
print(np.exp(fruits))
print(np.sqrt(fruits))
print(np.power(fruits,2))
apples 10
oranges 6
bananas 3
dtype: int64
apples 22026.465795
oranges 403.428793
bananas 20.085537
dtype: float64
apples 3.162278
oranges 2.449490
bananas 1.732051
dtype: float64
apples 100
oranges 36
bananas 9
dtype: int64
pandas还支持选择特定的数据进行操作。
print(fruits['bananas'] + 2)
print(fruits.iloc[0]-2)
print(fruits[['apples','oranges']]*2)
print(fruits[['apples','oranges']] / 2)
5
8
apples 20
oranges 12
dtype: int64
apples 5.0
oranges 3.0
dtype: float64
对于pandas series混合了几种数据类型,有时也可以进行算术操作,比如说。
groceries*2
eggs 60
milk YesYes
bread NoNo
dtype: object
groceries / 2
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~\Anaconda3\envs\yang\lib\site-packages\pandas\core\ops.py in na_op(x, y)
1504 try:
-> 1505 result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs)
1506 except TypeError:
~\Anaconda3\envs\yang\lib\site-packages\pandas\core\computation\expressions.py in evaluate(op, op_str, a, b, use_numexpr, **eval_kwargs)
207 if use_numexpr:
--> 208 return _evaluate(op, op_str, a, b, **eval_kwargs)
209 return _evaluate_standard(op, op_str, a, b)
~\Anaconda3\envs\yang\lib\site-packages\pandas\core\computation\expressions.py in _evaluate_standard(op, op_str, a, b, **eval_kwargs)
67 with np.errstate(all='ignore'):
---> 68 return op(a, b)
69
TypeError: unsupported operand type(s) for /: 'str' and 'int'
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
~\Anaconda3\envs\yang\lib\site-packages\pandas\core\ops.py in safe_na_op(lvalues, rvalues)
1528 with np.errstate(all='ignore'):
-> 1529 return na_op(lvalues, rvalues)
1530 except Exception:
~\Anaconda3\envs\yang\lib\site-packages\pandas\core\ops.py in na_op(x, y)
1506 except TypeError:
-> 1507 result = masked_arith_op(x, y, op)
1508
~\Anaconda3\envs\yang\lib\site-packages\pandas\core\ops.py in masked_arith_op(x, y, op)
1025 with np.errstate(all='ignore'):
-> 1026 result[mask] = op(xrav[mask], y)
1027
TypeError: unsupported operand type(s) for /: 'str' and 'int'
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-23-3d64b17810f9> in <module>
----> 1 groceries / 2
~\Anaconda3\envs\yang\lib\site-packages\pandas\core\ops.py in wrapper(left, right)
1581 rvalues = rvalues.values
1582
-> 1583 result = safe_na_op(lvalues, rvalues)
1584 return construct_result(left, result,
1585 index=left.index, name=res_name, dtype=None)
~\Anaconda3\envs\yang\lib\site-packages\pandas\core\ops.py in safe_na_op(lvalues, rvalues)
1531 if is_object_dtype(lvalues):
1532 return libalgos.arrmap_object(lvalues,
-> 1533 lambda x: op(x, rvalues))
1534 raise
1535
pandas/_libs/algos.pyx in pandas._libs.algos.arrmap()
~\Anaconda3\envs\yang\lib\site-packages\pandas\core\ops.py in <lambda>(x)
1531 if is_object_dtype(lvalues):
1532 return libalgos.arrmap_object(lvalues,
-> 1533 lambda x: op(x, rvalues))
1534 raise
1535
TypeError: unsupported operand type(s) for /: 'str' and 'int'
Pandas DataFrames
Pandas DataFrames是一个有行标签和列标签的两维数据结构,有点像Excel表格的结构。
建立一个DataFrame,需要先建立一个Pandas Series的字典,然后将这个字典放入pd.DataFrame()函数。
items = {'Bob': pd.Series(data =[245,25,55], index = ['bike','pants','watch']),
'Alice' : pd.Series(data = [40,110,500,45], index = ['book','glasses','bike','pants'])}
print(type(items))
<class 'dict'>
shopping_carts = pd.DataFrame(items)
shopping_carts
Bob | Alice | |
---|---|---|
bike | 245.0 | 500.0 |
book | NaN | 40.0 |
glasses | NaN | 110.0 |
pants | 25.0 | 45.0 |
watch | 55.0 | NaN |
我们可以注意到第一列的标签按照字母表的顺序排列了下来。其中,还有很多位置显示的是NaN。这是pandas表示那个特定行和列索引的位置没有数据的一种方式。
如果我们创建Pandas DataFrame的时候Pandas Series没有定义清楚的标签,那么pandas会生成数字索引的行标签。
data = {'Bob': pd.Series([245,25,55]),'Alice' : pd.Series([40,110,500,45])}
df = pd.DataFrame(data)
df
Bob | Alice | |
---|---|---|
0 | 245.0 | 40 |
1 | 25.0 | 110 |
2 | 55.0 | 500 |
3 | NaN | 45 |
我们可以利用DataFrame的属性提取一些信息。
print(shopping_carts.shape)
print(shopping_carts.ndim)
print(shopping_carts.size)
print(shopping_carts.values)
print()
print(shopping_carts.index)
print(shopping_carts.columns)
(5, 2)
2
10
[[245. 500.]
[ nan 40.]
[ nan 110.]
[ 25. 45.]
[ 55. nan]]
Index(['bike', 'book', 'glasses', 'pants', 'watch'], dtype='object')
Index(['Bob', 'Alice'], dtype='object')
有些时候我们不需要将整个字典都放到DataFrame中,我们可以使用关键字columns和index来限定。
bob_shopping_cart = pd.DataFrame(items,columns=['Bob'])
bob_shopping_cart
Bob | |
---|---|
bike | 245 |
pants | 25 |
watch | 55 |
alice_sel_shopping_cart = pd.DataFrame(items,index = ['glasses','bike'],columns = ['Alice'])
alice_sel_shopping_cart
Alice | |
---|---|
glasses | 110 |
bike | 500 |
我们也可以创建DataFrame使用列表字典,但是字典中的所有的列表必须有相同的长度。
data = {'Integers' : [1,2,3],
'Floats' : [4.5, 8.2, 9.6]}
df = pd.DataFrame(data)
df
Integers | Floats | |
---|---|---|
0 | 1 | 4.5 |
1 | 2 | 8.2 |
2 | 3 | 9.6 |
我们可以注意到上面的data字典没有索引标签,我们可以在pd.DataFrame中设置index关键字设置标签。
data = {'Integers' : [1,2,3],
'Floats' : [4.5, 8.2, 9.6]}
df = pd.DataFrame(data, index = ['label 1', 'label 2', 'label 3'])
df
Integers | Floats | |
---|---|---|
label 1 | 1 | 4.5 |
label 2 | 2 | 8.2 |
label 3 | 3 | 9.6 |
我们也可以使用一系列的python字典来创建dataframe。用法和上面类似,也可以使用index关键字。
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35},
{'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5}]
store_items = pd.DataFrame(items2, index = ['store 1', 'store 2'])
store_items
bikes | glasses | pants | watches | |
---|---|---|---|---|
store 1 | 20 | NaN | 30 | 35 |
store 2 | 15 | 50.0 | 5 | 10 |
访问数据
DataFrame支持通过使用行标签,列标签来访问DataFrame中的行元素,列元素或者单个元素。
import pandas as pd
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35},
{'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5}]
store_items = pd.DataFrame(items2, index = ['store 1', 'store 2'])
print(store_items)
print()
print(store_items[['bikes']])
print()
print(store_items[['bikes','pants']])
print()
print(store_items.loc[['store 1']])
print()
print(store_items['bikes'],['store 2'])
bikes glasses pants watches
store 1 20 NaN 30 35
store 2 15 50.0 5 10
bikes
store 1 20
store 2 15
bikes pants
store 1 20 30
store 2 15 5
bikes glasses pants watches
store 1 20 NaN 30 35
store 1 20
store 2 15
Name: bikes, dtype: int64 ['store 2']
如果我们要访问单个元素,我们需要先列标签,再行标签,否则程序会报错。
如果我们想要修改DataFrame,比如说加一列。如果我们想要给store_items设置一个衬衫列,可以这样
store_items['shirts'] = [15,2]
store_items
bikes | glasses | pants | watches | shirts | |
---|---|---|---|---|---|
store 1 | 20 | NaN | 30 | 35 | 15 |
store 2 | 15 | 50.0 | 5 | 10 | 2 |
我们也可以使用列间算术操作增加新的列。
store_items['suits'] = store_items['pants'] + store_items['shirts']
store_items
bikes | glasses | pants | watches | shirts | suits | |
---|---|---|---|---|---|---|
store 1 | 20 | NaN | 30 | 35 | 15 | 45 |
store 2 | 15 | 50.0 | 5 | 10 | 2 | 7 |
如果想要添加新的行,我们需要先创建一个新的DataFrame,然后将它加入到原来的DataFrame中。比如说你新开了第三家商店,想要加入到其中。
new_items = [{'bikes':20,'pants':30, 'watches': 35,'glass': 4}]
new_store = pd.DataFrame(new_items,index = ['store 3'])
new_store
bikes | glass | pants | watches | |
---|---|---|---|---|
store 3 | 20 | 4 | 30 | 35 |
然后我们可以使用append函数将这个新的商店信息加入到总的商店信息中
store_items = store_items.append(new_store)
store_items
bikes | glass | glasses | pants | shirts | suits | watches | |
---|---|---|---|---|---|---|---|
store 1 | 20 | NaN | NaN | 30 | 15.0 | 45.0 | 35 |
store 2 | 15 | NaN | 50.0 | 5 | 2.0 | 7.0 | 10 |
store 3 | 20 | 4.0 | NaN | 30 | NaN | NaN | 35 |
我们会发现列元素会按照字母表顺序排列。
你还可以通过控制特定列的特定行来创建一个新的列。比如说新建一个watches列通过下面的方式。
store_items['new watches'] = store_items['watches'][1:]
store_items
bikes | glass | glasses | pants | shirts | suits | watches | new watches | |
---|---|---|---|---|---|---|---|---|
store 1 | 20 | NaN | NaN | 30 | 15.0 | 45.0 | 35 | NaN |
store 2 | 15 | NaN | 50.0 | 5 | 2.0 | 7.0 | 10 | 10.0 |
store 3 | 20 | 4.0 | NaN | 30 | NaN | NaN | 35 | 35.0 |
DataFrame支持再任意地方插入一个新行,使用dataframe.insert(loc,label,data)其中loc表示位置,label表示新的列的标签,data表示数据。例子:
store_items.insert(4,'shoes',[8,5,0])
store_items
bikes | glass | glasses | pants | shoes | shirts | suits | watches | new watches | |
---|---|---|---|---|---|---|---|---|---|
store 1 | 20 | NaN | NaN | 30 | 8 | 15.0 | 45.0 | 35 | NaN |
store 2 | 15 | NaN | 50.0 | 5 | 5 | 2.0 | 7.0 | 10 | 10.0 |
store 3 | 20 | 4.0 | NaN | 30 | 0 | NaN | NaN | 35 | 35.0 |
我们可以添加行和列元素,也可以删除行和列元素。我们可以使用.pop()方法来删除列元素,可以使用.drop()方法,通过使用axis关键字来删除行或者列
store_items.pop('new watches')
store_items
bikes | glass | glasses | pants | shoes | shirts | suits | watches | |
---|---|---|---|---|---|---|---|---|
store 1 | 20 | NaN | NaN | 30 | 8 | 15.0 | 45.0 | 35 |
store 2 | 15 | NaN | 50.0 | 5 | 5 | 2.0 | 7.0 | 10 |
store 3 | 20 | 4.0 | NaN | 30 | 0 | NaN | NaN | 35 |
store_items = store_items.drop(['watches','shoes'],axis = 1)
store_items
bikes | glass | glasses | pants | shirts | suits | |
---|---|---|---|---|---|---|
store 1 | 20 | NaN | NaN | 30 | 15.0 | 45.0 |
store 2 | 15 | NaN | 50.0 | 5 | 2.0 | 7.0 |
store 3 | 20 | 4.0 | NaN | 30 | NaN | NaN |
store_items = store_items.drop(['store 2','store 1'],axis = 0)
store_items
bikes | glass | glasses | pants | shirts | suits | |
---|---|---|---|---|---|---|
store 3 | 20 | 4.0 | NaN | 30 | NaN | NaN |
我们有时需要改变行标签或者列标签,我们可以使用.rename()方法。
store_items = store_items.rename(index = {'store 3': 'last store'})
store_items
bikes | glass | glasses | pants | shirts | suits | |
---|---|---|---|---|---|---|
last store | 20 | 4.0 | NaN | 30 | NaN | NaN |
你也可以将索引改为列标签之一。
store_items = store_items.set_index('pants')
store_items
bikes | glass | glasses | shirts | suits | |
---|---|---|---|---|---|
pants | |||||
30 | 20 | 4.0 | NaN | NaN | NaN |
处理NaN
在我们处理大量的数据的时候,我们必须先处理一下其中的异常数据。其中大部分异常数据是应为数据缺失造成的。在pandas中,用NaN来代表缺失数据,下面我们将学习如何处理NaN的值。
首先,我们先创建一个含有NaN的DataFrame。
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35, 'shirts': 15, 'shoes':8, 'suits':45},
{'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5, 'shirts': 2, 'shoes':5, 'suits':7},
{'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4, 'shoes':10}]
store_items = pd.DataFrame(items2, index = ['store 1', 'store 2', 'store 3'])
store_items
bikes | glasses | pants | shirts | shoes | suits | watches | |
---|---|---|---|---|---|---|---|
store 1 | 20 | NaN | 30 | 15.0 | 8 | 45.0 | 35 |
store 2 | 15 | 50.0 | 5 | 2.0 | 5 | 7.0 | 10 |
store 3 | 20 | 4.0 | 30 | NaN | 10 | NaN | 35 |
在这个例子中,我们可以看出来有三个NaN,但是如果数据量很大的话,我们就没办法直接看出来有多少个。我们可以使用一个组合方法来查看有多少个NaN。
x = store_items.isnull()
print('Number of NaN values in our DataFrame:\n',x)
Number of NaN values in our DataFrame:
bikes glasses pants shirts shoes suits watches
store 1 False True False False False False False
store 2 False False False False False False False
store 3 False False False True False True False
x = store_items.isnull().sum()
print('Number of NaN values in our DataFrame:\n',x)
Number of NaN values in our DataFrame:
bikes 0
glasses 1
pants 0
shirts 1
shoes 0
suits 1
watches 0
dtype: int64
x = store_items.isnull().sum().sum()
print('Number of NaN values in our DataFrame:', x)
Number of NaN values in our DataFrame: 3
我们可以看出.isnull()返回一个布尔型的DataFrame,True代表这个地方是NaN。在pandas中,我们将1代表true,0代表False。所以我们可以使用.sum()方法计算。
当然,我们可以使用.count()来查询有多少个不是NaN的数。
print('Number of non-NaN values in the columns of our DataFrame:\n', store_items.count())
Number of non-NaN values in the columns of our DataFrame:
bikes 3
glasses 2
pants 3
shirts 2
shoes 3
suits 2
watches 3
dtype: int64
对于是NaN的地方,我们有两个选择,一是删除它,二是替代它。
如果我们想要删除它,我们可以使用.dropna(axis),其中当axis=0时删除行,当axis=1时删除列。
store_items.dropna(axis = 1)
bikes | pants | shoes | watches | |
---|---|---|---|---|
store 1 | 20 | 30 | 8 | 35 |
store 2 | 15 | 5 | 5 | 10 |
store 3 | 20 | 30 | 10 | 35 |
我们需要注意,我们这样子操作,store_items并没有什么变化。如果想要将原来的DataFrame改变,需要使用关键字inplace=True。
我们也可以选择使用合适的值来取代NaN。我们可以使用.fillna()。比如说用0取代所有NaN。
store_items.fillna(0)
bikes | glasses | pants | shirts | shoes | suits | watches | |
---|---|---|---|---|---|---|---|
store 1 | 20 | 0.0 | 30 | 15.0 | 8 | 45.0 | 35 |
store 2 | 15 | 50.0 | 5 | 2.0 | 5 | 7.0 | 10 |
store 3 | 20 | 4.0 | 30 | 0.0 | 10 | 0.0 | 35 |
我么们也可以使用.fillna()方法来取代NaN使用先前值。使用.fillna(method = ‘ffill’,axis)。当使用的方法是ffill时,使用的时前面的值填充,当使用的方法是backfill时,使用的是后填充。axis控制的是行或者列。方法与之前一样。
store_items.fillna(method = 'ffill',axis = 0)
bikes | glasses | pants | shirts | shoes | suits | watches | |
---|---|---|---|---|---|---|---|
store 1 | 20 | NaN | 30 | 15.0 | 8 | 45.0 | 35 |
store 2 | 15 | 50.0 | 5 | 2.0 | 5 | 7.0 | 10 |
store 3 | 20 | 4.0 | 30 | 2.0 | 10 | 7.0 | 35 |
store_items.fillna(method = 'ffill',axis=1)
bikes | glasses | pants | shirts | shoes | suits | watches | |
---|---|---|---|---|---|---|---|
store 1 | 20.0 | 20.0 | 30.0 | 15.0 | 8.0 | 45.0 | 35.0 |
store 2 | 15.0 | 50.0 | 5.0 | 2.0 | 5.0 | 7.0 | 10.0 |
store 3 | 20.0 | 4.0 | 30.0 | 30.0 | 10.0 | 10.0 | 35.0 |
store_items.fillna(method = 'backfill', axis = 0)
bikes | glasses | pants | shirts | shoes | suits | watches | |
---|---|---|---|---|---|---|---|
store 1 | 20 | 50.0 | 30 | 15.0 | 8 | 45.0 | 35 |
store 2 | 15 | 50.0 | 5 | 2.0 | 5 | 7.0 | 10 |
store 3 | 20 | 4.0 | 30 | NaN | 10 | NaN | 35 |
store_items.fillna(method = 'backfill', axis = 1)
bikes | glasses | pants | shirts | shoes | suits | watches | |
---|---|---|---|---|---|---|---|
store 1 | 20.0 | 30.0 | 30.0 | 15.0 | 8.0 | 45.0 | 35.0 |
store 2 | 15.0 | 50.0 | 5.0 | 2.0 | 5.0 | 7.0 | 10.0 |
store 3 | 20.0 | 4.0 | 30.0 | 10.0 | 10.0 | 35.0 | 35.0 |
我们会发现,如果我们没有使用合适的方法,NaN前面或者后面并没有值,那么会保持是NaN。而且,.fillna方法并没有修改原本的DataFrame,如果想要在原值修改,那么使用关键字inplace = True。
也可以使用.interpolate(method = ‘linear’,axis)方法线性插入合适的值。但是如果前面没有值的话,还保持是NaN。
加载数据
我们可能需要加载数据从各种来源的数据库文件,DataFrame支持加载各种格式的数据库文件,其中最流行的是CSV格式。我们可以使用pd.read_csv()函数加载CSV文件。
Google_stock = pd.read_csv('./GooG.csv')
print('Google_stock is of type:',type(Google_stock))
print('Google_stock has shape:',Google_stock.shape)
Google_stock is of type: <class 'pandas.core.frame.DataFrame'>
Google_stock has shape: (3313, 7)
Google_stock
Date | Open | High | Low | Close | Adj Close | Volume | |
---|---|---|---|---|---|---|---|
0 | 2004-08-19 | 49.676899 | 51.693783 | 47.669952 | 49.845802 | 49.845802 | 44994500 |
1 | 2004-08-20 | 50.178635 | 54.187561 | 49.925285 | 53.805050 | 53.805050 | 23005800 |
2 | 2004-08-23 | 55.017166 | 56.373344 | 54.172661 | 54.346527 | 54.346527 | 18393200 |
3 | 2004-08-24 | 55.260582 | 55.439419 | 51.450363 | 52.096165 | 52.096165 | 15361800 |
4 | 2004-08-25 | 52.140873 | 53.651051 | 51.604362 | 52.657513 | 52.657513 | 9257400 |
5 | 2004-08-26 | 52.135906 | 53.626213 | 51.991844 | 53.606342 | 53.606342 | 7148200 |
6 | 2004-08-27 | 53.700729 | 53.959049 | 52.503513 | 52.732029 | 52.732029 | 6258300 |
7 | 2004-08-30 | 52.299839 | 52.404160 | 50.675404 | 50.675404 | 50.675404 | 5235700 |
8 | 2004-08-31 | 50.819469 | 51.519913 | 50.749920 | 50.854240 | 50.854240 | 4954800 |
9 | 2004-09-01 | 51.018177 | 51.152302 | 49.512966 | 49.801090 | 49.801090 | 9206800 |
10 | 2004-09-02 | 49.274517 | 50.854240 | 49.150326 | 50.427021 | 50.427021 | 15232100 |
11 | 2004-09-03 | 50.148830 | 50.541279 | 49.339096 | 49.681866 | 49.681866 | 5191000 |
12 | 2004-09-07 | 50.178635 | 50.670437 | 49.483158 | 50.461796 | 50.461796 | 5891300 |
13 | 2004-09-08 | 50.044510 | 51.182110 | 49.925285 | 50.819469 | 50.819469 | 5023000 |
14 | 2004-09-09 | 50.933723 | 51.023144 | 50.173668 | 50.824436 | 50.824436 | 4092100 |
15 | 2004-09-10 | 50.471729 | 52.935703 | 50.322701 | 52.324677 | 52.324677 | 8764200 |
16 | 2004-09-13 | 52.970478 | 53.854729 | 52.886028 | 53.402668 | 53.402668 | 7902900 |
17 | 2004-09-14 | 53.377831 | 55.638126 | 53.049961 | 55.384777 | 55.384777 | 10910200 |
18 | 2004-09-15 | 54.922779 | 56.745922 | 54.743942 | 55.638126 | 55.638126 | 10793500 |
19 | 2004-09-16 | 55.807030 | 57.525848 | 55.464260 | 56.616764 | 56.616764 | 9335800 |
20 | 2004-09-17 | 56.840309 | 58.365391 | 56.408119 | 58.365391 | 58.365391 | 9543500 |
21 | 2004-09-20 | 58.097134 | 60.407108 | 58.007717 | 59.294346 | 59.294346 | 10708500 |
22 | 2004-09-21 | 59.517895 | 59.820923 | 58.375324 | 58.539257 | 58.539257 | 7282900 |
23 | 2004-09-22 | 58.320679 | 59.448345 | 58.027588 | 58.807514 | 58.807514 | 7638000 |
24 | 2004-09-23 | 59.036026 | 60.918781 | 58.131908 | 60.019630 | 60.019630 | 8599600 |
25 | 2004-09-24 | 60.079243 | 61.649033 | 59.493053 | 59.527828 | 59.527828 | 9191900 |
26 | 2004-09-27 | 59.393700 | 60.049435 | 58.519386 | 58.747902 | 58.747902 | 7119100 |
27 | 2004-09-28 | 60.258080 | 63.288372 | 59.716602 | 63.020115 | 63.020115 | 17056100 |
28 | 2004-09-29 | 62.940632 | 67.073753 | 62.707150 | 65.116478 | 65.116478 | 30745600 |
29 | 2004-09-30 | 64.530296 | 65.722542 | 64.083199 | 64.381264 | 64.381264 | 13861300 |
... | ... | ... | ... | ... | ... | ... | ... |
3283 | 2017-09-01 | 941.130005 | 942.479980 | 935.150024 | 937.340027 | 937.340027 | 947400 |
3284 | 2017-09-05 | 933.080017 | 937.000000 | 921.960022 | 928.450012 | 928.450012 | 1326400 |
3285 | 2017-09-06 | 930.150024 | 930.914978 | 919.270020 | 927.809998 | 927.809998 | 1527700 |
3286 | 2017-09-07 | 931.729980 | 936.409973 | 923.619995 | 935.950012 | 935.950012 | 1212700 |
3287 | 2017-09-08 | 936.489990 | 936.989990 | 924.880005 | 926.500000 | 926.500000 | 1011500 |
3288 | 2017-09-11 | 934.250000 | 938.380005 | 926.919983 | 929.080017 | 929.080017 | 1267000 |
3289 | 2017-09-12 | 932.590027 | 933.479980 | 923.861023 | 932.070007 | 932.070007 | 1134400 |
3290 | 2017-09-13 | 930.659973 | 937.250000 | 929.859985 | 935.090027 | 935.090027 | 1102600 |
3291 | 2017-09-14 | 931.250000 | 932.770020 | 924.000000 | 925.109985 | 925.109985 | 1397600 |
3292 | 2017-09-15 | 924.659973 | 926.489990 | 916.359985 | 920.289978 | 920.289978 | 2505400 |
3293 | 2017-09-18 | 920.010010 | 922.080017 | 910.599976 | 915.000000 | 915.000000 | 1306900 |
3294 | 2017-09-19 | 917.419983 | 922.419983 | 912.549988 | 921.809998 | 921.809998 | 936700 |
3295 | 2017-09-20 | 922.979980 | 933.880005 | 922.000000 | 931.580017 | 931.580017 | 1669800 |
3296 | 2017-09-21 | 933.000000 | 936.530029 | 923.830017 | 932.450012 | 932.450012 | 1290600 |
3297 | 2017-09-22 | 927.750000 | 934.729980 | 926.479980 | 928.530029 | 928.530029 | 1052700 |
3298 | 2017-09-25 | 925.450012 | 926.400024 | 909.700012 | 920.969971 | 920.969971 | 1856800 |
3299 | 2017-09-26 | 923.719971 | 930.820007 | 921.140015 | 924.859985 | 924.859985 | 1666900 |
3300 | 2017-09-27 | 927.739990 | 949.900024 | 927.739990 | 944.489990 | 944.489990 | 2239400 |
3301 | 2017-09-28 | 941.359985 | 950.690002 | 940.549988 | 949.500000 | 949.500000 | 1020300 |
3302 | 2017-09-29 | 952.000000 | 959.786011 | 951.510010 | 959.109985 | 959.109985 | 1581000 |
3303 | 2017-10-02 | 959.979980 | 962.539978 | 947.840027 | 953.270020 | 953.270020 | 1283400 |
3304 | 2017-10-03 | 954.000000 | 958.000000 | 949.140015 | 957.789978 | 957.789978 | 888300 |
3305 | 2017-10-04 | 957.000000 | 960.390015 | 950.690002 | 951.679993 | 951.679993 | 952400 |
3306 | 2017-10-05 | 955.489990 | 970.909973 | 955.179993 | 969.960022 | 969.960022 | 1213800 |
3307 | 2017-10-06 | 966.700012 | 979.460022 | 963.359985 | 978.890015 | 978.890015 | 1173900 |
3308 | 2017-10-09 | 980.000000 | 985.424988 | 976.109985 | 977.000000 | 977.000000 | 891400 |
3309 | 2017-10-10 | 980.000000 | 981.570007 | 966.080017 | 972.599976 | 972.599976 | 968400 |
3310 | 2017-10-11 | 973.719971 | 990.710022 | 972.250000 | 989.250000 | 989.250000 | 1693300 |
3311 | 2017-10-12 | 987.450012 | 994.119995 | 985.000000 | 987.830017 | 987.830017 | 1262400 |
3312 | 2017-10-13 | 992.000000 | 997.210022 | 989.000000 | 989.679993 | 989.679993 | 1157700 |
3313 rows × 7 columns
我们可以发现pandas自动给DataFrame添加了行标签,还会自动使用CSV中的列标签当作列标签。
我们也可以使用一些有用的函数来查看太大的DataFrame,比如说.head()和.tail()。
Google_stock.head(5)
Date | Open | High | Low | Close | Adj Close | Volume | |
---|---|---|---|---|---|---|---|
0 | 2004-08-19 | 49.676899 | 51.693783 | 47.669952 | 49.845802 | 49.845802 | 44994500 |
1 | 2004-08-20 | 50.178635 | 54.187561 | 49.925285 | 53.805050 | 53.805050 | 23005800 |
2 | 2004-08-23 | 55.017166 | 56.373344 | 54.172661 | 54.346527 | 54.346527 | 18393200 |
3 | 2004-08-24 | 55.260582 | 55.439419 | 51.450363 | 52.096165 | 52.096165 | 15361800 |
4 | 2004-08-25 | 52.140873 | 53.651051 | 51.604362 | 52.657513 | 52.657513 | 9257400 |
Google_stock.tail(5)
Date | Open | High | Low | Close | Adj Close | Volume | |
---|---|---|---|---|---|---|---|
3308 | 2017-10-09 | 980.000000 | 985.424988 | 976.109985 | 977.000000 | 977.000000 | 891400 |
3309 | 2017-10-10 | 980.000000 | 981.570007 | 966.080017 | 972.599976 | 972.599976 | 968400 |
3310 | 2017-10-11 | 973.719971 | 990.710022 | 972.250000 | 989.250000 | 989.250000 | 1693300 |
3311 | 2017-10-12 | 987.450012 | 994.119995 | 985.000000 | 987.830017 | 987.830017 | 1262400 |
3312 | 2017-10-13 | 992.000000 | 997.210022 | 989.000000 | 989.679993 | 989.679993 | 1157700 |
我们可以使用.isnull和.any()组合方法来查看是否某列有NaN。
Google_stock.isnull().any()
Date False
Open False
High False
Low False
Close False
Adj Close False
Volume False
dtype: bool
我们可以使用.describe()来获得DataFrame每一列的描述性信息。
Google_stock.describe()
Open | High | Low | Close | Adj Close | Volume | |
---|---|---|---|---|---|---|
count | 3313.000000 | 3313.000000 | 3313.000000 | 3313.000000 | 3313.000000 | 3.313000e+03 |
mean | 380.186092 | 383.493740 | 376.519309 | 380.072458 | 380.072458 | 8.038476e+06 |
std | 223.818650 | 224.974534 | 222.473232 | 223.853780 | 223.853780 | 8.399521e+06 |
min | 49.274517 | 50.541279 | 47.669952 | 49.681866 | 49.681866 | 7.900000e+03 |
25% | 226.556473 | 228.394516 | 224.003082 | 226.407440 | 226.407440 | 2.584900e+06 |
50% | 293.312286 | 295.433502 | 289.929291 | 293.029114 | 293.029114 | 5.281300e+06 |
75% | 536.650024 | 540.000000 | 532.409973 | 536.690002 | 536.690002 | 1.065370e+07 |
max | 992.000000 | 997.210022 | 989.000000 | 989.679993 | 989.679993 | 8.276810e+07 |
我们也可以指定某一列。
Google_stock['Adj Close'].describe()
count 3313.000000
mean 380.072458
std 223.853780
min 49.681866
25% 226.407440
50% 293.029114
75% 536.690002
max 989.679993
Name: Adj Close, dtype: float64
当然,我们也可以使用pandas提供的众多统计函数。
print()
print('Maximum values of each column:\n', Google_stock.max())
print()
print('Minimum Close value:', Google_stock['Close'].min())
print()
print('Average value of each column:\n', Google_stock.mean())
Maximum values of each column:
Date 2017-10-13
Open 992
High 997.21
Low 989
Close 989.68
Adj Close 989.68
Volume 82768100
dtype: object
Minimum Close value: 49.681866
Average value of each column:
Open 3.801861e+02
High 3.834937e+02
Low 3.765193e+02
Close 3.800725e+02
Adj Close 3.800725e+02
Volume 8.038476e+06
dtype: float64
另一个重要的指标是统计相关,我们可以使用.corr()函数查看不同列之间的相关性。
Google_stock.corr()
Open | High | Low | Close | Adj Close | Volume | |
---|---|---|---|---|---|---|
Open | 1.000000 | 0.999904 | 0.999845 | 0.999745 | 0.999745 | -0.564258 |
High | 0.999904 | 1.000000 | 0.999834 | 0.999868 | 0.999868 | -0.562749 |
Low | 0.999845 | 0.999834 | 1.000000 | 0.999899 | 0.999899 | -0.567007 |
Close | 0.999745 | 0.999868 | 0.999899 | 1.000000 | 1.000000 | -0.564967 |
Adj Close | 0.999745 | 0.999868 | 0.999899 | 1.000000 | 1.000000 | -0.564967 |
Volume | -0.564258 | -0.562749 | -0.567007 | -0.564967 | -0.564967 | 1.000000 |
1代表有很大的相关性,0代表数据根本不相关。
我们可以使用.groupby()函数来使用不同的分组方式分组数据。
data = pd.read_csv('./fake_company.csv')
data
Year | Name | Department | Age | Salary | |
---|---|---|---|---|---|
0 | 1990 | Alice | HR | 25 | 50000 |
1 | 1990 | Bob | RD | 30 | 48000 |
2 | 1990 | Charlie | Admin | 45 | 55000 |
3 | 1991 | Alice | HR | 26 | 52000 |
4 | 1991 | Bob | RD | 31 | 50000 |
5 | 1991 | Charlie | Admin | 46 | 60000 |
6 | 1992 | Alice | Admin | 27 | 60000 |
7 | 1992 | Bob | RD | 32 | 52000 |
8 | 1992 | Charlie | Admin | 28 | 62000 |
我们可以使用groupby()方法计算每年公司花多少钱在薪水上。
data.groupby(['Year'])['Salary'].sum()
Year
1990 153000
1991 162000
1992 174000
Name: Salary, dtype: int64
如果我们想知道每一年的平均薪水。
data.groupby(['Year'])['Salary'].mean()
Year
1990 51000
1991 54000
1992 58000
Name: Salary, dtype: int64
如果我们想知道,每个部门每年分配多少薪水。
data.groupby(['Year','Department'])['Salary'].sum()
大概就这么多。