pandas的基本使用

下载

下载pandas在命令行中输入conda install pandas,如果想要查看自己下载的版本,在jupyter notebook中输入!conda list pandas,在命令行中输入conda list pandas,就可以返回版本号。

pandas series

Pandas series是一个带标签的一维数组,或者说像是一个字典。与NumPy不同,pandas可以有不同类型的数据,而且还可以为每一个元素定义一个索引标签。

首先,我们需要信用pandas库。

import pandas as pd

pandas series

我们可以使用pd.Series(date,index)来创建,其中,index代表一系列索引标签。

import pandas as pd

groceries = pd.Series(data = [30,6,'Yes','No'],index = ['eggs','apples','milk','bread'])

groceries
eggs       30
apples      6
milk      Yes
bread      No
dtype: object

可以看出,第一列输出为索引,并且索引不是0到3,而是我们设置的标签。第二列是数据,且有各种数据形式。

Pandas和NumPy类似,有很多可以返回信息的函数。比如说.shape,.ndim,.size

print('Groceries has shape:', groceries.shape)
print('Groceries has dimension:', groceries.ndim)
print('Groceries has a total of', groceries.size, 'elements')
Groceries has shape: (4,)
Groceries has dimension: 1
Groceries has a total of 4 elements

pandas series还支持单独输出索引和数据。

print(groceries.values)
print(groceries.index)
[30 6 'Yes' 'No']
Index(['eggs', 'apples', 'milk', 'bread'], dtype='object')

Pandas Series还支持使用in查找是否含有这个标签。

x = 'bananas' in groceries
y = 'bread' in groceries

print(x)
print(y)
False
True

访问,修改,删除

首先,我们可以通过标签索引访问数据,也可以使用数字索引(支持正索引和负索引)。为了消除歧义,pandas有两个属性.loc和.iloc。.loc特别代表标签索引,.iloc特别代表数字索引。

print(groceries['eggs'])
print(groceries[['eggs','bread']])
print(groceries.loc[['eggs','apples']])
print()

print(groceries[[0,1]])
print(groceries[[-1]])
print(groceries.iloc[[0,1]])
30
eggs     30
bread    No
dtype: object
eggs      30
apples     6
dtype: object

eggs      30
apples     6
dtype: object
bread    No
dtype: object
eggs      30
apples     6
dtype: object

如果想要修改某个数据的值,可以直接访问然后等号右边修改。

groceries['eggs'] = 2
print(groceries)
eggs        2
apples      6
milk      Yes
bread      No
dtype: object

我们可以删除pandas series中的数据使用.drop()方法,但是这样修改的话只是返回了删除某个索引的值的数据,在原数据中并没有改变。如果想要改变原数组,需要添加关键字inplace = True

groceries = pd.Series(data = [30,6,'Yes','No'],index = ['eggs','apples','milk','bread'])
print(groceries.drop('apples'))
print(groceries)
eggs      30
milk     Yes
bread     No
dtype: object
eggs       30
apples      6
milk      Yes
bread      No
dtype: object
groceries.drop('apples',inplace = True)
print(groceries)
eggs      30
milk     Yes
bread     No
dtype: object

算术运算

和numpy一样,Pandas Series支持元素层面上的运算。

fruits = pd.Series(data=[10,6,3], index = ['apples','oranges','bananas'])
print(fruits)
print()

print(fruits + 2)
print(fruits - 2)
print(fruits * 2)
print(fruits / 2)
apples     10
oranges     6
bananas     3
dtype: int64

apples     12
oranges     8
bananas     5
dtype: int64
apples     8
oranges    4
bananas    1
dtype: int64
apples     20
oranges    12
bananas     6
dtype: int64
apples     5.0
oranges    3.0
bananas    1.5
dtype: float64

pandas还支持一些NumPy的数学函数。比如sqrt()。

import numpy as np

print(fruits)

print(np.exp(fruits))
print(np.sqrt(fruits))
print(np.power(fruits,2))
apples     10
oranges     6
bananas     3
dtype: int64
apples     22026.465795
oranges      403.428793
bananas       20.085537
dtype: float64
apples     3.162278
oranges    2.449490
bananas    1.732051
dtype: float64
apples     100
oranges     36
bananas      9
dtype: int64

pandas还支持选择特定的数据进行操作。

print(fruits['bananas'] + 2)
print(fruits.iloc[0]-2)
print(fruits[['apples','oranges']]*2)
print(fruits[['apples','oranges']] / 2)
5
8
apples     20
oranges    12
dtype: int64
apples     5.0
oranges    3.0
dtype: float64

对于pandas series混合了几种数据类型,有时也可以进行算术操作,比如说。

groceries*2
eggs         60
milk     YesYes
bread      NoNo
dtype: object
groceries / 2
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

~\Anaconda3\envs\yang\lib\site-packages\pandas\core\ops.py in na_op(x, y)
   1504         try:
-> 1505             result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs)
   1506         except TypeError:


~\Anaconda3\envs\yang\lib\site-packages\pandas\core\computation\expressions.py in evaluate(op, op_str, a, b, use_numexpr, **eval_kwargs)
    207     if use_numexpr:
--> 208         return _evaluate(op, op_str, a, b, **eval_kwargs)
    209     return _evaluate_standard(op, op_str, a, b)


~\Anaconda3\envs\yang\lib\site-packages\pandas\core\computation\expressions.py in _evaluate_standard(op, op_str, a, b, **eval_kwargs)
     67     with np.errstate(all='ignore'):
---> 68         return op(a, b)
     69 


TypeError: unsupported operand type(s) for /: 'str' and 'int'


During handling of the above exception, another exception occurred:


TypeError                                 Traceback (most recent call last)

~\Anaconda3\envs\yang\lib\site-packages\pandas\core\ops.py in safe_na_op(lvalues, rvalues)
   1528             with np.errstate(all='ignore'):
-> 1529                 return na_op(lvalues, rvalues)
   1530         except Exception:


~\Anaconda3\envs\yang\lib\site-packages\pandas\core\ops.py in na_op(x, y)
   1506         except TypeError:
-> 1507             result = masked_arith_op(x, y, op)
   1508 


~\Anaconda3\envs\yang\lib\site-packages\pandas\core\ops.py in masked_arith_op(x, y, op)
   1025             with np.errstate(all='ignore'):
-> 1026                 result[mask] = op(xrav[mask], y)
   1027 


TypeError: unsupported operand type(s) for /: 'str' and 'int'


During handling of the above exception, another exception occurred:


TypeError                                 Traceback (most recent call last)

<ipython-input-23-3d64b17810f9> in <module>
----> 1 groceries / 2


~\Anaconda3\envs\yang\lib\site-packages\pandas\core\ops.py in wrapper(left, right)
   1581             rvalues = rvalues.values
   1582 
-> 1583         result = safe_na_op(lvalues, rvalues)
   1584         return construct_result(left, result,
   1585                                 index=left.index, name=res_name, dtype=None)


~\Anaconda3\envs\yang\lib\site-packages\pandas\core\ops.py in safe_na_op(lvalues, rvalues)
   1531             if is_object_dtype(lvalues):
   1532                 return libalgos.arrmap_object(lvalues,
-> 1533                                               lambda x: op(x, rvalues))
   1534             raise
   1535 


pandas/_libs/algos.pyx in pandas._libs.algos.arrmap()


~\Anaconda3\envs\yang\lib\site-packages\pandas\core\ops.py in <lambda>(x)
   1531             if is_object_dtype(lvalues):
   1532                 return libalgos.arrmap_object(lvalues,
-> 1533                                               lambda x: op(x, rvalues))
   1534             raise
   1535 


TypeError: unsupported operand type(s) for /: 'str' and 'int'

Pandas DataFrames

Pandas DataFrames是一个有行标签和列标签的两维数据结构,有点像Excel表格的结构。

建立一个DataFrame,需要先建立一个Pandas Series的字典,然后将这个字典放入pd.DataFrame()函数。

items = {'Bob': pd.Series(data =[245,25,55], index = ['bike','pants','watch']),
        'Alice' : pd.Series(data = [40,110,500,45], index = ['book','glasses','bike','pants'])}

print(type(items))
<class 'dict'>
shopping_carts = pd.DataFrame(items)
shopping_carts
BobAlice
bike245.0500.0
bookNaN40.0
glassesNaN110.0
pants25.045.0
watch55.0NaN

我们可以注意到第一列的标签按照字母表的顺序排列了下来。其中,还有很多位置显示的是NaN。这是pandas表示那个特定行和列索引的位置没有数据的一种方式。

如果我们创建Pandas DataFrame的时候Pandas Series没有定义清楚的标签,那么pandas会生成数字索引的行标签。

data = {'Bob': pd.Series([245,25,55]),'Alice' : pd.Series([40,110,500,45])}

df = pd.DataFrame(data)
df
BobAlice
0245.040
125.0110
255.0500
3NaN45

我们可以利用DataFrame的属性提取一些信息。

print(shopping_carts.shape)
print(shopping_carts.ndim)
print(shopping_carts.size)
print(shopping_carts.values)
print()
print(shopping_carts.index)
print(shopping_carts.columns)
(5, 2)
2
10
[[245. 500.]
 [ nan  40.]
 [ nan 110.]
 [ 25.  45.]
 [ 55.  nan]]

Index(['bike', 'book', 'glasses', 'pants', 'watch'], dtype='object')
Index(['Bob', 'Alice'], dtype='object')

有些时候我们不需要将整个字典都放到DataFrame中,我们可以使用关键字columns和index来限定。

bob_shopping_cart = pd.DataFrame(items,columns=['Bob'])

bob_shopping_cart
Bob
bike245
pants25
watch55
alice_sel_shopping_cart = pd.DataFrame(items,index = ['glasses','bike'],columns = ['Alice'])

alice_sel_shopping_cart
Alice
glasses110
bike500

我们也可以创建DataFrame使用列表字典,但是字典中的所有的列表必须有相同的长度。

data = {'Integers' : [1,2,3],
        'Floats' : [4.5, 8.2, 9.6]}

df = pd.DataFrame(data)

df
IntegersFloats
014.5
128.2
239.6

我们可以注意到上面的data字典没有索引标签,我们可以在pd.DataFrame中设置index关键字设置标签。

data = {'Integers' : [1,2,3],
        'Floats' : [4.5, 8.2, 9.6]}

df = pd.DataFrame(data, index = ['label 1', 'label 2', 'label 3'])

df
IntegersFloats
label 114.5
label 228.2
label 339.6

我们也可以使用一系列的python字典来创建dataframe。用法和上面类似,也可以使用index关键字。

items2 = [{'bikes': 20, 'pants': 30, 'watches': 35}, 
          {'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5}]

store_items = pd.DataFrame(items2, index = ['store 1', 'store 2'])

store_items
bikesglassespantswatches
store 120NaN3035
store 21550.0510

访问数据

DataFrame支持通过使用行标签,列标签来访问DataFrame中的行元素,列元素或者单个元素。

import pandas as pd
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35}, 
          {'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5}]

store_items = pd.DataFrame(items2, index = ['store 1', 'store 2'])


print(store_items)
print()
print(store_items[['bikes']])
print()
print(store_items[['bikes','pants']])
print()
print(store_items.loc[['store 1']])
print()
print(store_items['bikes'],['store 2'])
         bikes  glasses  pants  watches
store 1     20      NaN     30       35
store 2     15     50.0      5       10

         bikes
store 1     20
store 2     15

         bikes  pants
store 1     20     30
store 2     15      5

         bikes  glasses  pants  watches
store 1     20      NaN     30       35

store 1    20
store 2    15
Name: bikes, dtype: int64 ['store 2']

如果我们要访问单个元素,我们需要先列标签,再行标签,否则程序会报错。

如果我们想要修改DataFrame,比如说加一列。如果我们想要给store_items设置一个衬衫列,可以这样

store_items['shirts'] = [15,2]
store_items
bikesglassespantswatchesshirts
store 120NaN303515
store 21550.05102

我们也可以使用列间算术操作增加新的列。

store_items['suits'] = store_items['pants'] + store_items['shirts']

store_items
bikesglassespantswatchesshirtssuits
store 120NaN30351545
store 21550.051027

如果想要添加新的行,我们需要先创建一个新的DataFrame,然后将它加入到原来的DataFrame中。比如说你新开了第三家商店,想要加入到其中。

new_items = [{'bikes':20,'pants':30, 'watches': 35,'glass': 4}]

new_store = pd.DataFrame(new_items,index = ['store 3'])
new_store
bikesglasspantswatches
store 32043035

然后我们可以使用append函数将这个新的商店信息加入到总的商店信息中

store_items = store_items.append(new_store)

store_items
bikesglassglassespantsshirtssuitswatches
store 120NaNNaN3015.045.035
store 215NaN50.052.07.010
store 3204.0NaN30NaNNaN35

我们会发现列元素会按照字母表顺序排列。

你还可以通过控制特定列的特定行来创建一个新的列。比如说新建一个watches列通过下面的方式。

store_items['new watches'] = store_items['watches'][1:]

store_items
bikesglassglassespantsshirtssuitswatchesnew watches
store 120NaNNaN3015.045.035NaN
store 215NaN50.052.07.01010.0
store 3204.0NaN30NaNNaN3535.0

DataFrame支持再任意地方插入一个新行,使用dataframe.insert(loc,label,data)其中loc表示位置,label表示新的列的标签,data表示数据。例子:

store_items.insert(4,'shoes',[8,5,0])
store_items
bikesglassglassespantsshoesshirtssuitswatchesnew watches
store 120NaNNaN30815.045.035NaN
store 215NaN50.0552.07.01010.0
store 3204.0NaN300NaNNaN3535.0

我们可以添加行和列元素,也可以删除行和列元素。我们可以使用.pop()方法来删除列元素,可以使用.drop()方法,通过使用axis关键字来删除行或者列

store_items.pop('new watches')

store_items
bikesglassglassespantsshoesshirtssuitswatches
store 120NaNNaN30815.045.035
store 215NaN50.0552.07.010
store 3204.0NaN300NaNNaN35
store_items = store_items.drop(['watches','shoes'],axis = 1)

store_items
bikesglassglassespantsshirtssuits
store 120NaNNaN3015.045.0
store 215NaN50.052.07.0
store 3204.0NaN30NaNNaN
store_items = store_items.drop(['store 2','store 1'],axis = 0)

store_items
bikesglassglassespantsshirtssuits
store 3204.0NaN30NaNNaN

我们有时需要改变行标签或者列标签,我们可以使用.rename()方法。

store_items = store_items.rename(index = {'store 3': 'last store'})

store_items
bikesglassglassespantsshirtssuits
last store204.0NaN30NaNNaN

你也可以将索引改为列标签之一。

store_items = store_items.set_index('pants')

store_items
bikesglassglassesshirtssuits
pants
30204.0NaNNaNNaN

处理NaN

在我们处理大量的数据的时候,我们必须先处理一下其中的异常数据。其中大部分异常数据是应为数据缺失造成的。在pandas中,用NaN来代表缺失数据,下面我们将学习如何处理NaN的值。

首先,我们先创建一个含有NaN的DataFrame。

items2 = [{'bikes': 20, 'pants': 30, 'watches': 35, 'shirts': 15, 'shoes':8, 'suits':45},
{'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5, 'shirts': 2, 'shoes':5, 'suits':7},
{'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4, 'shoes':10}]

store_items = pd.DataFrame(items2, index = ['store 1', 'store 2', 'store 3'])

store_items
bikesglassespantsshirtsshoessuitswatches
store 120NaN3015.0845.035
store 21550.052.057.010
store 3204.030NaN10NaN35

在这个例子中,我们可以看出来有三个NaN,但是如果数据量很大的话,我们就没办法直接看出来有多少个。我们可以使用一个组合方法来查看有多少个NaN。

x = store_items.isnull()

print('Number of NaN values in our DataFrame:\n',x)
Number of NaN values in our DataFrame:
          bikes  glasses  pants  shirts  shoes  suits  watches
store 1  False     True  False   False  False  False    False
store 2  False    False  False   False  False  False    False
store 3  False    False  False    True  False   True    False
x = store_items.isnull().sum()

print('Number of NaN values in our DataFrame:\n',x)
Number of NaN values in our DataFrame:
 bikes      0
glasses    1
pants      0
shirts     1
shoes      0
suits      1
watches    0
dtype: int64
x =  store_items.isnull().sum().sum()

print('Number of NaN values in our DataFrame:', x)
Number of NaN values in our DataFrame: 3

我们可以看出.isnull()返回一个布尔型的DataFrame,True代表这个地方是NaN。在pandas中,我们将1代表true,0代表False。所以我们可以使用.sum()方法计算。

当然,我们可以使用.count()来查询有多少个不是NaN的数。

print('Number of non-NaN values in the columns of our DataFrame:\n', store_items.count())
Number of non-NaN values in the columns of our DataFrame:
 bikes      3
glasses    2
pants      3
shirts     2
shoes      3
suits      2
watches    3
dtype: int64

对于是NaN的地方,我们有两个选择,一是删除它,二是替代它。

如果我们想要删除它,我们可以使用.dropna(axis),其中当axis=0时删除行,当axis=1时删除列。

store_items.dropna(axis = 1)
bikespantsshoeswatches
store 12030835
store 2155510
store 320301035

我们需要注意,我们这样子操作,store_items并没有什么变化。如果想要将原来的DataFrame改变,需要使用关键字inplace=True。

我们也可以选择使用合适的值来取代NaN。我们可以使用.fillna()。比如说用0取代所有NaN。

store_items.fillna(0)
bikesglassespantsshirtsshoessuitswatches
store 1200.03015.0845.035
store 21550.052.057.010
store 3204.0300.0100.035

我么们也可以使用.fillna()方法来取代NaN使用先前值。使用.fillna(method = ‘ffill’,axis)。当使用的方法是ffill时,使用的时前面的值填充,当使用的方法是backfill时,使用的是后填充。axis控制的是行或者列。方法与之前一样。

store_items.fillna(method = 'ffill',axis = 0)
bikesglassespantsshirtsshoessuitswatches
store 120NaN3015.0845.035
store 21550.052.057.010
store 3204.0302.0107.035
store_items.fillna(method = 'ffill',axis=1)
bikesglassespantsshirtsshoessuitswatches
store 120.020.030.015.08.045.035.0
store 215.050.05.02.05.07.010.0
store 320.04.030.030.010.010.035.0
store_items.fillna(method = 'backfill', axis = 0)
bikesglassespantsshirtsshoessuitswatches
store 12050.03015.0845.035
store 21550.052.057.010
store 3204.030NaN10NaN35
store_items.fillna(method = 'backfill', axis = 1)
bikesglassespantsshirtsshoessuitswatches
store 120.030.030.015.08.045.035.0
store 215.050.05.02.05.07.010.0
store 320.04.030.010.010.035.035.0

我们会发现,如果我们没有使用合适的方法,NaN前面或者后面并没有值,那么会保持是NaN。而且,.fillna方法并没有修改原本的DataFrame,如果想要在原值修改,那么使用关键字inplace = True。

也可以使用.interpolate(method = ‘linear’,axis)方法线性插入合适的值。但是如果前面没有值的话,还保持是NaN。

加载数据

我们可能需要加载数据从各种来源的数据库文件,DataFrame支持加载各种格式的数据库文件,其中最流行的是CSV格式。我们可以使用pd.read_csv()函数加载CSV文件。

Google_stock = pd.read_csv('./GooG.csv')

print('Google_stock is of type:',type(Google_stock))
print('Google_stock has shape:',Google_stock.shape)
Google_stock is of type: <class 'pandas.core.frame.DataFrame'>
Google_stock has shape: (3313, 7)
Google_stock
DateOpenHighLowCloseAdj CloseVolume
02004-08-1949.67689951.69378347.66995249.84580249.84580244994500
12004-08-2050.17863554.18756149.92528553.80505053.80505023005800
22004-08-2355.01716656.37334454.17266154.34652754.34652718393200
32004-08-2455.26058255.43941951.45036352.09616552.09616515361800
42004-08-2552.14087353.65105151.60436252.65751352.6575139257400
52004-08-2652.13590653.62621351.99184453.60634253.6063427148200
62004-08-2753.70072953.95904952.50351352.73202952.7320296258300
72004-08-3052.29983952.40416050.67540450.67540450.6754045235700
82004-08-3150.81946951.51991350.74992050.85424050.8542404954800
92004-09-0151.01817751.15230249.51296649.80109049.8010909206800
102004-09-0249.27451750.85424049.15032650.42702150.42702115232100
112004-09-0350.14883050.54127949.33909649.68186649.6818665191000
122004-09-0750.17863550.67043749.48315850.46179650.4617965891300
132004-09-0850.04451051.18211049.92528550.81946950.8194695023000
142004-09-0950.93372351.02314450.17366850.82443650.8244364092100
152004-09-1050.47172952.93570350.32270152.32467752.3246778764200
162004-09-1352.97047853.85472952.88602853.40266853.4026687902900
172004-09-1453.37783155.63812653.04996155.38477755.38477710910200
182004-09-1554.92277956.74592254.74394255.63812655.63812610793500
192004-09-1655.80703057.52584855.46426056.61676456.6167649335800
202004-09-1756.84030958.36539156.40811958.36539158.3653919543500
212004-09-2058.09713460.40710858.00771759.29434659.29434610708500
222004-09-2159.51789559.82092358.37532458.53925758.5392577282900
232004-09-2258.32067959.44834558.02758858.80751458.8075147638000
242004-09-2359.03602660.91878158.13190860.01963060.0196308599600
252004-09-2460.07924361.64903359.49305359.52782859.5278289191900
262004-09-2759.39370060.04943558.51938658.74790258.7479027119100
272004-09-2860.25808063.28837259.71660263.02011563.02011517056100
282004-09-2962.94063267.07375362.70715065.11647865.11647830745600
292004-09-3064.53029665.72254264.08319964.38126464.38126413861300
........................
32832017-09-01941.130005942.479980935.150024937.340027937.340027947400
32842017-09-05933.080017937.000000921.960022928.450012928.4500121326400
32852017-09-06930.150024930.914978919.270020927.809998927.8099981527700
32862017-09-07931.729980936.409973923.619995935.950012935.9500121212700
32872017-09-08936.489990936.989990924.880005926.500000926.5000001011500
32882017-09-11934.250000938.380005926.919983929.080017929.0800171267000
32892017-09-12932.590027933.479980923.861023932.070007932.0700071134400
32902017-09-13930.659973937.250000929.859985935.090027935.0900271102600
32912017-09-14931.250000932.770020924.000000925.109985925.1099851397600
32922017-09-15924.659973926.489990916.359985920.289978920.2899782505400
32932017-09-18920.010010922.080017910.599976915.000000915.0000001306900
32942017-09-19917.419983922.419983912.549988921.809998921.809998936700
32952017-09-20922.979980933.880005922.000000931.580017931.5800171669800
32962017-09-21933.000000936.530029923.830017932.450012932.4500121290600
32972017-09-22927.750000934.729980926.479980928.530029928.5300291052700
32982017-09-25925.450012926.400024909.700012920.969971920.9699711856800
32992017-09-26923.719971930.820007921.140015924.859985924.8599851666900
33002017-09-27927.739990949.900024927.739990944.489990944.4899902239400
33012017-09-28941.359985950.690002940.549988949.500000949.5000001020300
33022017-09-29952.000000959.786011951.510010959.109985959.1099851581000
33032017-10-02959.979980962.539978947.840027953.270020953.2700201283400
33042017-10-03954.000000958.000000949.140015957.789978957.789978888300
33052017-10-04957.000000960.390015950.690002951.679993951.679993952400
33062017-10-05955.489990970.909973955.179993969.960022969.9600221213800
33072017-10-06966.700012979.460022963.359985978.890015978.8900151173900
33082017-10-09980.000000985.424988976.109985977.000000977.000000891400
33092017-10-10980.000000981.570007966.080017972.599976972.599976968400
33102017-10-11973.719971990.710022972.250000989.250000989.2500001693300
33112017-10-12987.450012994.119995985.000000987.830017987.8300171262400
33122017-10-13992.000000997.210022989.000000989.679993989.6799931157700

3313 rows × 7 columns

我们可以发现pandas自动给DataFrame添加了行标签,还会自动使用CSV中的列标签当作列标签。

我们也可以使用一些有用的函数来查看太大的DataFrame,比如说.head()和.tail()。

Google_stock.head(5)
DateOpenHighLowCloseAdj CloseVolume
02004-08-1949.67689951.69378347.66995249.84580249.84580244994500
12004-08-2050.17863554.18756149.92528553.80505053.80505023005800
22004-08-2355.01716656.37334454.17266154.34652754.34652718393200
32004-08-2455.26058255.43941951.45036352.09616552.09616515361800
42004-08-2552.14087353.65105151.60436252.65751352.6575139257400
Google_stock.tail(5)
DateOpenHighLowCloseAdj CloseVolume
33082017-10-09980.000000985.424988976.109985977.000000977.000000891400
33092017-10-10980.000000981.570007966.080017972.599976972.599976968400
33102017-10-11973.719971990.710022972.250000989.250000989.2500001693300
33112017-10-12987.450012994.119995985.000000987.830017987.8300171262400
33122017-10-13992.000000997.210022989.000000989.679993989.6799931157700

我们可以使用.isnull和.any()组合方法来查看是否某列有NaN。

Google_stock.isnull().any()
Date         False
Open         False
High         False
Low          False
Close        False
Adj Close    False
Volume       False
dtype: bool

我们可以使用.describe()来获得DataFrame每一列的描述性信息。

Google_stock.describe()
OpenHighLowCloseAdj CloseVolume
count3313.0000003313.0000003313.0000003313.0000003313.0000003.313000e+03
mean380.186092383.493740376.519309380.072458380.0724588.038476e+06
std223.818650224.974534222.473232223.853780223.8537808.399521e+06
min49.27451750.54127947.66995249.68186649.6818667.900000e+03
25%226.556473228.394516224.003082226.407440226.4074402.584900e+06
50%293.312286295.433502289.929291293.029114293.0291145.281300e+06
75%536.650024540.000000532.409973536.690002536.6900021.065370e+07
max992.000000997.210022989.000000989.679993989.6799938.276810e+07

我们也可以指定某一列。

Google_stock['Adj Close'].describe()
count    3313.000000
mean      380.072458
std       223.853780
min        49.681866
25%       226.407440
50%       293.029114
75%       536.690002
max       989.679993
Name: Adj Close, dtype: float64

当然,我们也可以使用pandas提供的众多统计函数。

print()
print('Maximum values of each column:\n', Google_stock.max())
print()
print('Minimum Close value:', Google_stock['Close'].min())
print()
print('Average value of each column:\n', Google_stock.mean())
Maximum values of each column:
 Date         2017-10-13
Open                992
High             997.21
Low                 989
Close            989.68
Adj Close        989.68
Volume         82768100
dtype: object

Minimum Close value: 49.681866

Average value of each column:
 Open         3.801861e+02
High         3.834937e+02
Low          3.765193e+02
Close        3.800725e+02
Adj Close    3.800725e+02
Volume       8.038476e+06
dtype: float64

另一个重要的指标是统计相关,我们可以使用.corr()函数查看不同列之间的相关性。

Google_stock.corr()
OpenHighLowCloseAdj CloseVolume
Open1.0000000.9999040.9998450.9997450.999745-0.564258
High0.9999041.0000000.9998340.9998680.999868-0.562749
Low0.9998450.9998341.0000000.9998990.999899-0.567007
Close0.9997450.9998680.9998991.0000001.000000-0.564967
Adj Close0.9997450.9998680.9998991.0000001.000000-0.564967
Volume-0.564258-0.562749-0.567007-0.564967-0.5649671.000000

1代表有很大的相关性,0代表数据根本不相关。

我们可以使用.groupby()函数来使用不同的分组方式分组数据。

data = pd.read_csv('./fake_company.csv')

data
YearNameDepartmentAgeSalary
01990AliceHR2550000
11990BobRD3048000
21990CharlieAdmin4555000
31991AliceHR2652000
41991BobRD3150000
51991CharlieAdmin4660000
61992AliceAdmin2760000
71992BobRD3252000
81992CharlieAdmin2862000

我们可以使用groupby()方法计算每年公司花多少钱在薪水上。

data.groupby(['Year'])['Salary'].sum()
Year
1990    153000
1991    162000
1992    174000
Name: Salary, dtype: int64

如果我们想知道每一年的平均薪水。

data.groupby(['Year'])['Salary'].mean()
Year
1990    51000
1991    54000
1992    58000
Name: Salary, dtype: int64

如果我们想知道,每个部门每年分配多少薪水。

data.groupby(['Year','Department'])['Salary'].sum()

大概就这么多。

  • 6
    点赞
  • 29
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值