Pandas Notebook
由于使用emacs-org进行编辑,为方便暂且使用英文
Table of Contents
-
import numpy as np
import pandas as pd
Series
one-dimensional, labeled list, hold any datatype.
labels are referred to as index.
s = pd.Series(data, index = index)
where data can be: python dict, np ndarray, or scalar.
non-unique index is enabled.
create
from ndarray
index should be as long as ndarray,
or by default 0 ~ len(data) - 1.
>>> s = pd.Series(np.random.rand(5),
index = ['a', 'b', 'c', 'd', 'e'])
>>> s
a 0.845771
b 0.508507
c 0.964314
d 0.924016
e 0.687160
dtype: float64
>>> s1 = pd.Series(np.random.rand(5))
>>> s1
0 0.353030
1 0.419253
2 0.121847
3 0.631009
4 0.696326
dtype: float64
from dict
if index wasn’t passed, the key of dict would be used,
if index was passed, index will match the dict’s keys,
index is prior.
>>> d = {'a': 1, 'b': 2, 'v': 3}
>>> pd.Series(d)
a 1
b 2
v 3
dtype: int64
>>>
>>> pd.Series(d, index = ['a', 'b', 'c'])
a 1.0
b 2.0
c NaN
dtype: float64
notice: nan is a float number,
generally showing missing values,
shouldn’t be compared with
somevalue == np.nan, but use
somevalue is np.nan instead.
from scalar
index is must.
value’ll be the same.
>>> pd.Series(7, index = ['a', 'b', 'c', 'z'])
a 7
b 7
c 7
z 7
dtype: int64
manip
much like numpy and python dict
>>> s = pd.Series(np.random.rand(5),
index = ['a', 'b', 'c', 'd', 'e'])
>>> s
a 0.208771
b 0.456772
c 0.055980
d 0.481634
e 0.215942
dtype: float64
>>> s[0]
0.20877084185831607
>>> s[:3]
a 0.208771
b 0.456772
c 0.055980
dtype: float64
>>> s[s > s.median()]
b 0.456772
d 0.481634
dtype: float64
>>> s[[4, 3, 1]]
e 0.215942
d 0.481634
b 0.456772
dtype: float64
>>> s * 2
a 0.417542
b 0.913543
c 0.111960
d 0.963267
e 0.431885
dtype: float64
# notice that operations recognize the index
>>> s
a 0.208771
b 0.456772
c 0.055980
d 0.481634
e 0.215942
dtype: float64
>>> s[1:] + s[:-1]
a NaN
b 0.913543
c 0.111960
d 0.963267
e NaN
dtype: float64
>>> s['a']
0.20877084185831607
>>> s.get('a')
0.20877084185831607
>>> s.get('f')
>>> # <- this is 'none'
>>> s.get('f', np.nan)
nan
# get rid of the index?
>>> s.array
<PandasArray>
[ 0.20877084185831607, 0.4567715456417446, 0.055979995016902984,
0.4816336854012986, 0.21594244986110378]
Length: 5, dtype: float64
# get a genuine array?
>>> s.to_numpy()
array([0.20877084, 0.45677155, 0.05598 , 0.48163369, 0.21594245])
# name?
>>> s.name
>>> # <- this is 'none'
>>> s = s.rename('seriesA')
>>> s.name
'seriesA'
DataFrame
SQL table
data can be:
python dict of 1-D ndarrays, lists, dicts, or Series,
2-D ndarray, structured or record ndarray,
a Series, another DataFrame.
pass index*/*columns to force row/column labels.
create
from dict of Series or dicts
nested dicts will be firstly convert to pd.Series.
the created DataFrame’s index length is decided firstly by
the passed index variable, or by the longest data (Series).
the created DataFrame’s column width is decided firstly by
the passed columns variable, or by the number of data.
Series as data don’t have to follow length requirement,
while list or ndarray as data should be as long as the
DataFrame’s index.
>>> d = {'one': pd.Series(np.arange(1., 4.),
index = ['a', 'b', 'c']),
'two': pd.Series(np.arange(1., 5.),
index = ['a', 'b', 'c', 'd'])}
>>> df = pd.DataFrame(d)
>>> df
one two
a 1.0 1.0
b 2.0 2.0
c 3.0 3.0
d NaN 4.0
# follow index and columns --hard
>>> pd.DataFrame(d, index = ['d', 'c', 'a'])
one two
d NaN 4.0
c 3.0 3.0
a 1.0 1.0
>>> pd.DataFrame(d, index = ['d', 'c', 'a'], columns = ['one', 'X'])
one X
d NaN NaN
c 3.0 NaN
a 1.0 NaN
>>> df.index
Index(['a', 'b', 'c', 'd'], dtype='object')
>>> df.columns
Index(['one', 'two'], dtype='object')
from dict of ndarrays/lists
ndarrays and lists should be as long as the DataFrame’s index,
and as wide as the DataFrame’s columns.
from structured or record arrays
>>> arr = [[1, 2, 'asdf'], [11, 12, 'asd']]
>>> pd.DataFrame(arr, index = ['a', 'b'], columns = ['x', 'y', 'z'])
x y z
a 1 2 asdf
b 11 12 asd
from a list of dicts
>>> ls = [{'a': 1, 'b': 2}, {'c': 3, 'd': 4, 'e': 5}]
>>> pd.DataFrame(ls)
a b c d e
0 1.0 2.0 NaN NaN NaN
1 NaN NaN 3.0 4.0 5.0
>>> pd.DataFrame(ls, index = ['x', 'y'], columns = ['a', 'e'])
a e
x 1.0 NaN
y NaN 5.0
from a dict of tuples
>>> d = {('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2, ('B', 'A'): 5},
... ('b', 'c'): {('A', 'B'): 3, ('B', 'C'): 4}}
>>> pd.DataFrame(d)
a b
b c
A B 1.0 3.0
C 2.0 NaN
B A 5.0 NaN
C NaN 4.0
>>> d = {('a', 'b'): {('A', 'B'): 1, 'A': 2},
... 'b': {('A', 'B'): 3, ('B', 'C'): 4}}
>>> pd.DataFrame(d)
(a, b) b
(A, B) 1.0 3.0
A 2.0 NaN
(B, C) NaN 4.0
from a Series
the only columns is the Series’s name.
use np.nan to imply missing data
manip
like a dict
>>> df = pd.DataFrame(np.arange(1, 13).reshape(3, 4),
index = ['a', 'b', 'c'],
columns = ['one', 'two', 'three', 'four'])
>>> del df['one']
>>> df
two three four
a 2 3 4
b 6 7 8
c 10 11 12
>>> two = df.pop('two')
>>> two
a 2
b 6
c 10
Name: two, dtype: int32
>>> df['foo'] = 'bar'
>>> df
three four foo
a 3 4 bar
b 7 8 bar
c 11 12 bar
>>> df['three_trunc'] = df['three'][:2]
>>> df
three four foo three_trunc
a 3 4 bar 3.0
b 7 8 bar 7.0
c 11 12 bar NaN
# insert some columns but not at the end?
>>> df.insert(1, 'new', np.random.rand(3))
>>> df
three new four foo three_trunc
a 3 0.932232 4 bar 3.0
b 7 0.962772 8 bar 7.0
c 11 0.792010 12 bar NaN
# numerical
>>> df['flag'] = df['new'] > df['new'].median()
>>> df['mult'] = df['three'] * df['new']
>>> df
three new four foo three_trunc flag mult
a 3 0.932232 4 bar 3.0 False 2.796697
b 7 0.962772 8 bar 7.0 True 6.739403
c 11 0.792010 12 bar NaN False 8.712110
# transpose
>>> n = 'abcdef'
>>> df1 = pd.DataFrame(np.arange(30).reshape(5, 6), columns = [x for x in n])
>>> df1
a b c d e f
0 0 1 2 3 4 5
1 6 7 8 9 10 11
2 12 13 14 15 16 17
3 18 19 20 21 22 23
4 24 25 26 27 28 29
>>> df1.T
0 1 2 3 4
a 0 6 12 18 24
b 1 7 13 19 25
c 2 8 14 20 26
d 3 9 15 21 27
e 4 10 16 22 28
f 5 11 17 23 29
# turn to array
>>> df1.to_numpy()
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29]])
>>> np.asarray(df1)
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29]])
notice: when inserting a Series that isn’t as long as the DataFrame’s
index length, it will be forced to follow the index length.
but if insert a list or ndarray, make sure its length match the index.
indexing/selection
date as index/column
>>> index = pd.date_range('1/1/2020', periods = 10)
>>> index
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
'2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
'2020-01-09', '2020-01-10'],
dtype='datetime64[ns]', freq='D')
>>> df = pd.DataFrame(np.arange(30).reshape(10, 3), index = index, columns = index[:3])
>>> df
2020-01-01 2020-01-02 2020-01-03
2020-01-01 0 1 2
2020-01-02 3 4 5
2020-01-03 6 7 8
2020-01-04 9 10 11
2020-01-05 12 13 14
2020-01-06 15 16 17
2020-01-07 18 19 20
2020-01-08 21 22 23
2020-01-09 24 25 26
2020-01-10 27 28 29
show
>>> df1
a b c d e f
0 0 1 2 3 4 5
1 6 7 8 9 10 11
2 12 13 14 15 16 17
3 18 19 20 21 22 23
4 24 25 26 27 28 29
>>> df1.describe()
a b c d e f
count 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000
mean 12.000000 13.000000 14.000000 15.000000 16.000000 17.000000
std 9.486833 9.486833 9.486833 9.486833 9.486833 9.486833
min 0.000000 1.000000 2.000000 3.000000 4.000000 5.000000
25% 6.000000 7.000000 8.000000 9.000000 10.000000 11.000000
50% 12.000000 13.000000 14.000000 15.000000 16.000000 17.000000
75% 18.000000 19.000000 20.000000 21.000000 22.000000 23.000000
max 24.000000 25.000000 26.000000 27.000000 28.000000 29.000000
>>> df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
a 5 non-null int32
b 5 non-null int32
c 5 non-null int32
d 5 non-null int32
e 5 non-null int32
f 5 non-null int32
dtypes: int32(6)
memory usage: 248.0 bytes
>>> df1.head(2)
a b c d e f
0 0 1 2 3 4 5
1 6 7 8 9 10 11
>>> df1.tail(2)
a b c d e f
3 18 19 20 21 22 23
4 24 25 26 27 28 29
>>> df1.head()
a b c d e f
0 0 1 2 3 4 5
1 6 7 8 9 10 11
2 12 13 14 15 16 17
3 18 19 20 21 22 23
4 24 25 26 27 28 29
sort
# by index
>>> df1
a b c d e f
0 0 1 2 3 4 5
1 6 7 8 9 10 11
2 12 13 14 15 16 17
3 18 19 20 21 22 23
4 24 25 26 27 28 29
>>> df1.sort_index(axis = 1, ascending = False)
# axis: 0-> by index, 1-> by columns
f e d c b a
0 5 4 3 2 1 0
1 11 10 9 8 7 6
2 17 16 15 14 13 12
3 23 22 21 20 19 18
4 29 28 27 26 25 24
>>> df1.sort_values(by = 'f', ascending = False)
a b c d e f
4 24 25 26 27 28 29
3 18 19 20 21 22 23
2 12 13 14 15 16 17
1 6 7 8 9 10 11
0 0 1 2 3 4 5
IO
# csv
df.to_csv('./foo.csv')
df1 = pd.read_csv('./foo.csv')
# hdf5
df.to_hdf('./foo.h5', 'df')
df1 = pd.read_hdf('./foo.h5', 'df')
# excel
df.to_excel('./foo.xlsx', sheet_name = 'foooo')
df1 = pd.read_excel('./foo.xlsx', sheet_name = 'foooo')