Pandas笔记

Pandas Notebook

由于使用emacs-org进行编辑,为方便暂且使用英文

Table of Contents

  1. Series

    1. create
      1. from ndarray
      2. from dict
      3. from scalar
    2. manip
  2. DataFrame

    1. create
      1. from dict of Series or dicts
      2. from dict of ndarrays/lists
      3. from structured or record arrays
      4. from a list of dicts
      5. from a dict of tuples
      6. from a Series
      7. use np.nan to imply missing data
    2. manip
    3. indexing/selection
      1. date as index/column
    4. show
    5. sort
  3. IO

    import numpy as np
    import pandas as pd

Series

one-dimensional, labeled list, hold any datatype.

labels are referred to as index.

s = pd.Series(data, index = index)

where data can be: python dict, np ndarray, or scalar.

non-unique index is enabled.

create

from ndarray

index should be as long as ndarray,

or by default 0 ~ len(data) - 1.

>>> s = pd.Series(np.random.rand(5), 
              index = ['a', 'b', 'c', 'd', 'e'])
>>> s
a    0.845771
b    0.508507
c    0.964314
d    0.924016
e    0.687160
dtype: float64

>>> s1 = pd.Series(np.random.rand(5))
>>> s1
0    0.353030
1    0.419253
2    0.121847
3    0.631009
4    0.696326
dtype: float64

from dict

if index wasn’t passed, the key of dict would be used,

if index was passed, index will match the dict’s keys,
index is prior.

>>> d = {'a': 1, 'b': 2, 'v': 3}
>>> pd.Series(d)
a    1
b    2
v    3
dtype: int64
>>>
>>> pd.Series(d, index = ['a', 'b', 'c'])
a    1.0
b    2.0
c    NaN
dtype: float64

notice: nan is a float number,
generally showing missing values,
shouldn’t be compared with
somevalue == np.nan, but use
somevalue is np.nan instead.

from scalar

index is must.

value’ll be the same.

>>> pd.Series(7, index = ['a', 'b', 'c', 'z'])
a    7
b    7
c    7
z    7
dtype: int64

manip

much like numpy and python dict

>>> s = pd.Series(np.random.rand(5), 
                  index = ['a', 'b', 'c', 'd', 'e'])
>>> s
a    0.208771
b    0.456772
c    0.055980
d    0.481634
e    0.215942
dtype: float64

>>> s[0]
0.20877084185831607

>>> s[:3]
a    0.208771
b    0.456772
c    0.055980
dtype: float64

>>> s[s > s.median()]
b    0.456772
d    0.481634
dtype: float64

>>> s[[4, 3, 1]]
e    0.215942
d    0.481634
b    0.456772
dtype: float64

>>> s * 2
a    0.417542
b    0.913543
c    0.111960
d    0.963267
e    0.431885
dtype: float64

# notice that operations recognize the index
>>> s
a    0.208771
b    0.456772
c    0.055980
d    0.481634
e    0.215942
dtype: float64
>>> s[1:] + s[:-1]
a         NaN
b    0.913543
c    0.111960
d    0.963267
e         NaN
dtype: float64

>>> s['a']
0.20877084185831607

>>> s.get('a')
0.20877084185831607

>>> s.get('f')
>>> # <- this is 'none'

>>> s.get('f', np.nan)
nan

# get rid of the index?
>>> s.array
<PandasArray>
[ 0.20877084185831607,   0.4567715456417446, 0.055979995016902984,
   0.4816336854012986,  0.21594244986110378]
Length: 5, dtype: float64

# get a genuine array?
>>> s.to_numpy()
array([0.20877084, 0.45677155, 0.05598   , 0.48163369, 0.21594245])

# name? 
>>> s.name
>>> # <- this is 'none'

>>> s = s.rename('seriesA')
>>> s.name
'seriesA'

DataFrame

SQL table

data can be:
python dict of 1-D ndarrays, lists, dicts, or Series,
2-D ndarray, structured or record ndarray,
a Series, another DataFrame.

pass index*/*columns to force row/column labels.

create

from dict of Series or dicts

nested dicts will be firstly convert to pd.Series.

the created DataFrame’s index length is decided firstly by
the passed index variable, or by the longest data (Series).

the created DataFrame’s column width is decided firstly by
the passed columns variable, or by the number of data.

Series as data don’t have to follow length requirement,

while list or ndarray as data should be as long as the
DataFrame’s index.

>>> d = {'one': pd.Series(np.arange(1., 4.), 
                          index = ['a', 'b', 'c']), 
         'two': pd.Series(np.arange(1., 5.), 
                          index = ['a', 'b', 'c', 'd'])}
>>> df = pd.DataFrame(d)
>>> df
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

# follow index and columns --hard
>>> pd.DataFrame(d, index = ['d', 'c', 'a'])
   one  two
d  NaN  4.0
c  3.0  3.0
a  1.0  1.0

>>> pd.DataFrame(d, index = ['d', 'c', 'a'], columns = ['one', 'X'])
   one    X
d  NaN  NaN
c  3.0  NaN
a  1.0  NaN

>>> df.index
Index(['a', 'b', 'c', 'd'], dtype='object')
>>> df.columns
Index(['one', 'two'], dtype='object')

from dict of ndarrays/lists

ndarrays and lists should be as long as the DataFrame’s index,
and as wide as the DataFrame’s columns.

from structured or record arrays

>>> arr = [[1, 2, 'asdf'], [11, 12, 'asd']]

>>> pd.DataFrame(arr, index = ['a', 'b'], columns = ['x', 'y', 'z'])
    x   y     z
a   1   2  asdf
b  11  12   asd

from a list of dicts

>>> ls = [{'a': 1, 'b': 2}, {'c': 3, 'd': 4, 'e': 5}]

>>> pd.DataFrame(ls)
     a    b    c    d    e
0  1.0  2.0  NaN  NaN  NaN
1  NaN  NaN  3.0  4.0  5.0

>>> pd.DataFrame(ls, index = ['x', 'y'], columns = ['a', 'e'])
     a    e
x  1.0  NaN
y  NaN  5.0

from a dict of tuples

>>> d = {('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2, ('B', 'A'): 5},
...      ('b', 'c'): {('A', 'B'): 3, ('B', 'C'): 4}}

>>> pd.DataFrame(d)
       a    b
       b    c
A B  1.0  3.0
  C  2.0  NaN
B A  5.0  NaN
  C  NaN  4.0

>>> d = {('a', 'b'): {('A', 'B'): 1, 'A': 2},
...      'b': {('A', 'B'): 3, ('B', 'C'): 4}}

>>> pd.DataFrame(d)
        (a, b)    b
(A, B)     1.0  3.0
A          2.0  NaN
(B, C)     NaN  4.0

from a Series

the only columns is the Series’s name.

use np.nan to imply missing data

manip

like a dict

>>> df = pd.DataFrame(np.arange(1, 13).reshape(3, 4), 
                      index = ['a', 'b', 'c'], 
                      columns = ['one', 'two', 'three', 'four'])
>>> del df['one']

>>> df
   two  three  four
a    2      3     4
b    6      7     8
c   10     11    12

>>> two = df.pop('two')

>>> two
a     2
b     6
c    10
Name: two, dtype: int32

>>> df['foo'] = 'bar'

>>> df
   three  four  foo
a      3     4  bar
b      7     8  bar
c     11    12  bar

>>> df['three_trunc'] = df['three'][:2]

>>> df
   three  four  foo  three_trunc
a      3     4  bar          3.0
b      7     8  bar          7.0
c     11    12  bar          NaN

# insert some columns but not at the end?
>>> df.insert(1, 'new', np.random.rand(3))

>>> df
   three       new  four  foo  three_trunc
a      3  0.932232     4  bar          3.0
b      7  0.962772     8  bar          7.0
c     11  0.792010    12  bar          NaN

# numerical
>>> df['flag'] = df['new'] > df['new'].median()
>>> df['mult'] = df['three'] * df['new']

>>> df
   three       new  four  foo  three_trunc   flag      mult
a      3  0.932232     4  bar          3.0  False  2.796697
b      7  0.962772     8  bar          7.0   True  6.739403
c     11  0.792010    12  bar          NaN  False  8.712110

# transpose
>>> n = 'abcdef'
>>> df1 = pd.DataFrame(np.arange(30).reshape(5, 6), columns = [x for x in n])

>>> df1
    a   b   c   d   e   f
0   0   1   2   3   4   5
1   6   7   8   9  10  11
2  12  13  14  15  16  17
3  18  19  20  21  22  23
4  24  25  26  27  28  29

>>> df1.T
   0   1   2   3   4
a  0   6  12  18  24
b  1   7  13  19  25
c  2   8  14  20  26
d  3   9  15  21  27
e  4  10  16  22  28
f  5  11  17  23  29

# turn to array
>>> df1.to_numpy()
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29]])

>>> np.asarray(df1)
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29]])

notice: when inserting a Series that isn’t as long as the DataFrame’s
index length, it will be forced to follow the index length.

but if insert a list or ndarray, make sure its length match the index.

indexing/selection

operation syntax result select column df[col] Series select row by label df.loc[label] Series select row by integer location df.iloc[loc] Series slice rows df[5:10] DataFrame select rows by boolean vector df[bool\_vec] DataFrame

date as index/column

>>> index = pd.date_range('1/1/2020', periods = 10)

>>> index
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10'],
              dtype='datetime64[ns]', freq='D')

>>> df = pd.DataFrame(np.arange(30).reshape(10, 3), index = index, columns = index[:3])
>>> df
            2020-01-01  2020-01-02  2020-01-03
2020-01-01           0           1           2
2020-01-02           3           4           5
2020-01-03           6           7           8
2020-01-04           9          10          11
2020-01-05          12          13          14
2020-01-06          15          16          17
2020-01-07          18          19          20
2020-01-08          21          22          23
2020-01-09          24          25          26
2020-01-10          27          28          29

show

>>> df1
    a   b   c   d   e   f
0   0   1   2   3   4   5
1   6   7   8   9  10  11
2  12  13  14  15  16  17
3  18  19  20  21  22  23
4  24  25  26  27  28  29

>>> df1.describe()
               a          b          c          d          e          f
count   5.000000   5.000000   5.000000   5.000000   5.000000   5.000000
mean   12.000000  13.000000  14.000000  15.000000  16.000000  17.000000
std     9.486833   9.486833   9.486833   9.486833   9.486833   9.486833
min     0.000000   1.000000   2.000000   3.000000   4.000000   5.000000
25%     6.000000   7.000000   8.000000   9.000000  10.000000  11.000000
50%    12.000000  13.000000  14.000000  15.000000  16.000000  17.000000
75%    18.000000  19.000000  20.000000  21.000000  22.000000  23.000000
max    24.000000  25.000000  26.000000  27.000000  28.000000  29.000000

>>> df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
a    5 non-null int32
b    5 non-null int32
c    5 non-null int32
d    5 non-null int32
e    5 non-null int32
f    5 non-null int32
dtypes: int32(6)
memory usage: 248.0 bytes

>>> df1.head(2)
   a  b  c  d   e   f
0  0  1  2  3   4   5
1  6  7  8  9  10  11
>>> df1.tail(2)
    a   b   c   d   e   f
3  18  19  20  21  22  23
4  24  25  26  27  28  29
>>> df1.head()
    a   b   c   d   e   f
0   0   1   2   3   4   5
1   6   7   8   9  10  11
2  12  13  14  15  16  17
3  18  19  20  21  22  23
4  24  25  26  27  28  29

sort

# by index
>>> df1
    a   b   c   d   e   f
0   0   1   2   3   4   5
1   6   7   8   9  10  11
2  12  13  14  15  16  17
3  18  19  20  21  22  23
4  24  25  26  27  28  29

>>> df1.sort_index(axis = 1, ascending = False)
# axis: 0-> by index, 1-> by columns
    f   e   d   c   b   a
0   5   4   3   2   1   0
1  11  10   9   8   7   6
2  17  16  15  14  13  12
3  23  22  21  20  19  18
4  29  28  27  26  25  24

>>> df1.sort_values(by = 'f', ascending = False)
    a   b   c   d   e   f
4  24  25  26  27  28  29
3  18  19  20  21  22  23
2  12  13  14  15  16  17
1   6   7   8   9  10  11
0   0   1   2   3   4   5

IO

# csv
df.to_csv('./foo.csv')
df1 = pd.read_csv('./foo.csv')

# hdf5
df.to_hdf('./foo.h5', 'df')
df1 = pd.read_hdf('./foo.h5', 'df')

# excel
df.to_excel('./foo.xlsx', sheet_name = 'foooo')
df1 = pd.read_excel('./foo.xlsx', sheet_name = 'foooo')
1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看rEADME.md或论文文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。 5、资源来自互联网采集,如有侵权,私聊博主删除。 6、可私信博主看论文后选择购买源代码。 1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看rEADME.md或论文文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。 5、资源来自互联网采集,如有侵权,私聊博主删除。 6、可私信博主看论文后选择购买源代码。 1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md或论文文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。 5、资源来自互联网采集,如有侵权,私聊博主删除。 6、可私信博主看论文后选择购买源代码。
1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。 1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通;、 3本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看ReAdmE.md文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。 1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值