python之pandas：10分钟 to pandas

最新推荐文章于 2024-06-29 17:44:06 发布

networksu

最新推荐文章于 2024-06-29 17:44:06 发布

阅读量3.3k

点赞数

分类专栏： python pandas

本文链接：https://blog.csdn.net/networksu/article/details/87443761

版权

python 同时被 2 个专栏收录

6 篇文章 1 订阅

订阅专栏

pandas

1 篇文章 0 订阅

订阅专栏

pandas数据分析的第一步，一直有个疑问，有什么数据不能是SQL解决的。解决不了的，存储过程结果不了么？

那我们为什么要学pandas，刚开始学。不敢说为什么要用pandas替换sql，初步感觉是，可以处理多种数据来源的数据。excel，csv，数据库等。但对于一般程序员从SQL转过来做数据分析的，更喜欢通过SQL和pandas的比较来记录学习过程。

通过pandas的官网10 Minutes to pandas来了解一下。

http://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html

一、Create DataFrame

import pandas as pd
import numpy as np
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df)

                   A         B         C         D
2013-01-01  1.300644  0.278368 -1.005493  1.472200
2013-01-02 -1.157154  0.550540 -1.912141 -0.099683
2013-01-03  1.058534  2.118447 -0.116872  0.023518
2013-01-04 -0.070789 -0.033575  1.578565  0.174210
2013-01-05 -0.013015  1.326789 -0.561864 -0.605839
2013-01-06 -1.639182 -0.829422  0.743933  0.850687

1）可以把DataFrame理解为数据库里的Table

2）DataFrame有Index的概念，这个在数据库里面，Index可以是默认的0、1、2、3、4，也可以是A、B、C、D，也可以是向上面例子一样的日期。

二、View Data

In [13]: df.head()
Out[13]: 
                   A         B         C         D
2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
2013-01-02  1.212112 -0.173215  0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
2013-01-04  0.721555 -0.706771 -1.039575  0.271860
2013-01-05 -0.424972  0.567020  0.276232 -1.087401

1）df.head(n)n代表前几行，默认是n=5; SQL: select top 5 * from df

2）df.tail(3)表示后面三行； SQL:select top 3 * from df order by xxx desc

In [15]: df.index
Out[15]: 
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [16]: df.columns
Out[16]: Index(['A', 'B', 'C', 'D'], dtype='object')

3）df的两个属性，index和columns 返回都是list；因为在SQLServer中没有index的概念，至于查一个表的列

SQL：SELECT a.name FROM syscolumns a
INNER JOIN sysobjects b ON a.id=b.id
WHERE b.type='U' AND b.name LIKE '表名'

In [19]: df.describe()
Out[19]: 
              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean   0.073711 -0.431125 -0.687758 -0.233103
std    0.843157  0.922818  0.779887  0.973118
min   -0.861849 -2.104569 -1.509059 -1.135632
25%   -0.611510 -0.600794 -1.368714 -1.076610
50%    0.022070 -0.228039 -0.767252 -0.386188
75%    0.658444  0.041933 -0.034326  0.461706
max    1.212112  0.567020  0.276232  1.071804

4）df.describe()表示对数据表格进行初步的数据分析，针对每一数据（int、float）列进行数据统计。

count计数、mean平均数、std标准差、min最小、25%，50%，75%,max最大值

In [20]: df.T
Out[20]: 
   2013-01-01  2013-01-02  2013-01-03  2013-01-04  2013-01-05  2013-01-06
A    0.469112    1.212112   -0.861849    0.721555   -0.424972   -0.673690
B   -0.282863   -0.173215   -2.104569   -0.706771    0.567020    0.113648
C   -1.509059    0.119209   -0.494929   -1.039575    0.276232   -1.478427
D   -1.135632   -1.044236    1.071804    0.271860   -1.087401    0.524988

5）行列转换，df的T对象。SQL不好实现。

In [21]: df.sort_index(axis=1, ascending=False)
Out[21]: 
                   D         C         B         A
2013-01-01 -1.135632 -1.509059 -0.282863  0.469112
2013-01-02 -1.044236  0.119209 -0.173215  1.212112
2013-01-03  1.071804 -0.494929 -2.104569 -0.861849
2013-01-04  0.271860 -1.039575 -0.706771  0.721555
2013-01-05 -1.087401  0.276232  0.567020 -0.424972
2013-01-06  0.524988 -1.478427  0.113648 -0.673690

6）按索引排序当axis=1时，表示按列名排序；当axis=0时，表示按行index排序。ascending=true表示正序，asceding=false表示倒序。这个好像SQL没有唉。

In [22]: df['E']=[1,1,1,2,2,4]
In [22]: df.sort_values(by='B')
Out[22]: 
                   A         B         C         D  E
2013-01-01  0.360275 -2.036355  0.542733  0.833518  1
2013-01-03 -1.426237  1.311510  1.199737 -0.194401  1
2013-01-02 -1.455219 -0.098547 -1.604195  0.069465  1
2013-01-05  0.476745 -0.089874  1.761007 -0.023949  2
2013-01-04  0.197306 -0.458258  0.626157 -0.037381  2
2013-01-06  0.075495 -0.313558  0.797433  2.752348  4

7）按数值排序df.sort_values(by='B')表示按B列排序，by和ascending的true表示正序，false表示倒序。by和ascending都支持list，比如df.sort_values(by=['E','A'],ascending=[True,False]) ； SQL：select * from df order by E,A desc

networksu

关注

0
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
python之pandas：10分钟 to pandas

pandas数据分析的第一步，一直有个疑问，有什么数据不能是SQL解决的。解决不了的，存储过程结果不了么？那我们为什么要学pandas，刚开始学。不敢说为什么要用pandas替换sql，初步感觉是，可以处理多种数据来源的数据。excel，csv，数据库等。但对于一般程序员从SQL转过来做数据分析的，更喜欢通过SQL和pandas的比较来记录学习过程。通过pandas的官网10 Minute...
复制链接

扫一扫

专栏目录