http://blog.csdn.net/pipisorry/article/details/53486777
pandas高级功能:性能优化、面板数据、字符串方法、分类、可视化。
性能优化
对Pandas逐行运算的加速
使用dask对dataframe加速
python -m pip install "dask[complete]" # Install everything
from dask import dataframe as ddf
# df[new_word_cols] = df[word_cols].applymap(lang.word2idfunc) #方式1
# chunksize = args.file_batch_size // 8 #方式2
npartitions = 4 * multiprocessing.cpu_count() #方式3
df[new_word_cols] = ddf.from_pandas(df[word_cols], npartitions=npartitions) \
.map_partitions(lambda dds: dds.applymap(lang.word2idfunc)).compute(scheduler='processes')
应用apply函数时,方式3大概比方式1节省至少一半时间。
遍历时间对比:
[对Pandas百万级文本进行中文分词加速,看这一篇就足够了]
面板数据
{pandas数据结构有一维Series,二维DataFrame,这是三维Panel}pandas有一个Panel数据结构,可以将其看做一个三维版的,可以用一个由DataFrame对象组成的字典或一个三维ndarray来创建Panel对象:
import pandas.io.data as web
pdata = pd.Panel(dict((stk, web.get_data_yahoo(stk, '1/1/2009', '6/1/2012')) for stk in ['AAPL', 'GOOG', 'MSFT','DELL']))
Note: stk代表指标,6个指标;三维:stk,company,time.
Panel中的每一项(类似于DataFrame的列)都是一个DataFrame
>>> pdata
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 868 (major_axis) x 6 (minor_axis)
Items axis: AAPL to MSFT
Major_axis axis: 2009-01-02 00:00:00 to 2012-06-01 00:00:00
Minor_axis axis: Open to Adj Close
>>> pdata = pdata.swapaxes('items', 'minor')
>>>pdata['Adj Close']
三维度ix标签索引
基于ix的标签索引被推广到了三个维度,因此可以选取指定日期或日期范围的所有数据,如下所示:>>> pdata.ix[:,'6/1/2012',:]
>>>pdata.ix['Adj Close', '5/22/2012':,:]
另一个用于呈现面板数据(尤其是对拟合统计模型)的办法是“堆积式的” DataFrame 形式:
>>> stacked=pdata.ix[:,'5/30/2012':,:].to_frame()
>>>stacked
DataFrame有一个相应的to_panel方法,它是to_frame的逆运算:
>>> stacked.to_panel()
<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 3 (major_axis) x 4 (minor_axis)
Items axis: Open to Adj Close
Major_axis axis: 2012-05-30 00:00:00 to 2012-06-01 00:00:00
Minor_axis axis: AAPL to MSFT
皮皮Blog
字符串方法String Methods
Series is equipped with a set of string processing methods in the strattribute that make it easy to operate on each element of the array, as in thecode snippet below. Note that pattern-matching instr generally usesregularexpressions by default (and insome cases always uses them). See more atVectorized String Methods.
In [71]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
In [72]: s.str.lower()
Out[72]:
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
分类Categoricals
Since version 0.15, pandas can include categorical data in a DataFrame. For full docs, see thecategorical introduction and theAPI documentation.
In [122]: df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
Convert the raw grades to a categorical data type.
In [123]: df["grade"] = df["raw_grade"].astype("category")
In [124]: df["grade"]
Out[124]:
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): [a, b, e]
Rename the categories to more meaningful names (assigning to Series.cat.categories is inplace!)
In [125]: df["grade"].cat.categories = ["very good", "good", "very bad"]
Reorder the categories and simultaneously add the missing categories (methods underSeries.cat return a newSeries per default).
In [126]: df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
In [127]: df["grade"]
Out[127]:
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]
Sorting is per order in the categories, not lexical order.
In [128]: df.sort_values(by="grade")
Out[128]:
id raw_grade grade
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good
Grouping by a categorical column shows also empty categories.
In [129]: df.groupby("grade").size()
Out[129]:
grade
very bad 1
bad 0
medium 0
good 2
very good 3
dtype: int64
可视化Plot
DataFrame内置基于matplotlib的绘图功能
In [76]: df['GDP percap'].plot(kind='bar')
In [77]: import matplotlib.pyplot as plt
In [78]: plt.show()
直接绘制
Plotting docs.
In [130]: ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
In [131]: ts = ts.cumsum()
In [132]: ts.plot()
Out[132]: <matplotlib.axes._subplots.AxesSubplot at 0xaf49988c>
On DataFrame, plot() is a convenience to plot all of the columns with labels:
In [133]: df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
.....: columns=['A', 'B', 'C', 'D'])
.....:
In [134]: df = df.cumsum()
In [135]: plt.figure(); df.plot(); plt.legend(loc='best')
Out[135]: <matplotlib.legend.Legend at 0xaf499d4c>
绘制盒图
Python中有许多可视化模块,最流行的当属matpalotlib库[matplotlib绘图基础]。稍加提及,我们也可选择bokeh和seaborn模块[python高级绘图库seaborn]。
使用matplotlib
使用pandas模块中集成R的ggplot主题来美化图表
要使用ggplot,我们只需要在上述代码中多加一行:
比matplotlib.pyplot主题简洁太多。
更好的是引入seaborn模块
该模块是一个统计数据可视化库:
绘制散点图scatter
df:
age fat_percent
0 23 9.5
1 23 26.5
2 27 7.8
3 27 17.8
4 39 31.4
5 41 25.9
plt.show(df.plot(kind='scatter', x='age', y='fat_percent'))
Note: 不指定x,y会出错: ValueError: scatter requires and x and y column
绘制直方曲线图
绘制其它图
from: http://blog.csdn.net/pipisorry/article/details/53486777
ref: 《利用Python进行数据分析》*
Notebook Python: Getting Started with Data Analysis
Python and R: Is Python really faster than R?