pandas.DataFrame常见函数详解（内容摘自官网）

本文链接：https://blog.csdn.net/yang2110862/article/details/107799207

1：pandas.DataFrame.head

DataFrame. head ( n=5 ) [source] ¶

Return the first n rows.

This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.

For negative values of n, this function returns all rows except the last n rows, equivalent to df[:-n].

Parameters

n： int, default 5

Number of rows to select.

Returns

same type as caller

The first n rows of the caller object.

Examples

 
  >>> df = pd.DataFrame({'animal': ['alligator', 'bee', 'falcon', 'lion',
...                    'monkey', 'parrot', 'shark', 'whale', 'zebra']})
>>> df
      animal
0  alligator
1        bee
2     falcon
3       lion
4     monkey
5     parrot
6      shark
7      whale
8      zebra
 
 

Viewing the first 5 lines

 
  >>> df.head()
      animal
alligator
      bee
   falcon
     lion
   monkey
 
 

Viewing the first n lines (three in this case)

 
  >>> df.head(3)
      animal
0  alligator
1        bee
2     falcon
 
 

For negative values of n

 
  >>> df.head(-3)
      animal
alligator
      bee
   falcon
     lion
   monkey
   parrot
 
 

2：pandas.DataFrame.describe

DataFrame. describe ( percentiles=None, include=None, exclude=None, datetime_is_numeric=False ) [source] ¶

Generate descriptive statistics.

Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.

Parameters

percentiles： list-like of numbers, optional

The percentiles to include in the output. All should fall between 0 and 1. The default is [.25, .5, .75], which returns the 25th, 50th, and 75th percentiles.

include： ‘all’, list-like of dtypes or None (default), optional

A white list of data types to include in the result. Ignored for Series. Here are the options:

‘all’ : All columns of the input will be included in the output.
A list-like of dtypes : Limits the results to the provided data types. To limit the result to numeric types submit numpy.number. To limit it instead to object columns submit the numpy.object data type. Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O'])). To select pandas categorical columns, use 'category'
None (default) : The result will include all numeric columns.

exclude： list-like of dtypes or None (default), optional,

A black list of data types to omit from the result. Ignored for Series. Here are the options:

A list-like of dtypes : Excludes the provided data types from the result. To exclude numeric types submit numpy.number. To exclude object columns submit the data type numpy.object. Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O'])). To exclude pandas categorical columns, use 'category'
None (default) : The result will exclude nothing.

datetime_is_numeric： bool, default False

Whether to treat datetime dtypes as numeric. This affects statistics calculated for the column. For DataFrame input, this also controls whether datetime columns are included by default.

New in version 1.1.0.

Returns

Series or DataFrame

Summary statistics of the Series or Dataframe provided.

Examples

Describing a numeric Series.

 
  >>> s = pd.Series([1, 2, 3])
>>> s.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
dtype: float64
 
 

Describing a categorical Series.

 
  >>> s = pd.Series(['a', 'a', 'b', 'c'])
>>> s.describe()
count     4
unique    3
top       a
freq      2
dtype: object
 
 

Describing a timestamp Series.

 
  >>> s = pd.Series([
...   np.datetime64("2000-01-01"),
...   np.datetime64("2010-01-01"),
...   np.datetime64("2010-01-01")
... ])
>>> s.describe(datetime_is_numeric=True)
count                      3
mean     2006-09-01 08:00:00
min      2000-01-01 00:00:00
25%      2004-12-31 12:00:00
50%      2010-01-01 00:00:00
75%      2010-01-01 00:00:00
max      2010-01-01 00:00:00
dtype: object
 
 

Describing a DataFrame. By default only numeric fields are returned.

 
  >>> df = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']),
...                    'numeric': [1, 2, 3],
...                    'object': ['a', 'b', 'c']
...                   })
>>> df.describe()
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0
 
 

Describing all columns of a DataFrame regardless of data type.

 
  >>> df.describe(include='all')  
       categorical  numeric object
count            3      3.0      3
unique           3      NaN      3
top              f      NaN      a
freq             1      NaN      1
mean           NaN      2.0    NaN
std            NaN      1.0    NaN
min            NaN      1.0    NaN
25%            NaN      1.5    NaN
50%            NaN      2.0    NaN
75%            NaN      2.5    NaN
max            NaN      3.0    NaN
 
 

Describing a column from a DataFrame by accessing it as an attribute.

 
  >>> df.numeric.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
Name: numeric, dtype: float64
 
 

Including only numeric columns in a DataFrame description.

 
  >>> df.describe(include=[np.number])
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0
 
 

Including only string columns in a DataFrame description.

 
  >>> df.describe(include=[object])  
       object
count       3
unique      3
top         a
freq        1
 
 

Including only categorical columns from a DataFrame description.

 
  >>> df.describe(include=['category'])
       categorical
count            3
unique           3
top              f
freq             1
 
 

Excluding numeric columns from a DataFrame description.

 
  >>> df.describe(exclude=[np.number])  
       categorical object
count            3      3
unique           3      3
top              f      a
freq             1      1
 
 

Excluding object columns from a DataFrame description.

 
  >>> df.describe(exclude=[object])  
       categorical  numeric
count            3      3.0
unique           3      NaN
top              f      NaN
freq             1      NaN
mean           NaN      2.0
std            NaN      1.0
min            NaN      1.0
25%            NaN      1.5
50%            NaN      2.0
75%            NaN      2.5
max            NaN      3.0
 
 

3：pandas.DataFrame.plot

DataFrame. plot ( *args, **kwargs ) [source] ¶

Make plots of Series or DataFrame.

Uses the backend specified by the option plotting.backend. By default, matplotlib is used.

Parameters

data： Series or DataFrame

The object for which the method is called.

x： label or position, default None

Only used if data is a DataFrame.

y： label, position or list of label, positions, default None

Allows plotting of one column versus another. Only used if data is a DataFrame.

kind： str

The kind of plot to produce:

‘line’ : line plot (default)
‘bar’ : vertical bar plot
‘barh’ : horizontal bar plot
‘hist’ : histogram
‘box’ : boxplot
‘kde’ : Kernel Density Estimation plot
‘density’ : same as ‘kde’
‘area’ : area plot
‘pie’ : pie plot
‘scatter’ : scatter plot
‘hexbin’ : hexbin plot.

ax： matplotlib axes object, default None

An axes of the current figure.

subplots： bool, default False

Make separate subplots for each column.

sharex： bool, default True if ax is None else False

In case subplots=True, share x axis and set some x axis labels to invisible; defaults to True if ax is None otherwise False if an ax is passed in; Be aware, that passing in both an ax and sharex=True will alter all x axis labels for all axis in a figure.

sharey： bool, default False

In case subplots=True, share y axis and set some y axis labels to invisible.

layout： tuple, optional

(rows, columns) for the layout of subplots.

figsize： a tuple (width, height) in inches

Size of a figure object.

use_index： bool, default True

Use index as ticks for x axis.

title： str or list

Title to use for the plot. If a string is passed, print the string at the top of the figure. If a list is passed and subplots is True, print each item in the list above the corresponding subplot.

grid： bool, default None (matlab style default)

Axis grid lines.

legend： bool or {‘reverse’}

Place legend on axis subplots.

style： list or dict

The matplotlib line style per column.

logx： bool or ‘sym’, default False

Use log scaling or symlog scaling on x axis. .. versionchanged:: 0.25.0

logy： bool or ‘sym’ default False

Use log scaling or symlog scaling on y axis. .. versionchanged:: 0.25.0

loglog： bool or ‘sym’, default False

Use log scaling or symlog scaling on both x and y axes. .. versionchanged:: 0.25.0

xticks： sequence

Values to use for the xticks.

yticks： sequence

Values to use for the yticks.

xlim： 2-tuple/list

Set the x limits of the current axes.

ylim： 2-tuple/list

Set the y limits of the current axes.

xlabel： label, optional

Name to use for the xlabel on x-axis. Default uses index name as xlabel.

New in version 1.1.0.

ylabel： label, optional

Name to use for the ylabel on y-axis. Default will show no ylabel.

New in version 1.1.0.

rot： int, default None

Rotation for ticks (xticks for vertical, yticks for horizontal plots).

fontsize： int, default None

Font size for xticks and yticks.

colormap： str or matplotlib colormap object, default None

Colormap to select colors from. If string, load colormap with that name from matplotlib.

colorbar： bool, optional

If True, plot colorbar (only relevant for ‘scatter’ and ‘hexbin’ plots).

position： float

Specify relative alignments for bar plot layout. From 0 (left/bottom-end) to 1 (right/top-end). Default is 0.5 (center).

table： bool, Series or DataFrame, default False

If True, draw a table using the data in the DataFrame and the data will be transposed to meet matplotlib’s default layout. If a Series or DataFrame is passed, use passed data to draw a table.

yerr： DataFrame, Series, array-like, dict and str

See Plotting with Error Bars for detail.

xerr： DataFrame, Series, array-like, dict and str

Equivalent to yerr.

stacked： bool, default False in line and bar plots, and True in area plot

If True, create stacked plot.

sort_columns： bool, default False

Sort column names to determine plot ordering.

secondary_y： bool or sequence, default False

Whether to plot on the secondary y-axis if a list/tuple, which columns to plot on secondary y-axis.

mark_right： bool, default True

When using a secondary_y axis, automatically mark the column labels with “(right)” in the legend.

include_bool： bool, default is False

If True, boolean values can be plotted.

backend： str, default None

Backend to use instead of the backend specified in the option plotting.backend. For instance, ‘matplotlib’. Alternatively, to specify the plotting.backend for the whole session, set pd.options.plotting.backend.

New in version 1.0.0.

**kwargs

Options to pass to matplotlib plotting method.

Returns

matplotlib.axes.Axes or numpy.ndarray of them

If the backend is not the default matplotlib one, the return value will be the object returned by the backend.

Notes

See matplotlib documentation online for more on this subject
If kind = ‘bar’ or ‘barh’, you can specify relative alignments for bar plot layout by position keyword. From 0 (left/bottom-end) to 1 (right/top-end). Default is 0.5 (center)