pandas.DataFrame常见函数详解(内容摘自官网)

1:pandas.DataFrame.head

DataFrame. head ( n=5 ) [source]

Return the first n rows.

This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.

For negative values of n, this function returns all rows except the last n rows, equivalent to df[:-n].

Parameters

n: int, default 5

Number of rows to select.

Returns

same type as caller

The first n rows of the caller object.

See also

DataFrame.tail

Returns the last n rows.

Examples

>>> df = pd.DataFrame({'animal': ['alligator', 'bee', 'falcon', 'lion',
...                    'monkey', 'parrot', 'shark', 'whale', 'zebra']})
>>> df
      animal
0  alligator
1        bee
2     falcon
3       lion
4     monkey
5     parrot
6      shark
7      whale
8      zebra

Viewing the first 5 lines

>>> df.head()
      animal
0  alligator
1        bee
2     falcon
3       lion
4     monkey

Viewing the first n lines (three in this case)

>>> df.head(3)
      animal
0  alligator
1        bee
2     falcon

For negative values of n

>>> df.head(-3)
      animal
0  alligator
1        bee
2     falcon
3       lion
4     monkey
5     parrot

2:pandas.DataFrame.describe

DataFrame. describe ( percentiles=None, include=None, exclude=None, datetime_is_numeric=False ) [source]

Generate descriptive statistics.

Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.

Parameters

percentiles: list-like of numbers, optional

The percentiles to include in the output. All should fall between 0 and 1. The default is [.25, .5, .75], which returns the 25th, 50th, and 75th percentiles.

include: ‘all’, list-like of dtypes or None (default), optional

A white list of data types to include in the result. Ignored for Series. Here are the options:

  • ‘all’ : All columns of the input will be included in the output.

  • A list-like of dtypes : Limits the results to the provided data types. To limit the result to numeric types submit numpy.number. To limit it instead to object columns submit the numpy.object data type. Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O'])). To select pandas categorical columns, use 'category'

  • None (default) : The result will include all numeric columns.

exclude: list-like of dtypes or None (default), optional,

A black list of data types to omit from the result. Ignored for Series. Here are the options:

  • A list-like of dtypes : Excludes the provided data types from the result. To exclude numeric types submit numpy.number. To exclude object columns submit the data type numpy.object. Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O'])). To exclude pandas categorical columns, use 'category'

  • None (default) : The result will exclude nothing.

datetime_is_numeric: bool, default False

Whether to treat datetime dtypes as numeric. This affects statistics calculated for the column. For DataFrame input, this also controls whether datetime columns are included by default.

New in version 1.1.0.

Returns

Series or DataFrame

Summary statistics of the Series or Dataframe provided.

See also

DataFrame.count

Count number of non-NA/null observations.

DataFrame.max

Maximum of the values in the object.

DataFrame.min

Minimum of the values in the object.

DataFrame.mean

Mean of the values.

DataFrame.std

Standard deviation of the observations.

DataFrame.select_dtypes

Subset of a DataFrame including/excluding columns based on their dtype.

Notes

For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median.

For object data (e.g. strings or timestamps), the result’s index will include count, unique, top, and freq. The top is the most common value. The freq is the most common value’s frequency. Timestamps also include the first and last items.

If multiple object values have the highest count, then the count and top results will be arbitrarily chosen from among those with the highest count.

For mixed data types provided via a DataFrame, the default is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. If include='all' is provided as an option, the result will include a union of attributes of each type.

The include and exclude parameters can be used to limit which columns in a DataFrame are analyzed for the output. The parameters are ignored when analyzing a Series.

Examples

Describing a numeric Series.

>>> s = pd.Series([1, 2, 3])
>>> s.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
dtype: float64

Describing a categorical Series.

>>> s = pd.Series(['a', 'a', 'b', 'c'])
>>> s.describe()
count     4
unique    3
top       a
freq      2
dtype: object

Describing a timestamp Series.

>>> s = pd.Series([
...   np.datetime64("2000-01-01"),
...   np.datetime64("2010-01-01"),
...   np.datetime64("2010-01-01")
... ])
>>> s.describe(datetime_is_numeric=True)
count                      3
mean     2006-09-01 08:00:00
min      2000-01-01 00:00:00
25%      2004-12-31 12:00:00
50%      2010-01-01 00:00:00
75%      2010-01-01 00:00:00
max      2010-01-01 00:00:00
dtype: object

Describing a DataFrame. By default only numeric fields are returned.

>>> df = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']),
...                    'numeric': [1, 2, 3],
...                    'object': ['a', 'b', 'c']
...                   })
>>> df.describe()
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Describing all columns of a DataFrame regardless of data type.

>>> df.describe(include='all')  
       categorical  numeric object
count            3      3.0      3
unique           3      NaN      3
top              f      NaN      a
freq             1      NaN      1
mean           NaN      2.0    NaN
std            NaN      1.0    NaN
min            NaN      1.0    NaN
25%            NaN      1.5    NaN
50%            NaN      2.0    NaN
75%            NaN      2.5    NaN
max            NaN      3.0    NaN

Describing a column from a DataFrame by accessing it as an attribute.

>>> df.numeric.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
Name: numeric, dtype: float64

Including only numeric columns in a DataFrame description.

>>> df.describe(include=[np.number])
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Including only string columns in a DataFrame description.

>>> df.describe(include=[object])  
       object
count       3
unique      3
top         a
freq        1

Including only categorical columns from a DataFrame description.

>>> df.describe(include=['category'])
       categorical
count            3
unique           3
top              f
freq             1

Excluding numeric columns from a DataFrame description.

>>> df.describe(exclude=[np.number])  
       categorical object
count            3      3
unique           3      3
top              f      a
freq             1      1

Excluding object columns from a DataFrame description.

>>> df.describe(exclude=[object])  
       categorical  numeric
count            3      3.0
unique           3      NaN
top              f      NaN
freq             1      NaN
mean           NaN      2.0
std            NaN      1.0
min            NaN      1.0
25%            NaN      1.5
50%            NaN      2.0
75%            NaN      2.5
max            NaN      3.0

3:pandas.DataFrame.plot

DataFrame. plot ( *args, **kwargs ) [source]

Make plots of Series or DataFrame.

Uses the backend specified by the option plotting.backend. By default, matplotlib is used.

Parameters

data: Series or DataFrame

The object for which the method is called.

x: label or position, default None

Only used if data is a DataFrame.

y: label, position or list of label, positions, default None

Allows plotting of one column versus another. Only used if data is a DataFrame.

kind: str

The kind of plot to produce:

  • ‘line’ : line plot (default)

  • ‘bar’ : vertical bar plot

  • ‘barh’ : horizontal bar plot

  • ‘hist’ : histogram

  • ‘box’ : boxplot

  • ‘kde’ : Kernel Density Estimation plot

  • ‘density’ : same as ‘kde’

  • ‘area’ : area plot

  • ‘pie’ : pie plot

  • ‘scatter’ : scatter plot

  • ‘hexbin’ : hexbin plot.

ax: matplotlib axes object, default None

An axes of the current figure.

subplots: bool, default False

Make separate subplots for each column.

sharex: bool, default True if ax is None else False

In case subplots=True, share x axis and set some x axis labels to invisible; defaults to True if ax is None otherwise False if an ax is passed in; Be aware, that passing in both an ax and sharex=True will alter all x axis labels for all axis in a figure.

sharey: bool, default False

In case subplots=True, share y axis and set some y axis labels to invisible.

layout: tuple, optional

(rows, columns) for the layout of subplots.

figsize: a tuple (width, height) in inches

Size of a figure object.

use_index: bool, default True

Use index as ticks for x axis.

title: str or list

Title to use for the plot. If a string is passed, print the string at the top of the figure. If a list is passed and subplots is True, print each item in the list above the corresponding subplot.

grid: bool, default None (matlab style default)

Axis grid lines.

legend: bool or {‘reverse’}

Place legend on axis subplots.

style: list or dict

The matplotlib line style per column.

logx: bool or ‘sym’, default False

Use log scaling or symlog scaling on x axis. .. versionchanged:: 0.25.0

logy: bool or ‘sym’ default False

Use log scaling or symlog scaling on y axis. .. versionchanged:: 0.25.0

loglog: bool or ‘sym’, default False

Use log scaling or symlog scaling on both x and y axes. .. versionchanged:: 0.25.0

xticks: sequence

Values to use for the xticks.

yticks: sequence

Values to use for the yticks.

xlim: 2-tuple/list

Set the x limits of the current axes.

ylim: 2-tuple/list

Set the y limits of the current axes.

xlabel: label, optional

Name to use for the xlabel on x-axis. Default uses index name as xlabel.

New in version 1.1.0.

ylabel: label, optional

Name to use for the ylabel on y-axis. Default will show no ylabel.

New in version 1.1.0.

rot: int, default None

Rotation for ticks (xticks for vertical, yticks for horizontal plots).

fontsize: int, default None

Font size for xticks and yticks.

colormap: str or matplotlib colormap object, default None

Colormap to select colors from. If string, load colormap with that name from matplotlib.

colorbar: bool, optional

If True, plot colorbar (only relevant for ‘scatter’ and ‘hexbin’ plots).

position: float

Specify relative alignments for bar plot layout. From 0 (left/bottom-end) to 1 (right/top-end). Default is 0.5 (center).

table: bool, Series or DataFrame, default False

If True, draw a table using the data in the DataFrame and the data will be transposed to meet matplotlib’s default layout. If a Series or DataFrame is passed, use passed data to draw a table.

yerr: DataFrame, Series, array-like, dict and str

See Plotting with Error Bars for detail.

xerr: DataFrame, Series, array-like, dict and str

Equivalent to yerr.

stacked: bool, default False in line and bar plots, and True in area plot

If True, create stacked plot.

sort_columns: bool, default False

Sort column names to determine plot ordering.

secondary_y: bool or sequence, default False

Whether to plot on the secondary y-axis if a list/tuple, which columns to plot on secondary y-axis.

mark_right: bool, default True

When using a secondary_y axis, automatically mark the column labels with “(right)” in the legend.

include_bool: bool, default is False

If True, boolean values can be plotted.

backend: str, default None

Backend to use instead of the backend specified in the option plotting.backend. For instance, ‘matplotlib’. Alternatively, to specify the plotting.backend for the whole session, set pd.options.plotting.backend.

New in version 1.0.0.

**kwargs

Options to pass to matplotlib plotting method.

Returns

matplotlib.axes.Axes or numpy.ndarray of them

If the backend is not the default matplotlib one, the return value will be the object returned by the backend.

Notes

  • See matplotlib documentation online for more on this subject

  • If kind = ‘bar’ or ‘barh’, you can specify relative alignments for bar plot layout by position keyword. From 0 (left/bottom-end) to 1 (right/top-end). Default is 0.5 (center)

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值