1:pandas.DataFrame.head
-
Return the first n rows.
This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.
For negative values of n, this function returns all rows except the last n rows, equivalent to
df[:-n]
.
DataFrame.
head
(
n=5
)
[source]
¶
Parameters
-
n:
int, default 5
-
Number of rows to select.
Returns
-
same type as caller
-
The first n rows of the caller object.
See also
-
Returns the last n rows.
DataFrame.tail
Examples
>>> df = pd.DataFrame({'animal': ['alligator', 'bee', 'falcon', 'lion',
... 'monkey', 'parrot', 'shark', 'whale', 'zebra']})
>>> df
animal
0 alligator
1 bee
2 falcon
3 lion
4 monkey
5 parrot
6 shark
7 whale
8 zebra
Viewing the first 5 lines
>>> df.head()
animal
0 alligator
1 bee
2 falcon
3 lion
4 monkey
Viewing the first n lines (three in this case)
>>> df.head(3)
animal
0 alligator
1 bee
2 falcon
For negative values of n
>>> df.head(-3)
animal
0 alligator
1 bee
2 falcon
3 lion
4 monkey
5 parrot
2:pandas.DataFrame.describe
-
Generate descriptive statistics.
Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding
NaN
values.Analyzes both numeric and object series, as well as
DataFrame
column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.
DataFrame.
describe
(
percentiles=None,
include=None,
exclude=None,
datetime_is_numeric=False
)
[source]
¶
Parameters
-
percentiles:
list-like of numbers, optional
-
The percentiles to include in the output. All should fall between 0 and 1. The default is
[.25, .5, .75]
, which returns the 25th, 50th, and 75th percentiles.
include:
‘all’, list-like of dtypes or None (default), optional
-
A white list of data types to include in the result. Ignored for
Series
. Here are the options:‘all’ : All columns of the input will be included in the output.
A list-like of dtypes : Limits the results to the provided data types. To limit the result to numeric types submit
numpy.number
. To limit it instead to object columns submit thenumpy.object
data type. Strings can also be used in the style ofselect_dtypes
(e.g.df.describe(include=['O'])
). To select pandas categorical columns, use'category'
None (default) : The result will include all numeric columns.
exclude:
list-like of dtypes or None (default), optional,
-
A black list of data types to omit from the result. Ignored for
Series
. Here are the options:A list-like of dtypes : Excludes the provided data types from the result. To exclude numeric types submit
numpy.number
. To exclude object columns submit the data typenumpy.object
. Strings can also be used in the style ofselect_dtypes
(e.g.df.describe(include=['O'])
). To exclude pandas categorical columns, use'category'
None (default) : The result will exclude nothing.
datetime_is_numeric:
bool, default False
-
Whether to treat datetime dtypes as numeric. This affects statistics calculated for the column. For DataFrame input, this also controls whether datetime columns are included by default.
New in version 1.1.0.
Returns
-
Series or DataFrame
-
Summary statistics of the Series or Dataframe provided.
See also
-
Count number of non-NA/null observations.
-
Maximum of the values in the object.
-
Minimum of the values in the object.
-
Mean of the values.
-
Standard deviation of the observations.
-
Subset of a DataFrame including/excluding columns based on their dtype.
DataFrame.count
DataFrame.max
DataFrame.min
DataFrame.mean
DataFrame.std
DataFrame.select_dtypes
Notes
For numeric data, the result’s index will include count
, mean
, std
, min
, max
as well as lower, 50
and upper percentiles. By default the lower percentile is 25
and the upper percentile is 75
. The 50
percentile is the same as the median.
For object data (e.g. strings or timestamps), the result’s index will include count
, unique
, top
, and freq
. The top
is the most common value. The freq
is the most common value’s frequency. Timestamps also include the first
and last
items.
If multiple object values have the highest count, then the count
and top
results will be arbitrarily chosen from among those with the highest count.
For mixed data types provided via a DataFrame
, the default is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. If include='all'
is provided as an option, the result will include a union of attributes of each type.
The include and exclude parameters can be used to limit which columns in a DataFrame
are analyzed for the output. The parameters are ignored when analyzing a Series
.
Examples
Describing a numeric Series
.
>>> s = pd.Series([1, 2, 3])
>>> s.describe()
count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0
dtype: float64
Describing a categorical Series
.
>>> s = pd.Series(['a', 'a', 'b', 'c'])
>>> s.describe()
count 4
unique 3
top a
freq 2
dtype: object
Describing a timestamp Series
.
>>> s = pd.Series([
... np.datetime64("2000-01-01"),
... np.datetime64("2010-01-01"),
... np.datetime64("2010-01-01")
... ])
>>> s.describe(datetime_is_numeric=True)
count 3
mean 2006-09-01 08:00:00
min 2000-01-01 00:00:00
25% 2004-12-31 12:00:00
50% 2010-01-01 00:00:00
75% 2010-01-01 00:00:00
max 2010-01-01 00:00:00
dtype: object
Describing a DataFrame
. By default only numeric fields are returned.
>>> df = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']),
... 'numeric': [1, 2, 3],
... 'object': ['a', 'b', 'c']
... })
>>> df.describe()
numeric
count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0
Describing all columns of a DataFrame
regardless of data type.
>>> df.describe(include='all')
categorical numeric object
count 3 3.0 3
unique 3 NaN 3
top f NaN a
freq 1 NaN 1
mean NaN 2.0 NaN
std NaN 1.0 NaN
min NaN 1.0 NaN
25% NaN 1.5 NaN
50% NaN 2.0 NaN
75% NaN 2.5 NaN
max NaN 3.0 NaN
Describing a column from a DataFrame
by accessing it as an attribute.
>>> df.numeric.describe()
count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0
Name: numeric, dtype: float64
Including only numeric columns in a DataFrame
description.
>>> df.describe(include=[np.number])
numeric
count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0
Including only string columns in a DataFrame
description.
>>> df.describe(include=[object])
object
count 3
unique 3
top a
freq 1
Including only categorical columns from a DataFrame
description.
>>> df.describe(include=['category'])
categorical
count 3
unique 3
top f
freq 1
Excluding numeric columns from a DataFrame
description.
>>> df.describe(exclude=[np.number])
categorical object
count 3 3
unique 3 3
top f a
freq 1 1
Excluding object columns from a DataFrame
description.
>>> df.describe(exclude=[object])
categorical numeric
count 3 3.0
unique 3 NaN
top f NaN
freq 1 NaN
mean NaN 2.0
std NaN 1.0
min NaN 1.0
25% NaN 1.5
50% NaN 2.0
75% NaN 2.5
max NaN 3.0
3:pandas.DataFrame.plot
-
Make plots of Series or DataFrame.
Uses the backend specified by the option
plotting.backend
. By default, matplotlib is used.
DataFrame.
plot
(
*args,
**kwargs
)
[source]
¶
Parameters
-
data:
Series or DataFrame
-
The object for which the method is called.
x:
label or position, default None
-
Only used if data is a DataFrame.
y:
label, position or list of label, positions, default None
-
Allows plotting of one column versus another. Only used if data is a DataFrame.
kind:
str
-
The kind of plot to produce:
‘line’ : line plot (default)
‘bar’ : vertical bar plot
‘barh’ : horizontal bar plot
‘hist’ : histogram
‘box’ : boxplot
‘kde’ : Kernel Density Estimation plot
‘density’ : same as ‘kde’
‘area’ : area plot
‘pie’ : pie plot
‘scatter’ : scatter plot
‘hexbin’ : hexbin plot.
ax:
matplotlib axes object, default None
-
An axes of the current figure.
subplots:
bool, default False
-
Make separate subplots for each column.
sharex:
bool, default True if ax is None else False
-
In case
subplots=True
, share x axis and set some x axis labels to invisible; defaults to True if ax is None otherwise False if an ax is passed in; Be aware, that passing in both an ax andsharex=True
will alter all x axis labels for all axis in a figure.
sharey:
bool, default False
-
In case
subplots=True
, share y axis and set some y axis labels to invisible.
layout:
tuple, optional
-
(rows, columns) for the layout of subplots.
figsize:
a tuple (width, height) in inches
-
Size of a figure object.
use_index:
bool, default True
-
Use index as ticks for x axis.
title:
str or list
-
Title to use for the plot. If a string is passed, print the string at the top of the figure. If a list is passed and subplots is True, print each item in the list above the corresponding subplot.
grid:
bool, default None (matlab style default)
-
Axis grid lines.
legend:
bool or {‘reverse’}
-
Place legend on axis subplots.
style:
list or dict
-
The matplotlib line style per column.
logx:
bool or ‘sym’, default False
-
Use log scaling or symlog scaling on x axis. .. versionchanged:: 0.25.0
logy:
bool or ‘sym’ default False
-
Use log scaling or symlog scaling on y axis. .. versionchanged:: 0.25.0
loglog:
bool or ‘sym’, default False
-
Use log scaling or symlog scaling on both x and y axes. .. versionchanged:: 0.25.0
xticks:
sequence
-
Values to use for the xticks.
yticks:
sequence
-
Values to use for the yticks.
xlim:
2-tuple/list
-
Set the x limits of the current axes.
ylim:
2-tuple/list
-
Set the y limits of the current axes.
xlabel:
label, optional
-
Name to use for the xlabel on x-axis. Default uses index name as xlabel.
New in version 1.1.0.
ylabel:
label, optional
-
Name to use for the ylabel on y-axis. Default will show no ylabel.
New in version 1.1.0.
rot:
int, default None
-
Rotation for ticks (xticks for vertical, yticks for horizontal plots).
fontsize:
int, default None
-
Font size for xticks and yticks.
colormap:
str or matplotlib colormap object, default None
-
Colormap to select colors from. If string, load colormap with that name from matplotlib.
colorbar:
bool, optional
-
If True, plot colorbar (only relevant for ‘scatter’ and ‘hexbin’ plots).
position:
float
-
Specify relative alignments for bar plot layout. From 0 (left/bottom-end) to 1 (right/top-end). Default is 0.5 (center).
table:
bool, Series or DataFrame, default False
-
If True, draw a table using the data in the DataFrame and the data will be transposed to meet matplotlib’s default layout. If a Series or DataFrame is passed, use passed data to draw a table.
yerr:
DataFrame, Series, array-like, dict and str
-
See Plotting with Error Bars for detail.
xerr:
DataFrame, Series, array-like, dict and str
-
Equivalent to yerr.
stacked:
bool, default False in line and bar plots, and True in area plot
-
If True, create stacked plot.
sort_columns:
bool, default False
-
Sort column names to determine plot ordering.
secondary_y:
bool or sequence, default False
-
Whether to plot on the secondary y-axis if a list/tuple, which columns to plot on secondary y-axis.
mark_right:
bool, default True
-
When using a secondary_y axis, automatically mark the column labels with “(right)” in the legend.
include_bool:
bool, default is False
-
If True, boolean values can be plotted.
backend:
str, default None
-
Backend to use instead of the backend specified in the option
plotting.backend
. For instance, ‘matplotlib’. Alternatively, to specify theplotting.backend
for the whole session, setpd.options.plotting.backend
.New in version 1.0.0.
**kwargs
-
Options to pass to matplotlib plotting method.
Returns
-
If the backend is not the default matplotlib one, the return value will be the object returned by the backend.
matplotlib.axes.Axes
or numpy.ndarray of them
Notes
See matplotlib documentation online for more on this subject
If kind = ‘bar’ or ‘barh’, you can specify relative alignments for bar plot layout by position keyword. From 0 (left/bottom-end) to 1 (right/top-end). Default is 0.5 (center)