The official
documentation for pandas defines what most
developers would know as null values
as missing or missing
data in pandas. Within pandas,
a missingvalue
is denoted by NaN.
In most cases, the terms missing and null are
interchangeable, but to abide by the standards of pandas, we’ll
continue using missing throughout
this tutorial.
Evaluating for Missing Data
At the base level, pandas offers two functions to test
for missing data, isnull()and notnull().
As you may suspect, these are simple functions that return
a boolean value
indicating whether the passed in argument value is in
fact missing data.
In addition to the above functions, pandas also provides two
methods to check for missing data
on Series and DataFrame objects. These methods evaluate each object
in the Series or DataFrame and
provide a boolean value
indicating if the data is missing or
not.
For example, let’s create a simple Series in
pandas:
import pandas as pd
import numpy as np
s = pd.Series([2,3,np.nan,7,"The Hobbit"])
Now evaluating the Series s,
the output shows each value as expected, including
index 2 which
we explicitly set as missing.
In [2]: s
Out[2]:
0 2
1 3
2 NaN
3 7
4 The Hobbit
dtype: object
To test the isnull() method
on this series, we can use s.isnull() and
view the output:
In [3]: s.isnull()
Out[3]:
0 False
1 False
2 True
3 False
4 False
dtype: bool
As expected, the only value evaluated
as missing is
index 2.
Determine if ANY Value in a Series is Missing
While the isnull() method
is useful, sometimes we may wish to evaluate whether any value
is missing in
a Series.
There are a few possibilities involving chaining multiple methods
together.
The fastest method is
performed by chaining .values.any():
In [4]: s.isnull().values.any()
Out[4]:
True
In some cases, you may wish to
determine how
many missing values
exist in the collection, in which case you can
use .sum() chained
on:
In [5]: s.isnull().sum()
Out[5]:
1
Count Missing Values in DataFrame
While the chain of .isnull().values.any() will
work for a DataFrame object
to indicate if any value is missing,
in some cases it may be useful to also count the number
of missing values
across the entire DataFrame.
Since DataFrames are
inherently multidimensional, we must
invoke two methods of
summation.
For example, first we need to create a
simple DataFrame with a
few missingvalues:
In [6]: df = pd.DataFrame(np.random.randn(5,5))
df[df > 0.9] = pd.np.nan
Now if we chain a .sum() method
on, instead of getting the total sum
of missingvalues,
we’re given a list of all the summations of
each column:
In [7]: df.isnull().sum()
Out[7]:
0 3
1 0
2 1
3 1
4 0
dtype: int64
We can see in this example, our first column contains
three missing values,
along with one each in column 2 and 3 as
well.
In order to get the total
summation of
all missing values
in the DataFrame, we chain
two .sum() methods
together:
In [8]: df.isnull().sum().sum()
Out[8]:
5