website:
10 minutes to pandas
Data Manipulation with Pandas
Contents
Link to Juypter Notebook
Some common functions of pandas listed
Creating a series
series_name = pd.Series(["1", "2", "3"])
Creating a dataframe
dataframe_name = pd.DataFrame({"key1_name": value1_name, "key2_name": value2_name})
Importing data
dataframe_name = pd.read_csv("file_name")
OR
dataframe_name = pd.read_csv("URL_of_the_file")
Exporting data
dataframe_name.to_csv("file_name_you_want_to_store_as")
Describing data
.dtypes
.dtypes shows us what datatype each column contains.
dataframe_name.dtypes
.describe()
.describe() gives you a quick statistical overview of the numerical columns.
dataframe_name.describe()
.info()
.info() shows a handful of useful information about a DataFrame
dataframe_name.info()
.mean(), .sum()
You can also call various statistical and mathematical methods such as .mean() or .sum() directly on a DataFrame or Series.
dataframe_name.mean()
OR
series_name.mean()
OR
dataframe_name.sum()
OR
series_name.sum()
.columns
.columns will show you all the columns of a DataFrame.
dataframe_name.columns
.index
.index will show you the values in a DataFrame’s index (the column on the far left).
dataframe_name.index
len()
len will show you the length of a dataframe.
len(dataframe_name)
Viewing and selecting data
.head()
.head() allows you to view the first 5 rows of your DataFrame.
dataframe_name.head()
.tail()
.tail() allows you to see the bottom 5 rows of your DataFrame. This is helpful if your changes are influencing the bottom rows of your data.
dataframe_name.tail()
.loc[]
.loc[] takes an integer as input. And it chooses from your Series or DataFrame whichever index matches the number.
dataframe_name.loc[index you choose]
OR
series_name.loc[index you choose]
.iloc[]
iloc[] does a similar thing but works with exact positions.
dataframe_name.iloc[index you choose]
OR
series_name.iloc[index you choose]
Select a column
If you want to select a particular column, you can use [‘COLUMN_NAME’].
dataframe_name['column name']
Select the rows you want
Boolean indexing works with column selection too. Using it will select the rows which fulfill the condition in the brackets.
dataframe_name[dataframe_name['column name'] > a_condition]
pd.crosstab()
pd.crosstab() is a great way to view two different columns together and compare them.
pd.crosstab(dataframe_name["column_name_1"], dataframe_name["column_name_2"])
.groupby()
If you want to compare more columns in the context of another column, you can use .groupby().
# Group by one column and find the mean of the other columns
dataframe_name.groupby(["column_name"]).mean()
%matplotlib inline
%matplotlib inline is a special command which tells Jupyter to show your plots. Commands with % at the front are called magic commands.
# Import matplotlib and tell Jupyter to show plots
import matplotlib.pyplot as plt
%matplotlib inline
.plot()
You can visualize a column by calling .plot() on it.
dataframe_name["column_name"].plot()
.hist()
You can see the distribution of a column by calling .hist() on you
dataframe_name["column_name"].hist()
Manipulating data
Set a column to lowercase
# Lower the column
dataframe_name["column_name"].str.lower()
inplace, .fillna()
Some functions have a parameter called inplace which means a DataFrame is updated in place without having to reassign it.
.fillna() is a function which fills missing data.
# The missing data will not be replaced with mean values when inplace = flase
dataframe_name["column_name"].fillna(dataframe_name["column_name"].mean(),
inplace=False)
.dropna()
Let’s say you wanted to remove any rows which had missing data and only work with rows which had complete coverage.
You can do this using .dropna().
dataframe_name.dropna(inplace = True)
.drop(‘COLUMN_NAME’, axis=1).
You can remove a column using .drop(‘COLUMN_NAME’, axis=1).
dataframe_name = dataframe_name.drop("column_name", axis=1)
.sample(frac=X)
To shuffle the order of the dataframe you could use .sample(frac=1).
.sample() randomly samples different rows from a DataFrame. The frac parameter dictates the fraction, where 1 = 100% of rows, 0.5 = 50% of rows, 0.01 = 1% of rows.
dataframe_1 = dataframe.sample(frac=1)
.reset_index()
To get the index back to order
dataframe_1.reset_index()
.apply()
what if you wanted to apply a function to a column. You can do so using the .apply() function and passing it a lambda function.
dataframe_name["column_name"].apply(lambda x: the equation of the function)