Machine Learning and Data Science (2): Introduction to pandas in Python-CSDN博客

本文链接：https://blog.csdn.net/stellalxy/article/details/125177038

本文介绍了Python中pandas库的基本操作，包括创建Series和DataFrame、数据导入导出、数据描述、视图选择以及数据操纵。重点讲解了常用方法如.head()、.tail()、.describe()、.groupby()以及数据清洗和可视化技巧。

摘要由CSDN通过智能技术生成

website:
10 minutes to pandas
Data Manipulation with Pandas

Link to Juypter Notebook

Introduction to pandas

Some common functions of pandas listed

Creating a series

series_name = pd.Series(["1", "2", "3"])

Creating a dataframe

dataframe_name = pd.DataFrame({"key1_name": value1_name, "key2_name": value2_name})

Importing data

dataframe_name = pd.read_csv("file_name")

dataframe_name = pd.read_csv("URL_of_the_file")

Exporting data

dataframe_name.to_csv("file_name_you_want_to_store_as")

Describing data

.dtypes

.dtypes shows us what datatype each column contains.

dataframe_name.dtypes

.describe()

.describe() gives you a quick statistical overview of the numerical columns.

dataframe_name.describe()

.info()

.info() shows a handful of useful information about a DataFrame

dataframe_name.info()

.mean(), .sum()

You can also call various statistical and mathematical methods such as .mean() or .sum() directly on a DataFrame or Series.

dataframe_name.mean()

series_name.mean()

dataframe_name.sum()

series_name.sum()

.columns

.columns will show you all the columns of a DataFrame.

dataframe_name.columns

.index

.index will show you the values in a DataFrame’s index (the column on the far left).

dataframe_name.index

len()

len will show you the length of a dataframe.

len(dataframe_name)

Viewing and selecting data

.head()

.head() allows you to view the first 5 rows of your DataFrame.

dataframe_name.head()

.tail()

.tail() allows you to see the bottom 5 rows of your DataFrame. This is helpful if your changes are influencing the bottom rows of your data.

dataframe_name.tail()

.loc[]

.loc[] takes an integer as input. And it chooses from your Series or DataFrame whichever index matches the number.

dataframe_name.loc[index you choose]

series_name.loc[index you choose]

.iloc[]

iloc[] does a similar thing but works with exact positions.

dataframe_name.iloc[index you choose]

series_name.iloc[index you choose]

Select a column

If you want to select a particular column, you can use [‘COLUMN_NAME’].

dataframe_name['column name']

Select the rows you want

Boolean indexing works with column selection too. Using it will select the rows which fulfill the condition in the brackets.

dataframe_name[dataframe_name['column name'] > a_condition]

pd.crosstab()

pd.crosstab() is a great way to view two different columns together and compare them.

pd.crosstab(dataframe_name["column_name_1"], dataframe_name["column_name_2"])

.groupby()

If you want to compare more columns in the context of another column, you can use .groupby().

# Group by one column and find the mean of the other columns 
dataframe_name.groupby(["column_name"]).mean()

%matplotlib inline

%matplotlib inline is a special command which tells Jupyter to show your plots. Commands with % at the front are called magic commands.

# Import matplotlib and tell Jupyter to show plots
import matplotlib.pyplot as plt
%matplotlib inline

.plot()

You can visualize a column by calling .plot() on it.

dataframe_name["column_name"].plot()

.hist()

You can see the distribution of a column by calling .hist() on you

dataframe_name["column_name"].hist()

Manipulating data

Set a column to lowercase

# Lower the column
dataframe_name["column_name"].str.lower()

inplace, .fillna()

Some functions have a parameter called inplace which means a DataFrame is updated in place without having to reassign it.

.fillna() is a function which fills missing data.

# The missing data will not be replaced with mean values when inplace = flase
dataframe_name["column_name"].fillna(dataframe_name["column_name"].mean(), 
                                     inplace=False)

.dropna()

Let’s say you wanted to remove any rows which had missing data and only work with rows which had complete coverage.

You can do this using .dropna().

dataframe_name.dropna(inplace = True)

.drop(‘COLUMN_NAME’, axis=1).

You can remove a column using .drop(‘COLUMN_NAME’, axis=1).

dataframe_name = dataframe_name.drop("column_name", axis=1)

.sample(frac=X)

To shuffle the order of the dataframe you could use .sample(frac=1).

.sample() randomly samples different rows from a DataFrame. The frac parameter dictates the fraction, where 1 = 100% of rows, 0.5 = 50% of rows, 0.01 = 1% of rows.

dataframe_1 = dataframe.sample(frac=1)

.reset_index()

To get the index back to order

dataframe_1.reset_index()

.apply()

what if you wanted to apply a function to a column. You can do so using the .apply() function and passing it a lambda function.

dataframe_name["column_name"].apply(lambda x: the equation of the function)