source:https://towardsdatascience.com/23-great-pandas-codes-for-data-scientists-cca5ed9d8a38
基本数据集信息
(1)读取CSV数据
pd.DataFrame.from_csv(“csv_file”)
或者
pd.read_csv(“csv_file”)
(2)读取Excel数据
pd.read_excel("excel_file")
(3)将数据直接写入csv
df.to_csv("data.csv", sep=",", index=False)
#以逗号隔开,无索引
(4)数据特征的基本信息
df.info()
(5)数据的统计信息
print(df.describe())
(6)将数据打印为表格
print(tabulate(print_table, headers=headers))
where “print_table” is a list of lists and “headers” is a list of the string headers
(7)打印列名
df.columns
基本数据处理
(1)删除缺失值
df.dropna(axis=0, how='any')
Returns object with labels on given axis omitted where alternately any or all of the data are missing
(2)替换缺失值
df.replace(to_replace=None, value=None)
replaces values given in “to_replace” with “value”.
(3)检查空值
pd.isnull(object)
Detect missing values (NaN in numeric arrays, None/NaN in object arrays)
(4)删除特征
df.drop('feature_variable_name', axis=1)
axis is either 0 for rows, 1 for columns
(5)将对象转换为float
pd.to_numeric(df["feature_name"], errors='coerce')
Convert object types to numeric to be able to perform computations (in case they are string)
(6)将数据转换为numpy数组
df.as_matrix()
(7)打印前n行数据
df.head(n)
(8)根据特征名称得到数据
df.loc[feature_name]
数据的操作
(1)对一组数据进行函数变换
This one will multiple all values in the “height” column of the data frame by 2
df["height"].apply(lambda height: 2 * height)
OR
def multiply(x):
return x * 2
df["height"].apply(multiply)
(2)对列进行重命名
Here we will rename the 3rd column of the data frame to be called “size”
df.rename(columns = {df.columns[2]:'size'}, inplace=True)
(3)获取列的唯一项
Here we will get the unique entries of the column “name”
df["name"].unique()
(4)访问数据子集
Here we’ll grab a selection of the columns, “name” and “size” from the data frame
new_df = df[["name", "size"]]
(5)获取数据的基本信息
# Sum of values in a data frame
df.sum()
# Lowest value of a data frame
df.min()
# Highest value
df.max()
# Index of the lowest value
df.idxmin()
# Index of the highest value
df.idxmax()
# Statistical summary of the data frame, with quartiles, median, etc.
df.describe()
# Average values
df.mean()
# Median values
df.median()
# Correlation between columns
df.corr()
# To get these values for only one column, just select it like this#
df["size"].median()
(6)对数据排序
df.sort_values(ascending = False)
(7)布尔索引
Here we’ll filter our data column named “size” to show only values equal to 5
df.loc([0], ['size'])