内容:
·describe()
·info()
·异常值检测与处理
·numpy
·DataFrame
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
score = DataFrame(data={
"name": ["fom", "gom", "hom", "jom", "kom", "lom"],
"score": np.random.randint(0, 100, size=6),
"address": np.random.randint(1000, 2000, size=6)
})
score_copy = DataFrame(data={
"name": ["fom", "gom", "hom", "jom", "kom", "lom"],
"score": np.random.randint(0, 100, size=6),
"address": np.random.randint(1000, 2000, size=6)
})
score = pd.concat(objs=(score, score_copy), axis=1).drop(labels="name", axis=1)
score["name"] = ["fom", "gom", "hom", "jom", "kom", "lom"]
print(score)
print()
运行结果:
【describe()】
describe()只对可运算的类型的列有效
使用这个函数,可以快速查看每一列的各项数值,便于找出异常值
count:多少个值
mean:平均值
std:标准差
min:最小值
max:最大值
"""describe()"""
# describe()只对可运算的类型的列有效
# count:多少个值
# mean:平均值
# std:标准差
# min:最小值
# max:最大值
print(score.describe())
print(score.describe().loc["count"])
print()
运行结果:
【info()】
columns列索引
Non-Null:有没有空值
Count:个数
Dtype:类型
"""info()"""
# columns列索引
# Non-Null:有没有空值
# Count:个数
# Dtype:类型
print(score.info())
print()
运行结果:
【异常值检测与处理】
异常值检测常用的界定方法,如果数据都是呈标准正态分布的,如果data>3|data.std()|;
离群点检测检测,数值型数据都可以通过离群点来检测;
(numpy)
"""numpy"""
some_num = np.abs(np.random.randn(1000))
right_num = 3 * some_num.std()
print(some_num[some_num > right_num])
print()
运行结果:
(DataFrame)
"""DataFrame"""
score = DataFrame(data={
"chinese": np.random.randint(0, 100, size=6),
"math": np.random.randint(0, 100, size=6),
"english": np.random.randint(0, 100, size=6)},
index=["fom", "gom", "hom", "jom", "kom", "lom"]
)
print(score)
right_num = score.std()
print(score[score < 3 * right_num])
运行结果: