在进行数据分析的时候,在初步拿到数据表时,往往会需要对数据进行总体上的统计分析,包括数据类型,样本个数,是否有空值,样本抽检呢,以下会介绍较为常用的5个函数,分别是info(),describe(),sample(),head(),tail()
info()
info()函数是用于统计DataFrame的数据类型和非空值数量的函数,演示如下,样例数据集为如下所示
ident,site,dated 619,DR-1,1927-02-08 622,DR-1,1927-02-10 734,DR-3,1939-01-07 735,DR-3,1930-01-12 751,DR-3,1930-02-26 752,DR-3, 837,MSK-4,1932-01-14 844,DR-1,1932-03-22
import pandas as pd
import numpy as np
data = pd.read_csv('survey_visited.csv')
print(data.info())
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 8 entries, 0 to 7
# Data columns (total 3 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 ident 8 non-null int64
# 1 site 8 non-null object
# 2 dated 7 non-null object
# dtypes: int64(1), object(2)
# memory usage: 324.0+ bytes
# None
可以看到这里的info()函数统计出了在dated列中只有7个非空值,但是可以在rangeindex中看到索引共有8个,所以在dated列中存在一个空值,并且info()函数还显示了三列的数据类型
describe()
describe()是用于显示数值列的统计信息的,可以显示的包括个数,均值,标准差,最小值,最大值,中位数,可选的还有分位数,默认为四分位数也就是0.25和0.75,但是可以手动更改
import pandas as pd
import numpy as np
data = pd.read_csv('survey_visited.csv')
print(data.describe())
# ident
# count 8.000000
# mean 736.750000
# std 83.692891
# min 619.000000
# 25% 706.000000
# 50% 743.000000
# 75% 773.250000
# max 844.000000
如果这里不想显示四分位数,则可以手动修改percentiles参数,注意要以列表形式赋值否则会报错
import pandas as pd
import numpy as np
data = pd.read_csv('survey_visited.csv')
print(data.describe(percentiles=[0.1,0.7,0.9]))
# ident
# count 8.000000
# mean 736.750000
# std 83.692891
# min 619.000000
# 10% 621.100000
# 50% 743.000000
# 70% 751.900000
# 90% 839.100000
# max 844.000000
sample()
sample()函数的作用是按行指定数量的样本抽检,通过设置参数n即可选择抽检数量
import pandas as pd
import numpy as np
data = pd.read_csv('survey_visited.csv')
print(data.sample(n=3))
# ident site dated
# 2 734 DR-3 1939-01-07
# 3 735 DR-3 1930-01-12
# 7 844 DR-1 1932-03-22
head()/tail()
head()和tail()函数的作用分别为显示前几行和后几行的数据,默认显示5行,可以通过设置参数n来调整显示数量
import pandas as pd
import numpy as np
data = pd.read_csv('survey_visited.csv')
print(data.head())
print(data.tail())
# ident site dated
# 0 619 DR-1 1927-02-08
# 1 622 DR-1 1927-02-10
# 2 734 DR-3 1939-01-07
# 3 735 DR-3 1930-01-12
# 4 751 DR-3 1930-02-26
# ident site dated
# 3 735 DR-3 1930-01-12
# 4 751 DR-3 1930-02-26
# 5 752 DR-3 NaN
# 6 837 MSK-4 1932-01-14
# 7 844 DR-1 1932-03-22
import pandas as pd
import numpy as np
data = pd.read_csv('survey_visited.csv')
print(data.head(n=3))
print(data.tail(n=4))
# ident site dated
# 0 619 DR-1 1927-02-08
# 1 622 DR-1 1927-02-10
# 2 734 DR-3 1939-01-07
# ident site dated
# 4 751 DR-3 1930-02-26
# 5 752 DR-3 NaN
# 6 837 MSK-4 1932-01-14
# 7 844 DR-1 1932-03-22