数值运算操作
import pandas as pd
可自己传入数据,指定索引名和属性名
data = pd.DataFrame([[1, 2, 3], [4, 5, 6]], index=['a', 'b'], columns=['A', 'B', 'C'])
data
data.sum()
A 5
B 7
C 9
dtype: int64
data.sum(axis=1)
a 6
b 15
dtype: int64
data.mean()
A 2.5
B 3.5
C 4.5
dtype: float64
data.mean(axis=1)
a 2.0
b 5.0
dtype: float64
同样的还有:.max() .min() .median() 分别求最大值 最小值 中位数 都可以指定 axis
二元统计
df = pd.read_csv('../../datasets/titanic/test.csv')
df.head(5)
| PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
0 | 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q |
---|
1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | NaN | S |
---|
2 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | NaN | Q |
---|
3 | 895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | NaN | S |
---|
4 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | NaN | S |
---|
df.dtypes
PassengerId int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object
.cov()方法计算两两属性间的协方差 .corr() 方法计算属性间的相关系数
df.cov()
| PassengerId | Pclass | Age | SibSp | Parch | Fare |
---|
PassengerId | 14595.166667 | -2.720624 | -59.369047 | 0.413669 | 5.107914 | 55.514238 |
---|
Pclass | -2.720624 | 0.708690 | -5.906358 | 0.000820 | 0.015467 | -27.171232 |
---|
Age | -59.369047 | -5.906358 | 201.106695 | -1.135270 | -0.704115 | 291.838610 |
---|
SibSp | 0.413669 | 0.000820 | -1.135270 | 0.804178 | 0.270100 | 8.607981 |
---|
Parch | 5.107914 | 0.015467 | -0.704115 | 0.270100 | 0.963203 | 12.635175 |
---|
Fare | 55.514238 | -27.171232 | 291.838610 | 8.607981 | 12.635175 | 3125.657074 |
---|
df.corr()
| PassengerId | Pclass | Age | SibSp | Parch | Fare |
---|
PassengerId | 1.000000 | -0.026751 | -0.034102 | 0.003818 | 0.043080 | 0.008211 |
---|
Pclass | -0.026751 | 1.000000 | -0.492143 | 0.001087 | 0.018721 | -0.577147 |
---|
Age | -0.034102 | -0.492143 | 1.000000 | -0.091587 | -0.061249 | 0.337932 |
---|
SibSp | 0.003818 | 0.001087 | -0.091587 | 1.000000 | 0.306895 | 0.171539 |
---|
Parch | 0.043080 | 0.018721 | -0.061249 | 0.306895 | 1.000000 | 0.230046 |
---|
Fare | 0.008211 | -0.577147 | 0.337932 | 0.171539 | 0.230046 | 1.000000 |
---|
统计 某个 属性各个值的个数
df['Age'].value_counts(ascending=True)
34.5 1
76.0 1
26.5 1
60.5 1
7.0 1
..
18.0 13
30.0 15
22.0 16
21.0 17
24.0 17
Name: Age, Length: 79, dtype: int64
df['Age'].value_counts(ascending=True, bins=5)
(60.834, 76.0] 10
(0.0932, 15.336] 32
(45.668, 60.834] 42
(30.502, 45.668] 80
(15.336, 30.502] 168
Name: Age, dtype: int64
df['Age'].count()
332