5-数值运算--数据分析

最新推荐文章于 2021-06-14 21:18:39 发布

Bruce小鬼

最新推荐文章于 2021-06-14 21:18:39 发布

阅读量842

点赞数

分类专栏： python大数据分析

本文链接：https://blog.csdn.net/m0_38039437/article/details/80720552

版权

python大数据分析专栏收录该内容

27 篇文章 3 订阅

订阅专栏

创建DataFrame格式数据，指定他的行索引名称和列索引名称

     In [3]: 
   

 
            import pandas as pd 
            df = pd.DataFrame([[1,2,3],[4,5,6]],index=['a','b'],columns=['A','B','C']) 
            df 
           

       Out[3]: 
     

	A	B	C
a	1	2	3
b	4	5	6

默认按列求和计算

     In [4]: 
   

df.sum()

       Out[4]: 
     

A    5
B    7
C    9
dtype: int64

按行求和计算

     In [6]: 
   

 
            df.sum(axis=1)

       Out[6]: 
     

a     6
b    15
dtype: int64

根据指定的轴进行计算

     In [7]: 
   

 
            df.sum(axis='columns')

       Out[7]: 
     

a     6
b    15
dtype: int64

     In [8]: 
   

df.mean()

       Out[8]: 
     

A    2.5
B    3.5
C    4.5
dtype: float64

     In [9]: 
   

 
            df.mean(axis=1)

       Out[9]: 
     

a    2.0
b    5.0
dtype: float64

     In [10]: 
   

df.median()

       Out[10]: 
     

A    2.5
B    3.5
C    4.5
dtype: float64

二元统计

.cov():斜方差

     In [11]: 
   

 
            df = pd.read_csv('C:/JupyterWork/data/titanic.csv') 
            df.head()

       Out[11]: 
     

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

     In [12]: 
   

df.cov()

       Out[12]: 
     

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
PassengerId	66231.000000	-0.626966	-7.561798	138.696504	-16.325843	-0.342697	161.883369
Survived	-0.626966	0.236772	-0.137703	-0.551296	-0.018954	0.032017	6.221787
Pclass	-7.561798	-0.137703	0.699015	-4.496004	0.076599	0.012429	-22.830196
Age	138.696504	-0.551296	-4.496004	211.019125	-4.163334	-2.344191	73.849030
SibSp	-16.325843	-0.018954	0.076599	-4.163334	1.216043	0.368739	8.748734
Parch	-0.342697	0.032017	0.012429	-2.344191	0.368739	0.649728	8.661052
Fare	161.883369	6.221787	-22.830196	73.849030	8.748734	8.661052	2469.436846

corr():相关系数

     In [13]: 
   

df.corr()

       Out[13]: 
     

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
PassengerId	1.000000	-0.005007	-0.035144	0.036847	-0.057527	-0.001652	0.012658
Survived	-0.005007	1.000000	-0.338481	-0.077221	-0.035322	0.081629	0.257307
Pclass	-0.035144	-0.338481	1.000000	-0.369226	0.083081	0.018443	-0.549500
Age	0.036847	-0.077221	-0.369226	1.000000	-0.308247	-0.189119	0.096067
SibSp	-0.057527	-0.035322	0.083081	-0.308247	1.000000	0.414838	0.159651
Parch	-0.001652	0.081629	0.018443	-0.189119	0.414838	1.000000	0.216225
Fare	0.012658	0.257307	-0.549500	0.096067	0.159651	0.216225	1.000000

value_counts(): 统计指定列下各个数值出现的次数，默认降序排序

     In [14]: 
   

 
            df['Age'].value_counts()

       Out[14]: 
     

24.00    30
22.00    27
18.00    26
19.00    25
30.00    25
28.00    25
21.00    24
25.00    23
36.00    22
29.00    20
32.00    18
27.00    18
35.00    18
26.00    18
16.00    17
31.00    17
20.00    15
33.00    15
23.00    15
34.00    15
39.00    14
17.00    13
42.00    13
40.00    13
45.00    12
38.00    11
50.00    10
2.00     10
4.00     10
47.00     9
         ..
71.00     2
59.00     2
63.00     2
0.83      2
30.50     2
70.00     2
57.00     2
0.75      2
13.00     2
10.00     2
64.00     2
40.50     2
32.50     2
45.50     2
20.50     1
24.50     1
0.67      1
14.50     1
0.92      1
74.00     1
34.50     1
80.00     1
12.00     1
36.50     1
53.00     1
55.50     1
70.50     1
66.00     1
23.50     1
0.42      1
Name: Age, Length: 88, dtype: int64

### value_counts(): 统计指定列下各个数值出现的次数，设置升序排序

     In [15]: 
   

 
            df['Age'].value_counts(ascending = True)

       Out[15]: 
     

0.42      1
23.50     1
66.00     1
70.50     1
55.50     1
53.00     1
36.50     1
12.00     1
80.00     1
34.50     1
74.00     1
0.92      1
14.50     1
0.67      1
24.50     1
20.50     1
45.50     2
32.50     2
40.50     2
64.00     2
10.00     2
13.00     2
0.75      2
57.00     2
70.00     2
30.50     2
0.83      2
63.00     2
59.00     2
71.00     2
         ..
47.00     9
4.00     10
2.00     10
50.00    10
38.00    11
45.00    12
40.00    13
42.00    13
17.00    13
39.00    14
34.00    15
23.00    15
33.00    15
20.00    15
31.00    17
16.00    17
26.00    18
35.00    18
27.00    18
32.00    18
29.00    20
36.00    22
25.00    23
21.00    24
28.00    25
30.00    25
19.00    25
18.00    26
22.00    27
24.00    30
Name: Age, Length: 88, dtype: int64

计算一等舱，二等舱，三等舱分别有多少人

     In [16]: 
   

 
            df['Pclass'].value_counts(ascending = True) 
           

       Out[16]: 
     

2    184
1    216
3    491
Name: Pclass, dtype: int64

bins: 将数据按照指定的数值进行分组划分

     In [19]: 
   

 
            df['Age'].value_counts(ascending = True,bins = 5) 
           

       Out[19]: 
     

(64.084, 80.0]       11
(48.168, 64.084]     69
(0.339, 16.336]     100
(32.252, 48.168]    188
(16.336, 32.252]    346
Name: Age, dtype: int64

     In [20]: 
   

 
            df['Age'].count()

       Out[20]: 
     

help() 显示某个命令使用方法

     In [21]: 
   

 
            print(help(pd.value_counts))

Help on function value_counts in module pandas.core.algorithms:

value_counts(values, sort=True, ascending=False, normalize=False, bins=None, dropna=True)
    Compute a histogram of the counts of non-null values.
    
    Parameters
    ----------
    values : ndarray (1-d)
    sort : boolean, default True
        Sort by values
    ascending : boolean, default False
        Sort in ascending order
    normalize: boolean, default False
        If True then compute a relative histogram
    bins : integer, optional
        Rather than count values, group them into half-open bins,
        convenience for pd.cut, only works with numeric data
    dropna : boolean, default True
        Don't include counts of NaN
    
    Returns
    -------
    value_counts : Series

None