R语言＜---＞Python第二章--创建数据集

本文链接：https://blog.csdn.net/ddjhpxs/article/details/109034075

这个专题尝试用Python复刻《R语言实战》(第2版)，章节标题与原书基本一致。

第二章创建数据集

数据结构

列表

mylist = [1, 2, "three", [5, 6], (7,), True]
print(type(mylist))
mylist

<class 'list'>
[1, 2, 'three', [5, 6], (7,), True]

元组

mytuple = (1, 2, "three", [5, 6], (7,), True)
print(type(mytuple))
mytuple

<class 'tuple'>
(1, 2, 'three', [5, 6], (7,), True)

矩阵

import numpy as np

mymatrix = np.matrix([[1,2,3],[4,5,6],[7,8,9]], dtype=np.int)
mymatrix

matrix([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]])

数组

myarray = np.array([[1,2,3],[4,5,6],[7,8,9]], dtype=np.float)
myarray

array([[1., 2., 3.],
       [4., 5., 6.],
       [7., 8., 9.]])

数据框

import pandas as pd

# 创建数据框
frame_dict = {'patient_id':[1, 2, 3, 4],
              'age':[25, 34, 28, 52],
              'diabetes':["Type1", "Type2", "Type1", "Type1"],
              'status':["Poor", "Improved", "Excellent", "Poor"]}
myframe = pd.DataFrame(frame_dict)
myframe

	patient_id	age	diabetes	status
0	1	25	Type1	Poor
1	2	34	Type2	Improved
2	3	28	Type1	Excellent
3	4	52	Type1	Poor

选取元素

myframe['patient_id']

0    1
1    2
2    3
3    4
Name: patient_id, dtype: int64

myframe[['patient_id']]

	patient_id
0	1
1	2
2	3
3	4

myframe[['age','diabetes']]

	age	diabetes
0	25	Type1
1	34	Type2
2	28	Type1
3	52	Type1

myframe.status

0         Poor
1     Improved
2    Excellent
3         Poor
Name: status, dtype: object

列联表

pd.crosstab(myframe.diabetes, myframe.status)

status	Excellent	Improved	Poor
diabetes
Type1	1	0	2
Type2	0	1	0

描述性统计

myframe.describe()

	patient_id	age
count	4.000000	4.000000
mean	2.500000	34.750000
std	1.290994	12.093387
min	1.000000	25.000000
25%	1.750000	27.250000
50%	2.500000	31.000000
75%	3.250000	38.500000
max	4.000000	52.000000

数据框信息

myframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   patient_id  4 non-null      int64 
 1   age         4 non-null      int64 
 2   diabetes    4 non-null      object
 3   status      4 non-null      object
dtypes: int64(2), object(2)
memory usage: 256.0+ bytes

数据的输入

从带分隔符的文本文件导入

StudentID,First,Last,Math,Science,Social Studies
011,Bob,Smith,90,80,67
012,Jane,Weary,75,80
010,Dan,“Thornton, III”,65,75,70
040,Mary,“O’Leary”,90,95,92

pd.read_table("studentgrades.txt", sep=",")

	StudentID	First	Last	Math	Science	Social Studies
0	11	Bob	Smith	90	80.0	67
1	12	Jane	Weary	75	NaN	80
2	10	Dan	Thornton, III	65	75.0	70
3	40	Mary	O'Leary	90	95.0	92

grades = pd.read_csv("studentgrades.csv", sep=",")
grades

	StudentID	First	Last	Math	Science	Social Studies
0	11	Bob	Smith	90	80.0	67
1	12	Jane	Weary	75	NaN	80
2	10	Dan	Thornton, III	65	75.0	70
3	40	Mary	O'Leary	90	95.0	92

导入Excel数据

# 方法一
grades2 = pd.read_excel("studentgrades.xlsx", sheet_name="Sheet1")
grades2

	StudentID	First	Last	Math	Science	Social Studies
0	11	Bob	Smith	90	80.0	67
1	12	Jane	Weary	75	NaN	80
2	10	Dan	Thornton, III	65	75.0	70
3	40	Mary	O'Leary	90	95.0	92

# 方法二
reader = pd.ExcelFile("studentgrades.xlsx")
reader.sheet_names

['Sheet1', 'Sheet2', 'Sheet3']

sheet = reader.parse("Sheet1")
sheet

	StudentID	First	Last	Math	Science	Social Studies
0	11	Bob	Smith	90	80.0	67
1	12	Jane	Weary	75	NaN	80
2	10	Dan	Thornton, III	65	75.0	70
3	40	Mary	O'Leary	90	95.0	92

从剪切板导入数据

pd.read_clipboard(sep=",")

	StudentID	First	Last	Math	Science	Social Studies
0	11	Bob	Smith	90	80.0	67
1	12	Jane	Weary	75	NaN	80
2	10	Dan	Thornton, III	65	75.0	70
3	40	Mary	O'Leary	90	95.0	92

处理数据对象的方法

df = myframe.copy() # 复制myframe数据框至df
df

	patient_id	age	diabetes	status
0	1	25	Type1	Poor
1	2	34	Type2	Improved
2	3	28	Type1	Excellent
3	4	52	Type1	Poor

函数	用途
df.shape	显示df的维度(形状)
df.describe()	显示df的描述性统计
df.info()	显示df的基本信息
df.columns	显示df的列名
df.index	显示df的索引
df.head()	显示df的前5行
df.tail(6)	显示df的后6行