这个专题尝试用Python复刻《R语言实战》(第2版),章节标题与原书基本一致。
第二章 创建数据集
数据结构
列表
mylist = [1, 2, "three", [5, 6], (7,), True]
print(type(mylist))
mylist
<class 'list'>
[1, 2, 'three', [5, 6], (7,), True]
元组
mytuple = (1, 2, "three", [5, 6], (7,), True)
print(type(mytuple))
mytuple
<class 'tuple'>
(1, 2, 'three', [5, 6], (7,), True)
矩阵
import numpy as np
mymatrix = np.matrix([[1,2,3],[4,5,6],[7,8,9]], dtype=np.int)
mymatrix
matrix([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
数组
myarray = np.array([[1,2,3],[4,5,6],[7,8,9]], dtype=np.float)
myarray
array([[1., 2., 3.],
[4., 5., 6.],
[7., 8., 9.]])
数据框
import pandas as pd
frame_dict = {'patient_id':[1, 2, 3, 4],
'age':[25, 34, 28, 52],
'diabetes':["Type1", "Type2", "Type1", "Type1"],
'status':["Poor", "Improved", "Excellent", "Poor"]}
myframe = pd.DataFrame(frame_dict)
myframe
| patient_id | age | diabetes | status |
---|
0 | 1 | 25 | Type1 | Poor |
---|
1 | 2 | 34 | Type2 | Improved |
---|
2 | 3 | 28 | Type1 | Excellent |
---|
3 | 4 | 52 | Type1 | Poor |
---|
选取元素
myframe['patient_id']
0 1
1 2
2 3
3 4
Name: patient_id, dtype: int64
myframe[['patient_id']]
myframe[['age','diabetes']]
| age | diabetes |
---|
0 | 25 | Type1 |
---|
1 | 34 | Type2 |
---|
2 | 28 | Type1 |
---|
3 | 52 | Type1 |
---|
myframe.status
0 Poor
1 Improved
2 Excellent
3 Poor
Name: status, dtype: object
列联表
pd.crosstab(myframe.diabetes, myframe.status)
status | Excellent | Improved | Poor |
---|
diabetes | | | |
---|
Type1 | 1 | 0 | 2 |
---|
Type2 | 0 | 1 | 0 |
---|
描述性统计
myframe.describe()
| patient_id | age |
---|
count | 4.000000 | 4.000000 |
---|
mean | 2.500000 | 34.750000 |
---|
std | 1.290994 | 12.093387 |
---|
min | 1.000000 | 25.000000 |
---|
25% | 1.750000 | 27.250000 |
---|
50% | 2.500000 | 31.000000 |
---|
75% | 3.250000 | 38.500000 |
---|
max | 4.000000 | 52.000000 |
---|
数据框信息
myframe.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 patient_id 4 non-null int64
1 age 4 non-null int64
2 diabetes 4 non-null object
3 status 4 non-null object
dtypes: int64(2), object(2)
memory usage: 256.0+ bytes
数据的输入
从带分隔符的文本文件导入
StudentID,First,Last,Math,Science,Social Studies
011,Bob,Smith,90,80,67
012,Jane,Weary,75,80
010,Dan,“Thornton, III”,65,75,70
040,Mary,“O’Leary”,90,95,92
pd.read_table("studentgrades.txt", sep=",")
| StudentID | First | Last | Math | Science | Social Studies |
---|
0 | 11 | Bob | Smith | 90 | 80.0 | 67 |
---|
1 | 12 | Jane | Weary | 75 | NaN | 80 |
---|
2 | 10 | Dan | Thornton, III | 65 | 75.0 | 70 |
---|
3 | 40 | Mary | O'Leary | 90 | 95.0 | 92 |
---|
grades = pd.read_csv("studentgrades.csv", sep=",")
grades
| StudentID | First | Last | Math | Science | Social Studies |
---|
0 | 11 | Bob | Smith | 90 | 80.0 | 67 |
---|
1 | 12 | Jane | Weary | 75 | NaN | 80 |
---|
2 | 10 | Dan | Thornton, III | 65 | 75.0 | 70 |
---|
3 | 40 | Mary | O'Leary | 90 | 95.0 | 92 |
---|
导入Excel数据
grades2 = pd.read_excel("studentgrades.xlsx", sheet_name="Sheet1")
grades2
| StudentID | First | Last | Math | Science | Social Studies |
---|
0 | 11 | Bob | Smith | 90 | 80.0 | 67 |
---|
1 | 12 | Jane | Weary | 75 | NaN | 80 |
---|
2 | 10 | Dan | Thornton, III | 65 | 75.0 | 70 |
---|
3 | 40 | Mary | O'Leary | 90 | 95.0 | 92 |
---|
reader = pd.ExcelFile("studentgrades.xlsx")
reader.sheet_names
['Sheet1', 'Sheet2', 'Sheet3']
sheet = reader.parse("Sheet1")
sheet
| StudentID | First | Last | Math | Science | Social Studies |
---|
0 | 11 | Bob | Smith | 90 | 80.0 | 67 |
---|
1 | 12 | Jane | Weary | 75 | NaN | 80 |
---|
2 | 10 | Dan | Thornton, III | 65 | 75.0 | 70 |
---|
3 | 40 | Mary | O'Leary | 90 | 95.0 | 92 |
---|
从剪切板导入数据
pd.read_clipboard(sep=",")
| StudentID | First | Last | Math | Science | Social Studies |
---|
0 | 11 | Bob | Smith | 90 | 80.0 | 67 |
---|
1 | 12 | Jane | Weary | 75 | NaN | 80 |
---|
2 | 10 | Dan | Thornton, III | 65 | 75.0 | 70 |
---|
3 | 40 | Mary | O'Leary | 90 | 95.0 | 92 |
---|
处理数据对象的方法
df = myframe.copy()
df
| patient_id | age | diabetes | status |
---|
0 | 1 | 25 | Type1 | Poor |
---|
1 | 2 | 34 | Type2 | Improved |
---|
2 | 3 | 28 | Type1 | Excellent |
---|
3 | 4 | 52 | Type1 | Poor |
---|
函数 | 用途 |
---|
df.shape | 显示df的维度(形状) |
df.describe() | 显示df的描述性统计 |
df.info() | 显示df的基本信息 |
df.columns | 显示df的列名 |
df.index | 显示df的索引 |
df.head() | 显示df的前5行 |
df.tail(6) | 显示df的后6行 |