R语言<--->Python第二章--创建数据集

这个专题尝试用Python复刻《R语言实战》(第2版),章节标题与原书基本一致。


第二章 创建数据集

数据结构

列表

mylist = [1, 2, "three", [5, 6], (7,), True]
print(type(mylist))
mylist
<class 'list'>
[1, 2, 'three', [5, 6], (7,), True]

元组

mytuple = (1, 2, "three", [5, 6], (7,), True)
print(type(mytuple))
mytuple
<class 'tuple'>
(1, 2, 'three', [5, 6], (7,), True)

矩阵

import numpy as np

mymatrix = np.matrix([[1,2,3],[4,5,6],[7,8,9]], dtype=np.int)
mymatrix
matrix([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]])

数组

myarray = np.array([[1,2,3],[4,5,6],[7,8,9]], dtype=np.float)
myarray
array([[1., 2., 3.],
       [4., 5., 6.],
       [7., 8., 9.]])

数据框

import pandas as pd

# 创建数据框
frame_dict = {'patient_id':[1, 2, 3, 4],
              'age':[25, 34, 28, 52],
              'diabetes':["Type1", "Type2", "Type1", "Type1"],
              'status':["Poor", "Improved", "Excellent", "Poor"]}
myframe = pd.DataFrame(frame_dict)
myframe
patient_idagediabetesstatus
0125Type1Poor
1234Type2Improved
2328Type1Excellent
3452Type1Poor
选取元素
myframe['patient_id']
0    1
1    2
2    3
3    4
Name: patient_id, dtype: int64
myframe[['patient_id']]
patient_id
01
12
23
34
myframe[['age','diabetes']]
agediabetes
025Type1
134Type2
228Type1
352Type1
myframe.status
0         Poor
1     Improved
2    Excellent
3         Poor
Name: status, dtype: object
列联表
pd.crosstab(myframe.diabetes, myframe.status)
statusExcellentImprovedPoor
diabetes
Type1102
Type2010
描述性统计
myframe.describe()
patient_idage
count4.0000004.000000
mean2.50000034.750000
std1.29099412.093387
min1.00000025.000000
25%1.75000027.250000
50%2.50000031.000000
75%3.25000038.500000
max4.00000052.000000
数据框信息
myframe.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   patient_id  4 non-null      int64 
 1   age         4 non-null      int64 
 2   diabetes    4 non-null      object
 3   status      4 non-null      object
dtypes: int64(2), object(2)
memory usage: 256.0+ bytes

数据的输入

从带分隔符的文本文件导入

StudentID,First,Last,Math,Science,Social Studies
011,Bob,Smith,90,80,67
012,Jane,Weary,75,80
010,Dan,“Thornton, III”,65,75,70
040,Mary,“O’Leary”,90,95,92

pd.read_table("studentgrades.txt", sep=",")
StudentIDFirstLastMathScienceSocial Studies
011BobSmith9080.067
112JaneWeary75NaN80
210DanThornton, III6575.070
340MaryO'Leary9095.092
grades = pd.read_csv("studentgrades.csv", sep=",")
grades
StudentIDFirstLastMathScienceSocial Studies
011BobSmith9080.067
112JaneWeary75NaN80
210DanThornton, III6575.070
340MaryO'Leary9095.092

导入Excel数据

# 方法一
grades2 = pd.read_excel("studentgrades.xlsx", sheet_name="Sheet1")
grades2
StudentIDFirstLastMathScienceSocial Studies
011BobSmith9080.067
112JaneWeary75NaN80
210DanThornton, III6575.070
340MaryO'Leary9095.092
# 方法二
reader = pd.ExcelFile("studentgrades.xlsx")
reader.sheet_names
['Sheet1', 'Sheet2', 'Sheet3']
sheet = reader.parse("Sheet1")
sheet
StudentIDFirstLastMathScienceSocial Studies
011BobSmith9080.067
112JaneWeary75NaN80
210DanThornton, III6575.070
340MaryO'Leary9095.092

从剪切板导入数据

pd.read_clipboard(sep=",")
StudentIDFirstLastMathScienceSocial Studies
011BobSmith9080.067
112JaneWeary75NaN80
210DanThornton, III6575.070
340MaryO'Leary9095.092

处理数据对象的方法

df = myframe.copy() # 复制myframe数据框至df
df
patient_idagediabetesstatus
0125Type1Poor
1234Type2Improved
2328Type1Excellent
3452Type1Poor
函数用途
df.shape显示df的维度(形状)
df.describe()显示df的描述性统计
df.info()显示df的基本信息
df.columns显示df的列名
df.index显示df的索引
df.head()显示df的前5行
df.tail(6)显示df的后6行

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值