pandas-1-Series和DataFrame介绍

薄荷杂学

于 2023-07-14 21:41:51 发布

阅读量392

点赞数

分类专栏：薄荷学Pandas 文章标签： pandas

本文链接：https://blog.csdn.net/weixin_43825323/article/details/131731956

版权

薄荷学Pandas 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Series是什么？

定义

在Python的pandas库中，Series是一种数据结构，你可以将它想象成一个带有标签的一维数组。每个标签都与数组中的一个数据值相关联。Series的标签可以是任何数据类型，包括整数和字符串。

Series的索引在左边，值在右边。从0到数据长度-1是默认索引，用户也可以自定义该索引。通过values和index属性可以得到Series的数据和索引。

import pandas as pd
import numpy as np
# 创建一个Series
s = pd.Series([1, 3, 5, np.nan, 6, 8])
# 打印Series
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

这个例子中，我们创建了一个Series，其中包含6个元素。这些元素的值分别是1.0、3.0、5.0、NaN（表示“不是数字”或者说是缺失值）、6.0和8.0。每个值都有一个与之相关联的标签，这些标签在这个例子中是0到5的整数。注意：如果你没有提供标签，pandas将默认使用整数序列作为标签。

创建Series的多种方法

通过一维数组创建Series

import numpy as np
import pandas as pd
arr1 = np.arange(10)
s1 = pd.Series(arr1)
s1

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int32

知识点补充：

np.arange 是 NumPy 库中的一个函数，用于生成在指定范围内的等差数列。它是 "array range" 的缩写，意味着 "数组范围"。

np.arange 函数的基本语法如下：

numpy.arange(start, stop, step, dtype)

其中：

start 是范围的开始值。如果只提供了 stop 参数（即只有一个输入参数），则默认从 0 开始。
stop 是范围的结束值。需要注意的是，这个结束值是不包含在内的。
step 是等差数列中两个连续值之间的差。默认值是 1。
dtype 是结果数组的数据类型。如果未给定，将根据输入数据推断数据类型。

import numpy as np

# 生成从0到9的等差数列
a = np.arange(10)
print(a)  # 输出: [0 1 2 3 4 5 6 7 8 9]

# 生成从1到10的等差数列
b = np.arange(1, 11)
print(b)  # 输出: [ 1  2  3  4  5  6  7  8  9 10]

# 生成从1到10的等差数列，步长为2
c = np.arange(1, 11, 2)
print(c)  # 输出: [1 3 5 7 9]

[0 1 2 3 4 5 6 7 8 9]
[ 1  2  3  4  5  6  7  8  9 10]
[1 3 5 7 9]

通过列表创建Series

import pandas as pd
s = pd.Series([1, 2, 3, 4, 5])
s

0    1
1    2
2    3
3    4
4    5
dtype: int64

通过字典创建Series

import pandas as pd
dict_data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(dict_data)
s
# 在这个例子中，字典的键会被用作 Series 的索引。

a    0.0
b    1.0
c    2.0
dtype: float64

dic1 = {'a':10,'b':20,'c':30,'d':40,'e':50}
print(dic1)
print(type(dic1))

{'a': 10, 'b': 20, 'c': 30, 'd': 40, 'e': 50}
<class 'dict'>

使用标量创建 Series

import pandas as pd

s = pd.Series(5, index=[0, 1, 2, 3])
s

0    5
1    5
2    5
3    5
dtype: int64

在这个例子中，我们创建了一个包含相同值的 Series，这个值是我们提供的标量值。在创建 Series 时，我们还提供了一个索引列表。

Series的索引

在 Pandas 中，Series 对象的索引可以在创建时指定，也可以在创建后修改。下面是一些示例：

在创建时指定索引

import pandas as pd
s = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s
# 在这个例子中，我们创建了一个 Series，其值为 [1, 2, 3, 4]，并且索引是 ['a', 'b', 'c', 'd']。

a    1
b    2
c    3
d    4
dtype: int64

创建后修改索引

在创建 Series 之后，你可以直接通过赋值的方式更改其索引。例如：

# 在这个例子中，我们首先创建了一个默认索引（0, 1, 2, 3）的 Series
# 然后我们更改了其索引为 ['a', 'b', 'c', 'd']。
import pandas as pd

# 创建一个Series
s = pd.Series([1, 2, 3, 4])
print(s)

# 修改索引
s.index = ['a', 'b', 'c', 'd']
print(s)

0    1
1    2
2    3
3    4
dtype: int64
a    1
b    2
c    3
d    4
dtype: int64

DataFrame是什么？

初始DataFrame

DataFrame 是 pandas 库中的一个非常重要的数据结构。

可以将它想象成一个表格，其中每列可以是不同的类型（例如，数字，字符串，布尔值等），并且每列都有一个列名。

DataFrame 既有行索引也有列索引。

以下是一个创建 DataFrame 的简单示例：

import pandas as pd
data = {
    'name': ['Tom', 'Nick', 'John', 'Tom'],
    'age': [20, 21, 19, 18],
    'city': ['New York', 'London', 'Toronto', 'Paris']
}
df = pd.DataFrame(data)
print(df)

   name  age      city
0   Tom   20  New York
1  Nick   21    London
2  John   19   Toronto
3   Tom   18     Paris

在这个例子中，首先定义了一个字典 data，其中的键（'name'，'age'，'city'）将成为 DataFrame 的列名，对应的值（列表）将成为该列的数据。然后我们用这个字典创建了一个 DataFrame，并打印出了它的内容。

可以看到，DataFrame 有两个索引：一个是行索引（在这个例子中是 0 到 3），另一个是列索引（在这个例子中是 'name'，'age'，'city'）。

可以通过多种方式创建 DataFrame，例如从 CSV 文件或 SQL 数据库中读取数据，或者从 Python 的字典、列表或其他 DataFrame 中创建。（详见下文）

DataFrame 提供了许多功能，包括数据的查看、访问、选择、删除、替换、排序、分组、合并、重塑、统计分析等等。它是数据处理和分析的主要工具之一，后面的章节主要是基于对DataFrame的处理。

创建DateFrame的多种方法

通过二维数组创建DataFrame

arr2 = np.array(np.arange(12)).reshape(4,3)
print(arr2)
print(type(arr2))

[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]
<class 'numpy.ndarray'>

df1 = pd.DataFrame(arr2)
print(df1)
print(type(df1))

   0   1   2
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11
<class 'pandas.core.frame.DataFrame'>

通过字典创建DataFrame

使用字典创建DataFrame实例时，利用DataFrame可以将字典的键直接设置为列索引，并且制定一个列表作为字典的值，字典的值便成为该索引下所有的元素。

dic2 = {'a':[1,2,3,4],'b':[5,6,7,8],'c':[9,10,11,12],'d':[13,14,15,16]}
print(dic2)
df2 = pd.DataFrame(dic2)
print(df2)

{'a': [1, 2, 3, 4], 'b': [5, 6, 7, 8], 'c': [9, 10, 11, 12], 'd': [13, 14, 15, 16]}
   a  b   c   d
0  1  5   9  13
1  2  6  10  14
2  3  7  11  15
3  4  8  12  16

dic3 = {'one':{'a':1,'b':2,'c':3,'d':4},'two':{'a':5,'b':6,'c':7,'d':8},'three':{'a':9,'b':10,'c':11,'d':12}}
print(dic3)
df3 = pd.DataFrame(dic3)
print(df3)

{'one': {'a': 1, 'b': 2, 'c': 3, 'd': 4}, 'two': {'a': 5, 'b': 6, 'c': 7, 'd': 8}, 'three': {'a': 9, 'b': 10, 'c': 11, 'd': 12}}
   one  two  three
a    1    5      9
b    2    6     10
c    3    7     11
d    4    8     12

通过DataFrame创建DataFrame

df4 = df3[['one','three']]
df4

	one	three
a	1	9
b	2	10
c	3	11
d	4	12

s3 = df3['one']
s3

a    1
b    2
c    3
d    4
Name: one, dtype: int64

直接读入csv文件或excel文件构造DataFrame(最常用)

Pandas 库提供了一些非常方便的函数来从 CSV 文件或 Excel 文件中读取数据并直接创建 DataFrame。以下是一些示例：

从 CSV 文件中读取数据：

import pandas as pd
# 读取 CSV 文件并创建 DataFrame
df = pd.read_csv('filename.csv')
# 查看 DataFrame 的前几行
df.head()

在这个例子中，pd.read_csv() 函数用于读取 CSV 文件并创建 DataFrame。你需要替换 'filename.csv' 为你的 CSV 文件的实际路径和文件名。df.head() 函数用于查看 DataFrame 的前几行。

从 Excel 文件中读取数据：

import pandas as pd
# 读取 Excel 文件并创建 DataFrame
df = pd.read_excel('filename.xlsx')
# 查看 DataFrame 的前几行
df.head()

在这个例子中，pd.read_excel() 函数用于读取 Excel 文件并创建 DataFrame。你需要替换 'filename.xlsx' 为你的 Excel 文件的实际路径和文件名。df.head() 函数用于查看 DataFrame 的前几行。

注意：如果CSV 或 Excel 文件位于互联网上的某个 URL，你可以直接将 URL 作为 pd.read_csv() 或 pd.read_excel() 函数的参数。

此外，如果Excel 文件有多个工作表，你可以通过 sheet_name 参数来指定要读取的工作表。例如，pd.read_excel('filename.xlsx', sheet_name='Sheet1') 会读取名为 'Sheet1' 的工作表。

其他数据源构造DataFrame

描述	读入	写入
以逗号作为分隔符的数据	read_csv	to_csv
json数据	read_json	to_json
网页中的表	read_html	to_html
剪贴板中数据内容	read_clibboard	to_clipboard
MS Excel文件	read_excel	to_excel
分布式存储系统(HDFStore)中的HDFS文件	read_hdf	to_hdf
Feather格式数据(一种快速可互操作的二进制数据框)	read_feather	to_feather
Parquet数据(Hadoop生态系统中的一种列式存储格式)	read_parquet	to_parquet
MessagePack格式数据(json的1对1二进制表示)	read_msgpack	to_msgpack
Stata数据	read_stata	to_stata
Python Pickle数据	read_pickle	to_pickle
SQL、MySQL数据库中的数据	read_sql	to_sql
Google Big Query(可与Google存储结合使用的大量数据集进行交互式分析)	read_gbq	to_gbq

薄荷杂学

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
pandas-1-Series和DataFrame介绍

其中的键（'name'，'age'，'city'）将成为 DataFrame 的列名，对应的值（列表）将成为该列的数据。可以看到，DataFrame 有两个索引：一个是行索引（在这个例子中是 0 到 3），另一个是列索引（在这个例子中是 'name'，'age'，'city'）。使用字典创建DataFrame实例时，利用DataFrame可以将字典的键直接设置为列索引，并且制定一个列表作为字典的值，字典的值便成为该索引下所有的元素。每个值都有一个与之相关联的标签，这些标签在这个例子中是0到5的整数。
复制链接

扫一扫

专栏目录