NumPy教程:使用Python进行数据分析

本文是一篇关于使用NumPy进行数据分析的教程,介绍了如何从CSV数据列表开始,利用NumPy创建二维数组,进行索引、切片、赋值等操作,以及进行一维和多维数组的操作,包括数组运算、数据类型转换、数组比较和重塑等。通过实例展示了NumPy在处理葡萄酒质量数据时的便捷性,帮助读者更好地理解和应用NumPy进行数据处理。
摘要由CSDN通过智能技术生成

NumPy is a commonly used Python data analysis package. By using NumPy, you can speed up your workflow, and interface with other packages in the Python ecosystem, like scikit-learn, that use NumPy under the hood. NumPy was originally developed in the mid 2000s, and arose from an even older package called Numeric. This longevity means that almost every data analysis or machine learning package for Python leverages NumPy in some way.

NumPy是一个常用的Python数据分析包。 通过使用NumPy,您可以加快工作流程,并与Python生态系统中的其他软件包(例如scikit-learn)进行交互,这些软件包在后台使用了NumPy。 NumPy最初是在2000年代中期开发的,起源于一个名为Numeric的更老的软件包。 这种长寿意味着几乎所有用于Python的数据分析或机器学习包都以某种方式利用了NumPy。

In this tutorial, we’ll walk through using NumPy to analyze data on wine quality. The data contains information on various attributes of wines, such as pH and fixed acidity, along with a quality score between 0 and 10 for each wine. The quality score is the average of at least 3 human taste testers. As we learn how to work with NumPy, we’ll try to figure out more about the perceived quality of wine.

在本教程中,我们将逐步使用NumPy分析葡萄酒质量数据。 数据包含有关葡萄酒各种属性的信息,例如pHfixed acidity ,以及每种葡萄酒的质量得分在010之间。 质量得分是至少3人类味觉测试人员的平均值。 当我们学习如何与NumPy合作时,我们将尝试更多地了解葡萄酒的品质。

The wines we’ll be analyzing are from the Minho region of Portugal.

我们将要分析的葡萄酒来自葡萄牙的Minho地区。

The data was downloaded from the UCI Machine Learning Repository, and is available here. Here are the first few rows of the winequality-red.csv file, which we’ll be using throughout this tutorial:

数据是从UCI机器学习存储库下载的,可以在这里找到 。 这是winequality-red.csv文件的前几行,我们将在本教程中使用它们:

"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5

"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5

The data is in what I’m going to call ssv (semicolon separated values) format – each record is separated by a semicolon (;), and rows are separated by a new line. There are 1600 rows in the file, including a header row, and 12 columns.

数据采用ssv(用分号分隔的值)格式-每条记录用分号( ; )分隔,行用新行分隔。 文件中有1600行,包括标题行和12列。

Before we get started, a quick version note – we’ll be using Python 3.5. Our code examples will be done using Jupyter notebook.

在开始之前,请快速阅读一下版本注释–我们将使用Python 3.5 。 我们的代码示例将使用Jupyter notebook完成。

If you want to jump right into a specific area, here are the topics:

如果您想直接跳到特定区域,请参考以下主题:

CSV数据列表列表 (Lists Of Lists for CSV Data)

Before using NumPy, we’ll first try to work with the data using Python and the csv package. We can read in the file using the csv.reader object, which will allow us to read in and split up all the content from the ssv file.

在使用NumPy之前,我们将首先尝试使用Python和csv包处理数据。 我们可以使用csv.reader对象读入文件,这将使我们能够读入并拆分ssv文件中的所有内容。

In the below code, we:

在下面的代码中,我们:

  • Import the csv library.
  • Open the winequality-red.csv file.
    • With the file open, create a new csv.reader object.
      • Pass in the keyword argument delimiter=";" to make sure that the records are split up on the semicolon character instead of the default comma character.
    • Call the list type to get all the rows from the file.
    • Assign the result to wines.
  • 导入csv库。
  • 打开winequality-red.csv文件。
    • 打开文件后,创建一个新的csv.reader对象。
      • 传递关键字参数delimiter=";" 确保记录以分号而不是默认的逗号分隔。
    • 调用列表类型以获取文件中的所有行。
    • 将结果分配给wines

In [1]:

在[1]中:

Once we’ve read in the data, we can print out the first 3 rows:

读取数据后,我们可以打印出前3行:

In [3]:

在[3]中:

printprint (( wineswines [:[: 33 ])
])

[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality'], ['7.4', '0.7', '0', '1.9', '0.076', '11', '34', '0.9978', '3.51', '0.56', '9.4', '5'], ['7.8', '0.88', '0', '2.6', '0.098', '25', '67', '0.9968', '3.2', '0.68', '9.8', '5']]

The data has been read into a list of lists. Each inner list is a row from the ssv file. As you may have noticed, each item in the entire list of lists is represented as a string, which will make it harder to do computations.

数据已被读入列表列表。 每个内部列表都是ssv文件中的一行。 您可能已经注意到,整个列表列表中的每个项目都以字符串表示,这将使计算变得更加困难。

We’ll format the data into a table to make it easier to view:

我们将数据格式化为表格以使其更易于查看:

fixed acidity 固定酸度 volatile acidity 挥发性酸度 citric acid 柠檬酸 residual sugar 残留糖 chlorides 氯化物 free sulfur dioxide 游离二氧化硫 total sulfur dioxide 总二氧化硫 density 密度 pH pH值 sulphates 硫酸盐 alcoholquality 质量
7.4 7.4 0.70 0.70 0 0 1.9 1.9 0.076 0.076 11 11 34 34 0.9978 0.9978 3.51 3.51 0.56 0.56 9.4 9.4 5 5
7.8 7.8 0.88 0.88 0 0 2.6 2.6 0.098 0.098 25 25 67 67 0.9968 0.9968 3.20 3.20 0.68 0.68 9.8 9.8 5 5

As you can see from the table above, we’ve read in three rows, the first of which contains column headers. Each row after the header row represents a wine. The first element of each row is the fixed acidity, the second is the volatile acidity, and so on. We can find the average quality of the wines. The below code will:

从上表中可以看到,我们读了三行,其中第一行包含列标题。 标题行之后的每一行代表一种葡萄酒。 每行的第一个元素是fixed acidity ,第二个元素是volatile acidity ,依此类推。 我们可以找到葡萄酒的平均quality 。 下面的代码将:

  • Extract the last element from each row after the header row.
  • Convert each extracted element to a float.
  • Assign all the extracted elements to the list qualities.
  • Divide the sum of all the elements in qualities by the total number of elements in qualities to the get the mean.
  • 从标题行之后的每一行中提取最后一个元素。
  • 将每个提取的元素转换为浮点数。
  • 将所有提取的元素分配给列表qualities
  • 把所有的元素之和qualities的元素的总数qualities的获得的平均值。

In [4]:

在[4]中:

Out[4]:

出[4]:


5.6360225140712945

Although we were able to do the calculation we wanted, the code is fairly complex, and it won’t be fun to have to do something similar every time we want to compute a quantity. Luckily, we can use NumPy to make it easier to work with our data.

尽管我们能够进行所需的计算,但是代码却相当复杂,每次我们要计算数量时都必须执行类似的操作并不是一件有趣的事情。 幸运的是,我们可以使用NumPy来简化数据处理。

Numpy二维数组 (Numpy 2-Dimensional Arrays)

With NumPy, we work with multidimensional arrays. We’ll dive into all of the possible types of multidimensional arrays later on, but for now, we’ll focus on 2-dimensional arrays. A 2-dimensional array is also known as a matrix, and is something you should be familiar with. In fact, it’s just a different way of thinking about a list of lists. A matrix has rows and columns. By specifying a row number and a column number, we’re able to extract an element from a matrix.

使用NumPy,我们可以处理多维数组。 稍后,我们将深入探讨所有可能的多维数组类型,但现在,我们将专注于二维数组。 二维数组也称为矩阵,您应该熟悉它。 实际上,这只是考虑列表列表的另一种方式。 矩阵具有行和列。 通过指定行号和列号,我们能够从矩阵中提取元素。

In the below matrix, the first row is the header row, and the first column is the fixed acidity column:

在下面的矩阵中,第一行是标题行,第一列是fixed acidity列:

fixed acidity 固定酸度 volatile acidity 挥发性酸度 citric acid 柠檬酸 residual sugar 残留糖 chlorides 氯化物 free sulfur dioxide 游离二氧化硫 total sulfur dioxide 总二氧化硫 density 密度 pH pH值 sulphates 硫酸盐 alcoholquality 质量
7.4 7.4 0.70 0.70 0 0 1.9 1.9 0.076 0.076 11 11 34 34 0.9978 0.9978 3.51 3.51 0.56 0.56 9.4 9.4 5 5
7.8 7.8 0.88 0.88 0 0 2.6 2.6 0.098 0.098 25 25 67 67 0.9968 0.9968 3.20 3.20 0.68 0.68 9.8 9.8 5 5

If we picked the element at the first row and the second column, we’d get volatile acidity. If we picked the element in the third row and the second column, we’d get 0.88.

如果我们在第一行和第二列中选择元素,则会得到volatile acidity 。 如果我们在第三行和第二列中选择元素,则将得到0.88

In a NumPy array, the number of dimensions is called the rank, and each dimension is called an axis. So the rows are the first axis, and the columns are the second axis.

在NumPy数组中,维数称为等级,每个维数称为轴。 因此,行是第一个轴,列是第二个轴。

Now that you understand the basics of matrices, let’s see how we can get from our list of lists to a NumPy array.

现在您已经了解了矩阵的基本知识,让我们看看如何从列表列表中获取NumPy数组。

创建一个NumPy数组 (Creating A NumPy Array)

We can create a NumPy array using the numpy.array function. If we pass in a list of lists, it will automatically create a NumPy array with the same number of rows and columns. Because we want all of the elements in the array to be float elements for easy computation, we’ll leave off the header row, which contains strings. One of the limitations of NumPy is that all the elements in an array have to be of the same type, so if we include the header row, all the elements in the array will be read in as strings. Because we want to be able to do computations like find the average quality of the wines, we need the elements to all be floats.

我们可以使用numpy.array函数创建一个NumPy数组。 如果我们传入一个列表列表,它将自动创建一个具有相同行数和列数的NumPy数组。 因为我们希望数组中的所有元素都是float元素以便于计算,所以我们将省略包含字符串的标题行。 NumPy的局限性之一是数组中的所有元素必须具有相同的类型,因此,如果我们包含标题行,则数组中的所有元素都将作为字符串读取。 因为我们希望能够进行计算,例如找到葡萄酒的平均quality ,所以我们需要所有元素都是浮点数。

In the below code, we:

在下面的代码中,我们:

  • Import the numpy package.
  • Pass the list of lists wines into the array function, which converts it into a NumPy array.
    • Exclude the header row with list slicing.
    • Specify the keyword argument dtype to make sure each element is converted to a float. We’ll dive more into what the dtype is later on.
  • 导入numpy包。
  • 将列表wines的列表传递给array函数,该函数将其转换为NumPy数组。
    • 用列表切片排除标题行。
    • 指定关键字参数dtype以确保每个元素都转换为浮点数。 稍后我们将深入探讨dtype

In [5]:

在[5]中:

import import numpy numpy as as np

np

wines wines = = npnp .. arrayarray (( wineswines [[ 11 :], :], dtypedtype == npnp .. floatfloat )
)

If we display wines, we’ll now get a NumPy array:

如果展示wines ,我们现在将获得一个NumPy数组:

Out[6]:

出[6]:


array([[  7.4  ,   0.7  ,   0.   , ...,   0.56 ,   9.4  ,   5.   ],
       [  7.8  ,   0.88 ,   0.   , ...,   0.68 ,   9.8  ,   5.   ],
       [  7.8  ,   0.76 ,   0.04 , ...,   0.65 ,   9.8  ,   5.   ],
       ..., 
       [  6.3  ,   0.51 ,   0.13 , ...,   0.75 ,  11.   ,   6.   ],
       [  5.9  ,   0.645,   0.12 , ...,   0.71 ,  10.2  ,   5.   ],
       [  6.   ,   0.31 ,   0.47 , ...,   0.66 ,  11.   ,   6.   ]])

We can check the number of rows and columns in our data using the shape property of NumPy arrays:

我们可以使用NumPy数组的shape属性检查数据中的行数和列数:

In [7]:

在[7]中:

Out[7]:

出[7]:


(1599, 12)

替代的NumPy数组创建方法 (Alternative NumPy Array Creation Methods)

There are a variety of methods that you can use to create NumPy arrays. To start with, you can create an array where every element is zero. The below code will create an array with 3 rows and 4 columns, where every element is 0, using numpy.zeros:

您可以使用多种方法来创建NumPy数组。 首先,您可以创建一个数组,其中每个元素均为零。 下面的代码将使用numpy.zeros创建一个具有34列的数组,其中每个元素均为0

In [8]:

在[8]中:

empty_array empty_array = = npnp .. zeroszeros (((( 33 ,, 44 ))
))
empty_array
empty_array

Out[8]:

出[8]:


array([[ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.]])

It’s useful to create an array with all zero elements in cases when you need an array of fixed size, but don’t have any values for it yet.

在需要固定大小的数组但尚无任何值的情况下,创建一个全零元素的数组很有用。

You can also create an array where each element is a random number using numpy.random.rand. Here’s an example:

您还可以使用numpy.random.rand创建一个数组,其中每个元素都是随机数。 这是一个例子:

In [9]:

在[9]中:

Out[9]:

出[9]:


array([[ 0.2247223 ,  0.92240549,  0.14541893,  0.61731257],
       [ 0.00154957,  0.82342197,  0.74044906,  0.11466845],
       [ 0.6152478 ,  0.14433138,  0.13009583,  0.22981301]])

Creating arrays full of random numbers can be useful when you want to quickly test your code with sample arrays.

当您想使用示例数组快速测试代码时,创建充满随机数的数组会很有用。

使用NumPy读取文件 (Using NumPy To Read In Files)

It’s possible to use NumPy to directly read csv or other files into arrays. We can do this using the numpy.genfromtxt function. We can use it to read in our initial data on red wines.

可以使用NumPy直接将csv或其他文件读入数组。 我们可以使用numpy.genfromtxt函数执行此操作。 我们可以使用它来读取有关红酒的原始数据。

In the below code, we:

在下面的代码中,我们:

  • Use the genfromtxt function to read in the winequality-red.csv file.
  • Specify the keyword argument delimiter=";" so that the fields are parsed properly.
  • Specify the keyword argument skip_header=1 so that the header row is skipped.
  • 使用genfromtxt函数读取winequality-red.csv文件。
  • 指定关键字参数delimiter=";" 以便正确解析字段。
  • 指定关键字参数skip_header=1以便跳过标题行。

In [62]:

在[62]中:

wines wines = = npnp .. genfromtxtgenfromtxt (( "winequality-red.csv""winequality-red.csv" , , delimiterdelimiter == ";"";" , , skip_headerskip_header == 11 )
)

wines will end up looking the same as if we read it into a list then converted it to an array of floats. NumPy will automatically pick a data type for the elements in an array based on their format.

wines外观与我们将其读入列表然后将其转换为浮点数组的外观相同。 NumPy将根据其格式自动为数组中的元素选择数据类型。

索引NumPy数组 (Indexing NumPy Arrays)

We now know how to create arrays, but unless we can retrieve results from them, there isn’t a lot we can do with NumPy. We can use array indexing to select individual elements, groups of elements, or entire rows and columns. One important thing to keep in mind is that just like Python lists, NumPy is zero-indexed, meaning that the index of the first row is 0, and the index of the first column is 0. If we want to work with the fourth row, we’d use index 3, if we want to work with the second row, we’d use index 1, and so on. We’ll again work with the wines array:

现在,我们知道如何创建数组,但是除非可以从数组中检索结果,否则NumPy不能做很多事情。 我们可以使用数组索引来选择单个元素,元素组或整个行和列。 要牢记的重要一点是,就像Python列表一样,NumPy的索引为零,这意味着第一行的索引为0 ,第一列的索引为0 。 如果要使用第四行,则使用索引3 ;如果要使用第二行,则使用索引1 ,依此类推。 我们将再次使用wines数组:

7.4 7.4 0.70 0.70 0.00 0.00 1.9 1.9 0.076 0.076 11 11 34 34 0.9978 0.9978 3.51 3.51 0.56 0.56 9.4 9.4 5 5
7.8 7.8 0.88 0.88 0.00 0.00 2.6 2.6 0.098 0.098 25 25 67 67 0.9968 0.9968 3.20 3.20 0.68 0.68 9.8 9.8 5 5
7.8 7.8 0.76 0.76 0.04 0.04 2.3 2.3 0.092 0.092 15 15 54 54 0.9970 0.9970 3.26 3.26 0.65 0.65 9.8 9.8 5 5
11.2 11.2 0.28 0.28 0.56 0.56 1.9 1.9 0.075 0.075 17 17 60 60 0.9980 0.9980 3.16 3.16 0.58 0.58 9.8 9.8 6 6
7.4 7.4 0.70 0.70 0.00 0.00 1.9 1.9 0.076 0.076 11 11 34 34 0.9978 0.9978 3.51 3.51 0.56 0.56 9.4 9.4 5 5

Let’s select the element at row 3 and column 4. In the below code, we pass in the index 2 as the row index, and the index 3 as the column index. This retrieves the value from the fourth column of the third row:

让我们选择第3行和第4列的元素。 在下面的代码中,我们将索引2作为行索引,将索引3作为列索引。 这将从第三行的第四列中检索值:

In [11]:

在[11]中:

Out[11]:

出[11]:


2.2999999999999998

Since we’re working with a 2-dimensional array in NumPy, we specify 2 indexes to retrieve an element. The first index is the row, or axis 1, index, and the second index is the column, or axis 2, index. Any element in wines can be retrieved using 2 indexes.

由于我们正在NumPy中使用二维数组,因此我们指定2个索引来检索元素。 第一个索引是行或轴1索引,第二个索引是列或轴2索引。 wines任何元素都可以使用2索引进行检索。

切片NumPy数组 (Slicing NumPy Arrays)

If we instead want to select the first three items from the fourth column, we can do it using a colon (:). A colon indicates that we want to select all the elements from the starting index up to but not including the ending index. This is also known as a slice:

如果我们不是要选择从第四列的前三个项目,我们可以使用一个冒号做( : )。 冒号表示我们要选择从开始索引到结束索引(但不包括结束索引)的所有元素。 这也称为切片:

In [12]:

在[12]中:

wineswines [[ 00 :: 33 ,, 33 ]
]

Out[12]:

出[12]:


array([ 1.9,  2.6,  2.3])

Just like with list slicing, it’s possible to omit the 0 to just retrieve all the elements from the beginning up to element 3:

就像列表切片一样,可以省略0以仅检索从开始到元素3所有元素:

In [13]:

在[13]中:

Out[13]:

出[13]:


array([ 1.9,  2.6,  2.3])

We can select an entire column by specifying that we want all the elements, from the first to the last. We specify this by just using the colon (:), with no starting or ending indices. The below code will select the entire fourth column:

我们可以通过指定我们想要从头到尾的所有元素来选择整个列。 (我们只要使用冒号指定此: ),没有开始或结束索引。 下面的代码将选择整个第四列:

In [14]:

在[14]中:

wineswines [:,[:, 33 ]
]

Out[14]:

出[14]:


array([ 1.9,  2.6,  2.3, ...,  2.3,  2. ,  3.6])

We selected an entire column above, but we can also extract an entire row:

我们在上方选择了整列,但我们也可以提取整行:

In [15]:

在[15]中:

Out[15]:

出[15]:


array([ 11.2  ,   0.28 ,   0.56 ,   1.9  ,   0.075,  17.   ,  60.   ,
         0.998,   3.16 ,   0.58 ,   9.8  ,   6.   ])

If we take our indexing to the extreme, we can select the entire array using two colons to select all the rows and columns in wines. This is a great party trick, but doesn’t have a lot of good applications:

如果我们将索引扩展到极限,则可以使用两个冒号选择整个数组,以选择wines所有行和列。 这是一个很棒的聚会把戏,但没有很多好的应用程序:

In [16]:

在[16]中:

wineswines [:,:]
[:,:]

Out[16]:

出[16]:


array([[  7.4  ,   0.7  ,   0.   , ...,   0.56 ,   9.4  ,   5.   ],
       [  7.8  ,   0.88 ,   0.   , ...,   0.68 ,   9.8  ,   5.   ],
       [  7.8  ,   0.76 ,   0.04 , ...,   0.65 ,   9.8  ,   5.   ],
       ..., 
       [  6.3  ,   0.51 ,   0.13 , ...,   0.75 ,  11.   ,   6.   ],
       [  5.9  ,   0.645,   0.12 , ...,   0.71 ,  10.2  ,   5.   ],
       [  6.   ,   0.31 ,   0.47 , ...,   0.66 ,  11.   ,   6.   ]])

给NumPy数组赋值 (Assigning Values To NumPy Arrays)

We can also use indexing to assign values to certain elements in arrays. We can do this by assigning directly to the indexed value:

我们还可以使用索引将值分配给数组中的某些元素。 我们可以通过直接分配索引值来做到这一点:

In [17]:

在[17]中:

We can do the same for slices. To overwrite an entire column, we can do this:

我们可以对切片执行相同的操作。 要覆盖整个列,我们可以这样做:

In [18]:

在[18]中:

wineswines [:,[:, 1010 ] ] = = 50
50

The above code overwrites all the values in the eleventh column with 50.

上面的代码用50覆盖了第11列中的所有值。

一维NumPy数组 (1-Dimensional NumPy Arrays)

So far, we’ve worked with 2-dimensional arrays, such as wines. However, NumPy is a package for working with multidimensional arrays. One of the most common types of multidimensional arrays is the 1-dimensional array, or vector. As you may have noticed above, when we sliced wines, we retrieved a 1-dimensional array. A 1-dimensional array only needs a single index to retrieve an element. Each row and column in a 2-dimensional array is a 1-dimensional array. Just like a list of lists is analogous to a 2-dimensional array, a single list is analogous to a 1-dimensional array. If we slice wines and only retrieve the third row, we get a 1-dimensional array:

到目前为止,我们已经处理了二维数组,例如wines 。 但是,NumPy是用于处理多维数组的软件包。 多维数组的最常见类型之一是一维数组或矢量。 如您在上面可能已经注意到的那样,当我们对wines切片时,我们检索到一维数组。 一维数组仅需要单个索引即可检索元素。 二维数组中的每一行和每一列都是一维数组。 就像列表列表类似于二维数组一样,单个列表类似于一维数组。 如果我们将葡萄酒切成薄片并仅获取第三行,则会得到一维数组:

In [20]:

在[20]中:

Here’s how third_wine looks:

这是third_wine外观:

11.200
0.280
0.560
1.900
0.075
17.000
60.000
0.998
3.160
0.580
9.800
6.000

11.200
0.280
0.560
1.900
0.075
17.000
60.000
0.998
3.160
0.580
9.800
6.000

We can retrieve individual elements from third_wine using a single index. The below code will display the second item in third_wine:

我们可以使用单个索引从third_wine检索单个元素。 下面的代码将在third_wine显示第二个项目:

In [21]:

在[21]中:

Out[21]:

出[21]:


0.28000000000000003

Most NumPy functions that we’ve worked with, such as numpy.random.rand, can be used with multidimensional arrays. Here’s how we’d use numpy.random.rand to generate a random vector:

我们使用过的大多数NumPy函数,例如numpy.random.rand ,都可以与多维数组一起使用。 这是我们使用numpy.random.rand生成随机向量的方法:

In [22]:

在[22]中:

npnp .. randomrandom .. randrand (( 33 )
)

Out[22]:

出[22]:


array([ 0.88588862,  0.85693478,  0.19496774])

Previously, when we called np.random.rand, we passed in a shape for a 2-dimensional array, so the result was a 2-dimensional array. This time, we passed in a shape for a single dimensional array. The shape specifies the number of dimensions, and the size of the array in each dimension. A shape of (10,10) will be a 2-dimensional array with 10 rows and 10 columns. A shape of (10,) will be a 1-dimensional array with 10 elements.

以前,当我们调用np.random.rand ,我们为二维数组传递了一个形状,因此结果是一个二维数组。 这次,我们为一维数组传递了一个形状。 形状指定维数,以及每个维中数组的大小。 形状(10,10)将是具有1010列的二维数组。 形状(10,)将是包含10元素的一维数组。

Where NumPy gets more complex is when we start to deal with arrays that have more than 2 dimensions.

凡NumPy的变得更加复杂是,当我们开始处理有超过阵列2的尺寸。

N维NumPy数组 (N-Dimensional NumPy Arrays)

This doesn’t happen extremely often, but there are cases when you’ll want to deal with arrays that have greater than 3 dimensions. One way to think of this is as a list of lists of lists. Let’s say we want to store the monthly earnings of a store, but we want to be able to quickly lookup the results for a quarter, and for a year. The earnings for one year might look like this:

这种情况并不经常发生,但是在某些情况下,您将需要处理大于3维的数组。 考虑这一点的一种方法是将列表视为列表列表。 假设我们要存储商店的每月收入,但是我们希望能够快速查找季度和一年的结果。 一年的收入可能看起来像这样:

The store earned $500 in January, $505 in February, and so on. We can split up these earnings by quarter into a list of lists:

该商店在一月份的收入$500 $505 ,二月份的$505 ,依此类推。 我们可以按季度将这些收入分成以下列表:

In [23]:

在[23]中:

year_one year_one = = [
    [
    [[ 500500 ,, 505505 ,, 490490 ],
    ],
    [[ 810810 ,, 450450 ,, 678678 ],
    ],
    [[ 234234 ,, 897897 ,, 430430 ],
    ],
    [[ 560560 ,, 10231023 ,, 640640 ]
]
]
]

We can retrieve the earnings from January by calling year_one[0][0]. If we want the results for a whole quarter, we can call year_one[0] or year_one[1]. We now have a 2-dimensional array, or matrix. But what if we now want to add the results from another year? We have to add a third dimension:

我们可以通过调用year_one[0][0]来检索1月的收入。 如果我们需要整个季度的结果,可以调用year_one[0]year_one[1] 。 现在,我们有了一个二维数组或矩阵。 但是,如果我们现在想将另一年的结果相加怎么办? 我们必须添加第三维:

In [24]:

在[24]中:

We can retrieve the earnings from January of the first year by calling earnings[0][0][0]. We now need three indexes to retrieve a single element. A three-dimensional array in NumPy is much the same. In fact, we can convert earnings to an array and then get the earnings for January of the first year:

我们可以通过调用earnings[0][0][0]来检索第一年1月的earnings[0][0][0] 。 现在,我们需要三个索引来检索单个元素。 NumPy中的三维数组几乎相同。 实际上,我们可以将earnings转换为数组,然后获得第一年一月的收入:

In [25]:

在[25]中:

earnings earnings = = npnp .. arrayarray (( earningsearnings )
)
earningsearnings [[ 00 ,, 00 ,, 00 ]
]

We can also find the shape of the array:

我们还可以找到数组的形状:

In [26]:

在[26]中:

Indexing and slicing work the exact same way with a 3-dimensional array, but now we have an extra axis to pass in. If we wanted to get the earnings for January of all years, we could do this:

索引和切片对3维数组的工作方式完全相同,但是现在我们需要传递一个额外的轴。如果我们想获得所有年份一月的收入,我们可以这样做:

In [27]:

在[27]中:

earningsearnings [:,[:, 00 ,, 00 ]
]

Out[27]:

出[27]:


array([500, 600])

If we wanted to get first quarter earnings from both years, we could do this:

如果我们想获得两个年度的第一季度收益,我们可以这样做:

In [28]:

在[28]中:

Out[28]:

出[28]:


array([[500, 505, 490],
       [600, 605, 490]])

Adding more dimensions can make it much easier to query your data if it’s organized in a certain way. As we go from 3-dimensional arrays to 4-dimensional and larger arrays, the same properties apply, and they can be indexed and sliced in the same ways.

如果以某种方式组织数据,则添加更多维度可以使查询数据变得更加容易。 当我们从3维数组变为4维及更大的数组时,将应用相同的属性,并且可以用相同的方式对它们进行索引和切片。

NumPy数据类型 (NumPy Data Types)

As we mentioned earlier, each NumPy array can store elements of a single data type. For example, wines contains only float values. NumPy stores values using its own data types, which are distinct from Python types like float and str. This is because the core of NumPy is written in a programming language called C, which stores data differently than the Python data types. NumPy data types map between Python and C, allowing us to use NumPy arrays without any conversion hitches.

如前所述,每个NumPy数组都可以存储单个数据类型的元素。 例如, wines仅包含浮点值。 NumPy使用自己的数据类型存储值,这与Python类型(例如floatstr 。 这是因为NumPy的核心是用一种称为C的编程语言编写的,该语言存储的数据不同于Python数据类型。 NumPy数据类型在Python和C之间映射,从而使我们可以使用NumPy数组而没有任何转换障碍。

You can find the data type of a NumPy array by accessing the dtype property:

您可以通过访问dtype属性来找到NumPy数组的数据类型:

In [29]:

在[29]中:

wineswines .. dtype
dtype

Out[29]:

出[29]:


dtype('float64')

NumPy has several different data types, which mostly map to Python data types, like float, and str. You can find a full listing of NumPy data types here, but here are a few important ones:

NumPy有几种不同的数据类型,它们大多数都映射到Python数据类型,例如floatstr 。 您可以在此处找到NumPy数据类型的完整列表,但以下是一些重要的数据类型:

  • float – numeric floating point data.
  • int – integer data.
  • string – character data.
  • object – Python objects.
  • float数字浮点数据。
  • int –整数数据。
  • string –字符数据。
  • object – Python对象。

Data types additionally end with a suffix that indicates how many bits of memory they take up. So int32 is a 32 bit integer data type, and float64 is a 64 bit float data type.

数据类型还以一个后缀结尾,该后缀指示它们占用了多少内存。 因此, int32是32位整数数据类型,而float6464位float数据类型。

转换数据类型 (Converting Data Types)

You can use the numpy.ndarray.astype method to convert an array to a different type. The method will actually copy the array, and return a new array with the specified data type. For instance, we can convert wines to the int data type:

您可以使用numpy.ndarray.astype方法将数组转换为其他类型。 该方法实际上将复制该数组,并返回具有指定数据类型的新数组。 例如,我们可以将wines转换为int数据类型:

In [30]:

在[30]中:

Out[30]:

出[30]:


array([[ 7,  0,  0, ...,  0,  9,  5],
       [ 7,  0,  0, ...,  0,  9,  5],
       [ 7,  0,  0, ...,  0,  9,  5],
       ..., 
       [ 6,  0,  0, ...,  0, 11,  6],
       [ 5,  0,  0, ...,  0, 10,  5],
       [ 6,  0,  0, ...,  0, 11,  6]])

As you can see above, all of the items in the resulting array are integers. Note that we used the Python int type instead of a NumPy data type when converting wines. This is because several Python data types, including float, int, and string, can be used with NumPy, and are automatically converted to NumPy data types.

正如您在上面看到的,结果数组中的所有项目都是整数。 请注意,在转换wines时,我们使用Python int类型而不是NumPy数据类型。 这是因为NumPy可以使用包括floatintstring在内的几种Python数据类型,并将它们自动转换为NumPy数据类型。

We can check the name property of the dtype of the resulting array to see what data type NumPy mapped the resulting array to:

我们可以检查结果数组的dtypename属性,以查看NumPy将什么数据类型映射到结果数组:

In [31]:

在[31]中:

int_wines int_wines = = wineswines .. astypeastype (( intint )
)
int_winesint_wines .. dtypedtype .. name
name

The array has been converted to a 64-bit integer data type. This allows for very long integer values, but takes up more space in memory than storing the values as 32-bit integers.

该数组已转换为64位整数数据类型。 这允许很长的整数值,但是比将值存储为32位整数要占用更多的内存空间。

If you want more control over how the array is stored in memory, you can directly create NumPy dtype objects like numpy.int32:

如果要进一步控制数组在内存中的存储方式,可以直接创建NumPy dtype对象,例如numpy.int32

Out[32]:

出[32]:


numpy.int32

You can use these directly to convert between types:

您可以直接使用它们在类型之间进行转换:

In [33]:

在[33]中:

Out[33]:

出[33]:


array([[ 7,  0,  0, ...,  0,  9,  5],
       [ 7,  0,  0, ...,  0,  9,  5],
       [ 7,  0,  0, ...,  0,  9,  5],
       ..., 
       [ 6,  0,  0, ...,  0, 11,  6],
       [ 5,  0,  0, ...,  0, 10,  5],
       [ 6,  0,  0, ...,  0, 11,  6]], dtype=int32)

NumPy阵列运算 (NumPy Array Operations)

NumPy makes it simple to perform mathematical operations on arrays. This is one of the primary advantages of NumPy, and makes it quite easy to do computations.

NumPy使在数组上执行数学运算变得简单。 这是NumPy的主要优点之一,并且使得进行计算非常容易。

单数组数学 (Single Array Math)

If you do any of the basic mathematical operations (/, *, -, +, ^) with an array and a value, it will apply the operation to each of the elements in the array.

如果对数组和值执行任何基本数学运算( /*-+^ ),它将对数组中的每个元素应用该运算。

Let’s say we want to add 10 points to each quality score because we’re drunk and feeling generous. Here’s how we’d do that:

假设我们要为每个质量得分增加10分,因为我们喝醉了并且感觉很慷慨。 这是我们的处理方式:

In [34]:

在[34]中:

wineswines [:,[:, 1111 ] ] + + 10
10

Out[34]:

出[34]:


array([ 15.,  15.,  15., ...,  16.,  15.,  16.])

Note that the above operation won’t change the wines array – it will return a new 1-dimensional array where 10 has been added to each element in the quality column of wines.

请注意,上述操作不会更改wines数组-它会返回一个新的1维数组,其中10添加到wines质量列中的每个元素。

If we instead did +=, we’d modify the array in place:

如果改为使用+= ,则会在适当的位置修改数组:

In [35]:

在[35]中:

Out[35]:

出[35]:


array([ 15.,  15.,  15., ...,  16.,  15.,  16.])

All the other operations work the same way. For example, if we want to multiply each of the quality score by 2, we could do it like this:

所有其他操作都以相同的方式进行。 例如,如果我们想将每个质量得分乘以2 ,我们可以这样做:

In [36]:

在[36]中:

wineswines [:,[:, 1111 ] ] * * 2
2

Out[36]:

出[36]:


array([ 30.,  30.,  30., ...,  32.,  30.,  32.])

多数组数学 (Multiple Array Math)

It’s also possible to do mathematical operations between arrays. This will apply the operation to pairs of elements. For example, if we add the quality column to itself, here’s what we get:

在数组之间进行数学运算也是可能的。 这会将操作应用于成对的元素。 例如,如果我们向其自身添加quality列,则得到的是:

In [38]:

在[38]中:

Out[38]:

出[38]:


array([ 10.,  10.,  10., ...,  12.,  10.,  12.])

Note that this is equivalent to wines[11] * 2 – this is because NumPy adds each pair of elements. The first element in the first array is added to the first element in the second array, the second to the second, and so on.

请注意,这等效于wines[11] * 2 –这是因为NumPy添加了每对元素。 将第一个数组中的第一个元素添加到第二个数组中的第一个元素,第二个添加到第二个,依此类推。

We can also use this to multiply arrays. Let’s say we want to pick a wine that maximizes alcohol content and quality (we want to get drunk, but we’re classy). We’d multiply alcohol by quality, and select the wine with the highest score:

我们还可以使用它来乘法数组。 假设我们要挑选一种酒,使酒精含量和质量达到最高(我们想喝醉,但我们很优雅)。 我们将alcohol乘以quality ,然后选择得分最高的葡萄酒:

In [39]:

在[39]中:

wineswines [:,[:, 1010 ] ] * * wineswines [:,[:, 1111 ]
]

Out[39]:

出[39]:


array([ 47.,  49.,  49., ...,  66.,  51.,  66.])

All of the common operations (/, *, -, +, ^) will work between arrays.

所有常见的操作( /*-+^ )将在数组之间工作。

广播 (Broadcasting)

Unless the arrays that you’re operating on are the exact same size, it’s not possible to do elementwise operations. In cases like this, NumPy performs broadcasting to try to match up elements. Essentially, broadcasting involves a few steps:

除非您要操作的数组的大小完全相同,否则无法进行元素操作。 在这种情况下,NumPy执行广播以尝试匹配元素。 本质上,广播涉及几个步骤:

  • The last dimension of each array is compared.
    • If the dimension lengths are equal, or one of the dimensions is of length 1, then we keep going.
    • If the dimension lengths aren’t equal, and none of the dimensions have length 1, then there’s an error.
  • Continue checking dimensions until the shortest array is out of dimensions.
  • 比较每个数组的最后一个维度。
    • 如果尺寸长度相等,或者其中一个尺寸长度为1 ,那么我们继续前进。
    • 如果尺寸长度不相等,并且所有尺寸都不具有长度1 ,则存在错误。
  • 继续检查尺寸,直到最短的数组超出尺寸为止。

For example, the following two shapes are compatible:

例如,以下两个形状是兼容的:

This is because the length of the trailing dimension of array A is 3, and the length of the trailing dimension of array B is 3. They’re equal, so that dimension is okay. Array B is then out of elements, so we’re okay, and the arrays are compatible for mathematical operations.

这是因为数组A的尾随尺寸的长度为3 ,而数组B的尾随尺寸的长度为3 。 它们是相等的,所以尺寸是可以的。 然后数组B的元素不足,所以我们可以了,并且这些数组可用于数学运算。

The following two shapes are also compatible:

以下两个形状也兼容:

A: (1,2)
B  (50,2)

A: (1,2)
B  (50,2)

The last dimension matches, and A is of length 1 in the first dimension.

最后一个尺寸匹配,并且在第一个尺寸中A的长度为1

These two arrays don’t match:

这两个数组不匹配:

The lengths of the dimensions aren’t equal, and neither array has either dimension length equal to 1.

维度的长度不相等,两个数组的维度长度都不等于1

There’s a detailed explanation of broadcasting here, but we’ll go through a few examples to illustrate the principle:

有广播的详细说明在这里 ,但我们将通过几个例子来说明这样的原则:

In [40]:

在[40]中:

wines wines * * npnp .. arrayarray ([([ 11 ,, 22 ])
])

The above example didn’t work because the two arrays don’t have a matching trailing dimension. Here’s an example where the last dimension does match:

上面的示例不起作用,因为两个数组没有匹配的尾随尺寸。 这是最后一个尺寸匹配的示例:

In [72]:

在[72]中:

array_one array_one = = npnp .. arrayarray (
    (
    [
        [
        [[ 11 ,, 22 ],
        ],
        [[ 33 ,, 44 ]
    ]
    ]
]
)
)
array_two array_two = = npnp .. arrayarray ([([ 44 ,, 55 ])

])

array_one array_one + + array_two
array_two

Out[72]:

出[72]:


array([[5, 7],
       [7, 9]])

As you can see, array_two has been broadcasted across each row of array_one. Here’s an example with our wines data:

如您所见, array_two已在array_two每一行中array_one 。 这是我们的wines数据示例:

In [41]:

在[41]中:

Out[41]:

出[41]:


array([[  8.08375389,   0.89047394,   0.77022918, ...,   0.94917479,
         10.34668852,   5.34569289],
       [  8.48375389,   1.07047394,   0.77022918, ...,   1.06917479,
         10.74668852,   5.34569289],
       [  8.48375389,   0.95047394,   0.81022918, ...,   1.03917479,
         10.74668852,   5.34569289],
       ..., 
       [  6.98375389,   0.70047394,   0.90022918, ...,   1.13917479,
         11.94668852,   6.34569289],
       [  6.58375389,   0.83547394,   0.89022918, ...,   1.09917479,
         11.14668852,   5.34569289],
       [  6.68375389,   0.50047394,   1.24022918, ...,   1.04917479,
         11.94668852,   6.34569289]])

Elements of rand_array are broadcast over each row of wines, so the first column of wines has the first value in rand_array added to it, and so on.

的元件rand_array被超过的每一行广播wines ,所以第一列wines具有在第一值rand_array添加到它,等等。

喜欢这篇文章吗? 使用Dataquest学习数据科学! (Enjoying this post? Learn data science with Dataquest!)

  • Learn from the comfort of your browser.
  • Work with real-life data sets.
  • Build a portfolio of projects.
  • 从舒适的浏览器中学习。
  • 处理实际数据集。
  • 建立项目组合。

NumPy数组方法 (NumPy Array Methods)

In addition to the common mathematical operations, NumPy also has several methods that you can use for more complex calculations on arrays. An example of this is the numpy.ndarray.sum method. This finds the sum of all the elements in an array by default:

除了常见的数学运算之外,NumPy还提供了几种方法,可用于对数组进行更复杂的计算。 numpy.ndarray.sum方法就是一个例子。 默认情况下,这将查找数组中所有元素的总和:

In [42]:

在[42]中:

wineswines [:,[:, 1111 ]] .. sumsum ()
()

The total of all of our quality ratings is 154.1788. We can pass the axis keyword argument into the sum method to find sums over an axis. If we call sum across the wines matrix, and pass in axis=0, we’ll find the sums over the first axis of the array. This will give us the sum of all the values in every column. This may seem backwards that the sums over the first axis would give us the sum of each column, but one way to think about this is that the specified axis is the one “going away”. So if we specify axis=0, we want the rows to go away, and we want to find the sums for each of the remaining axes across each row:

我们所有的质量154.1788154.1788 。 我们可以将axis关键字参数传递给sum方法,以查找轴上的总和。 如果我们在wines矩阵中调用sum ,并在axis=0传递,我们将在数组的第一个轴上找到sum。 这将为我们提供每一列中所有值的总和。 这似乎倒退了,第一条轴上的总和将提供每一列的总和,但是考虑这一点的一种方法是指定的轴是“离开”的那条轴。 因此,如果我们指定axis=0 ,我们希望各行消失,并希望找到每一行中其余各轴的总和:

In [43]:

在[43]中:

Out[43]:

出[43]:


array([ 13303.1    ,    843.985  ,    433.29   ,   4059.55   ,
          139.859  ,  25384.     ,  74302.     ,   1593.79794,
         5294.47   ,   1052.38   ,  16666.35   ,   9012.     ])

We can verify that we did the sum correctly by checking the shape. The shape should be 12, corresponding to the number of columns:

我们可以通过检查形状来验证我们正确地做了和。 形状应为12 ,对应于列数:

In [44]:

在[44]中:

wineswines .. sumsum (( axisaxis == 00 )) .. shape
shape

If we pass in axis=1, we’ll find the sums over the second axis of the array. This will give us the sum of each row:

如果传入axis=1 ,我们将找到数组第二个轴的总和。 这将给我们每一行的总和:

In [45]:

在[45]中:

Out[45]:

出[45]:


array([  74.5438 ,  123.0548 ,   99.699  , ...,  100.48174,  105.21547,
         92.49249])

There are several other methods that behave like the sum method, including:

其他几种方法的行为类似于sum方法,包括:

You can find a full list of array methods here.

您可以在此处找到数组方法的完整列表。

NumPy数组比较 (NumPy Array Comparisons)

NumPy makes it possible to test to see if rows match certain values using mathematical comparison operations like <, >, >=, <=, and ==. For example, if we want to see which wines have a quality rating higher than 5, we can do this:

NumPy使得可以使用数学比较操作(例如<>>=<===测试行是否与某些值匹配。 例如,如果我们想查看哪些葡萄酒的质量等级高于5 ,我们可以这样做:

In [46]:

在[46]中:

wineswines [:,[:, 1111 ] ] > > 5
5

Out[46]:

出[46]:


array([False, False, False, ...,  True, False,  True], dtype=bool)

We get a Boolean array that tells us which of the wines have a quality rating greater than 5. We can do something similar with the other operators. For instance, we can see if any wines have a quality rating equal to 10:

我们得到一个布尔数组,该布尔数组告诉我们哪些葡萄酒的质量评级大于5 。 我们可以和其他运营商做类似的事情。 例如,我们可以查看是否有任何葡萄酒的质量评级等于10

In [47]:

在[47]中:

Out[47]:

出[47]:


array([False, False, False, ..., False, False, False], dtype=bool)

子集 (Subsetting)

One of the powerful things we can do with a Boolean array and a NumPy array is select only certain rows or columns in the NumPy array. For example, the below code will only select rows in wines where the quality is over 7:

我们可以使用布尔数组和NumPy数组执行的强大功能之一是仅选择NumPy数组中的某些行或列。 例如,以下代码将仅选择质量超过7 wines中的行:

In [64]:

在[64]中:

high_quality high_quality = = wineswines [:,[:, 1111 ] ] > > 7
7
wineswines [[ high_qualityhigh_quality ,:][:,:][: 33 ,:]
,:]

Out[64]:

出[64]:


array([[  7.90000000e+00,   3.50000000e-01,   4.60000000e-01,
          3.60000000e+00,   7.80000000e-02,   1.50000000e+01,
          3.70000000e+01,   9.97300000e-01,   3.35000000e+00,
          8.60000000e-01,   1.28000000e+01,   8.00000000e+00],
       [  1.03000000e+01,   3.20000000e-01,   4.50000000e-01,
          6.40000000e+00,   7.30000000e-02,   5.00000000e+00,
          1.30000000e+01,   9.97600000e-01,   3.23000000e+00,
          8.20000000e-01,   1.26000000e+01,   8.00000000e+00],
       [  5.60000000e+00,   8.50000000e-01,   5.00000000e-02,
          1.40000000e+00,   4.50000000e-02,   1.20000000e+01,
          8.80000000e+01,   9.92400000e-01,   3.56000000e+00,
          8.20000000e-01,   1.29000000e+01,   8.00000000e+00]])

We select only the rows where high_quality contains a True value, and all of the columns. This subsetting makes it simple to filter arrays for certain criteria. For example, we can look for wines with a lot of alcohol and high quality. In order to specify multiple conditions, we have to place each condition in parentheses, and separate conditions with an ampersand (&):

我们仅选择high_quality包含True值的行以及所有列。 通过此子集,可以轻松地根据特定条件过滤阵列。 例如,我们可以寻找含有大量酒精和高品质的葡萄酒。 为了指定多个条件,我们必须把括号每个条件,独立的条件与符号( & ):

In [63]:

在[63]中:

Out[63]:

出[63]:


array([[ 12.8,   8. ],
       [ 12.6,   8. ],
       [ 12.9,   8. ],
       [ 13.4,   8. ],
       [ 11.7,   8. ],
       [ 11. ,   8. ],
       [ 11. ,   8. ],
       [ 14. ,   8. ],
       [ 12.7,   8. ],
       [ 12.5,   8. ],
       [ 11.8,   8. ],
       [ 13.1,   8. ],
       [ 11.7,   8. ],
       [ 14. ,   8. ],
       [ 11.3,   8. ],
       [ 11.4,   8. ]])

We can combine subsetting and assignment to overwrite certain values in an array:

我们可以结合使用子集和赋值来覆盖数组中的某些值:

In [50]:

在[50]中:

high_quality_and_alcohol high_quality_and_alcohol = = (( wineswines [:,[:, 1010 ] ] > > 1010 ) ) & & (( wineswines [:,[:, 1111 ] ] > > 77 )
)
wineswines [[ high_quality_and_alcoholhigh_quality_and_alcohol ,, 1010 :] :] = = 20
20

重塑NumPy数组 (Reshaping NumPy Arrays)

We can change the shape of arrays while still preserving all of their elements. This often can make it easier to access array elements. The simplest reshaping is to flip the axes, so rows become columns, and vice versa. We can accomplish this with the numpy.transpose function:

我们可以更改数组的形状,同时保留所有元素。 这通常可以使访问数组元素更加容易。 最简单的重塑是翻转轴,因此行变成列,反之亦然。 我们可以使用numpy.transpose函数完成此操作

In [51]:

在[51]中:

Out[51]:

出[51]:


(12, 1599)

We can use the numpy.ravel function to turn an array into a one-dimensional representation. It will essentially flatten an array into a long sequence of values:

我们可以使用numpy.ravel函数将数组转换为一维表示形式。 它将本质上将数组平整为一长串值:

In [52]:

在[52]中:

wineswines .. ravelravel ()
()

Out[52]:

出[52]:


array([  7.4 ,   0.7 ,   0.  , ...,   0.66,  11.  ,   6.  ])

Here’s an example where we can see the ordering of numpy.ravel:

这是一个示例,我们可以看到numpy.ravel的顺序:

In [73]:

在[73]中:

Out[73]:

出[73]:


array([1, 2, 3, 4, 5, 6, 7, 8])

Finally, we can use the numpy.reshape function to reshape an array to a certain shape we specify. The below code will turn the second row of wines into a 2-dimensional array with 2 rows and 6 columns:

最后,我们可以使用numpy.reshape函数将数组重塑为我们指定的特定形状。 下面的代码将把第二行wines变成具有26列的二维数组:

In [53]:

在[53]中:

wineswines [[ 11 ,:],:] .. reshapereshape (((( 22 ,, 66 ))
))

Out[53]:

出[53]:


array([[  7.8   ,   0.88  ,   0.    ,   2.6   ,   0.098 ,  25.    ],
       [ 67.    ,   0.9968,   3.2   ,   0.68  ,   9.8   ,   5.    ]])

结合NumPy数组 (Combining NumPy Arrays)

With NumPy, it’s very common to combine multiple arrays into a single unified array. We can use numpy.vstack to vertically stack multiple arrays. Think of it like the second arrays’s items being added as new rows to the first array. We can read in the winequality-white.csv dataset that contains information on the quality of white wines, then combine it with our existing dataset, wines, which contains information on red wines.

使用NumPy,将多个数组组合成一个统一的数组非常普遍。 我们可以使用numpy.vstack垂直堆叠多个数组。 可以将第二个数组的项作为新行添加到第一个数组中。 我们可以读取winequality-white.csv数据集,其中包含有关白葡萄酒的质量的信息,然后将其与我们现有的数据集wines ,其中包含有关红葡萄酒的信息。

In the below code, we:

在下面的代码中,我们:

  • Read in winequality-white.csv.
  • Display the shape of white_wines.
  • 阅读winequality-white.csv
  • 显示white_wines的形状。

In [55]:

在[55]中:

Out[55]:

出[55]:


(4898, 12)

As you can see, we have attributes for 4898 wines. Now that we have the white wines data, we can combine all the wine data.

如您所见,我们有4898葡萄酒的属性。 现在我们有了白葡萄酒数据,我们可以合并所有葡萄酒数据。

In the below code, we:

在下面的代码中,我们:

  • Use the vstack function to combine wines and white_wines.
  • Display the shape of the result.
  • 使用vstack函数将wineswhite_wines结合在一起。
  • 显示结果的形状。

In [56]:

在[56]中:

all_wines all_wines = = npnp .. vstackvstack (((( wineswines , , white_wineswhite_wines ))
))
all_winesall_wines .. shape
shape

Out[56]:

出[56]:


(6497, 12)

As you can see, the result has 6497 rows, which is the sum of the number of rows in wines and the number of rows in red_wines.

如您所见,结果有6497行,这是wines中行数和red_wines数的red_wines

If we want to combine arrays horizontally, where the number of rows stay constant, but the columns are joined, then we can use the numpy.hstack function. The arrays we combine need to have the same number of rows for this to work.

如果我们想水平合并数组,使行数保持不变,但将列连接在一起 ,则可以使用numpy.hstack函数。 我们组合的数组需要具有相同的行数才能起作用。

Finally, we can use numpy.concatenate as a general purpose version of hstack and vstack. If we want to concatenate two arrays, we pass them into concatenate, then specify the axis keyword argument that we want to concatenate along. Concatenating along the first axis is similar to vstack, and concatenating along the second axis is similar to hstack:

最后,我们可以将numpy.concatenate用作hstackvstack通用版本。 如果要串联两个数组,可以将它们传递给concatenate ,然后指定要串联的axis关键字参数。 沿第一个轴的连接类似于vstack ,沿第二个轴的连接类似于hstack

In [57]:

在[57]中:

Out[57]:

出[57]:


array([[  7.4 ,   0.7 ,   0.  , ...,   0.56,   9.4 ,   5.  ],
       [  7.8 ,   0.88,   0.  , ...,   0.68,   9.8 ,   5.  ],
       [  7.8 ,   0.76,   0.04, ...,   0.65,   9.8 ,   5.  ],
       ..., 
       [  6.5 ,   0.24,   0.19, ...,   0.46,   9.4 ,   6.  ],
       [  5.5 ,   0.29,   0.3 , ...,   0.38,  12.8 ,   7.  ],
       [  6.  ,   0.21,   0.38, ...,   0.32,  11.8 ,   6.  ]])

进一步阅读 (Further Reading)

You should now have a good grasp of NumPy, and how to apply it to a data set.

现在,您应该对NumPy有很好的了解,以及如何将其应用于数据集。

If you want to dive into more depth, here are some resources that may be helpful:

如果您想深入研究,以下一些资源可能会有所帮助:

  • You can’t mix multiple data types in an array.
  • You have to remember what type of data each column contains.
  • 您不能在数组中混合使用多种数据类型。
  • 您必须记住每列包含的数据类型。

翻译自: https://www.pybloggers.com/2016/10/numpy-tutorial-data-analysis-with-python/

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值