numpy 数组操作_使用numpy数组的高级操作

最新推荐文章于 2024-08-07 18:34:42 发布

李_涛

最新推荐文章于 2024-08-07 18:34:42 发布

阅读量853

点赞数

原文链接：https://medium.com/analytics-vidhya/advanced-operations-using-numpy-arrays-cedfc3d3c700

版权

这篇博客详细介绍了如何在Python中使用numpy库进行高级的数组操作，包括但不限于数组的创建、组合、切片和数学运算。通过实例解析，读者将能掌握numpy在数据分析和科学计算中的强大功能。

摘要由CSDN通过智能技术生成

numpy 数组操作

In my previous post, I talk about Reduction Operations in Numpy Arrays. You may read through it before you move on to the more Advanced Operations below.

在我以前的文章中，我讨论了Numpy数组中的约简操作。您可以通读下面的内容，然后继续进行下面的更多高级操作。

The topics covered in this post are as follows:

这篇文章涵盖的主题如下：

You can click on any of these above to jump to the respective section.

您可以单击以上任意项跳到相应的部分。

介绍 (Introduction)

Numpy consists of a subpackage called linalg which has functions particularly pertaining to linear algebra which is an integral part in the working of many DL and ML algorithms. We will discuss several concepts about these operations along with their numpy implementation which will inevitably become a part of your Data Science toolkit. We will be covering three of the most important operations that can be carried out with numpy arrays which are heavily used in DL and ML applications such as Natural Language Processing, Image Retrieval tasks and Customer Recommendation tasks. Without any further delay, let’s get started!

Numpy由一个名为linalg的子程序包组成，该子程序包具有特别与线性代数有关的功能，而线性代数是许多DL和ML算法工作中不可或缺的一部分。我们将讨论有关这些操作的几个概念以及它们的numpy实现，这些实现将不可避免地成为您的数据科学工具包的一部分。我们将介绍numpy数组可以执行的三个最重要的操作，这些数组在DL和ML应用程序中大量使用，例如自然语言处理，图像检索任务和客户推荐任务。不用再拖延了，让我们开始吧！

点积 (The Dot Product)

In vector algebra, the dot product represnts a scalar quantity obtained by sum aggregating the product of vectors along the n-dimensional space respectively.

在向量代数中，点积表示通过分别沿n维空间对向量乘积求和而获得的标量。

In linear algebra, dot product is typically used for finding out if two vectors are perpendicular to each other or to find out the magnitude of a single vector or to find out the projection of a vector along another vector.

在线性代数中，点积通常用于确定两个向量是否彼此垂直，或者找出单个向量的大小，或者找出向量沿另一个向量的投影。

In Data Science, the dot product is typically used to find out the similarity or distance between two or more vectors in some high dimensional space. When we perform nearest neighbour search this is what we typically use. The similarity found out using dot product is called cosine similarity because the dot product theoretically is given by the expression

在数据科学中，点积通常用于找出某个高维空间中两个或多个向量之间的相似度或距离。当我们执行最近邻居搜索时，这通常是我们使用的方法。使用点积发现的相似度称为余弦相似度，因为理论上该点积由表达式给出

The smaller the angle between the two vectors, more closely aligned the two vectors are since the cosine of an angle is high when the angle itself is small/low.

两个向量之间的角度越小，两个向量就越紧密对齐，因为当角度本身较小/较低时，角度的余弦较高。

In Natural Language Processing applications, the words are individually represented as vectors or embeddings of different lengths (50, 100, 200, 300, 512 etc.) Dot products are used to identify word similarity and word relationships which emerge out of words being used in similar contexts; such as country-capital pairs, or male-female pairs and so on.

在自然语言处理应用程序中，单词分别表示为不同长度(50、100、200、300、512等)的向量或嵌入。点乘积用于识别单词相似性和单词关系，这些相似性和单词关系源自类似的情况；例如国家-首都配对，或男女配对等。

In applications such as visual search, the images are converted into a single dimensional vector which are compared on the basis of dot product as discussed above to retrieve similar looking items or images.

在诸如视觉搜索之类的应用中，图像被转换成一维向量，如上所述，该向量基于点积进行比较以检索相似的外观项目或图像。

奇异值分解 (Singular Value Decomposition)

This is a concept very commonly used in recommendation systems. Basically, it is used to extract topics or genres or gists of information from a consumer/user/record v/s product/movie/song/feature matrix respectively. (We’ll use user and song respectively in our example).

这是推荐系统中非常常用的概念。基本上，它用于分别从消费者/用户/记录产品/电影/歌曲/功能矩阵中提取信息的主题或类型或要点。 (在示例中，我们将分别使用user和song)。

The topics extracted by SVD are abstract and may not really be in the scope of humans to understand. But generally it’s observed that most of the times it’s identifiable what the topic can broadly be a manifestation of. For example, consider we have constructed a matrix which has names of novels as rows and individual words in the novel as columns where each value represents the relevance of that word in a particular novel (the underlying method is called tf-idf which we won’t dig deep into now). This matrix can be decomposed into three matrices which can help us understand the vectors in more detail.

SVD提取的主题是抽象的，可能并不真正属于人类的理解范围。但是通常观察到，大多数时候可以确定该主题可以广泛地体现为什么。例如，假设我们构建了一个矩阵，其中将小说的名称作为行，将小说中的各个单词作为列，其中每个值代表该单词在特定小说中的相关性(基本方法称为tf-idf，我们将请深入了解)。该矩阵可以分解为三个矩阵，可以帮助我们更详细地了解向量。

The first one is the user-topic matrix, the second one is the topic importance matrix(this is always a diagonal matrix where the diagonal elements are representative of importance of the respctive topic) and the third matrix is the topic song matrix. Once we have these matrices, we can utilize this information to create buckets of users to suggest music from respective topics which they like to hear. Whereas the terms users and songs here are for the sake of better understanding, it could very possibly be novels-words, users-movies etc.

第一个是用户主题矩阵，第二个是主题重要性矩阵(这始终是一个对角矩阵，其中对角线元素代表相应主题的重要性)，第三个矩阵是主题歌曲矩阵。一旦有了这些矩阵，我们就可以利用这些信息来创建用户桶，以根据他们喜欢听的各个主题来推荐音乐。尽管此处的用户和歌曲是为了更好地理解，但很可能是小说，单词，用户-电影等。

By default the number of topics is the same as the number of songs/words in the above case but that doesn’t serve the purpose because every song or word will be a genre/topic of it’s own that way; so in practise, there’s another version of SVD that we use called truncated SVD in which the number of topics is restricted in number. This gives us a workable number of topics that are meaningful for recommendations.

默认情况下，主题的数量与上述情况下的歌曲/单词的数量相同，但这并没有达到目的，因为每首歌曲或单词都是那样的风格/主题。因此，实际上，我们使用了另一种SVD版本，称为截短SVD，其中主题的数量受到限制。这为我们提供了对建议有意义的可行主题。

Numpy doesn’t offer a function to perform truncated SVD. We have to use another library called scikit-learn to do the same; don’t bother much about it, I will cover it in detail a future post. However it’s important to note that truncated SVD is more commonly used because it acts like a tool which can extract meaning from these numbers while compressing the information substantially to a useful state.

Numpy不提供执行截断SVD的功能。我们必须使用另一个名为scikit-learn的库来完成此操作。不要太在意它，我将在以后的文章中详细介绍它。但是，需要注意的是，截断的SVD更常用，因为它像一种工具，可以从这些数字中提取含义，同时将信息基本上压缩为有用状态。

If you’re really interested to dig deeper into the topic, I would recommend you read this medium post which covers svd in great depth with all the math worked out beautifully.

如果您真的有兴趣深入探讨该主题，我建议您阅读这篇中篇文章，其中涵盖了svd的深度知识，并且所有数学公式都经过精心设计。

矩阵的逆 (Inverse of a matrix)

The inverse of a matrix is a matrix such that when it’s multiplied with the matrix itself, we obtain an identity matrix. Now, why is such a matrix useful?

矩阵的逆是一个矩阵，这样当它与矩阵本身相乘时，我们就可以得到一个单位矩阵。现在，为什么这样的矩阵有用？

For a long time, solving systems of linear equations in order to find out unknown variables has been prevalent which in my honest opinion could be considered the origins of data science. Because understanding natural phenomenon, quantifying them into constraints and expressing them as a system of linear equations to find out the values of unknown variables using numerical methods involved dealing with observational data and is still used significantly in the computational fields.

长期以来，解决线性方程组以找出未知变量的方法一直很普遍，以我的诚实观点，这可以认为是数据科学的起源。因为了解自然现象，所以将其量化为约束并将其表示为线性方程组，以便使用涉及观测数据的数值方法来找出未知变量的值，并且在计算领域中仍非常有用。

That’s why the inverse of a matrix becomes an important quantity when we solve these equations. The general formulation of the problem is

这就是为什么当我们求解这些方程时，矩阵的逆成为重要的量的原因。问题的一般表述是

The first quantity in RHS of the second equation is called inverse of matrix A. Numpy offers an eponymous function to compute the inverse of a matrix.

第二个方程式在RHS中的第一个量称为矩阵A的逆。Numpy提供了一个同义函数来计算矩阵的逆。

Well, matrices is an expansive chapter and there’s no end to it. These three operations in the world of data science are the most commonly used and make it to being an invaluable tool in your data science toolbox. One other method known as PCA is very similar to SVD, apart from this, there might be other tools like cross-product/vector-product etc. which are not very commonly used in data science, so we’ll save those for a later day.

好吧，矩阵是一个广阔的篇章，它没有尽头。数据科学领域中的这三个操作是最常用的，使其成为数据科学工具箱中的宝贵工具。另一种称为PCA的方法与SVD非常相似，此外，可能还有其他产品，例如交叉乘积/矢量乘积等，在数据科学中并不常用，因此我们将其保存下来供以后使用。天。

The code snippets above can be viewed on my github in this repository Numpy Explained.

可以在我的github中的Numpy Explained存储库中查看上面的代码段。

Thanks for reading through this entire series of getting to know about Numpy Arrays and hope these tools work in your favour when working on your data science solutions!

感谢您阅读了有关Numpy Arrays的整个系列文章，并希望这些工具在开发数据科学解决方案时对您有所帮助！