数据集准备及数据预处理_数据理解和准备–数据集的基础工作

最新推荐文章于 2024-04-20 00:10:25 发布

culuo4781

最新推荐文章于 2024-04-20 00:10:25 发布

阅读量1.7k

点赞数

文章标签：数据库 java python 数据分析大数据

原文链接：https://www.sqlshack.com/data-understanding-and-preparation-basic-work-with-datasets/

版权

数据集准备及数据预处理

In my previous four articles, I worked on a single variable of a dataset. I have shown example code in T-SQL, R, and Python languages. I always used the same dataset. Therefore, you might have gotten the impression that in R and in Python, you can operate on a dataset the same way like you operate on an SQL Server table. However, there is a big difference between an SQL Server table and Python or R data frame.

在前四篇文章中，我研究了数据集的单个变量。我已经用T-SQL，R和Python语言显示了示例代码。我总是使用相同的数据集。因此，您可能会产生一种印象，即在R和Python中，您可以像对SQL Server表一样操作数据集。但是，SQL Server表和Python或R数据帧之间有很大的区别。

In this article, will do a bit more formal introduction to R and Python data frames. I will show how to make basic operations on data frames, like filter them, make a projection, join and bind them, and sort them. For the sake of completeness, I am starting with T-SQL. Of course, the first part is really just a very brief recapitulation of the basic SELECT statement.

在本文中，将对R和Python数据框架做一些更正式的介绍。我将展示如何对数据框进行基本操作，例如过滤它们，进行投影，联接和绑定它们以及对其进行排序。为了完整起见，我从T-SQL开始。当然，第一部分实际上只是对基本SELECT语句的简要概述。

核心T-SQL SELECT语句元素 (Core T-SQL SELECT statement elements)

The simplest query to retrieve the data you can write includes the SELECT and the FROM clauses. In the select clause, you can use the star character, literally SELECT *, to denote that you need all columns from a table in the result set.

检索可写数据的最简单查询包括SELECT和FROM子句。在select子句中，可以使用星号字符SELECT *表示您需要结果集中表中的所有列。

Better than using SELECT * is to explicitly list only the columns you need. This means you are returning only a projection on the table. A projection means you filter the columns. Of course, you can filter also the rows with the WHERE clause.

比使用SELECT *更好的是仅显式列出所需的列。这意味着您只返回桌子上的投影。投影意味着您可以过滤列。当然，您也可以使用WHERE子句过滤行。

In a relational database, you typically have data spread in multiple tables. Each table represents a set of entities of the same kind, like customers, or products, or orders. In order to get result sets meaningful for the business your database supports, you most of the time need to retrieve data from multiple tables in the same query. You need to join two or more tables based on some conditions. The most frequent kind of a join is the inner join. Rows returned are those for which the condition in the join predicate for the two tables joined evaluates to true. Note that in a relational database, you have three-valued logic, because there is always a possibility that a piece of data is unknown. You mark the unknown with the NULL keyword. A predicate can thus evaluate to true, false or NULL. For an inner join, the order of the tables involved in the join is not important.

在关系数据库中，通常您的数据分布在多个表中。每个表代表一组相同种类的实体，例如客户，产品或订单。为了获得对数据库支持的业务有意义的结果集，大多数时候您需要从同一查询的多个表中检索数据。您需要根据某些条件联接两个或多个表。最常见的联接类型是内部联接。返回的行是连接的两个表的连接谓词中的条件评估为true的行。请注意，在关系数据库中，您具有三值逻辑，因为总有一种数据未知的可能性。您使用NULL关键字标记未知。因此，谓词可以评估为true ， false或NULL 。对于内部联接，联接中涉及的表的顺序并不重要。

In the query where you join multiple tables, you should use table aliases. If a column’s name is unique across all tables in the query, then you can use it without table name. You can shorten the two-part column names by using table aliases. You specify table aliases in the FROM clause. Once you specify table aliases, you must always use the aliases; you can’t refer to the original table names in that query anymore. Please note that a column name might be unique in the query at the moment when you write the query. However, later somebody could add a column with the same name in another table involved in the query. If the column name is not preceded by an alias or by the table name, you would get an error when executing the query because of the ambiguous column name. In order to make the code more stable and more readable, you should always use table aliases for each column in the query. You can specify column aliases as well.

在连接多个表的查询中，应使用表别名。如果列名在查询中的所有表中都是唯一的，则可以不使用表名就使用它。您可以使用表别名来缩短两部分的列名。您可以在FROM子句中指定表别名。一旦指定了表别名，就必须始终使用别名。您将无法再在该查询中引用原始表名。请注意，在编写查询时，列名在查询中可能是唯一的。但是，稍后有人可以在查询所涉及的另一个表中添加具有相同名称的列。如果列名前面没有别名或表名，则由于列名模棱两可，执行查询时会出现错误。为了使代码更稳定和可读性更好，应始终对查询中的每一列使用表别名。您也可以指定列别名。

A table in SQL Server represents a set. In a set, the order of the elements is not defined. Therefore, you cannot refer to a row or a column in a table by position. Also, the result of the SELECT statement does not guarantee any specific ordering. If you want to return the data in a specific order, you need to use the ORDER BY clause. The following query shows these basic SELECT elements.

SQL Server中的表代表一个集合。在一组中，未定义元素的顺序。因此，您不能按位置引用表中的行或列。同样，SELECT语句的结果也不保证任何特定的顺序。如果要按特定顺序返回数据，则需要使用ORDER BY子句。以下查询显示了这些基本的SELECT元素。

USE AdventureWorksDW2016;
GO
-- Basic SELECT statement elements
SELECT c.CustomerKey, c.FirstName, c.LastName,
 g.City, g.StateProvinceName AS State
FROM dbo.DimCustomer AS c
 INNER JOIN dbo.DimGeography AS g
  ON c.GeographyKey = g.GeographyKey
WHERE g.EnglishCountryRegionName = N'Australia'
ORDER BY c.LastName DESC, c.FirstName DESC;

The following figure shows the partial results, limited to the first six rows only. Please note the order of the rows and other elements of the basic SELECT statement implemented.

下图显示了部分结果，仅限于前六行。请注意所执行的基本SELECT语句的行顺序和其他元素。

R数据帧

最低0.47元/天解锁文章

culuo4781

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
数据集准备及数据预处理_数据理解和准备–数据集的基础工作

数据集准备及数据预处理 In my previous four articles, I worked on a single variable of a dataset. I have shown example code in T-SQL, R, and Python languages. I always used the same dataset. Therefore, you mig...
复制链接

扫一扫