电子表格转换成数据库_创建数据库,将电子表格转换为关系数据库,第1部分...

本文介绍了如何将电子表格转换为关系数据库,包括使用Python和SQL等工具进行数据整理和数据库创建的过程,旨在帮助用户更好地管理和操作数据。
摘要由CSDN通过智能技术生成

电子表格转换成数据库

Part 1: Creating an Entity Relational Diagram (ERD)

第1部分:创建实体关系图(ERD)

A Relational Database Management System (RDMS) is a program that allows us to create, update, and manage a relational database. Structured Query Language (SQL) is a programming language used to communicate with data stored in the RDMS. The SQL skill for using a RDMA is required for many data-related positions these days. In the social media forums like Quora or Reddit, there are many people who search for a public database for practicing their SQL querying skills. However, although there are many public data sets in a single spreadsheet, there are not many public databases online. Even if you found a data set for the topic you have interest in, the format of the data is usually just one spreadsheet, not a database for most cases. Therefore, it will be very useful to know how to convert a data set in one spreadsheet to a database with multiple tables fitting a relational database format. Knowing the process of this conversion can give us many chances to practice SQL querying skills with a variety of databases.

关系数据库管理系统 (RDMS)是允许我们创建,更新和管理关系数据库的程序。 结构化查询语言 (SQL)是一种编程语言,用于与RDMS中存储的数据进行通信。 如今,与许多数据相关的职位都需要使用RDMASQL技能。 在QuoraReddit等社交媒体论坛中,有很多人在搜索公共数据库以练习其SQL查询技能。 但是,尽管单个电子表格中有许多公共数据集,但在线的公共数据库并不多。 即使您找到了感兴趣的主题的数据集,数据的格式通常也只是一个电子表格,在大多数情况下不是数据库。 因此,了解如何将一个电子表格中的数据集转换为具有多个符合关系数据库格式的表的数据库将非常有用。 了解这种转换的过程可以使我们有很多机会来练习各种数据库SQL查询技能。

This is the first article in a two part series. The goal of this series is to show how to create a relational database for SQL. The whole process is to convert a data in one spreadsheet to a relational database for SQL. In this first article I create an Entity Relational Diagram (ERD) which is a graphical representation showing the relationships between entities.

这是分两部分的系列文章中的第一篇。 本系列的目的是展示如何为SQL创建关系数据库。 整个过程是将一个电子表格中的数据转换为SQL的关系数据库。 在第一篇文章中,我创建一个实体关系图(ERD) ,该图是显示实体之间关系的图形表示。

数据集 (The Data Set)

Let’s find a data set for this practice. There are many public data on the Kaggle dataset. Among a bunch of data sets, I selected a data set named Sample Sales Data. The following pictures show the original format of the data which is contained in one spreadsheet.

让我们为该练习找到一个数据集。 Kaggle数据集上有许多公共数据。 在一系列数据集中,我选择了一个名为Sample Sales Data的数据集 下图显示了一个电子表格中包含的数据的原始格式。

Image for post
Image for post
Sample Sales Data from the Kaggle Dataset
来自Kaggle数据集的样本销售数据

The data has 25 columns and 2824 rows including headers. The list of the headers is as follows:

数据有25列和2824行,包括标题。 标头列表如下:

  • ORDERNUMBER: the identification number for each order

    ORDERNUMBER :每个订单的标识号

  • QUANTITYORDERED: the quantity ordered

    QUANTITYORDERED :订购的数量

  • PRICEEACH: the actual price paid for the transaction (variable across transactions)

    PRICEEACH :为交易支付的实际价格(随交易而变化)

  • ORDERLINENUMBER: the number of the order line

    ORDERLINENUMBER :订单行的编号

  • SALES: the amount of sales

    销售 :销售额

  • ORDERDATE: the order date

    ORDERDATE :订单日期

  • STATUS: the shipping status (Shipped, Resolved, Cancelled, On Hold, Disputed, and In Progress)

    状态 :运输状态(已发货,已解决,已取消,保留,有争议和进行中)

  • QTR_ID: the quarter of the order date

    QTR_ID :订购日期的四分之一

  • MONTH_ID: the month of the order date

    MONTH_ID :订购日期的月份

  • YEAR_ID: the year of the order date

    YEAR_ID :订单日期的年份

  • PRODUCTLINE: the category of products

    PRODUCTLINE :产品类别

  • MSRP: the manufacture’s suggested retail price (constant across transactions)

    MSRP :制造商的建议零售价(在交易中保持不变)

  • PRODUCTCODE: the identification code for each product

    PRODUCTCODE :每个产品的识别码

  • CUSTOMERNAME: the names of customers

    客户名称:客户名称

  • PHONE: the phone numbers of customers

    电话 :客户的电话号码

  • ADDRESSLINE1: addressline 1 for customers

    ADDRESSLINE1 :客户的地址行1

  • ADDRESSLINE2: address line 2 for customers

    ADDRESSLINE2 :客户的地址行2

  • CITY: city names for customers

    城市 :客户的城市名称

  • STATE: state names for customers (only for customers located in the US)

    STATE :客户的州名(仅适用于位于美国的客户)

  • POSTALCODE: postal codes for customers

    邮政编码 :为客户邮政编码

  • COUNTRY: countries for customers

    国家 :客户所在的国家

  • TERRITORY: the regional names of each country (NA, EMEA, Japan, and APAC)

    地区 :每个国家的区域名称(NA,EMEA,日本和亚太地区)

  • CONTACTLASTNAME and CONTACTFIRSTNAME: the last and first names of customers

    CONTACTLASTNAMECONTACTFIRSTNAME :客户的姓氏和名字

  • DEALSIZE: the deal sizes of orders

    DEALSIZE :订单的交易大小

创建实体关系图(ERD) (Creating an Entity Relationship Diagram (ERD))

The information contained in the data can be basically divided into three entities: Customers, Products, and Orders.

数据中包含的信息基本上可以分为三个实体: 客户产品订单

顾客 (Customers)

Based on the columns in the data, the attributes to be included in the entity for Customers are CUSTOMERNAME, CONTACTLASTNAME, CONTACTFIRSTNAME, PHONE, ADDRESSLINE1, ADDRESSLINE2, POSTALCODE, CITY, STATE, COUNTRY, and TERRITORY. CustomerID is added into the entity as the primary key. And for convenience sake, the cases of the attributes are changed as follows:

根据数据中的列,要包含在“客户”实体中的属性是CUSTOMERNAMECONTACTLASTNAME,CONTACTFIRSTNAME, PHONEADDRESSLINE1ADDRESSLINE2POSTALCODECITY,STATE,COUNTRYTERRITORYCustomerID作为主键添加到实体中。 为了方便起见,属性的情况如下更改:

Image for post

The problem of keeping City, State, Country, and Territory in the entity for Customers is that there will be many duplicate rows in the table for Customers. Therefore, entities for City, Country, and Territory are separately created, and then the links between them and Customers are made using foreign keys. The reason why the entity for State is not created is that there are many null values in State so that states do not determine countries and territories in this data.

保留城市国家地区的问题 客户实体中 是表中的客户会有很多重复的行。 因此,将分别创建用于城市国家地区的实体,然后使用外键在它们和客户之间建立链接。 为什么不建立国有实体的原因是,在许多国家空值,这样的状态并不确定这个数据的国家和地区。

In the ERD, there are three types of relationships between entities: One-to-One, One-to-Many(or Many-to-One), and Many-to-Many. The relationships between Territories and Countries, Countries and Cities, and Cities and Customers are One-to-Many relationships. For example, each customer lives in one city in this data while one city has many customers. Therefore, using the Crow’s Foot Notation, the ERD between Customers, Cities, Countries, and Territories is as follows:

在ERD中,实体之间存在三种类型的关系: 一对一,一对多(或多对一)和多对多区域国家国家城市以及城市客户之间的关系是一对多关系。 例如,在此数据中,每个客户居住在一个城市中,而一个城市有许多客户。 因此,使用乌鸦脚符号客户之间的ERD, 城市国家 ,和地区如下:

Image for post
The Crow’s Foot Notations for Customers, Cities, Countries, and Territories
客户,城市,国家和地区的鱼尾纹符号

产品展示 (Products)

The attributes to be included in the entity for Products are PRODUCTCODE, PRODUCTLINE, MSRP, and ORDERNUMBER. Among these attributes, PRODUCTCODE can be a candidate for the primary key in the entity since it is unique and does not have null values. However, in order to keep the format of the primary keys consistent across entities, ProductID is added into the entity and is set to be the primary key. The letter cases are changed as follows:

产品实体中要包含的属性是PRODUCTCODEPRODUCTLINEMSRPORDERNUMBER 。 在这些属性中, PRODUCTCODE可以是实体中主键的候选者,因为它是唯一的并且没有空值。 但是,为了使主键的格式在各个实体之间保持一致, 产品编号 被添加到实体中并设置为主键。 字母大小写更改如下:

Image for post

As we saw in the case for the entity for Customers, including Productline in this entity generates duplicate rows as well. Therefore, the entity for ProductLines is created separately, and is linked to the entity for Products using a foreign key. Since each product is classified into one product line while each product line has many products, the relationship between ProductLines and Products is One-to-Many.

正如我们在“ 客户 ”实体案例中看到的那样,在该实体中包括Productline也会生成重复的行。 因此, ProductLine的实体是单独创建的,并使用外键链接到Products的实体。 由于每个产品分类为一个产品线,而每个产品线具有许多产品,因此ProductLineProducts之间的关系为一对多

Image for post
The Crows’ Notation for ProductLines and Products
产品线和产品的乌鸦符号

This is not the final version for Products since OrderNumber will be included in the entity for Orders as well. Including OrderNumber in the entity for Orders will make some changes in the entity for Products. This will be covered after the entity for Orders is created.

这并非以来ORDERNUMBER 产品将包含在订单的实体以及最终版本。 在订单的实体,包括订单号码将在产品实体的一些变化。 创建订单实体后将对此进行介绍。

命令 (Orders)

The entity for Orders should include ORDERNUMBER, ORDERDATE, PRODUCTCODE, STATUS, QUARTER_ID, MONTH_ID, YEAR_ID,and CUSTOMERNAME as its attributes. OrderID is added into this entity as the primary key. The names and cases of the attributes are changed as follows for convenience sake:

订单实体应包括ORDERNUMBERORDERDATEPRODUCTCODESTATUSQUARTER_IDMONTH_IDYEAR_IDCUSTOMERNAME作为其属性。 OrderID作为主键添加到该实体中。 为了方便起见,属性的名称和大小写更改如下:

Image for post

In order to remove duplicate rows generated by including Status and CustomerName in this entity, the entities for Status and CustomerNames are created separately and are linked to the entity for Orders using foreign keys. The entity for CustomerNames (which is equivalent to the Customers) was already created previously. The relationships of Orders with Status and Customers are Many-to-One, respectively.

为了删除通过在此实体中包含StatusCustomerName生成的重复行,分别创建了StatusCustomerNames的实体,并使用外键将其链接到Order的实体。 CustomerNames的实体(等同于Customer )是先前已创建的。 订单状态客户的关系分别是多对一的

Image for post
The Crow’s Foot Notations for Orders, Customers, and Status
订单,客户和状态的鱼尾纹

订单和产品 (Orders and Products)

Each order contains a list of products in it. This means that Orders and Products are related to each other. What is the relationship between these two entries? Each product can be ordered by multiple orders and each order can contain multiple products. Therefore, the relationship is many-to-many.

每个订单中都包含一个产品列表。 这意味着订单产品相互关联。 这两个条目之间是什么关系? 每个产品可以按多个订单订购,每个订单可以包含多个产品。 因此,这种关系是多对多的

Image for post
The Crow’s Foot Notation for Orders and Products 1
订单和产品的鱼尾纹1

The way to create tables which are in the many-to-many relationship is to create a junction table between the two original tables and then to make one-to-many relationships between the original tables and the junction table. The junction table is the table for the combination of Orders and Products, and then one-to-many relationships are made between Orders and the junction, and between Products and the junction. The primary key for this junction entity should be the composite of OrderID and ProductID.

创建作为在许多一对多的关系表的方法是将原来的两个表之间创建一个表,然后做一个对许多原始表和联接表之间的关系。 联结表是OrdersProducts组合的表,然后在Orders和联结之间以及Products和联结之间建立一对多关系。 此联结实体的主键应为OrderID的组合 ProductID

The remaining columns which are not included in any entity so far are QUANTITYORDERED, PRICEEACH, ORDERLINENUMBER, SALES, and DEALSIZE. These columns cannot be determined by Orders or Products by themselves. For example, the quantity of a product in an order cannot be identified by an order or a product alone. Only the combination of an order and a product can identify it. Therefore, those columns are included in the junction entity as its attributes. The entities and their relationships for Orders and Products are as follows:

到目前为止,尚未包含在任何实体中的其余列为QUANTITYORDEREDPRICEEACHORDERLINENUMBERSALESDEALSIZE 。 这些列不能由订单产品自行确定。 例如,订单中的产品数量无法通过订单或单独的产品来识别。 只有订单和产品的组合才能识别它。 因此,这些列作为其属性包含在联结实体中。 订单产品的实体及其关系如下:

Image for post
The Crow’s Foot Notations for Orders and Products 2
订单和产品的鱼尾纹2

Due to duplicate rows, the entity for DealSizes is separately created and linked to the entity for Orders_Products using foreign key for SizeID.

由于重复行,为DealSizes实体被单独创建并链接到实体使用SizeID外键Orders_Products。

整个ERD (The Entire ERD)

Now, we completed creating all the entities for this data and making relationships between them. The entire ERD is as follows:

现在,我们完成了为该数据创建所有实体并在它们之间建立关系。 整个ERD如下:

Image for post
The Final Version of the ERD for Sample Sales Data
样品销售数据ERD的最终版本

The usefulness of the ERD is that it makes us to easily understand the relationships between the columns (or the entities) in the data set and to see the entire structure of the data set at a glance.

ERD的有用之处在于,它使我们能够轻松了解数据集中的列(或实体)之间的关系,并一目了然地看到数据集的整个结构。

下一步是什么? (What’s Next?)

This article covered how to create an Entity Relationship Diagram using a data set. The data for this practice is Sample Sales Data obtained from the Kaggle dataset. It is briefly covered how to divide the original data set into entities and make relationships between the entities.

本文介绍了如何使用数据集创建实体关系图。 该实践的数据是从Kaggle数据集获得的样本销售数据 。 简要介绍了如何将原始数据集划分为实体以及如何在实体之间建立关系。

In the next article, the practice of converting a data in a single spread sheet to a relational database by using MySQL will be covered.

在下一篇文章中 ,将介绍通过使用MySQL将单个电子表格中的数据转换为关系数据库的实践。

翻译自: https://medium.com/swlh/creating-a-database-converting-a-spreadsheet-to-a-relational-database-part-1-2a9a228bf77a

电子表格转换成数据库

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值