数据库101数据科学家数据库简介

Data science is one of the fast-growing fields that I can’t see slowing down any time soon. Not with how our data dependence is overgrowing day by day. Data science is all about data, collecting it, cleaning it, analyzing it, visualizing it, and using it to make our life better.

数据科学是快速增长的领域之一,我认为它不会很快放缓。 与我们的数据依赖性如何日趋增长无关。 数据科学是关于数据,收集,清理,分析,可视化并使用它来改善我们生活的全部。

Handling large amounts of data can be a challenging task for data scientists. Most of the time, that data we need to process and analyze is much larger than the capacity of our devices (the size of the RAM). Storing the information on the hard-drive might cause our code to be much slower.

对于数据科学家来说,处理大量数据可能是一项艰巨的任务。 在大多数情况下,我们需要处理和分析的数据远远大于设备的容量(RAM的大小)。 将信息存储在硬盘驱动器上可能会导致我们的代码慢得多。

Not to mention that in order to make sense of the data, and to process it efficiently, we need to have this data ordered in some way. Here where databases come to play.

更不用说为了理解数据并进行有效处理,我们需要以某种方式对这些数据进行排序。 在这里数据库发挥作用。

A database is defined as a structured set of data held in a computer’s memory or on the cloud that is accessible in various ways.

数据库的定义是保存在计算机内存或云中的结构化数据集,可以通过多种方式访问​​。

As a data scientist, you will need to design, create, and interact with databases on most of the projects you will work on. Sometimes you will need to create everything from scratch, while at other times, you will just need to know how to communicate with an already existing database.

作为数据科学家,您将需要在您将要从事的大多数项目上设计,创建数据库并与之交互。 有时您需要从头开始创建所有内容,而在其他时候,您只需要知道如何与现有数据库进行通信即可。

When I first started my journey in data science, handling databases was one of the most challenging aspects to master. That’s why I decide to write a series of articles about everything databases.

当我第一次开始数据科学之旅时,处理数据库是要掌握的最具挑战性的方面之一。 这就是为什么我决定撰写有关所有数据库的一系列文章的原因。

This article will be a brief introduction to databases. What is SQL? Why do we need databases? And the different types of databases.

本文将简要介绍数据库。 什么是SQL? 我们为什么需要数据库? 以及不同类型的数据库。

为什么要使用数据库? (Why use databases?)

Data surround us; everything we use in our daily life is based on massive amounts of data. You turn on Netflix, it suggests what you should watch next, based on your previous selections. You open the Spotify app; it tells you to want songs you might like based on your preferences.

数据围绕着我们; 我们日常生活中使用的所有内容都基于大量数据。 您打开Netflix,它会根据您先前的选择建议下一步应该看什么。 您打开Spotify应用; 它会根据您的喜好告诉您想要喜欢的歌曲。

Collecting and analyzing data is one of the ways to personalize the experience of every one of us. It’s a way to build one product that can fit everyone.

收集和分析数据是个性化我们每个人的体验的方法之一。 这是一种构建适合所有人的产品的方法。

But,

但,

To do that, this data needs to be stored and structured somewhere, somewhere that is easy to access, provide fast communication, and is secure.

为此,需要将这些数据存储在易于访问,提供快速通信且安全的某个位置并进行结构化。

Databases make structured storage secure, efficient, and fast. They provide a framework for how the data should be stored, structured, and retrieved. Having databases saves you the hassle of needing to figure out what to do with your data in every new project.

数据库使结构化存储安全,高效和快速。 它们为如何存储,结构化和检索数据提供了一个框架。 拥有数据库可以省去在每个新项目中弄清楚如何处理数据的麻烦。

数据库类型 (Types of database)

Image for post
Canva) Canva制作)

关系数据库 (Relational databases)

In a relational database, the data is organized and stored into tables that can be linked to each other use some relation. For example, an airline company can have a table of passengers for all flights, and another for passengers on a specific flight. A flight code can connect these two tables.

在关系数据库中,数据被组织并存储到可以使用某种关系相互链接的表中。 例如,一家航空公司可以为所有航班提供一张乘客表,为特定航班的乘客提供一张表。 飞行代码可以连接这两个表。

This ability to have connected tables allows us — as developers and data scientists — to understand better the relation between the different elements of the table. Understanding the relationship can give us hints and insight that will make the process of analyzing and visualizing the data an easier task.

连接表的这种能力使我们(作为开发人员和数据科学家)可以更好地理解表中不同元素之间的关系。 了解这种关系可以为我们提供提示和见解,这将使分析和可视化数据的过程变得更加容易。

The way to communicate and interact with relational databases is through using the SQL language.

与关系数据库进行通信和交互的方式是使用SQL语言。

非关系数据库 (Non-relational databases)

Non-relational databases, also known as NoSQL databases. These databases are those that connect the information stored in them by categories rather than relations.

非关系数据库,也称为NoSQL数据库。 这些数据库是按类别而不是关系连接存储在其中的信息的数据库。

The most popular form of the NoSQL database is key-value pairs, which you can think about the same you do a Python dictionary. Keys have to be unique, as long as they are, a key-value pair can store all the relations in one document.

NoSQL数据库中最流行的形式是键-值对,您可以认为它们与Python字典的用法相同。 键必须是唯一的,只要它们是键值对就可以将所有关系存储在一个文档中。

Relational databases use tables as their core storing unit. A table in a database consists of a collection of rows and columns, and you can connect several tables using relations. In NoSQL, however, the data is stored on document-like storage. You can still perform all everyday tasks, such as add, delete, update your data as long as you know how the document is structured.

关系数据库使用表作为其核心存储单元。 数据库中的表由行和列的集合组成,您可以使用关系连接多个表。 但是,在NoSQL中,数据存储在类似文档的存储中。 只要知道文档的结构,您仍然可以执行所有日常任务,例如添加,删除,更新数据。

结构化查询语言(SQL) (Structured Query Language (SQL))

SQL is a powerful programming language used to manipulate data in a relational database management system (RDBMS). SQL is relatively easy, yet so powerful and efficient. Developers and data scientists use SQL to add, delete, update, or perform specific o[eration on a relational database.

SQL是一种功能强大的编程语言,用于在关系数据库管理系统(RDBMS)中处理数据。 SQL相对容易,但功能强大且高效。 开发人员和数据科学家使用SQL在关系数据库上添加,删除,更新或执行特定操作。

SQL is not just for performing simple operations on databases; it can also be used to design databases or perform some analytics of the data stored.

SQL不仅用于在数据库上执行简单的操作; 它也可以用于设计数据库或对存储的数据进行一些分析。

为什么要使用SQL? (Why SQL?)

SQL is very popular, and it’s widely used in software development — in general — and data science in particular for various reasons, including:

SQL非常流行,并且由于各种原因,它已广泛用于软件开发(通常),尤其是数据科学,其中包括:

  1. Flexibility: SQL allows you to add or delete new columns, tables, rename relations, and make other changes while the database is up and running, and queries are happening. Moreover, it can be integrated into many script languages hassle-free.

    灵活性: SQL允许您在数据库启动和运行以及正在发生查询时添加或删除新的列,表,重命名关系以及进行其他更改。 而且,它可以轻松地集成到许多脚本语言中。

  2. Ease of use: Learning the basics of SQL is simple and straight-forward. There is no confusing syntax of hidden tips to master using SQL.

    易用性:学习SQL基础非常简单明了。 使用SQL掌握隐秘技巧的语法没有令人困惑的地方。

  3. No redundancy: Because of the relational nature of SQL, you can have all the information you need about an entry all in the same location, so you will not need to repeat the same information across all tables.

    无冗余:由于SQL的关系性质,您可以将条目所需的所有信息全部放在同一位置,因此您无需在所有表中重复相同的信息。

  4. Reliability: Most relational databases are accessible to export and import, making backup and restore a breeze. These exports can be performed while the database is running, making restore on failure easy.

    可靠性:大多数关系数据库都可以导出和导入,从而使备份和还原变得轻而易举。 这些导出可以在数据库运行时执行,从而使故障恢复变得容易。

SQL与非SQL数据库 (SQL vs. no-SQL databases)

Whenever you are assigned a new project or attempt to design a w database, the first question you probably ask yourself is “which database should I use? SQL or NoSQL?”.

每当您被分配一个新项目或尝试设计aw数据库时,您可能会问自己的第一个问题是“我应该使用哪个数据库? SQL还是NoSQL?”。

Here’s the thing, when trying to choose a correct database type, I often refer to the CAP theorem. The CAP theorem describes the relationship between three aspects of your database: availability, consistency, and partition tolerance.

事情就是这样,当试图选择正确的数据库类型时,我经常参考CAP定理。 CAP定理描述了数据库三个方面之间的关系:可用性,一致性和分区容限。

  1. Consistency: This means that every inquiry to the database should return the most recent value. There are 5 different levels of consistency, from strong — immediate — to eventual — out-of-date results.

    一致性:这意味着对数据库的每次查询都应返回最新值。 有5种不同的一致性级别,从强的(立即的)结果到最终的过时的结果。

  2. Availability: This means anyone can make a request for data and get a response, even if one or more items of the database are down.

    可用性:这意味着即使数据库中的一项或多项已关闭,任何人都可以请求数据并获得响应。

  3. Partition tolerance: A partition is a communications break within a system. Tolerance means the database should function adequately even if the communications between aspects of it are broken.

    分区容限:分区是系统内的通信中断。 容忍意味着即使数据库各方面之间的通信中断,数据库也应能正常运行。

To select a database type, you need to prioritize two of the three aspects of the CAP theorem. If you care more about consistency and availability, then you should choose a relational database. However, if you care more about availability and partition tolerance, or consistency and partition tolerance, then a NoSQL database will work better for your project.

要选择数据库类型,您需要确定CAP定理的三个方面中的两个优先级。 如果您更关心一致性和可用性,那么应该选择一个关系数据库。 但是,如果您更关注可用性和分区容忍度,或者一致性和分区容忍度,那么NoSQL数据库将更适合您的项目。

Image for post
Canva) Canva制作)

结论 (Conclusion)

Data is the most crucial part of data science; you can’t have data science without data. Designing, creating, and communicating with databases is essential for any data scientist to grow her/his career and enrich their knowledge-base.

数据是数据科学中最关键的部分。 没有数据就无法拥有数据科学。 设计,创建数据库并与数据库进行通信对于任何数据科学家发展其职业并丰富其知识基础都是必不可少的。

Databases are a vast and broad field; I couldn’t fit everything in one single article. That’s why I decided to divide the topic into three articles covering all essential and necessary knowledge of data science that a data scientist should be aware of.

数据库是一个广阔的领域。 我无法将所有内容都放在一篇文章中。 因此,我决定将该主题分为三篇,涵盖数据科学家应了解的数据科学的所有基本知识和必要知识。

The upcoming articles will cover the basics of designing and interacting with a database (introduction to SQL). The final article will cover the common database libraries used in Python and how to choose the correct one for your data and your application.

即将到来的文章将介绍设计和与数据库交互(SQL简介)的基础知识。 最后一篇文章将介绍Python中使用的常见数据库库,以及如何为您的数据和应用程序选择正确的数据库。

翻译自: https://towardsdatascience.com/databases-101-introduction-to-databases-for-data-scientists-ee18c9f0785d

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值