pandas sql_您可以使用SQL在Pandas中做所有事情

本文介绍如何利用pandas库中的SQL语法进行数据分析,展示了在Python中使用类似SQL的方法处理DataFrame的强大功能。
摘要由CSDN通过智能技术生成

pandas sql

Pandas is one of the most useful tools a data scientist can use. It provides several handy functionalities to extract information. Unfortunately, using Pandas requires data to be loaded into DataFrames, which aren’t great for handling massive quantities of data — quantities of data common at a company where you would need such data manipulation skills.

熊猫是数据科学家可以使用的最有用的工具之一。 它提供了一些方便的功能来提取信息。 不幸的是,使用Pandas要求将数据加载到DataFrames中,这对于处理大量数据(对于需要此类数据处理技能的公司中常见的数据量)而言并不是一个很好的选择。

If you were to download a complete, massive dataset — perhaps store it as a .csv file — you would need to spend at least several minutes waiting for the file to complete download and be converted into a DataFrame by Pandas. On top of this, any operations you perform will be slow because Pandas has trouble dealing with these massive amounts of data and needs to run through every row.

如果要下载完整的海量数据集(也许将其存储为.csv文件),则需要花费至少几分钟的时间等待文件完成下载,并由Pandas转换为DataFrame。 最重要的是,您执行的任何操作都会很慢,因为Pandas无法处理这些海量数据,并且需要遍历每一行。

A better solution? First query the database with SQL — the native, efficient processing language it likely runs on — then download a reduced dataset, if necessary, and use Pandas to operate on the smaller-scale tables it is designed to work on. While SQL is very efficient with large databases, it cannot replace the value of Panda’s plotting integration with other libraries and with the Python language in general.

更好的解决方案? 首先使用SQL查询数据库(它可能在其上运行的本机高效处理语言)进行查询,然后在必要时下载精简的数据集,然后使用Pandas在设计用于较小规模的表上进行操作。 尽管SQL在大型数据库中非常高效,但它无法取代Panda与其他库以及一般的Python语言进行绘图集成的价值。

In the case where we only need to find numerical answers, however, like the average cost of an item or how many employees receive commission but not an hourly wage above $40, there is (usually) no need to touch Pandas and a Python environment at all.

但是,在只需要找到数字答案的情况下,例如一件商品的平均成本或有多少雇员获得佣金,但时薪不超过40美元,(通常)无需接触Pandas和Python环境。所有。

While most people’s knowledge of SQL stops at SELECT * FROM table WHERE id = ‘Bob’, one would be surprised by the functionalities that SQL offers.

尽管大多数人对SQL的了解只停留在SELECT * FROM table WHERE id = 'Bob' ,但人们会对SQL提供的功能感到惊讶。

As an example, we will work on the SQL Tryit Editor database provided by w3schools. This site allows you to run SQL queries on a fictional database. The table Customers (you can see the full list of tables on the right panel) has 7 columns with numerical and text data:

例如,我们将使用w3schools提供SQL Tryit Editor数据库。 该站点允许您在虚拟数据库上运行SQL查询。 客户表(您可以在右侧面板上查看表的完整列表)具有7列,其中包含数字和文本数据:

Image for post

Say we want to send a prepared statement in the form of “Name LIVES IN Address, City”. This is simple: the double pipes operator || acts as a concatenation. Then, we can add column values and strings, and save the result under an alias, or the column name of the result, using AS.

假设我们要以“姓名住址,城市”的形式发送准备好的声明。 这很简单:双管道运算符|| 充当串联。 然后,我们可以添加列值和字符串,并使用AS将结果保存在别名或结果的列名称下。

Image for post

Note that since we additionally specified Country=’Mexico’, our results are all of addresses and cities within Mexico. This type of operation would be more difficult and less easy to do in Python.

请注意,由于我们另外指定了Country='Mexico' ,因此我们的结果是墨西哥境内的所有地址和城市。 在Python中,这种类型的操作将更加困难且不那么容易。

Note that not all databases support the same syntax. This article uses syntax for PostgreSQL, although every operation discussed has the operations in DB2, Oracle, MySQL, and SQL Server databases, sometimes with the same syntax, sometimes with different syntaxes. StackOverflow or Google can help you find these database-specific keywords.

请注意,并非所有数据库都支持相同的语法。 本文使用了PostgreSQL语法,尽管讨论的每个操作都具有DB2,Oracle,MySQL和SQL Server数据库中的操作,有时使用相同的语法,有时使用不同的语法。 StackOverflow或Google可以帮助您找到这些特定于数据库的关键字。

Say that, in the Products table, we want to group products into three price buckets: Cheap, Regular, and Expensive, for prices less than $12, between $12 and $21, and above $21, respectively.

假设在Products表中,我们希望将产品分为三个价格段: CheapRegularExpensive ,价格分别低于12美元,12美元至21美元以及21美元以上。

Image for post

No problem! The CASE keyword can help. This keyword acts like an if/else if/else statement in other languages like Python.

没问题! CASE关键字可以提供帮助。 在其他语言(例如Python)中,此关键字的作用类似于if / else if / else语句。

The CASE keyword uses the syntax WHEN condition THEN value. When multiple WHENs are stacked, they assume an ‘else-if’ relationship. Lastly, an ELSE value can be added if the condition is not met. Lastly, END is written to indicate the end of the CASE statement, and the results are saved (aliased) to a column named Bucket through AS Bucket.

CASE关键字使用语法WHEN condition THEN value 。 当多个WHEN堆叠在一起时,它们假定为“ else-if”关系。 最后,如果不满足条件,则可以添加ELSE value 。 最后,写入END以指示CASE语句的结束,并且结果通过AS Bucket保存(混叠)到名为Bucket的列中。

Image for post

This could also be done in Pandas with .apply(), with much slower speed.

这也可以在熊猫中使用.apply() ,但速度要慢得多。

Say we want to randomly sample 5 rows from Products. Although there is no direct method to do this, we can get creative by using both the ORDER BY and LIMIT keywords. ORDER BY orders the data in a certain format; for instance, using ORDER BY Price ASC would order the data such that the price was in ascending format. Using DESC uses descending, and ORDER BY works with strings by sorting them alphabetically.

假设我们要从Products随机抽取5行。 尽管没有直接的方法可以执行此操作,但是我们可以同时使用ORDER BYLIMIT关键字来发挥创意。 ORDER BY以某种格式对数据进行排序; 例如,使用ORDER BY Price ASC将对数据进行排序,以使价格采用升序格式。 使用DESC使用降序,而ORDER BY通过按字母顺序对字符串进行处理。

ORDER BY random() orders the data randomly, and LIMIT x returns the first x rows in the selected subset of data. This way, five random rows are selected from the data (* means all columns).

ORDER BY random()对数据进行随机排序,并且LIMIT x返回所选数据子集中的前x行。 这样,从数据中选择了五个随机行( *表示所有列)。

Note: Unfortunately, SQL Tryit Editor does not support random(), but real databases do (or use a variant, like rand()).

注意:不幸的是,SQL Tryit Editor不支持random(),但实际数据库却支持(或使用诸如rand()类的变体)。

Like this task of randomly sampling, most of SQL is about chaining together several simpler commands like SELECT and integrating them with built-in functions to yield astonishingly complex results.

像这项随机​​抽样任务一样,大多数SQL都是将几个简单的命令(如SELECT链接在一起,并将它们与内置函数集成在一起,以产生惊人的复杂结果。

Moreover, SQL provides all the statistical functions you may need. With everything from MIN() to MAX() to COUNT() to SUM() to AVG() to ASIN() (arcsine), you’re set. You can either use package extensions for metrics like standard deviation or create them yourself using existing default functions, which is not difficult to do at all.

此外,SQL提供了您可能需要的所有统计功能。 从MIN()MAX()COUNT()SUM()AVG()ASIN() (arcsine),一切就绪。 您可以将包扩展名用于标准差之类的指标,也可以使用现有的默认功能自行创建,这一点也不难。

These are standard tasks — but what is more amazing is that SQL is a Turing-complete language. Put simply, you could represent a program in, say, Python or C++, in SQL, by building your own complex memory systems and using elements of SQL like functions, if/elses, recursion, etc. You can view some fascinating demonstrations of Turing-complete SQL here. The main point of this is not to encourage you to use SQL as an operational language, but to demonstrate that SQL can be used to do so much more than you thought.

这些是标准任务-但更令人惊奇的是SQL是图灵完备的语言。 简而言之,您可以通过构建自己的复杂内存系统并使用SQL元素(例如函数,if / els,递归等)来表示SQL中的Python或C ++程序。您可以查看一些有趣的图灵演示-完整SQL here 。 这样做的主要目的不是鼓励您将SQL用作一种操作语言,而是要证明SQL可以做的事情比您想的还要多。

There’s so much more that you can do in SQL that we haven’t discussed:

您可以在SQL中做很多事情,但我们没有讨论过:

  • Specify your own custom functions, like you would declare a function in Python or C++. These can be used to, for example, parse IP addresses.

    指定自己的自定义函数,就像在Python或C ++中声明一个函数一样。 这些可用于例如解析IP地址。
  • Use recursion to create complex looping and data generation with the WITH keyword.

    使用递归通过WITH关键字创建复杂的循环和数据生成。

  • Sort string columns by a substring.

    子字符串对字符串列进行排序。

  • Perform complex joining between multiple tables.

    在多个表之间执行复杂的联接。
  • Use SQL to generate SQL (automating tasks).

    使用SQL生成SQL(自动任务)。
  • Generate forecasts using statistical models.

    使用统计模型生成预测。
  • Create histograms.

    创建直方图。
  • Build tree structures (with leaf, branch, root nodes).

    构建树结构(具有叶,分支,根节点)。

It’s true that you can do a lot more with SQL than you can with Pandas. That being said, usually that additional functionality is not necessary. The main reason why you should be using SQL is because it is built to handle large quantities of data in a custom environment that DataFrames are not.

的确,使用SQL可以比使用Pandas做更多的事情。 话虽如此,通常不需要其他功能。 您应该使用SQL的主要原因是因为它是为在DataFrames不能提供的自定义环境中处理大量数据而构建的。

Generally, SQL is a simple but sometimes very messy language, and it should usually be used just to reduce the size of the data until it becomes more manageable to handle in Pandas’ smaller environment.

通常,SQL是一种简单但有时很混乱的语言,通常应将其用于减小数据大小,直到在Pandas较小的环境中处理起来更易于管理为止。

关键点 (Key Points)

  • Pandas isn’t good at handling big data, and its features can all be done with SQL. However, Pandas’ value comes from its integration with other plotting libraries, machine learning libraries, and the Python language.

    Pandas并不擅长处理大数据,并且其所有功能都可以通过SQL完成。 但是,Pandas的价值来自与其他绘图库,机器学习库和Python语言的集成。
  • The goal should usually be to use SQL to narrow down a large dataset into one that is more relevant for the task, then to handle it in a Python environment, using Pandas’ DataFrame as the basis for storage.

    目标通常应该是使用SQL将大型数据集缩小为与任务更相关的数据集,然后在Pandas的DataFrame作为存储基础的Python环境中进行处理。
  • Don’t be scared to touch SQL to handle big data problems. As demonstrated above, SQL’s syntax is simple and is almost all about chaining together simple commands to yield more complex results. If you have a clear vision of the result, you can make it happen with SQL.

    不要害怕接触SQL来处理大数据问题。 如上所示,SQL的语法很简单,几乎所有内容都与将简单命令链接在一起以产生更复杂的结果有关。 如果您对结果有清晰的认识,则可以使用SQL来实现它。
  • SQL can do a lot more than most people realize.

    SQL可以做的事比大多数人意识到的要多。

翻译自: https://towardsdatascience.com/you-can-do-everything-in-pandas-with-sql-354adb30fbf9

pandas sql

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值