pandas sql_将SQL与Pandas一起使用

pandas sql

pandas is a very popular solution for many looking to move away from databases and toward smaller, more realistic solutions when it comes to manipulating data. Because many developers, analysts and data scientists come from a SQL background, there have been efforts to make pandas more friendly to those users.

对于处理数据的许多人来说, pandas是一个非常受欢迎的解决方案,他们希望从数据库转向更小巧,更现实的解决方案。 由于许多开发人员,分析人员和数据科学家都来自SQL背景,因此一直在努力使熊猫对这些用户更加友好。

熊猫语 (Pandasql)

One such popular solution is a package called pandasql, which is a very convenient way of running SQL queries on pandas DataFrames. Since this package has been around for awhile, many users have opted for this solution. And why wouldn’t they? Using SQLite as its backend, pandasql has complete and reliable SQL syntax.

这样一种流行的解决方案是一个名为pandasql的软件包,这是一种在pandas DataFrames上运行SQL查询的非常方便的方法。 由于此软件包已经存在了一段时间,因此许多用户选择了此解决方案。 他们为什么不呢? 使用SQLite作为后端,pandasql具有完整且可靠SQL语法。

Now as a little aside, I’ll explain how pandasql works internally. To load all pandas DataFrames into pandasql, the package provides the syntax,

现在,除了一点点,我将解释pandasql在内部如何工作。 要将所有pandas DataFrames加载到pandasql中,该软件包提供了语法,

from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

This essentially means that sqldf will look through all existing Python variables and load the ones it identifies as DataFrames into a SQLite database.

从本质上讲,这意味着sqldf将浏览所有现有的Python变量并将其标识为DataFrame的变量加载到SQLite数据库中。

为什么这是个问题? (Why is this a problem?)

To take a step away from pandasql for a moment, let’s recall why we are using pandas — as an alternative to the functionality that databases provide. So what is actually occurring with pandasql, is that users are given the illusion of using pandas, along with SQL, when they are in fact only using SQLite, which is often not intended for production use due to it’s lack of scalability SQLite is an excellent piece of software, but in cases like these it can present many bottlenecks, which can be seen in the many performance issues raised on pandasql’s GitHub repo. One such issue can be seen here.

暂时离开pandasql ,让我们回想一下为什么要使用pandas —作为数据库提供的功能的替代方法。 所以pandasql实际发生的是,当用户实际上仅使用SQLite时,就会给用户使用pandas和SQL的幻觉,由于缺乏可伸缩性, SQLite通常不适合用于生产用途SQLite是一个很好的选择软件,但是在这种情况下,它可能会带来许多瓶颈,这可以从pandasql的GitHub存储库上引发的许多性能问题中看出。 这样的问题可以在这里看到。

This problem is something I noticed quite a while ago, but unfortunately pandasql is no longer maintained. At the time of this article it hasn’t had any commits since early 2017, so it’s unlikely that the way the package works will be changed.

这个问题是我很久以前注意到的,但是不幸的是不再维护pandasql 。 在撰写本文时,自2017年初以来未进行任何提交,因此不太可能更改软件包的工作方式。

The solution to this problem and the only way to get the most out of pandas functionality is create a package that has no dependency on external databases, only on pandas itself. This requires creating a SQL parser that translates SQL syntax directly into pandas operations. In fact, many DataFrame-like projects like dask, rapids, and modin could share and benefit from such an interface.

解决此问题的方法以及充分利用熊猫功能的唯一方法是创建一个不依赖外部数据库的包,而仅依赖熊猫本身。 这需要创建一个SQL解析器,将SQL语法直接转换为pandas操作。 事实上,许多数据帧像这样的界面像DASK,急流大黄素可以分享和受益的项目。

DataFrameSQL (DataFrame Sql)

So over the course of a few months I developed a new python package called dataframe_sql. This package uses a custom SQL dialect that I wrote, with the intent of it being compatible with almost all major SQL dialects. Using this package users can register individual dataframes, like so

因此,在几个月的时间内,我开发了一个名为dataframe_sql的新python包。 该软件包使用了我编写的自定义SQL方言,目的是与几乎所有主要SQL方言兼容。 使用此软件包,用户可以注册单个数据帧,如下所示

import pandas
import dataframe_sqldf = pandas.DataFrame({"column1": [1, 2, 3], "column2": ["4", "5", "6"]})

dataframe_sql.register_temp_table(df, "my_table")

and query them as if they were tables in a database,

并像查询数据库中的表一样查询它们,

dataframe_sql.query(""""select column1, cast(column2 as integer) + 1 as my_col2 from my_table"""")

resulting in this DataFrame,

导致此DataFrame,

+---------+---------+
| column1 | my_col2 |
+=========+=========+
| 1 | 5 |
+---------+---------+
| 2 | 6 |
+---------+---------+
| 3 | 7 |
+---------+---------+

The cool part about what happens in the query function though, and what removes any bottlenecks, is that instead of going through an external database to retrieve a query result, the code¹

关于查询功能中发生的事情以及消除任何瓶颈的最酷的部分是,代码¹无需通过外部数据库来检索查询结果,而是

df = df.copy()
df['my_col2'].astype('integer', inplace=True)
df['my_col2'] = df['column2'] + 1
df.drop(columns=['column2'], inplace=True)

is executed.

被执行。

dataframe_sql provides full SQL select statement coverage, including MS SQL server style window functions and is available for download from PyPi,

dataframe_sql提供了完整SQL选择语句覆盖范围,包括MS SQL Server样式窗口功能,可以从PyPi下载,

pip install dataframe_sql

If you encounter any issues using the package or would like to view the source code, you can find dataframe_sql on GitHub.

如果您在使用该软件包时遇到任何问题,或者想查看源代码,可以在GitHub上找到dataframe_sql。

[1] Note that this is not the exact syntax that is generated, but a functional equivalent.

[1]请注意,这不是生成的确切语法,而是功能上等效的语法。

翻译自: https://medium.com/swlh/using-sql-with-pandas-595ae9382cd3

pandas sql

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值