数据科学家SQL技巧

最新推荐文章于 2024-08-04 20:56:26 发布

weixin_26737625

最新推荐文章于 2024-08-04 20:56:26 发布

阅读量127

点赞数

文章标签： sql java python mysql

原文链接：https://towardsdatascience.com/sql-tricks-for-data-scientists-53298467dd5

版权

I found myself buried in cron jobs and CSV files that were requested by various teams within my company. There were endless requests for new data exports or updates to those exports. Anytime anybody wanted to add a field, I was the single point of failure for that task. I had to first remember which service generated that report as well as remember the point of the report. Then I had to investigate if the new field that was desired was available, could be derived from other columns, or required a new database connection.

我发现自己陷入了公司内部各个团队要求的cron作业和CSV文件中。不断有新的数据导出或更新这些导出的请求。每当有人要添加字段时，我就是该任务的单点失败。我必须首先记住哪个服务生成了该报告以及该报告的要点。然后，我必须调查所需的新字段是否可用，是否可以从其他列派生或需要新的数据库连接。

I needed to find an application that could help me to keep track of all the reports, manage all the various database connections, and allow someone to maintain the notifications on their own. A final important feature would be to offload some of the report generating off of my plate and allow people to self-serve all the data.

我需要找到一个应用程序，该应用程序可以帮助我跟踪所有报告，管理所有各种数据库连接以及允许某人自己维护通知。最后一个重要的功能是减轻我的印版生成的某些报告的负担，并允许人们自助服务所有数据。

I settled on Metabase because it fit all the criteria I was looking for. It’s open-source, works with a variety of different data sources, has user/permission management, many charting/dashboarding options, and various different types of notifications.

我选择了Metabase，因为它符合我所寻找的所有条件。它是开源的，可以处理各种不同的数据源，具有用户/权限管理，许多图表/仪表板选项以及各种不同类型的通知。

There was only one problem — Metabase is entirely SQL based. My workflow of using simple selects to query the database and transforming the data into CSVs wasn’t going to be available. I had to use raw SQL. How could I insert any logic into SQL? How could I loop over results? How could I generate date ranges? How can I use rolling windows? Those types of questions made it sound like a SQL-only workflow wasn’t going to cut it.

只有一个问题-元数据库完全基于SQL。我使用简单选择查询数据库并将数据转换为CSV的工作流将不可用。我不得不使用原始SQL。如何将任何逻辑插入SQL？我该如何遍历结果？如何生成日期范围？如何使用滚动窗口？这些类型的问题听起来像纯SQL的工作流并不会减少它。

But what if those operations were actually possible? What if SQL was actually a Turing-complete language with recursion? What if there was a way to pivot data or use windows? Below I’ll go over a few tips that I discovered on my journey to take full advantage of the power of SQL.

但是，如果这些操作实际上可行，该怎么办？如果SQL实际上是具有递归的图灵完备语言，该怎么办？如果可以透视数据或使用窗口怎么办？下面，我将介绍一些在旅途中发现的技巧，这些技巧可充分利用SQL的功能。

建立 (Setup)

If you want to be able to execute the examples on your own, you can follow the below instructions. If you just want to read, skip to the next section.

如果您希望自己执行示例，则可以按照以下说明进行操作。 如果您只想阅读，请跳至下一部分。

I’m using MySQL 8 and the sample Sakila database for the examples. If you have Docker installed, I have provided lines of code to use to get a working MySQL 8 server running.

我使用MySQL 8和示例Sakila数据库作为示例。如果您已安装Docker，则我提供了几行代码以使运行中MySQL 8服务器运行。

The following starts the docker container.

以下将启动docker容器。

docker run -d --publish=3307:3306 --name=mysql_test -e MYSQL_ROOT_PASSWORD=root mysql/mysql-server:latest

The database sample database can be loaded by downloading the sample files and extracting them.

可以通过下载示例文件并将其解压缩来加载数据库示例数据库。

#enter the shell
docker exec -it mysql_test bin/bash#install wget
yum install wget#get file
wget https://downloads.mysql.com/docs/sakila-db.tar.gz#install tar
yum install tar#extract files
tar xvzf sakila-db.tar.gz#start the mysql console, the password was set in the docker run cmd
mysql -u root -p#import the schema
SOURCE sakila-db/sakila-schema.sql#import the data
SOURCE sakila-db/sakila-data.sql#use the newly created database
use sakila;#if all went well, the following cmd should show all the tables
show tables;

After the setup is completed, you can try out the examples for yourself.

设置完成后，您可以自己尝试示例。

查找重复活动 (Find a Repeating Event)

I’ll start off with an easier — yet still powerful — example. Joining a table with another table can cause a cross (or cartesian) join if that table has a one to many or many to many relationships. We can use a subquery to narrow down the result set to one row from each table.

我将从一个简单但仍然强大的示例开始。如果该表具有一对多或多对多关系，则将一个表与另一个表联接会导致交叉(或笛卡尔)联接。我们可以使用子查询将结果集缩小到每个表的一行。

For instance, let's say we want to find each actor and the last film they appeared in alphabetical order.

例如，假设我们要查找每个演员和他们按字母顺序出现的最后一部电影。

SELECT a.first_name, a.last_name, f.title 
FROM actor a 
JOIN film_actor fa ON fa.actor_id = a.actor_id 
JOIN film f ON f.film_id = fa.film_id;

The above query will list all the actors and all the films. But if you want to list just one film — the last one — we need to add a subquery to use only the films we want.

上面的查询将列出所有演员和所有电影。但是，如果您只想列出一部电影(最后一部)，我们需要添加一个子查询以仅使用我们想要的电影。

SELECT a.first_name, fa.film_id, f.title 
FROM actor a 
LEFT JOIN (
  SELECT actor_id, MAX(film_id) as film_id 
  FROM film_actor group by actor_id
) fa ON fa.actor_id = a.actor_id 
LEFT JOIN film f ON f.film_id = fa.film_id ;

Image for post — example response when listing only one film

For tables that have repeating events — like logging in and creating objects — you can find the first or last occurrence or a difference between the first and last occurrence. Below is the SQL to find the longevity of the customers where longevity is defined as the difference (in days) between their first and last rental date.

对于具有重复事件(例如，登录和创建对象)的表，您可以找到第一个或最后一个匹配项，或第一次和最后一个匹配项之间的差异。以下是用于查找客户寿命SQL，其中寿命定义为客户的第一个和最后一个租赁日期之间的差(天)。

SELECT c.customer_id,c.first_name, c.last_name, r.first_rental, r.last_rental, DATEDIFF(r.last_rental, r.first_rental) as customer_longevity 
FROM customer c 
LEFT JOIN (
  SELECT customer_id, MIN(rental_date) as first_rental, MAX(rental_date) AS last_rental 
  FROM rental GROUP BY customer_id
) r ON r.customer_id = c.customer_id;

旋转 (Pivoting)

Data that you want to convert the rows to columns — for presentation and/or charting — can be pivoted and summarized to fit your needs. Take the following query and sample output.

您可以将要转换为表格和/或图表的行转换为列的数据，以适应您的需求。进行以下查询和示例输出。

SELECT MONTHNAME(r.rental_date), c.name, count(r.rental_id) 
FROM rental r 
LEFT JOIN film f ON f.film_id = r.inventory_id 
LEFT JOIN film_category fc ON fc.film_id = f.film_id 
LEFT JOIN category c ON c.category_id = fc.category_id 
GROUP BY MONTHNAME(r.rental_date),c.name;

Using CASE WHEN we can move the monthly breakdown into one row per month instead of one row per month per category.

使用CASE WHEN，我们可以将每月细目移动到每个类别每月而不是每个类别每月一行。

SELECT MONTHNAME(r.rental_date), 
COUNT(CASE WHEN c.name = ‘Horror’ THEN r.rental_id ELSE NULL END) AS HorrorCount, 
COUNT(CASE WHEN c.name = ‘Action’ THEN r.rental_id ELSE NULL END) AS ActionCount, 
COUNT(CASE WHEN c.name = ‘Comedy’ THEN r.rental_id ELSE NULL END) AS ComedyCount, 
COUNT(CASE WHEN c.name = ‘Sci-Fi’ THEN r.rental_id ELSE NULL END) AS ScifiCount 
FROM rental r 
LEFT JOIN film f ON f.film_id = r.inventory_id 
LEFT JOIN film_category fc ON fc.film_id = f.film_id 
LEFT JOIN category c ON c.category_id = fc.category_id 
GROUP BY MONTHNAME(r.rental_date);

I didn’t use all the categories to avoid a one page SQL statement but you could add the other columns to complete the query. MySQL doesn’t have a built-in way to dynamically create each column but there is the ability to use a prepared statement to avoid having to spell out each column.

我并没有使用所有类别来避免使用一页SQL语句，但是您可以添加其他列来完成查询。 MySQL没有内置的方法来动态创建每一列，但可以使用准备好的语句来避免拼出每一列。

SET @sql = NULL;
SELECT
 GROUP_CONCAT(DISTINCT
 CONCAT(
 ‘COUNT(CASE WHEN c.name = ‘’’,
 name,
 ‘’’ THEN r.rental_id ELSE NULL END) AS `’,
 name,
 ‘`’
 )
 ) INTO @sql
FROM category;
SET @sql = CONCAT(‘SELECT MONTHNAME(r.rental_date), ‘, @sql ,’ FROM rental r LEFT JOIN film f ON f.film_id = r.inventory_id LEFT JOIN film_category fc ON fc.film_id = f.film_id LEFT JOIN category c ON c.category_id = fc.category_id GROUP BY MONTHNAME(r.rental_date)’);PREPARE stmt FROM @sql;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;

Note: The group_concat_max_len variable needs to be sufficiently long enough to hold all the possible categories. The default value is 1024 but that can be changed per session if you run into any errors when trying to run the above SQL statement.

注意：group_concat_max_len变量必须足够长才能容纳所有可能的类别。默认值为1024，但是如果尝试运行上述SQL语句时遇到任何错误，则可以在每个会话中更改该默认值。

SET SESSION group_concat_max_len = 1000000;

卷帘窗 (Rolling windows)

Continuing with the theme of aggregating data, we can also use window functions to create rolling aggregates. Earlier we used a query to list all the rentals by genre and month. How could we add another column to show a running total per month as well as what percentage of rentals for the month were in each genre?

继续以聚合数据为主题，我们还可以使用窗口函数来创建滚动聚合。之前我们使用查询按类型和月份列出了所有租金。我们如何添加另一列以显示每月的总运行量以及每种类型的当月租金百分比？

SELECT MONTHNAME(r.rental_date), c.name, count(r.rental_id),     SUM(count(r.rental_id)) over(PARTITION BY MONTHNAME(r.rental_date)) as rental_month_total, count(rental_id) / SUM(count(r.rental_id)) over(PARTITION BY MONTHNAME(r.rental_date)) * 100 as percentage_of_rentals 
FROM rental r 
LEFT JOIN film f ON f.film_id = r.inventory_id 
LEFT JOIN film_category fc ON fc.film_id = f.film_id 
LEFT JOIN category c ON c.category_id = fc.category_id 
GROUP BY MONTHNAME(r.rental_date),c.name;

The Overkeyword allows you to define how to partition the data. In this case, we are using the rental date month.

Over关键字使您可以定义如何对数据进行分区。在这种情况下，我们使用的是出租日期月份。

产生资料 (Generating Data)

I often run across data that I want to track over time. The problem is that there are instances where the data is sparse and may not have values for each time unit I want to display. For instance, take the following query:

我经常会遇到需要跟踪的数据。问题是在某些情况下数据稀疏并且可能没有我要显示的每个时间单位的值。例如，采用以下查询：

SELECT DATE_FORMAT(r.rental_date,"%Y-%M") as rental_date, count(r.rental_id) as rental_count FROM rental r  
LEFT JOIN film f ON f.film_id = r.inventory_id  
LEFT JOIN film_category fc ON fc.film_id = f.film_id  
LEFT JOIN category c ON c.category_id = fc.category_id  
GROUP BY DATE_FORMAT(r.rental_date,"%Y-%M");

The previous query shows the count by month as expected but what if I want to see a 0 for the other columns.

上一个查询按预期显示按月计数，但是如果我想看到其他列为0，该怎么办。

WITH RECURSIVE t(v) as (   
  SELECT  DATE('2005-03-01')   
  UNION ALL   
  SELECT v + INTERVAL 1 MONTH   
  FROM t    
  LIMIT 12 
) 
SELECT DATE_FORMAT(t.v,"%Y-%M") as rental_date, count(r.rental_id) as rental_count FROM rental r  
LEFT JOIN film f ON f.film_id = r.inventory_id   
LEFT JOIN film_category fc ON fc.film_id = f.film_id   
LEFT JOIN category c ON c.category_id = fc.category_id   
RIGHT JOIN t on DATE_FORMAT(t.v,"%Y-%M")  = DATE_FORMAT(r.rental_date,"%Y-%M") 
GROUP BY DATE_FORMAT(t.v,"%Y-%M");

Recursive CTEs look more intimidating than they are. A better explanation than I could provide on the subject can be found here.

递归CTE看上去比实际情况更具威慑力。在这里可以找到比我能提供的更好的解释。

结论 (Conclusion)

SQL offers a powerful language to extract data. I’ve gone over a few tips on how to use SQL to handle more complex filtering and aggregation instead of needing to perform those operations outside of the database.

SQL提供了一种强大的语言来提取数据。我已经讲解了一些技巧，这些技巧说明了如何使用SQL来处理更复杂的过滤和聚合，而不需要在数据库外部执行这些操作。

I do recommend Metabase for those wishing to add a business intelligence reporting engine to the organization. The only criteria that it has yet to satisfy is to actually get my coworkers to self-serve the report generation. But that’s probably a function of me saying “yes” to all report requests.

对于那些希望向组织添加商业智能报告引擎的人， 我确实建议使用 Metabase 。 它唯一尚未满足的标准是实际上让我的同事为报告生成服务。 但这可能是我对所有报告要求说“是”的功能。

翻译自: https://towardsdatascience.com/sql-tricks-for-data-scientists-53298467dd5

weixin_26737625

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
数据科学家SQL技巧

I found myself buried in cron jobs and CSV files that were requested by various teams within my company. There were endless requests for new data exports or updates to those exports. Anytime anybody w...
复制链接

扫一扫