【IMDB】IMDB数据集导入PostgreSQL和join order benchmark(JOB)查询生成

文章介绍了如何将IMDB数据集导入到PostgreSQL数据库中,使用JOINOrderBenchmark(JOB)进行查询性能测试。IMDB数据集为电影相关的信息,JOB基准测试关注数据库优化器在复杂查询时的表连接顺序。文章详细阐述了从下载CSV文件到创建表和导入数据的步骤,并提供了验证数据导入成功的技巧。
摘要由CSDN通过智能技术生成

简述

  • IMDB数据库是一个很大的,被广泛使用的电影,电视节目和演员信息的数据库,它包括了有关电影、电视节目、演员、制作公司、编剧、导演等信息。IMDB数据集可以为电影评论、分类、预测以及其他机器学习任务提供有用的参考信息。
  • Join Order Benchmark(JOB)是一个数据库基准测试,旨在评估数据库优化器的能力,特别是在确定关系表之间连接顺序方面。该基准测试涉及多个关系表的连接,挑战数据库优化器在处理复杂查询时进行最佳加入顺序决策的能力。论文链接:http://www.vldb.org/pvldb/vol9/p204-leis.pdf

join order benchmark(JOB)查询获取

IMDB数据集导入PostgreSQL和join order benchmark(JOB)查询生成:

join order benchmark(JOB)-github-含有安装教程

进入github,需要查询语句直接下载即可:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UK92X7yI-1686801757755)(typora_img/image-20230615111540448.png)]

注意,代码里有给出IMDB数据集的下载,但是第二步的网站链接失效了,所以用其它方法导入:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3AJ1dPZb-1686801757756)(typora_img/image-20230615111713043.png)]

IMDB导入数据到PG

数据集 TPC-H、TPC-DS、IMDB的导入使用

(1)下载CSV等文件

下载 imdb.tgz,放置到到某个路径,记住该路径,后面有用。作者这里放置在/var/lib/pgsql/benchmark

接着,解压imdb.tgz:

tar -zxvf imdb.tgz

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-L6aDiH1J-1686801757757)(typora_img/image-20230615112333547.png)]

以下命令都需要进入psql后运行:

(2)psql进入PG,创建数据库:

CREATE DATABASE imdbload;

使用imdbload数据库:

\c imdbload

(2)执行sql脚本创建,注意讲前面的路径修改为imdb.tgz的放置路径:

\i /var/lib/pgsql/benchmark/schematext.sql;

(3)导入数据,注意讲前面的路径修改为imdb.tgz的放置路径:

\copy aka_name from '/var/lib/pgsql/benchmark/aka_name.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy aka_title from '/var/lib/pgsql/benchmark/aka_title.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy cast_info from '/var/lib/pgsql/benchmark/cast_info.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy char_name from '/var/lib/pgsql/benchmark/char_name.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy comp_cast_type from '/var/lib/pgsql/benchmark/comp_cast_type.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy company_name from '/var/lib/pgsql/benchmark/company_name.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy company_type from '/var/lib/pgsql/benchmark/company_type.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy complete_cast from '/var/lib/pgsql/benchmark/complete_cast.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy info_type from '/var/lib/pgsql/benchmark/info_type.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy keyword from '/var/lib/pgsql/benchmark/keyword.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy kind_type from '/var/lib/pgsql/benchmark/kind_type.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy link_type from '/var/lib/pgsql/benchmark/link_type.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy movie_companies from '/var/lib/pgsql/benchmark/movie_companies.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy movie_info from '/var/lib/pgsql/benchmark/movie_info.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy movie_info_idx from '/var/lib/pgsql/benchmark/movie_info_idx.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy movie_keyword from '/var/lib/pgsql/benchmark/movie_keyword.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy movie_link from '/var/lib/pgsql/benchmark/movie_link.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy name from '/var/lib/pgsql/benchmark/name.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy person_info from '/var/lib/pgsql/benchmark/person_info.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy role_type from '/var/lib/pgsql/benchmark/role_type.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy title from '/var/lib/pgsql/benchmark/title.csv' with delimiter as ',' csv quote '"' escape as '\';

(4)检验数据(可有可无

导入后我们并不知道是否导入成功,可以写个shell脚本检验下。当然,如果嫌麻烦可以跳过,检查一两个表即可。

bash命令显示imdbload的所有表:

echo "\dt" | psql -t -A -d imdbload

如果显示的结果是:

public|aka_name|table|postgres
public|aka_title|table|postgres
public|cast_info|table|postgres
public|char_name|table|postgres
public|comp_cast_type|table|postgres
public|company_name|table|postgres
public|company_type|table|postgres
public|complete_cast|table|postgres
public|info_type|table|postgres
public|keyword|table|postgres
public|kind_type|table|postgres
public|link_type|table|postgres
public|movie_companies|table|postgres
public|movie_info|table|postgres
public|movie_info_idx|table|postgres
public|movie_keyword|table|postgres
public|movie_link|table|postgres
public|name|table|postgres
public|person_info|table|postgres
public|role_type|table|postgres
public|title|table|postgres

那么脚本需要分割|

#!/bin/bash

# 获取所有表格名称
TABLES=$(echo "\dt" | psql -t -A -d imdbload)

# 遍历每个表格并获取其记录数
for table in $TABLES; do
	table=$(echo "${table}" | cut -d '|' -f 2)
    count=$(echo "SELECT COUNT(*) FROM $table" | psql -t -A -d imdbload)
    echo "$table: $count"
done

如果只是movie_info,那么去掉第八行table=$(echo "${table}" | cut -d '|' -f 2)

测试:

vim test_imdb.sh,将完整的脚本写入,wq退出。然后sh test_imdb.sh,如果结果:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-vYnH2HRs-1686801757757)(typora_img/image-20230615114426024.png)]

发现都有数据,那么说明导入数据成功!

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

暗夜无风

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值