简述
- IMDB数据库是一个很大的,被广泛使用的电影,电视节目和演员信息的数据库,它包括了有关电影、电视节目、演员、制作公司、编剧、导演等信息。IMDB数据集可以为电影评论、分类、预测以及其他机器学习任务提供有用的参考信息。
- Join Order Benchmark(JOB)是一个数据库基准测试,旨在评估数据库优化器的能力,特别是在确定关系表之间连接顺序方面。该基准测试涉及多个关系表的连接,挑战数据库优化器在处理复杂查询时进行最佳加入顺序决策的能力。论文链接:http://www.vldb.org/pvldb/vol9/p204-leis.pdf
join order benchmark(JOB)查询获取
IMDB数据集导入PostgreSQL和join order benchmark(JOB)查询生成:
join order benchmark(JOB)-github-含有安装教程
进入github,需要查询语句直接下载即可:
注意,代码里有给出IMDB数据集的下载,但是第二步的网站链接失效了,所以用其它方法导入:
IMDB导入数据到PG
(1)下载CSV等文件
下载 imdb.tgz,放置到到某个路径,记住该路径,后面有用。作者这里放置在/var/lib/pgsql/benchmark
接着,解压imdb.tgz:
tar -zxvf imdb.tgz
以下命令都需要进入psql后运行:
(2)psql
进入PG,创建数据库:
CREATE DATABASE imdbload;
使用imdbload数据库:
\c imdbload
(2)执行sql脚本创建表,注意讲前面的路径修改为imdb.tgz的放置路径:
\i /var/lib/pgsql/benchmark/schematext.sql;
(3)导入数据,注意讲前面的路径修改为imdb.tgz的放置路径:
\copy aka_name from '/var/lib/pgsql/benchmark/aka_name.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy aka_title from '/var/lib/pgsql/benchmark/aka_title.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy cast_info from '/var/lib/pgsql/benchmark/cast_info.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy char_name from '/var/lib/pgsql/benchmark/char_name.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy comp_cast_type from '/var/lib/pgsql/benchmark/comp_cast_type.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy company_name from '/var/lib/pgsql/benchmark/company_name.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy company_type from '/var/lib/pgsql/benchmark/company_type.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy complete_cast from '/var/lib/pgsql/benchmark/complete_cast.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy info_type from '/var/lib/pgsql/benchmark/info_type.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy keyword from '/var/lib/pgsql/benchmark/keyword.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy kind_type from '/var/lib/pgsql/benchmark/kind_type.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy link_type from '/var/lib/pgsql/benchmark/link_type.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy movie_companies from '/var/lib/pgsql/benchmark/movie_companies.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy movie_info from '/var/lib/pgsql/benchmark/movie_info.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy movie_info_idx from '/var/lib/pgsql/benchmark/movie_info_idx.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy movie_keyword from '/var/lib/pgsql/benchmark/movie_keyword.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy movie_link from '/var/lib/pgsql/benchmark/movie_link.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy name from '/var/lib/pgsql/benchmark/name.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy person_info from '/var/lib/pgsql/benchmark/person_info.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy role_type from '/var/lib/pgsql/benchmark/role_type.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy title from '/var/lib/pgsql/benchmark/title.csv' with delimiter as ',' csv quote '"' escape as '\';
(4)检验数据(可有可无)
导入后我们并不知道是否导入成功,可以写个shell脚本检验下。当然,如果嫌麻烦可以跳过,检查一两个表即可。
bash命令显示imdbload的所有表:
echo "\dt" | psql -t -A -d imdbload
如果显示的结果是:
public|aka_name|table|postgres
public|aka_title|table|postgres
public|cast_info|table|postgres
public|char_name|table|postgres
public|comp_cast_type|table|postgres
public|company_name|table|postgres
public|company_type|table|postgres
public|complete_cast|table|postgres
public|info_type|table|postgres
public|keyword|table|postgres
public|kind_type|table|postgres
public|link_type|table|postgres
public|movie_companies|table|postgres
public|movie_info|table|postgres
public|movie_info_idx|table|postgres
public|movie_keyword|table|postgres
public|movie_link|table|postgres
public|name|table|postgres
public|person_info|table|postgres
public|role_type|table|postgres
public|title|table|postgres
那么脚本需要分割|
:
#!/bin/bash
# 获取所有表格名称
TABLES=$(echo "\dt" | psql -t -A -d imdbload)
# 遍历每个表格并获取其记录数
for table in $TABLES; do
table=$(echo "${table}" | cut -d '|' -f 2)
count=$(echo "SELECT COUNT(*) FROM $table" | psql -t -A -d imdbload)
echo "$table: $count"
done
如果只是movie_info
,那么去掉第八行table=$(echo "${table}" | cut -d '|' -f 2)
测试:
vim test_imdb.sh
,将完整的脚本写入,wq
退出。然后sh test_imdb.sh
,如果结果:
发现都有数据,那么说明导入数据成功!