duckdb学习-1

15 篇文章 1 订阅
14 篇文章 0 订阅

DuckDB is a fast in-process analytical database
DuckDB supports a feature-rich SQL dialect complemented with deep integrations into client APIs

在notebook中使用duckdb

安装

pip install duckdb

示例代码:


#> pip install jupysql
#> pip install duckdb-engine
import duckdb 
import pandas as pd

#在 jupysql 上设置配置,直接将数据输出到 Pandas,并简化打印到笔记本的输出。


%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False


#使用 SQLAlchemy 样式的连接字符串将 jupysql 连接到 DuckDB。 连接到新的内存中 DuckDB、默认连接或文件支持的数据库:
#%sql duckdb:///:memory:
#%sql duckdb:///:default:
#%sql duckdb:///path/to/file.db

# 原生方式连接到DuckDB
#con = duckdb.connect("file.db")

%load_ext sql 
con = duckdb.connect("file.db")
%sql con --alias duckdb

# 查询
# 查询结果将显示为 Pandas DataFrame
# 单行的模式
%sql SELECT 'Off and flying!' AS a_duckdb_column;
# 多行的模式
%%sql
SELECT
    schema_name,
    function_name
FROM duckdb_functions()
ORDER BY ALL DESC
LIMIT 5;

# 结果赋值给变量
%sql res << SELECT 'Off and flying!' AS a_duckdb_column;

# 和pandas联动
%sql output_df << SELECT sum(i) AS total_i FROM input_df;
#
con.sql("select * fromdf")
con.sql("create table input_df as select * from input_df;")
con.sql("insert into input_df select * from input_df;")
# 导出
temp_df=con.sql("select * from input_df").df()

# DuckDB offers a relational API that can be used to chain together query operations. These are lazily evaluated so that DuckDB can optimize their execution. These operators can act on Pandas DataFrames, DuckDB tables or views (which can point to any underlying storage format that DuckDB can read, such as CSV or Parquet files, etc.). Here we show a simple example of reading from a Pandas DataFrame and returning a DataFrame.

rel=con.from_df(input_df)
transformed_rel =rel.filter("i>1").project("i,j,i*2 as k").order("i desc")
transformed_rel.df()

和ibis集成:  the portable Python dataframe library
Ibis (ibis-project.org)

数据的导入和导出

# csv
SELECT * FROM read_csv('input.csv');
COPY tbl FROM 'input.csv';
# 导出
COPY tbl TO 'output.csv' (HEADER, DELIMITER ',');
COPY (SELECT * FROM tbl) TO 'output.csv' (HEADER, DELIMITER ',');



# parquet
SELECT * FROM read_parquet('input.parquet')
COPY tbl FROM 'input.parquet' (FORMAT PARQUET);
COPY tbl TO 'output.parquet' (FORMAT PARQUET);
COPY (SELECT * FROM tbl) TO 'output.parquet' (FORMAT PARQUET);


# json
SELECT * FROM read_json_auto('input.json');
COPY tbl FROM 'input.json';

COPY tbl TO 'output.json';
COPY (SELECT * FROM tbl) TO 'output.json';

# Excel
INSTALL spatial; 
LOAD spatial;
SELECT * FROM st_read('test_excel.xlsx', layer = 'Sheet1');

#Importing a Sheet with/without a Header
#The option HEADERS has three possible values:
#* FORCE: treat the first row as a header
#* DISABLE treat the first row as a row of data
#* AUTO attempt auto-detection (default)

SELECT * FROM st_read( 'test_excel.xlsx', layer = 'Sheet1', open_options = ['HEADERS=FORCE'] );

#The option FIELD_TYPE defines how field types should be treated:
#* STRING: all fields should be loaded as strings (VARCHAR type)
#* AUTO: field types should be auto-detected (default)
#For example, to treat the first row as a header and use auto-detection for types, run:

SELECT *FROM st_read(
    'test_excel.xlsx',
    layer = 'Sheet1',
    open_options = ['HEADERS=FORCE', 'FIELD_TYPES=AUTO']);
# 导出
COPY tbl TO 'output.xlsx' WITH (FORMAT GDAL, DRIVER 'xlsx');
COPY (SELECT * FROM tbl) TO 'output.xlsx' WITH (FORMAT GDAL, DRIVER 'xlsx');


# 从其他数据库导入
INSTALL mysql;
load mysql;
ATTACH 'host=localhost user=root port=0 database=mysqlscanner' AS mysql_db (TYPE mysql_scanner, READ_ONLY);
USE mysql_db;
# 可以直接对mysql进行读写

INSTALL postgres;
load postgres;
SELECT * FROM postgres_scan('host=localhost port=5432 dbname=mydb', 'public', 'mytable');

INSTALL sqlite;
load sqlite;

SELECT * FROM sqlite_scan('test.db', 'tbl_name');
-- attach the SQLite file "test.db" 
ATTACH 'test.db' AS test (TYPE sqlite); 
-- the table "tbl_name" can now be queried as if it is a regular table 
SELECT * FROM test.tbl_name; 
-- switch the active database to "test" 
USE test; 
-- list all tables in the file SHOW TABLES;

# 直接读取文件
SELECT size, parse_path(filename), content FROM read_text('test/sql/table_function/files/*.txt');

查询数据库的一些基础信息

#查看表信息: describe, show

describe tbname;
show tbname;

#描述查询: 
describe select * from dual;

#describe 可以使用子查询,这允许从描述创建表.
CREATE TABLE tbl_description AS SELECT * FROM (DESCRIBE tbl);

#解释执行计划:
EXPLAIN SELECT * FROM tbl;
SET explain_output = 'all';

EXPLAIN SELECT c_count, count(*) AS custdist FROM ( SELECT c_custkey, count(o_orderkey) FROM customer LEFT OUTER JOIN orders ON c_custkey = o_custkey AND o_comment NOT LIKE '%special%requests%' GROUP BY c_custkey) AS c_orders (c_custkey, c_count) GROUP BY c_count ORDER BY custdist DESC, c_count DESC;

#列出表信息:
show tables;
show all tables;

#对表或者查询进行summary: 
SUMMARIZE tbl;
SUMMARIZE SELECT * FROM tbl;

# 其他:
SELECT version();
PRAGMA platform;
SELECT * FROM duckdb_extensions();

meta table functions

  • duckdb_columns(): columns
  • duckdb_constraints(): constraints
  • duckdb_databases(): lists the databases that are accessible from within the current DuckDB process
  • duckdb_dependencies(): dependencies between objects
  • duckdb_extensions(): extensions
  • duckdb_functions(): functions
  • duckdb_indexes(): secondary indexes
  • duckdb_keywords(): DuckDB’s keywords and reserved words
  • duckdb_optimizers(): the available optimization rules in the DuckDB instance
  • duckdb_schemas(): schemas
  • duckdb_sequences(): sequences
  • duckdb_settings(): settings
  • duckdb_tables(): base tables
  • duckdb_types(): data types
  • duckdb_views(): views
  • duckdb_temporary_files(): the temporary files DuckDB has written to disk, to offload data from memory
DuckDB是一款嵌入式数据库,主要用于嵌入其他程序执行快速的SQL分析查询。它被称为“分析型数据库中的SQLite”,因为它易于安装和使用,没有外部依赖,并提供了C/C++、Python和R的语言绑定。下面是DuckDB的详细讲解: 1. DuckDB的特点 - 嵌入式数据库:DuckDB是一个嵌入式数据库,可以轻松地嵌入到其他应用程序中。 - 零依赖:DuckDB没有外部依赖,因此安装和使用非常简单。 - 高性能:DuckDB在处理大型数据集时表现出色,特别是在分析查询方面。 - 支持SQL:DuckDB支持SQL,包括复杂的查询和聚合操作。 - 支持多种编程语言DuckDB提供了C/C++、Python和R的语言绑定,因此可以轻松地在这些语言中使用它。 2. DuckDB的安装 DuckDB可以通过多种方式进行安装,包括Python和R的包管理器,以及源码安装。下面是Python和R中安装DuckDB的方法: - Python中安装DuckDB: ```shell pip install duckdb ``` - R中安装DuckDB: ```shell install.packages("duckdb") ``` 3. DuckDB的使用 DuckDB的使用非常简单,可以通过SQL查询来操作数据。下面是一个使用Python进行DuckDB查询的例子: ```python import duckdb # 创建连接 con = duckdb.connect(database=':memory:', read_only=False) # 创建表 con.execute("CREATE TABLE users (id INTEGER PRIMARY KEY, name VARCHAR(100), age INTEGER)") # 插入数据 con.execute("INSERT INTO users VALUES (1, 'Alice', 25)") con.execute("INSERT INTO users VALUES (2, 'Bob', 30)") con.execute("INSERT INTO users VALUES (3, 'Charlie', 35)") # 查询数据 result = con.execute("SELECT * FROM users WHERE age > 30") print(result.fetchall()) # 关闭连接 con.close() ``` 上面的代码创建了一个名为“users”的表,并向其中插入了一些数据。然后,它执行了一个查询,以查找年龄大于30岁的用户,并打印了结果。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

白日与明月

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值