Multiple Catalog

RFC PR: datafuselabs/databend#8255
Tracking Issue: datafuselabs/databend#8300

Summary

Allow users to maintain multiple catalogs for the databend.

Motivation

Databend organize data in three layers:

catalog -> database -> table

catalog: the biggest layer for databend, contains all databases and tables, provided by Catalog
database: the container of tables, provided by Database
table: the smallest unit in databend, provided by Table

By default, all databases and tables will be stored in default catalog (powered by metasrv).

Databend supports multiple catalogs now, but only in a static way.

To allow accessing the hive catalog, users need to configure hive inside databend-query.toml in this way:

[catalog]
meta_store_address = "127.0.0.1:9083"
protocol = "binary"

Users can't add/alter/remove the catalogs during runtime.

By allowing users to maintain multiple catalogs for the databend, we can integrate more catalogs like iceberg more quickly.

Guide-level explanation

After this RFC has been implemented, users can create new catalogs like:

CREATE CATALOG my_hive
  TYPE=HIVE
  CONNECTION = (URL='<hive-meta-store>' THRIFT_PROTOCOL=BINARY);
SELECT * FROM my_hive.DB.table;

Besides, users can alter or drop a catalog:

DROP CATALOG my_hive;

Users can add more catalogs like:

CREATE CATALOG my_iceberg
  TYPE=ICEBERG
  CONNECTION = (URL='s3://my_bucket/path/to/iceberg');
SELECT * FROM my_iceberg.DB.table;

With this feature, users can join data from different catalogs now:

select
    my_iceberg.DB.purchase_records.Client_ID,
    my_iceberg.DB.purchase_records.Item,
    my_iceberg.DB.purchase_records.QTY
from my_hive.DB.vip_info
inner join my_iceberg.DB.purchase_records
    on my_hive.DB.vip_info.Client_ID = my_iceberg.DB.purchase_records.Client_ID;

Reference-level explanation

Databend has a framework for multiple catalogs now. The only change for us is to store catalog-related information in metasrv instead.

To make it possible to start a query without metasrv, we will also support configuring catalogs in config like:

[catalogs.my_hive]
meta_store_address = "127.0.0.1:9083"
protocol = "binary"

[catalogs.my_iceberg]
URL = "s3://bucket"

Static catalogs will always be loaded from configs and can't be altered or dropped.

Drawbacks

None.

Rationale and alternatives

None.

Prior art

Presto

Presto is an open-source SQL query engine that's fast, reliable, and efficient at scale. It doesn't have persisted states, so all its connectors will be configured.

Take iceberg as an example:

connector.name=iceberg
hive.metastore.uri=hostname:port
iceberg.catalog.type=hive

While using:

USE iceberg.tpch;
CREATE TABLE IF NOT EXISTS ctas_nation AS (SELECT * FROM nation);
DESCRIBE ctas_nation;

Unresolved questions

None.

Future possibilities

Iceberg Catalog

Discussed in RFC Iceberg External Table

Discussed in Tracking issues of integration with delta sharing

Summary​

Motivation​

Guide-level explanation​

Reference-level explanation​

Drawbacks​

Rationale and alternatives​

Prior art​

Presto​

Unresolved questions​

Future possibilities​

Iceberg Catalog​

Delta Sharing Catalog​

Summary

Motivation

Guide-level explanation

Reference-level explanation

Drawbacks

Rationale and alternatives

Prior art

Presto

Unresolved questions

Future possibilities

Iceberg Catalog

Delta Sharing Catalog