Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 10 additions & 10 deletions docs/table-design/data-partitioning/basic-concepts.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ This document introduces the partitioning (Partition) and bucketing (Bucket) mec

## 1. Overview

Doris uses a two-tier data partitioning approach of **Partition + Bucket** to organize the data of a table in an orderly manner across the nodes of the cluster:
Doris distributes a table's data across the cluster in two tiers, **partitions** and **buckets**:

- **Partition**: horizontally divides the table into smaller subsets by column values (such as time or region), making it easier to perform query pruning and data lifecycle management.
- **Bucket**: further evenly distributes data within each partition into multiple data shards (Tablets), fully utilizing the parallelism of the cluster.
Expand All @@ -29,7 +29,7 @@ The data flow can be summarized as:
Table ──► Partition ──► Bucket ──► Tablet (data shard, stored on BE nodes)
```

A reasonable partitioning and bucketing design brings the following benefits at the same time: **faster queries** (partition pruning, parallel scanning), **more flexible management** (archiving/cleanup by time), and **more even writes** (avoiding hotspots).
Good partitioning and bucketing give you three things at once: **faster queries** (partition pruning and parallel scans), **easier management** (archive or clean up by time), and **more even writes** (no hotspots).

## 2. Core Concepts

Expand Down Expand Up @@ -62,27 +62,27 @@ Doris supports two **partition types**:

If no partition is specified at table creation time, Doris generates a default partition that is transparent to the user, containing all the data in the table.

A reasonable partition design brings the following benefits:
Good partition design provides:

- **Improved query performance**: through partition pruning, the system can filter out irrelevant partitions based on query conditions, reducing the amount of data scanned and significantly lowering the I/O burden, which is especially suitable for large-scale datasets.
- **Management flexibility**: data can be split along logical dimensions such as time or region, making archiving, cleanup, and backup easier. For example, partitioning by time enables efficient management of historical and incremental data, supporting time-based data maintenance strategies.
- **Faster queries**: partition pruning skips partitions that can't match the query, so Doris scans less data and does less I/O. This matters most on large datasets.
- **Easier management**: splitting by time or region makes archiving, cleanup, and backup simpler. For example, partitioning by time lets you manage historical and incoming data separately.

### 2.3 Bucket

Bucketing further divides the data within a partition into smaller, mutually disjoint data units according to certain rules. Each row of data belongs to exactly one specific bucket.

Unlike partitions that divide by ranges of column values, the goal of bucketing is to **evenly distribute** data across predefined buckets, thereby reducing data skew and improving query execution performance through better data locality.
Partitions divide data by ranges or lists of column values. Bucketing instead spreads data **evenly** across the buckets in a partition, which reduces skew and improves query performance through better data locality.

Doris supports two **bucketing methods**:

- **Hash bucketing**: computes the `crc32` hash of the bucketing column values and takes the modulo with the number of buckets to evenly distribute the data.
- **Random bucketing**: randomly assigns data to buckets. When using Random bucketing, you can combine the `load_to_single_tablet` parameter to optimize fast writes for small-scale data.

A reasonable bucketing design brings the following benefits:
Good bucketing provides:

- **Even data distribution**: reduces the risk of data concentration or skew, and avoids overloading some nodes or storage devices.
- **Reduced hotspots**: prevents some nodes or partitions from being overloaded, improving system stability and processing capability.
- **Improved concurrent performance**: when multiple query requests need to access different data within the same partition, bucketing allows the system to process multiple requests in parallel effectively, thereby improving throughput.
- **Even data distribution**: less risk of skew, and no single node or disk gets overloaded.
- **Fewer hotspots**: no node or partition gets overloaded, which keeps the system stable.
- **Better concurrency**: Doris reads different buckets in the same partition in parallel, which improves throughput.

### 2.4 Tablet and Node Architecture

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -64,14 +64,14 @@ Doris 支持两种**分区类型**:

合理分区可以带来以下收益:

- **查询性能提升**:通过分区裁剪,系统可以根据查询条件过滤掉不相关的分区,减少数据扫描量,显著降低 I/O 负担,特别适合大规模数据集;
- **管理灵活性**:可按时间、地域等逻辑维度对数据进行分割,便于归档、清理和备份。例如按时间分区可高效管理历史数据与新增数据,支持基于时间的数据维护策略
- **查询更快**:分区裁剪会跳过无法匹配查询的分区,从而减少扫描的数据量和 I/O;数据集越大,收益越明显。
- **管理更简单**:按时间或地域切分,便于归档、清理和备份。例如按时间分区,可分别管理历史数据与新增数据

### 2.3 分桶(Bucket)

分桶是指将一个分区中的数据,按照某种规则进一步划分为更小的、互不相交的数据单元。每一行数据属于且仅属于一个特定的分桶。

与按列值范围划分的分区不同,分桶的目标是将数据**均匀分布**到预定义的桶中,从而减少数据倾斜,并通过提高数据局部性来提升查询执行性能
分区按列值的范围或枚举来划分数据;分桶则在分区内将数据**均匀分布**到各个桶中,从而减少数据倾斜,并通过更好的数据局部性提升查询性能

Doris 支持两种**分桶方式**:

Expand All @@ -82,7 +82,7 @@ Doris 支持两种**分桶方式**:

- **数据均匀分布**:减少数据集中或倾斜的风险,避免部分节点或存储设备资源过载;
- **减少热点**:避免某些节点或分区过度负载,提升系统稳定性和处理能力;
- **提高并发性能**:当多个查询请求需要访问同一分区中的不同数据时,分桶可使系统有效地并行处理多个请求,从而提升吞吐量。
- **提高并发性能**:Doris 可以并行读取同一分区中的不同分桶,从而提升吞吐量。

### 2.4 数据分片(Tablet)与节点架构

Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
---

Check warning on line 1 in i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/table-design/data-partitioning/basic-concepts.mdx

View workflow job for this annotation

GitHub Actions / Build Check

seo-title-duplicate

Rendered SEO title is duplicated across indexable pages%3A "基本概念 - Apache Doris". Add a version%2C locale%2C or page-specific qualifier. Owner%3A @apache/doris-website-maintainers
{
"title": "基本概念",
"language": "zh-CN",
Expand All @@ -18,7 +18,7 @@

Doris 采用 **分区(Partition)+ 分桶(Bucket)** 的两层数据划分方式,将一张表的数据有序地组织到集群的各个节点上:

- **分区**:按列值(如时间、地区)将表水平切分为更小的子集,便于查询裁剪与数据生命周期管理;

Check warning on line 21 in i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/table-design/data-partitioning/basic-concepts.mdx

View workflow job for this annotation

GitHub Actions / Build Check

markdown-code-fence-language

Code fence should declare a language. Owner%3A @apache/doris-website-maintainers
- **分桶**:在每个分区内进一步将数据均匀打散到多个数据分片(Tablet)中,以充分利用集群并行能力。

![doris-data-partitioning](/images/next/table-design/data-partitioning.jpg)
Expand Down Expand Up @@ -64,14 +64,14 @@

合理分区可以带来以下收益:

- **查询性能提升**:通过分区裁剪,系统可以根据查询条件过滤掉不相关的分区,减少数据扫描量,显著降低 I/O 负担,特别适合大规模数据集;
- **管理灵活性**:可按时间、地域等逻辑维度对数据进行分割,便于归档、清理和备份。例如按时间分区可高效管理历史数据与新增数据,支持基于时间的数据维护策略
- **查询更快**:分区裁剪会跳过无法匹配查询的分区,从而减少扫描的数据量和 I/O;数据集越大,收益越明显。
- **管理更简单**:按时间或地域切分,便于归档、清理和备份。例如按时间分区,可分别管理历史数据与新增数据

### 2.3 分桶(Bucket)

分桶是指将一个分区中的数据,按照某种规则进一步划分为更小的、互不相交的数据单元。每一行数据属于且仅属于一个特定的分桶。

与按列值范围划分的分区不同,分桶的目标是将数据**均匀分布**到预定义的桶中,从而减少数据倾斜,并通过提高数据局部性来提升查询执行性能
分区按列值的范围或枚举来划分数据;分桶则在分区内将数据**均匀分布**到各个桶中,从而减少数据倾斜,并通过更好的数据局部性提升查询性能

Doris 支持两种**分桶方式**:

Expand All @@ -82,14 +82,14 @@

- **数据均匀分布**:减少数据集中或倾斜的风险,避免部分节点或存储设备资源过载;
- **减少热点**:避免某些节点或分区过度负载,提升系统稳定性和处理能力;
- **提高并发性能**:当多个查询请求需要访问同一分区中的不同数据时,分桶可使系统有效地并行处理多个请求,从而提升吞吐量。
- **提高并发性能**:Doris 可以并行读取同一分区中的不同分桶,从而提升吞吐量。

### 2.4 数据分片(Tablet)与节点架构

一个分桶在物理上对应一个**数据分片(Tablet)**,Tablet 是 Doris 中数据管理的最小单元,也是数据移动、复制等操作的基本物理单位。

Doris 集群由两类节点组成:

Check warning on line 92 in i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/table-design/data-partitioning/basic-concepts.mdx

View workflow job for this annotation

GitHub Actions / Build Check

markdown-cjk-spacing

Chinese text should contain spaces around adjacent English words or numbers. Owner%3A @apache/doris-website-maintainers
- **FE 节点(Frontend)**:管理集群元数据(如表、分片信息),负责 SQL 的解析与执行规划;
- **BE 节点(Backend)**:存储 Tablet 数据,负责计算任务的执行;BE 的结果汇总后由 FE 返回给用户。

Expand Down Expand Up @@ -252,7 +252,7 @@
);
```

关于该功能的细节说明,详见 [自动分区与动态分区联用](./auto-partitioning#与动态分区联用)。

Check failure on line 255 in i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/table-design/data-partitioning/basic-concepts.mdx

View workflow job for this annotation

GitHub Actions / Build Check

link-missing-anchor

Anchor #与动态分区联用 does not exist in i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/table-design/data-partitioning/auto-partitioning.md. Owner%3A @apache/doris-website-maintainers

</TabItem>

Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
---

Check notice on line 1 in versioned_docs/version-4.x/table-design/data-partitioning/basic-concepts.mdx

View workflow job for this annotation

GitHub Actions / Build Check

i18n-sync-locale-candidate

Japanese docs are report-only. Generate a candidate translation from the changed files and merge it only after human review. Owner%3A @apache/doris-website-maintainers

Check notice on line 1 in versioned_docs/version-4.x/table-design/data-partitioning/basic-concepts.mdx

View workflow job for this annotation

GitHub Actions / Build Check

i18n-sync-version-candidate

A 3.x counterpart exists. Confirm whether the change is supported in 3.x before leaving it unsynced. Owner%3A @apache/doris-website-maintainers

Check warning on line 1 in versioned_docs/version-4.x/table-design/data-partitioning/basic-concepts.mdx

View workflow job for this annotation

GitHub Actions / Build Check

i18n-sync-version-missing

English 4.x and current docs are strongly synchronized%2C but the current counterpart is missing. Add it or explain the version-specific exception in the PR description. Owner%3A @apache/doris-website-maintainers

Check warning on line 1 in versioned_docs/version-4.x/table-design/data-partitioning/basic-concepts.mdx

View workflow job for this annotation

GitHub Actions / Build Check

seo-description-length

SEO description should be 80-160 characters; current length is 293. Owner%3A @apache/doris-website-maintainers

Check warning on line 1 in versioned_docs/version-4.x/table-design/data-partitioning/basic-concepts.mdx

View workflow job for this annotation

GitHub Actions / Build Check

sidebar-orphan-doc

Document table-design/data-partitioning/basic-concepts is not referenced by versioned_sidebars/version-4.x-sidebars.json. Owner%3A @apache/doris-website-maintainers
{
"title": "Basic Concepts",
"language": "en",
Expand All @@ -16,9 +16,9 @@

## 1. Overview

Doris uses a two-tier data partitioning approach of **Partition + Bucket** to organize the data of a table in an orderly manner across the nodes of the cluster:
Doris distributes a table's data across the cluster in two tiers, **partitions** and **buckets**:

- **Partition**: horizontally divides the table into smaller subsets by column values (such as time or region), making it easier to perform query pruning and data lifecycle management.

Check warning on line 21 in versioned_docs/version-4.x/table-design/data-partitioning/basic-concepts.mdx

View workflow job for this annotation

GitHub Actions / Build Check

markdown-code-fence-language

Code fence should declare a language. Owner%3A @apache/doris-website-maintainers
- **Bucket**: further evenly distributes data within each partition into multiple data shards (Tablets), fully utilizing the parallelism of the cluster.

![doris-data-partitioning](/images/next/table-design/data-partitioning.jpg)
Expand All @@ -29,7 +29,7 @@
Table ──► Partition ──► Bucket ──► Tablet (data shard, stored on BE nodes)
```

A reasonable partitioning and bucketing design brings the following benefits at the same time: **faster queries** (partition pruning, parallel scanning), **more flexible management** (archiving/cleanup by time), and **more even writes** (avoiding hotspots).
Good partitioning and bucketing give you three things at once: **faster queries** (partition pruning and parallel scans), **easier management** (archive or clean up by time), and **more even writes** (no hotspots).

## 2. Core Concepts

Expand Down Expand Up @@ -62,27 +62,27 @@

If no partition is specified at table creation time, Doris generates a default partition that is transparent to the user, containing all the data in the table.

A reasonable partition design brings the following benefits:
Good partition design provides:

- **Improved query performance**: through partition pruning, the system can filter out irrelevant partitions based on query conditions, reducing the amount of data scanned and significantly lowering the I/O burden, which is especially suitable for large-scale datasets.
- **Management flexibility**: data can be split along logical dimensions such as time or region, making archiving, cleanup, and backup easier. For example, partitioning by time enables efficient management of historical and incremental data, supporting time-based data maintenance strategies.
- **Faster queries**: partition pruning skips partitions that can't match the query, so Doris scans less data and does less I/O. This matters most on large datasets.
- **Easier management**: splitting by time or region makes archiving, cleanup, and backup simpler. For example, partitioning by time lets you manage historical and incoming data separately.

### 2.3 Bucket

Bucketing further divides the data within a partition into smaller, mutually disjoint data units according to certain rules. Each row of data belongs to exactly one specific bucket.

Unlike partitions that divide by ranges of column values, the goal of bucketing is to **evenly distribute** data across predefined buckets, thereby reducing data skew and improving query execution performance through better data locality.
Partitions divide data by ranges or lists of column values. Bucketing instead spreads data **evenly** across the buckets in a partition, which reduces skew and improves query performance through better data locality.

Doris supports two **bucketing methods**:

- **Hash bucketing**: computes the `crc32` hash of the bucketing column values and takes the modulo with the number of buckets to evenly distribute the data.
- **Random bucketing**: randomly assigns data to buckets. When using Random bucketing, you can combine the `load_to_single_tablet` parameter to optimize fast writes for small-scale data.

A reasonable bucketing design brings the following benefits:
Good bucketing provides:

- **Even data distribution**: reduces the risk of data concentration or skew, and avoids overloading some nodes or storage devices.
- **Reduced hotspots**: prevents some nodes or partitions from being overloaded, improving system stability and processing capability.
- **Improved concurrent performance**: when multiple query requests need to access different data within the same partition, bucketing allows the system to process multiple requests in parallel effectively, thereby improving throughput.
- **Even data distribution**: less risk of skew, and no single node or disk gets overloaded.
- **Fewer hotspots**: no node or partition gets overloaded, which keeps the system stable.
- **Better concurrency**: Doris reads different buckets in the same partition in parallel, which improves throughput.

### 2.4 Tablet and Node Architecture

Expand Down
Loading