大数据Doris（二十四）：Doris数据Insert Into导入方式介绍网站首页 技术交流

大数据Doris（二十四）：Doris数据Insert Into导入方式介绍

Lansonli 2024-06-17 11:27:15

简介大数据Doris（二十四）：Doris数据Insert Into导入方式介绍

文章目录

Doris数据Insert Into导入方式介绍

Doris数据Insert Into导入方式介绍

Doris 提供多种数据导入方案，可以针对不同的数据源进行选择不同的数据导入方式。Doris支持各种各样的数据导入方式：Insert Into、json格式数据导入、Binlog Load、Broker Load、Routine Load、Spark Load、Stream Load、S3 Load，后面文章分别进行介绍。

注意：Doris 中的所有导入操作都有原子性保证，即一个导入作业中的数据要么全部成功，要么全部失败，不会出现仅部分数据导入成功的情况。

Insert Into 语句的使用方式和 MySQL 等数据库中 Insert Into 语句的使用方式类似。但在 Doris 中，所有的数据写入都是一个独立的导入作业，所以这里将 Insert Into 也作为一种导入方式介绍。

一、语法及参数

Insert Into插入数据的语法如下：

INSERT INTO table_name
[ PARTITION (p1, ...) ]
[ WITH LABEL label]
[ (column [, ...]) ]
{ VALUES ( { expression | DEFAULT } [, ...] ) [, ...] | query }

以上语法参数的解释如下：

tablet_name: 导入数据的目的表。可以是 db_name.table_name 形式。
partitions: 指定待导入的分区，必须是 table_name 中存在的分区，多个分区名称用逗号分隔。
label: 为 Insert 任务指定一个 label。
column_name: 指定的目的列，必须是 table_name 中存在的列。
expression: 需要赋值给某个列的对应表达式。
DEFAULT: 让对应列使用默认值。
query: 一个普通查询，查询的结果会写入到目标中。

Insert Into 命令需要通过 MySQL 协议提交，创建导入请求会同步返回导入结果，主要的Insert Into 命令包含以下两种：

INSERT INTO tbl SELECT ...
INSERT INTO tbl (col1, col2, ...) VALUES (1, 2, ...), (1,3, ...);

二、案例

下面创建表tbl1,来演示Insert Into操作。

#创建表 tbl1
CREATE TABLE IF NOT EXISTS example_db.tbl1
(
`user_id` BIGINT NOT NULL COMMENT "用户id",
`date` DATE NOT NULL COMMENT "日期",
`username` VARCHAR(32) NOT NULL COMMENT "用户名称",
`age` BIGINT NOT NULL COMMENT "年龄",
`score` BIGINT NOT NULL DEFAULT "0" COMMENT "分数"
)
DUPLICATE KEY(`user_id`)
PARTITION BY RANGE(`date`)
(
PARTITION `p1` VALUES [("2023-01-01"),("2023-02-01")),
PARTITION `p2` VALUES [("2023-02-01"),("2023-03-01")),
PARTITION `p3` VALUES [("2023-03-01"),("2023-04-01"))
)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);

#通过Insert Into 向表中插入数据
mysql> insert into example_db.tbl1 values  (1,"2023-01-01","zs",18,100), (2,"2023-02-01","ls",19,200);
Query OK, 2 rows affected (0.09 sec)
{'label':'insert_1b2ba205dee54110_b7a9c0e53b866215', 'status':'VISIBLE', 'txnId':'6015'}

#创建表tbl2 ，表结构与tbl1一样,同时数据会复制过来。
mysql> create table tbl2  as select * from tbl1;
Query OK, 2 rows affected (0.43 sec)
{'label':'insert_fad2b6e787fa451a_90ba76071950c3ae', 'status':'VISIBLE', 'txnId':'6016'}

#向表tbl2中使用Insert into select 方式插入数据
mysql> insert into tbl2 select * from tbl1;
Query OK, 2 rows affected (0.18 sec)
{'label':'insert_7a52e9f60f7b454b_a9807cd2281932dc', 'status':'VISIBLE', 'txnId':'6017'}

#Insert into 还可以指定Label，指定导入作业的标识
mysql> insert into example_db.tbl2 with label mylabel values (3,"2023-03-01","ww",20,300),(4,"2023-03-01","ml",21,400);
Query OK, 2 rows affected (0.11 sec)
{'label':'mylabel', 'status':'VISIBLE', 'txnId':'6018'}

#查询表tbl2中的数据
mysql> select * from tbl2;
+---------+------------+----------+------+-------+
| user_id | date       | username | age  | score |
+---------+------------+----------+------+-------+
|       1 | 2023-01-01 | zs       |   18 |   100 |
|       1 | 2023-01-01 | zs       |   18 |   100 |
|       4 | 2023-03-01 | ml       |   21 |   400 |
|       2 | 2023-02-01 | ls       |   19 |   200 |
|       2 | 2023-02-01 | ls       |   19 |   200 |
|       3 | 2023-03-01 | ww       |   20 |   300 |
+---------+------------+----------+------+-------+
6 rows in set (0.12 sec)

Insert Into 本身就是一个 SQL 命令，其返回结果会根据执行结果的不同，分为结果集为空和结果集不为空两种情况。

结果集为空时，返回“Query OK, 0 rows affected”。结果集不为空时分为导入成功和导入失败，导入失败直接返回对应的错误，导入成功返回一个包含“label”、“status”、“txnId”等字段的json串，例如：

{'label':'my_label1', 'status':'visible', 'txnId':'4005'}

{'label':'insert_f0747f0e-7a35-46e2-affa-13a235f4020d', 'status':'committed', 'txnId':'4005'}

{'label':'my_label1', 'status':'visible', 'txnId':'4005', 'err':'some other error'}

label 为用户指定的 label 或自动生成的 label。Label 是该 Insert Into 导入作业的标识。每个导入作业，都有一个在单 database 内部唯一的 Label。
status 表示导入数据是否可见。如果可见，显示 visible，如果不可见，显示 committed。数据不可见是一个临时状态，这批数据最终是一定可见的。
txnId 为这个 insert 对应的导入事务的 id。
err 字段会显示一些其他非预期错误。

当前执行 INSERT 语句时，对于有不符合目标表格式的数据，默认的行为是过滤，比如字符串超长等。 但是对于有要求数据不能够被过滤的业务场景，可以通过设置会话变量 enable_insert_strict 为 true （默认true,建议为true）来确保当有数据被过滤掉的时候，INSERT 不会被执行成功 。也可以通过命令：set enable_insert_strict=false；设置为false，插入数据时至少有一条数据被正确导入，则返回成功，那么错误的数据会自动过滤不插入数据表，当需要查看被过滤的行时，用户可以通过“SHOW LOAD ”语句查看，举例如下：

#向表tbl1中插入包含错误数据的数据集，返回报错信息
mysql> insert into example_db.tbl1 values (3,"2023-03-01","wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww",20,300),(4,"2023-03-01","ml",21,400);
ERROR 5025 (HY000): Insert has filtered data in strict mode, tracking_url=http://192.168.179.6:8040/api/_load_error_log?file=__shard_0/error_log_insert_stmt_34684048e4234210-b0c4a99c9aabcb20_34684048e4234210_b0c4a99c9aabcb20

#设置 enable_insert_strict 为false
set enable_insert_strict=false;

#向表tbl1中插入包含错误数据的数据集
mysql> insert into example_db.tbl1 values (3,"2023-03-01","wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww",20,300),(4,"2023-03-01","ml",21,400);
Query OK, 1 row affected, 1 warning (0.18 sec)
{'label':'insert_43d97ba2ec544fde_b4339d3f1c93753c', 'status':'VISIBLE', 'txnId':'7010'}

#show load查看过滤的数据获取URL
mysql> show loadG;
*************************** 1. row ***************************
         JobId: 21007
         Label: insert_43d97ba2ec544fde_b4339d3f1c93753c
         State: FINISHED
      Progress: ETL:100%; LOAD:100%
          Type: INSERT
       EtlInfo: NULL
      TaskInfo: cluster:N/A; timeout(s):3600; max_filter_ratio:0.0
      ErrorMsg: NULL
    CreateTime: 2023-02-10 20:47:06
  EtlStartTime: 2023-02-10 20:47:06
 EtlFinishTime: 2023-02-10 20:47:06
 LoadStartTime: 2023-02-10 20:47:06
LoadFinishTime: 2023-02-10 20:47:06
           URL: http://192.168.179.7:8040/api/_load_error_log?file=__shard_0/error_log_insert_stmt_43d97ba2ec544fde-b4339d3f1c93753d_43d97ba2ec544fde_b4339d3f1c93753d    JobDetails: {"Unfinished backends":{},"ScannedRows":0,"TaskNumber":0,"LoadBytes":0,"All backends":{},"FileNumber":0,"FileSize":0}
 TransactionId: 7010
  ErrorTablets: {}

#执行 SHOW LOAD WARNINGS ON "url" 来查询被过滤数据信息
mysql> SHOW LOAD WARNINGS ON "http://192.168.179.7:8040/api/_load_error_log?file=__shard_0/error_log_insert_stmt_43d97ba2ec544fde-b4339d3f1c93753d_43d97ba2ec544fde_b4339d3f1c93753d"G;
*************************** 1. row ***************************
         JobId: -1
         Label: NULL
ErrorMsgDetail: Reason: column_name[username], the length of input is too long than schema. first 32 bytes of input str: [wwwwwwwwwwwwwwwwwwwwwww
wwwwwwwww] schema length: 32; actual length: 36; . src line []; 1 row in set (0.01 sec)

三、注意事项

1、关于插入数据量

Insert Into 对数据量没有限制，大数据量导入也可以支持。但Insert Into 有默认的超时时间，用户预估的导入数据量过大，就需要修改系统的Insert Into导入超时时间。如何预估导入时间，估算方式如下：

假设有36G数据需要导入到Doris，Doris集群数据导入速度为10M/s（最大限速为10M/s，可以根据先前导入的数据量/消耗秒计算出当前集群平均的导入速度），那么预估导入时间为36G*1024M/(10M/s) = ~3686秒。

2、关于insert操作返回结果

如果返回结果为 ERROR 1064 (HY000)，则表示导入失败。
如果返回结果为 Query OK，则表示执行成功。
- 如果 rows affected 为 0，表示结果集为空，没有数据被导入。
- 如果 rows affected 大于 0：
  - 如果 status 为 committed，表示数据还不可见。需要通过 show transaction 语句查看状态直到 visible
  - 如果 status 为 visible，表示数据导入成功。
  - 如果 warnings 大于 0，表示有数据被过滤，可以通过 show load 语句获取 url 查看被过滤的行。

3、关于导入任务超时

导入任务的超时时间(以秒为单位)，导入任务在设定的 timeout 时间内未完成则会被系统取消，变成 CANCELLED。目前 Insert Into 并不支持自定义导入的 timeout 时间，所有 Insert Into 导入的超时时间是统一的，默认的 timeout 时间为1小时。如果导入的源文件无法在规定时间内完成导入，则需要调整 FE 的参数insert_load_default_timeout_second。

同时Insert Into语句受到Session变量query_timeout的限制。可以通过 SET query_timeout = xxx; 来增加超时时间，单位是秒。