KDB加载文本数据最佳方案--kdloader

技术2022-07-11 171

KDB快速加载文本数据方式

kdloader介绍kdloader最佳性能实践测试环境信息测试步骤总结注意事项kdloader参数说明

kdloader介绍

kdLoader 是 K-DB 数据库提供的数据加载工具。只需要创建保存数据的文本文件，从文本文件中导入数据到数据库。可轻松的将大量数据批量载入 K-DB 数据库。

kdloader最佳性能实践

测试环境信息

ENV： 2 + 3 DISK: PCIE SSD CPU: Intel® Xeon® CPU E5-2680 v4 @ 2.40GHz 56cores 机器： NF5280M4 测试表/rows/大小: account / 5725269 / 4G /

测试步骤

普通加载，日志产生量为4.2G，耗时33min kdloader userid=test/test control=ACCOUNT.ctl log=account1.log 设置表不记录redo，日志产生量为4.2G。耗时33min，无效果 alter table account nologging; kdloader userid=test/test control=ACCOUNT.ctl log=account2.log 调整bindsize, 从65536 改为最大值10485760，日志产生量为3.3.G。耗时6min，提高了5倍加载速度 kdloader errors=500 bindsize=10485760 userid=test/test control=ACCOUNT.ctl log=accoun3.log 使用并行加载 direct=Y 方式，日志产生量为0.02G。耗时2min，提高了15倍速度，单个CPU core打满 kdloader errors=500 bindsize=10485760 readsize=2097152 userid=test/test direct=Y control=ACCOUNT.ctl log=accoun5.log 把大文件拆分成多个小文件。日志产生量为0.02G. 耗时0.5min，提高了60倍速度。多个CPU core打满 kdloader errors=500 bindsize=10485760 userid=test/test direct=Y control=ACCOUNT1.ctl log=accoun6.log

总结

kdloader程序，最快的加载方式为：使用直接路径加载，把大文件拆分为多个小文件，并行加载。速度最快

注意事项

设置 LD_MEM_PROTECT=N --没有明显效果如果表上有索引，不能使用并行直接路径加载方式。表上有索引的，建议导入数据之后重建索引数据文件中每个列最好不用双引号分割，否则parallel参数无效。

kdloader参数说明

$ kdloader --help

Options: -h|–help Display the more detailed information. -c|–charset Display usable character sets in this version. -v|–version Display the version information.

Controls: userid=username/password@dbname Logon information to connect. The dbname has to be matched with the name described in tbdsn.tbr exactly.

control= The name of the control file. Allowed to use a absolute path and a relative path.

data= The name of the data file to load. Allowed to use a absolute path and a relative path.

log= The name of the log file. This is for logging the entire loading process. Allowed to use a absolute path and a relative path. If this parameter is omitted, it is used .log as default.

bad= The name of the bad file. This is for logging the records to be failed to load. Allowed to use a absolute path and a relative path. If this parameter is omitted, it is used .bad as default.

discard= The name of the discard file. This is for logging the records to be discarded. Allowed to use a absolute path and a relative path. If this parameter is omitted, the records does not be logged.

skip= The count of line to skip in data file from the start line. The default is 0.

errors= The count of error to allow. If error is raised more than this value, abort the loading immediately. The default is 50.

rows= The count of row for each commit. If this value is smaller, the performance is lower. The default is ‘commit after the end’.

message= The count of records to process to print out. The value is smaller, the performance is slower. The default is 0.

readsize= The size of buffer to read from the data file. This value is allowed to use the maximum 2M(2,097,152) byte. The default is 65536 byte.

disable_idx=<Y|N> Whether to disable indexes before loading. The default is ‘N’.

direct=<Y|N> Whether to use the Direct Path Loading. If you choose ‘y|Y’, load by the Direct Path Loading. Or load by the conventional loading. The default is ‘N’.

dpl_log=<Y|N> Whether to log about the Direct Path Loading. If you choose ‘y|Y’, be logged the loading process and use it by recovery. But the performance is low. If you choose ‘n|N’, do not be logged. The default is ‘N’.

parallel= The count of threads to execute Parallel Direct Path Loading. The default is 1.

bindsize= The size of buffer to read in the Direct Path Loading. This value is allowed to use the maximum 10M(10,485,760) byte. The default is 65536 byte.

Processed: 0.024, SQL: 9