KingbaseES R3 集群主库归档失败案例


案例说明:
本案例用于KingbaseES R3集群归档进程归档日志失败的处理,对于一线的生产环境具有 一定的参考意义。

数据库版本:

TEST=# select version();
                                                         VERSION                                                         
---------------------------------------------------------------------------------------------------------------
 Kingbase V008R003C002B0270 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46), 64-bit
(1 row)

集群架构:

一、案例故障现象(主库归档失败)

1、主库数据库服务进程

[kingbase@node1 bin]$ ps -ef |grep kingbase
kingbase  8180     1  0 13:44 ?        00:00:00 /home/kingbase/cluster/kha/db/bin/kingbase -D /home/kingbase/cluster/kha/db/data
kingbase  8181  8180  0 13:44 ?        00:00:00 kingbase: logger process   
kingbase  8183  8180  0 13:44 ?        00:00:00 kingbase: checkpointer process   
kingbase  8184  8180  0 13:44 ?        00:00:00 kingbase: writer process   
kingbase  8185  8180  0 13:44 ?        00:00:00 kingbase: wal writer process   
kingbase  8186  8180  0 13:44 ?        00:00:00 kingbase: autovacuum launcher process   
kingbase  8187  8180  0 13:44 ?        00:00:00 kingbase: archiver process   failed on 000000020000000000000010
kingbase  8188  8180  0 13:44 ?        00:00:00 kingbase: stats collector process   
kingbase  8189  8180  0 13:44 ?        00:00:00 kingbase: bgworker: syslogical supervisor   
kingbase  8253  8180  0 13:44 ?        00:00:00 kingbase: wal sender process SYSTEM 192.168.7.243(51049) streaming 0/160001B0

2、查看sys_log日志

2021-03-01 13:49:30.693 CST,,,8187,,603c7f31.1ffb,23,,2021-03-01 13:44:17 CST,,0,LOG,00000,"archive command failed with exit code 45","The failed archive command was: /home/kingbase/cluster/kha/db/bin/sys_rman_v6 --config /home/kingbase/kbbr3_repo/sys_rman_v6.conf --stanza=kingbase archive-push sys_xlog/000000020000000000000010",,,,,,,,""
2021-03-01 13:49:30.694 CST,,,8187,,603c7f31.1ffb,24,,2021-03-01 13:44:17 CST,,0,WARNING,01000,"archiving transaction log file ""000000020000000000000010"" failed too many times, will try again later",,,,,,,,,""

二、故障处理步骤

1、查看archive日志归档配置

2、查看归档配置文件和目录信息

[kingbase@node1 sys_log]$ ls -lh /home/kingbase/kbbr3_repo/sys_rman_v6.conf
-rw-rw-r-- 1 kingbase kingbase 589 Mar  1 12:26 /home/kingbase/kbbr3_repo/sys_rman_v6.conf

[kingbase@node1 sys_log]$ cat /home/kingbase/kbbr3_repo/sys_rman_v6.conf
# Genarate by script at 20210301122559, should not change manually
[kingbase]
kb1-path=/home/kingbase/cluster/kha/db/data
kb1-port=54321
kb1-user=SUPERMANAGER_V8ADMIN
kb1-pass=S0lOR0JBU0VBRE1JTg==
kb2-path=/home/kingbase/cluster/kha/db/data
kb2-port=54321
kb2-user=SUPERMANAGER_V8ADMIN
kb2-pass=S0lOR0JBU0VBRE1JTg==
kb2-host=192.168.7.243
kb2-host-user=kingbase

[global]
repo1-path=/home/kingbase/kbbr3_repo
repo1-retention-full=5
log-path=/tmp/
log-level-file=info
log-level-console=info
log-subprocess=y
process-max=4
#### default gz, support: gz none
compress-type=gz
compress-level=3

3、执行手工归档

[kingbase@node1 bin]$ /home/kingbase/cluster/kha/db/bin/sys_rman_v6 --config /home/kingbase/kbbr3_repo/sys_rman_v6.conf --stanza=kingbase archive-push /home/kingbase/cluster/kha/db/data/sys_xlog/000000020000000000000010

2021-03-01 14:43:20.928 P00   INFO: archive-push command begin 2.27: [/home/kingbase/cluster/kha/db/data/sys_xlog/000000020000000000000010] --compress-level=3 --compress-type=gz --config=/home/kingbase/kbbr3_repo/sys_rman_v6.conf --log-level-console=info --log-level-file=info --log-path=/tmp --log-subprocess --kb2-host=192.168.7.243 --kb1-path=/home/kingbase/cluster/kha/db/data --kb2-path=/home/kingbase/cluster/kha/db/data --process-max=4 --repo1-path=/home/kingbase/kbbr3_repo --stanza=kingbase
2021-03-01 14:43:21.203 P00   INFO: pushed WAL file '000000020000000000000010' to the archive
2021-03-01 14:43:21.204 P00   INFO: archive-push command end: completed successfully (276ms)

# 查看归档日志文件
[kingbase@node1 data]$ ls -lh /home/kingbase/kbbr3_repo/archive/kingbase/9.6-1/0000000200000000
total 208K
-rw-r----- 1 kingbase kingbase 91K Mar  1 12:20 00000002000000000000000E-583ac46b5270f365463cb0bfb3b96185af6492dd.gz
-rw-r----- 1 kingbase kingbase 303 Mar  1 12:20 00000002000000000000000F.00000028.backup
-rw-r----- 1 kingbase kingbase 83K Mar  1 12:20 00000002000000000000000F-1c450ca422ee6312e8a69dcb7d8c446a99425995.gz
-rw-r----- 1 kingbase kingbase 28K Mar  1 12:21 000000020000000000000010-044ab3927144a6510a42ce9c2bc331cf209aff56.gz

=从以上信息可知,手工归档成功,可以说明归档的配置文件及目录权限等没有问题。=

4、重启集群测试

[kingbase@node1 bin]$ ps -ef |grep kingbase

kingbase  8180     1  0 13:44 ?        00:00:00 /home/kingbase/cluster/kha/db/bin/kingbase -D /home/kingbase/cluster/kha/db/data
kingbase  8181  8180  0 13:44 ?        00:00:00 kingbase: logger process   
kingbase  8183  8180  0 13:44 ?        00:00:00 kingbase: checkpointer process   
kingbase  8184  8180  0 13:44 ?        00:00:00 kingbase: writer process   
kingbase  8185  8180  0 13:44 ?        00:00:00 kingbase: wal writer process   
kingbase  8186  8180  0 13:44 ?        00:00:00 kingbase: autovacuum launcher process   
kingbase  8187  8180  0 13:44 ?        00:00:00 kingbase: archiver process   failed on 000000020000000000000010
kingbase  8188  8180  0 13:44 ?        00:00:00 kingbase: stats collector process   
kingbase  8189  8180  0 13:44 ?        00:00:00 kingbase: bgworker: syslogical supervisor   
kingbase  8253  8180  0 13:44 ?        00:00:00 kingbase: wal sender process SYSTEM 192.168.7.243(51049) streaming 0/160001B0

=== 从以上信息可知,归档仍然失败。===

5、修改archive_command配置(跳过归档)

6、重启集群测试

[kingbase@node1 bin]$ ps -ef |grep kingbase

kingbase 21906  5688  0 14:09 pts/0    00:00:00 ./ksql -U SYSTEM -W ******** TEST
kingbase 23167     1  0 14:11 ?        00:00:00 /home/kingbase/cluster/kha/db/bin/kingbase -D /home/kingbase/cluster/kha/db/data
kingbase 23168 23167  0 14:11 ?        00:00:00 kingbase: logger process   
kingbase 23170 23167  0 14:11 ?        00:00:00 kingbase: checkpointer process   
kingbase 23171 23167  0 14:11 ?        00:00:00 kingbase: writer process   
kingbase 23172 23167  0 14:11 ?        00:00:00 kingbase: wal writer process   
kingbase 23173 23167  0 14:11 ?        00:00:00 kingbase: autovacuum launcher process   
kingbase 23174 23167  0 14:11 ?        00:00:00 kingbase: archiver process   
kingbase 23175 23167  0 14:11 ?        00:00:00 kingbase: stats collector process   
kingbase 23176 23167  0 14:11 ?        00:00:00 kingbase: bgworker: syslogical supervisor   
kingbase 23194 23167  0 14:11 ?        00:00:00 kingbase: wal sender process SYSTEM 192.168.7.243(54037) streaming 0/180000D0

=== 从以上信息可知,已经没有归档失败的状态信息。===

7、在主库手工执行wal日志切换

TEST=# select sys_switch_xlog();
 SYS_SWITCH_XLOG 
-----------------
 0/180000E8
(1 row)

TEST=# select sys_switch_xlog();
 SYS_SWITCH_XLOG 
-----------------
 0/19000078
(1 row)

TEST=# select sys_switch_xlog();
 SYS_SWITCH_XLOG 
-----------------
 0/1A000000
(1 row)

8、再重新恢复archive_command配置

9、重启集群测试

[kingbase@node1 bin]$ ps -ef |grep kingbase
kingbase 25979     1  0 14:15 ?        00:00:00 /home/kingbase/cluster/kha/db/bin/kingbase -D /home/kingbase/cluster/kha/db/data
kingbase 25980 25979  0 14:15 ?        00:00:00 kingbase: logger process   
kingbase 25983 25979  0 14:15 ?        00:00:00 kingbase: checkpointer process   
kingbase 25984 25979  0 14:15 ?        00:00:00 kingbase: writer process   
kingbase 25985 25979  0 14:15 ?        00:00:00 kingbase: wal writer process   
kingbase 25986 25979  0 14:15 ?        00:00:00 kingbase: autovacuum launcher process   
kingbase 25987 25979  0 14:15 ?        00:00:00 kingbase: archiver process   
kingbase 25988 25979  0 14:15 ?        00:00:00 kingbase: stats collector process   
kingbase 25989 25979  0 14:15 ?        00:00:00 kingbase: bgworker: syslogical supervisor   
kingbase 26006 25979  0 14:15 ?        00:00:00 kingbase: wal sender process SYSTEM 192.168.7.243(54457) streaming 0/1B0000D0

=== 从以上信息可知,已经没有归档失败的状态信息。===

10、测试wal日志归档

1)执行wal日志切换(主库)

TEST=# select sys_switch_xlog();
 SYS_SWITCH_XLOG 
-----------------
 0/1C000078
(1 row)

TEST=# select sys_switch_xlog();
 SYS_SWITCH_XLOG 
-----------------
 0/1E000000
(1 row)

2)查看归档信息

[kingbase@node1 0000000200000000]$ pwd
/home/kingbase/kbbr3_repo/archive/kingbase/9.6-1/0000000200000000
[kingbase@node1 0000000200000000]$ ls -lh
total 460K
-rw-r----- 1 kingbase kingbase 91K Mar  1 12:20 00000002000000000000000E-583ac46b5270f365463cb0bfb3b96185af6492dd.gz
-rw-r----- 1 kingbase kingbase 303 Mar  1 12:20 00000002000000000000000F.00000028.backup
-rw-r----- 1 kingbase kingbase 83K Mar  1 12:20 00000002000000000000000F-1c450ca422ee6312e8a69dcb7d8c446a99425995.gz
-rw-r----- 1 kingbase kingbase 28K Mar  1 12:21 000000020000000000000010-044ab3927144a6510a42ce9c2bc331cf209aff56.gz
-rw-r----- 1 kingbase kingbase 83K Mar  1 14:22 00000002000000000000001B-f64eadf9b3ecb50ce6925cdc8c196bf33af4cc8c.gz
-rw-r----- 1 kingbase kingbase 84K Mar  1 14:22 00000002000000000000001C-300ed282be050b1194cc15b974b5b11b120d2076.gz
-rw-r----- 1 kingbase kingbase 84K Mar  1 14:23 00000002000000000000001D-249ab2fafe02a51ca06846098d4bad2f786d1422.gz

=从归档日志信息看,在wal日志发生切换时,产生了归档;但是前面在修改archive_command='/bin/true'参数后,导致一部分wal日志没有归档。==

11、查看数据库服务进程

[kingbase@node1 0000000200000000]$ ps -ef |grep kingbase
.......
kingbase 25979     1  0 14:15 ?        00:00:00 /home/kingbase/cluster/kha/db/bin/kingbase -D /home/kingbase/cluster/kha/db/data
kingbase 25980 25979  0 14:15 ?        00:00:00 kingbase: logger process   
kingbase 25983 25979  0 14:15 ?        00:00:00 kingbase: checkpointer process   
kingbase 25984 25979  0 14:15 ?        00:00:00 kingbase: writer process   
kingbase 25985 25979  0 14:15 ?        00:00:00 kingbase: wal writer process   
kingbase 25986 25979  0 14:15 ?        00:00:00 kingbase: autovacuum launcher process   
kingbase 25987 25979  0 14:15 ?        00:00:00 kingbase: archiver process   last was 00000002000000000000001D
kingbase 25988 25979  0 14:15 ?        00:00:00 kingbase: stats collector process   
kingbase 25989 25979  0 14:15 ?        00:00:00 kingbase: bgworker: syslogical supervisor   
kingbase 26006 25979  0 14:15 ?        00:00:00 kingbase: wal sender process SYSTEM 192.168.7.243(54457) streaming 0/1E000060

=== 从以上信息获知,数据库归档进程正常。===

三、总结

此案例解决了KingbaseES R3集群归档失败的故障,对于数据库归档失败在原生PostgreSQL也发现过此类问题,处理起来比较繁琐,主库归档恢复正常后,需要对数据库做一次物理的全备,因为中间缺失了一部分wal日志的归档。

相关