proxysql运维实践
前置基础参考:https://www.cnblogs.com/gered/p/15868767.html#autoid-4-0-0
【1】环境
当前架构
【2】主从场景
(1)1主2从,主库挂掉
10秒就超时(这个阈值是连接超时)反馈出来了,但如果我们的DML、select 等其他操作 超过10秒是没有关系的;
主库挂了,非 mysql_query_rules 定位到从组的请求,都会连不上,我们查看一下相关日志
select * from runtime_mysql_servers; select from_unixtime(time_start_us/1000/1000) as `datetime`,* from mysql_server_ping_log where ping_error is not null limit 10; select from_unixtime(time_start_us/1000/1000) as `datetime`,* from mysql_server_connect_log where connect_error is not null order by time_start_us desc limit 10; select from_unixtime(time_start_us/1000/1000) as `datetime`,* from mysql_server_read_only_log where error is not null limit 10;
结果:
admin@mysqldb 17:02:48 [(none)]>select * from runtime_mysql_servers; +--------------+----------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+ | hostgroup_id | hostname | port | gtid_port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment | +--------------+----------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+ | 10 | 192.168.148.39 | 3306 | 0 | SHUNNED | 1 | 0 | 1000 | 0 | 0 | 0 | | | 20 | 192.168.148.30 | 3306 | 0 | ONLINE | 1 | 0 | 1000 | 0 | 0 | 0 | | | 20 | 192.168.148.27 | 3306 | 0 | ONLINE | 1 | 0 | 1000 | 0 | 0 | 0 | | +--------------+----------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+ 3 rows in set (0.00 sec) admin@mysqldb 17:02:52 [(none)]>select from_unixtime(time_start_us/1000/1000) as `datetime`,* from mysql_server_ping_log where ping_error is not null limit 10; +---------------------+----------------+------+------------------+----------------------+---------------------------------------------------------+ | datetime | hostname | port | time_start_us | ping_success_time_us | ping_error | +---------------------+----------------+------+------------------+----------------------+---------------------------------------------------------+ | 2022-02-16 08:56:43 | 192.168.148.39 | 3306 | 1645001803735858 | 0 | Can't connect to MySQL server on '192.168.148.39' (115) | | 2022-02-16 08:56:45 | 192.168.148.39 | 3306 | 1645001805750545 | 0 | Can't connect to MySQL server on '192.168.148.39' (115) | | 2022-02-16 08:56:47 | 192.168.148.39 | 3306 | 1645001807770656 | 0 | Can't connect to MySQL server on '192.168.148.39' (115) | | 2022-02-16 08:56:49 | 192.168.148.39 | 3306 | 1645001809742158 | 0 | Can't connect to MySQL server on '192.168.148.39' (115) | | 2022-02-16 08:56:51 | 192.168.148.39 | 3306 | 1645001811771683 | 0 | Can't connect to MySQL server on '192.168.148.39' (115) | | 2022-02-16 08:56:53 | 192.168.148.39 | 3306 | 1645001813752847 | 0 | Can't connect to MySQL server on '192.168.148.39' (115) | | 2022-02-16 08:56:55 | 192.168.148.39 | 3306 | 1645001815746628 | 0 | Can't connect to MySQL server on '192.168.148.39' (115) | | 2022-02-16 08:56:57 | 192.168.148.39 | 3306 | 1645001817742752 | 0 | Can't connect to MySQL server on '192.168.148.39' (115) | | 2022-02-16 08:56:59 | 192.168.148.39 | 3306 | 1645001819728439 | 0 | Can't connect to MySQL server on '192.168.148.39' (115) | | 2022-02-16 08:57:01 | 192.168.148.39 | 3306 | 1645001821729269 | 0 | Can't connect to MySQL server on '192.168.148.39' (115) | +---------------------+----------------+------+------------------+----------------------+---------------------------------------------------------+ 10 rows in set (0.00 sec) admin@mysqldb 17:02:52 [(none)]>select from_unixtime(time_start_us/1000/1000) as `datetime`,* from mysql_server_connect_log where connect_error is not null order by time_start_us desc limit 10; +---------------------+----------------+------+------------------+-------------------------+---------------------------------------------------------+ | datetime | hostname | port | time_start_us | connect_success_time_us | connect_error | +---------------------+----------------+------+------------------+-------------------------+---------------------------------------------------------+ | 2022-02-16 08:56:47 | 192.168.148.39 | 3306 | 1645001807770108 | 0 | Can't connect to MySQL server on '192.168.148.39' (115) | | 2022-02-16 08:56:45 | 192.168.148.39 | 3306 | 1645001805716648 | 0 | Can't connect to MySQL server on '192.168.148.39' (115) | | 2022-02-16 08:56:43 | 192.168.148.39 | 3306 | 1645001803713712 | 0 | Can't connect to MySQL server on '192.168.148.39' (115) | | 2022-02-16 08:56:41 | 192.168.148.39 | 3306 | 1645001801713565 | 0 | Can't connect to MySQL server on '192.168.148.39' (115) | +---------------------+----------------+------+------------------+-------------------------+---------------------------------------------------------+ 4 rows in set (0.00 sec) admin@mysqldb 17:02:52 [(none)]>select from_unixtime(time_start_us/1000/1000) as `datetime`,* from mysql_server_read_only_log where error is not null limit 10; +---------------------+----------------+------+------------------+-----------------+-----------+---------------------------------------------------------------------------------------------+ | datetime | hostname | port | time_start_us | success_time_us | read_only | error | +---------------------+----------------+------+------------------+-----------------+-----------+---------------------------------------------------------------------------------------------+ | 2022-02-16 08:56:42 | 192.168.148.39 | 3306 | 1645001802730203 | 0 | NULL | Lost connection to MySQL server during query | | 2022-02-16 08:56:44 | 192.168.148.39 | 3306 | 1645001804750337 | 0 | NULL | timeout on creating new connection: Can't connect to MySQL server on '192.168.148.39' (115) | | 2022-02-16 08:56:46 | 192.168.148.39 | 3306 | 1645001806742312 | 0 | NULL | timeout on creating new connection: Can't connect to MySQL server on '192.168.148.39' (115) | +---------------------+----------------+------+------------------+-----------------+-----------+---------------------------------------------------------------------------------------------+ 3 rows in set (0.00 sec)
然后我们看一下最新的
select from_unixtime(time_start_us/1000/1000) as `datetime`,* from mysql_server_ping_log order by time_start_us desc limit 10; select from_unixtime(time_start_us/1000/1000) as `datetime`,* from mysql_server_connect_log order by time_start_us desc limit 10; select from_unixtime(time_start_us/1000/1000) as `datetime`,* from mysql_server_read_only_log order by time_start_us desc limit 10;
从下图我们可以看出, ping 还是一直不通,而 connect 、read_only 中甚至直接剔除对 主库 192.168.148.39 的连接尝试了
结合上面的所有,我们发现 ping_log 报错了 300次(2s一次的频率),而 connect_log(2秒一次的频率)只报了4次;
如下图,我们发现我们选中的参数,次数是3,所以就 connect 重试3次后,所以 connect_log 中只有4次记录是正常的;
(2)1主2从,主库挂掉后重新启动恢复
恢复主库后:可无缝连上,现有连接无需重连,无需断开原长连接
、
恢复后,我们看看 mysql server,hostgrou_id
我们可以看到,status 还是 shunned 状态
我们执行一个路由到主库的相关操作,它状态就变回来了;
(3)1主2从=》主库挂掉,从变主库=》1主1从
关闭主库,从库1: 192.168.148.27 执行如下代码,变成 proxysql中的主库
stop slave; set global read_only=0;
如下图:
为什么主库变成下图这样,即在10 写组,又在20 读组?因为如下参数
mysql-monitor_writer_is_also_reader = true(当原read_only=1 的server read_only=0时,迁移一份到 writer 组的同时,也保留它的 reader 角色)
这个是为了避免故障转移后只有1个实例的时候,可以同时兼顾读写,从而业务连续;
我们再查询看看:是可以都读负载均衡分发的;写也是写到 我们的新主库 192.168.148.27 中去了;
注意:proxysql 连接在任意时候都没断开过,所以主从切换等情况不影响它的长连接,也不需要重连等等;
(4)1主2从=》从库全部挂掉
现有连接,断掉了;
如下图,我们发现从库都挂掉了
如下图,我们发现,只能写,不能读了,这是因为 读规则里的所有机器都挂掉了;
所以我们要么立马所有20(读路由)的 query_rules,全部干掉
delete from mysql_query_rules where destination_hostgroup=20;
删除前是不行的,删除 20的 query rule之后然后可以查了;
(5)1主2从=》全部从库挂掉的解决办法
需求:当我们所有从挂掉的时候,proxysql的读请求无法访问!! 因为select 路由还是会路由到 20分组,而20分组所有实例不可用;
解决:
把主库也额外加入到写组去,权重设置为1,其他的设置为1000,那就99.95% 请求都不会分发到权重为1的 实例上去;
1、所有规则还原到【1】中的去
2、测试权重的作用
update mysql_servers set weight=1000 where hostgroup_id=20; INSERT INTO mysql_servers(hostgroup_id,hostname,port) VALUES (20,'192.168.148.39',3306); load mysql servers to runtime; save mysql servers to disk;
查询路由如下:
我们来看看短连接 查询3000次;
for i in {1..3000}; do mysql -uproxysql -p123456 -P6033 -h 192.168.148.39 -e"select @@server_id" >>1.txt; done
发现没有 393306的,也就是3000次居然没有一次分发到 主库;这就是权重的原因;
我们再来试试,1W次,分到主库的查询才6次,可以理解成近2000次查询请求,才分配到1次给主库;
3、模拟2个从库都挂掉
如下图,我们可以看到 2个从库都是 shunned 状态了;无法连接的上了
查询还是可以查的,主库承接起来了查询请求;
成功;
【总结】主从模式的设计与规划
(1)mysql-monitor_writer_is_also_reader = true
在read_only=1 的读实例 变成主后(变成 writer 后),依然保留一份它作为 reader
1主1从的时候:
为了防止主库挂掉后,从库变成新主库导致没有 read_only=1 的 查询实例,所以加上这个参数就可以解决这个问题,新主库的单实例读写均可;
(2)mysql_servers 下的 weight 字段;权重的设置
权重的设置,非多主的情况,主库也要加入读组,可以通过权重来解决;
大概可以换算成,比如我们现在是 2个从库的权重是1000,主库的只读配置的权重是 1,那么主库承接请求的可能性就是 1/(1000+1000+1);
但在一主多从的情况下,要结合 mysql-monitor_writer_is_also_reader 来查看使用;