Zabbix proxy 故障排查

Zabbix proxy is a major component in the whole Zabbix architecture.As a result, very often failure of one proxy can lead to dramatic results in all monitoring setup causing a storm of events and problems.

Zabbix代理是整个Zabbix体系结构中的一个主要组件。
因此，代理服务器的频繁故障往往会导致戏剧性的结果：所有的监控设置引发风暴事件和问题。

Contents

I. Setup (0:57)

II. Why the queue is growing? (2:57)

1. 配置错误：Misconfiguration

2. zabbix proxy 连不到监控客户端agent：Proxy cannot connect to the agent (6:26)
3. zabbix proxy 连不到监控服务器server：Proxy cannot connect to the server (12:15)
4. zabbix proxy 发送到server的数据不够快：Proxy cannot send data fast enough (13:52)
5. Proxy doesn’t have enough processes (18:14)
6. Server cannot keep up with the data (23:34)
7. Too many unsent values (26:08)

# 以下是官方的一篇博客，基本覆盖了zabbix proxy 常见的故障，以帮助大家快速定位问题。（往往实际要复杂很多，算是提供一个思路吧）

In this post, the most common cases on proxy troubleshooting are considered, which will give you a quick insight on where you should look when something goes wrong.

1. Setup

Setup (installed from the packages):

Zabbix 5.01,
CentOS 8,
Zabbix proxy on the same machine with the Zabbix Server,
MariaDB database engine hosts Zabbix Server database, and Zabbix Proxy database.
Three Zabbix server hosts (duplicated) in Configuration > Hosts,
‘training proxy‘ — active proxy without encryption with compression turned on — added in Administration > Proxies.

In Administration > Queue, the numbers growing, but considering the number of monitored items, no data is received from all items on my hosts.

Queue overview（监控队列查看）

In the production instance, this would also lead to problems. In this case, all triggers with nodata() trigger function would fire and create a lot of problems.

2. Why the queue is growing?

Zabbix Proxy is monitoring our Zabbix agents.Monitoring checks and types could be both – passive and active.

Active checks mean that agents connect to proxy to request the configuration which will contain information what data should be collected. Grab that data on the host and pass it to proxy.Passive agent checks are working opposite way. Proxy is connecting to agents and polling for value that has to be collected.

Proxy itself also could work in active or passive mode which also describes same connection specifics.

（客户端agent有主动被动模式、代理proxy也有主、被动模式）

2.1 Misconfiguration

If we have an issue with the proxy, we should definitely check the proxy logs. Which means that we would need to have possibility to SSH to proxy server.

# 遇到问题先看日志

# 注意自己的日志路径
tail -f /var/log/zabbix/zabbix_proxy.log

‘cannot send proxy data to server’ message spamming

The message that proxy data cannot be sent to the server at the localhost IP address, because the proxy “Training proxy” is not found will appear only in one case — when the hostname parameter in the Zabbix proxy configuration file does not match the proxy name in the Zabbix frontend.

While we have the ‘Training proxy’ running, in the frontend, in Administration > Proxies, the ‘training proxy’ is specified in the lowercase. This is enough for the proxy to stop reporting the data to the server and start to write these errors in the log.

Different proxy name spelling

So, we need to change the proxy name and click Update.

Updating the proxy name

Then, to save us time to force reload of the configuration cache on the Zabbix Server, you may execute:

# 一般如果是服务端配置原因，修改后可以手动重载配置生效
zabbix_server -R config_cache_reload

and then check the proxy logs. （修改后 tail -f 查看日志确认之前的报错没有了）

You’ll see that the message has stopped spamming. Note that server updates it’s configuration cache automatically every minute by default.

# 这管制台队列页面，没有延迟的队列信息

In Administration > Queue, no values are delayed, and all the data is processed.

Proxy queue overview（监控队列查看）

总结1：一般情况看proxy日志基本能够反映问题所在，个别情况结合server日志

So, in misconfiguration cases, you need to:

a) check the log of the proxy, which in most cases will be enough, and
b) in rare cases, check the log of the server.

2.2 Proxy cannot connect to the agent

#如果代理无法从客户端获取数据，主要可以代理和客户端日志。
If the proxy is not able to get the data from an agent, it has nothing to do with the proxy-server communication or with the server itself. So, we need to check for some hints about problem in Proxy and Agent log files.

#首要还是查看日志

First, check the proxy. Most likely, in the logs you will see some ‘network errors’, ‘hosts unavailable’, ‘connection timed out’, ‘TCP connection problems’, or some similar errors. So, if there is an error message that connection is lost, for instance, to agent 1, but everything else is working fine, then the problem most likely is with the network. There might be some changes in the network that prevent the proxy from pulling the data from the agent. We can run zabbix_get from the proxy with the IP of the agent, which we are trying to troubleshoot, and some simple key to test connectivity.

# 还可以使用 zabbix_get 工具测试

# 使用zabbix_get. 获取 iterm： ystem.cpu.load 的值
zabbix_get -s 127.0.0.1 -k system.cpu.load

If there are no problems, the agent should report back.

If the agent is reporting but in the frontend we still don’t see any data, there might be other issues, for instance, if the proxy doesn’t have enough processes that are responsible for this monitoring type.

The agent might be simply stopped, crashed, or deleted. In this case, we would see error messages when running zabbix_get.

If we get this message, the proxy will not be able to get the value, and the problem is somewhere with a connection between the proxy and the agent or with the agent itself. In this case, we need to check the host, which we’re monitoring — whether the agent is running, or listening to the port, or whether there are some firewall rules that might be blocking the proxy connection to the agent. Maybe simply Server= IP address in the agent configuration differs from Proxy IP address （比如：客户端的Server配置与proxy ip是否一致）

总结2：查看proxy、agent日志，对比其他agent，使用zabbix_get 调用单个监控键值测试

So, if the proxy can’t connect to the agent, you need to:

a) check proxy logs,
b) check agent logs,
c) isolate agents that are not working. If all agents are not working, then it’s a network problem,
d) use zabbix_get or snmpget (snmp problems) to test the connection between the proxy and the host.

2.3 Proxy cannot connect to the server

When the proxy cannot connect to the server, we might not see anything on the server log as the connection is broken and the server will not receive any data. We need to check the proxy logs where we’ll see error messages, such as ‘cannot connect to server‘ or ‘cannot send proxy data to server‘.

总结3: 查看日志是否有连接Server或发送数据异常

If the proxy can’t connect to the server, you need to:

a) check the proxy logs.

2.4 Proxy cannot send data fast enough

If there is a queue from the proxy but some data is going through, the proxy is able to connect to the server and to the agents, as the data is received and sent, though with a delay.

This means that the proxy cannot keep up with the data flow, and the data in the proxy is piling up faster than it is sent.

You can see the problem in the Latest data where the graphs will have gaps and dots.

（注：Monitoring > Latest data : 找到相关监控项，看图形是否有中断或断点。）

To check for the problem, you need to execute ps ax | grep sender on Zabbix proxy host.

# 查看进程运行情况
# sender 用于proxy 主动模式 问题排查
ps ax | grep sender

There is only one process — data sender on the Zabbix proxy, which doesn’t exist on the server and which is the only process responsible to send data to the server.

Here, we are interested in the lines describing the sender.

If we run the command multiple times, the lines describing the sender will be changing.

The problem exists, if we see that the number of values sent per second is not enough. One iteration of data sender takes hundreds of seconds or even more. In this case, we can open the database and run the query:

# 查看对列历史数据
select max(id)-(select nextid from ids where table_name="proxy_history" limit 1) from proxy_history;

This query will tell us how many values in the proxy database are left that are not yet sent to the Zabbix server. Basically, that’s a backlog queue on the proxy.

（图例是没有数据积压的情况）

In the example above, everything works fine, the queue is not piling up.

if we would observe that it takes hundred of seconds for data sender to send values to server, most likely result from query would be not zero. And running query multiple times would show that backlog on the proxy is only increasing.

These problems might be for several reasons, one of them is a slow network connection between the Zabbix proxy and the Zabbix server. In this case, the ping is usually not enough to tell if the network is good or bad. If all the processes are not busy, but the data sender is struggling to send the data, then you should probably consult your networking team. It could also be a problem on Zabbix server side. Data sender is single process that is sending data to the server, but on the server side only Trappers are responsible to accept that data. So make sure that you have enough trappers on the server and that they are not 100% busy.

So, if the proxy cannot send data fast enough, you need to:

a) check data sender,
b) check the queue on the proxy database,
c) check the network connection speed.

2.5 Proxy doesn’t have enough processes

Suppose, we have a Zabbix server up and running, Zabbix frontend, and most of the environment is monitored through the Zabbix server.

In Administration > Proxies, we see that our Training proxy has only three hosts added.

These three hosts are receiving data, there’s no queue and no data is missing. Suppose, in a month we need to deploy a network discovery or an active agent registration on the proxy and add more hosts to be monitored. We might end up with around 30,000 hosts when we’ll see the problems: gaps or the queue raising on the proxies.

If we check the server log, there might be nothing wrong in it — no problems, no slow queries. We can also go Monitoring > Hosts and display, for instance, the ‘internal process busy‘ graph.

Then the issue might be on the proxy. As we’ve added 30,000 new hosts, the number of processes on the proxy might not be enough for current amount of data going through proxy.

Processes running on the proxy

So, when we deploy 30,000 hosts, amount of processes we had previously is not enough for current setup. This problem will not be displayed in the server logs or the server graph. In this case, we need to monitor the proxy. To do that, go to Configuration > Hosts, create a host, add it to the group, specify that the host is monitored by proxy, and the proxy itself. Click Add.

Adding host

NOTE. Users often specify the interface of the agent to the external address of the proxy, which is not exactly correct.

You’ll monitor the proxy with the template Template App Zabbix Proxy in Configuration > Templates. This proxy has internal items that don’t use the IP address specified in the Zabbix agent interface.

Internal items on the proxy

If you won’t configure this host to be monitored by the proxy itself in Configuration > Hosts, then you will still see the data on those performance graphs, but that data will be coming from the Zabbix server, and you might be misled by the Monitoring > Hosts graphs, data collector processes, and internal processes. So, you would actually see the data from the server, which might not be busy, while the proxy might be struggling. In this case, all you need is to add more processes as pollers, trappers, etc.

If the proxy doesn’t have enough processes, you need to make sure you monitor the proxy correctly and to check the performance graphs of the proxy:

a) data gathering processes busy,
b) internal processes busy, and
c) cache free percentage.

2.6 Server cannot keep up with the data

# 如果客户端==》 proxy 数据推送及时，可能就需要从server找问题
If the data is seamlessly pulled from the agents by the proxy, which then pushes it to the server, there might be problems on the server.

We’ll be able to notice this in the logs of the server where we most likely would see some slow queries or timeouts.

Most importantly, we will see the same information in the performance graphs in Monitoring > Hosts.

Here, we need to check, for instance, ‘data gathering process busy‘.

‘data gathering process busy‘ performance graphs

More specifically, we interested in the trappers, as only trappers are operating with active proxies that we are considering in this example.

So, if trappers are completely busy, most likely the only thing you need to do is simply increase the number of trappers.

If trappers are busy, but in ‘Zabbix internal process busy %‘ graph we see that housekeeper, history sinker, and basically most of the graphs are running with a 100% load,

and there are some dots and gaps in the graph, this means that you have serious performance issues with the Zabbix server.

（Server 端指标负载过高，或者图形断断续续，大部分情况说明zabbix server 出现了性能瓶颈或配置问题）

In 99% of cases, this will be caused by database performance issues and the problem with the Zabbix setup.

总结6: 检查server端日志、检查server本身监控图标、检查慢sql

So, if the server cannot keep up with the data, you need to:

a) check the server logs,
b) check server performance graphs,
c) check for slow queries in the log. If you see slow queries to insert the data in the history table, then it is the database that is not keeping up.

2.7 Too many unsent values

Sometimes, any of the above problems can cause the proxy to collect a backlog when the queue not dropping or dropping extremely slow after the problem is fixed. In this case, we can run the query to check the backlog on the proxy database.

#proxy_history 当前最大id减去ids中的nexid 就是当前积压的数据队列
select max(id)-(select nextid from ids where table_name="proxy_history" limit 1) from proxy_history;

If you see, for instance, millions of values, then the proxy is not functioning for some period, has a huge backlog in the database, and the queue is still piling up. In this case, the only thing that we can do is to drop the backlog — delete all the data stored in the proxy database and start from scratch. We’ll lose the unsent data history for that period, but at least you will get back to monitoring.

To do that, we need:

1. stop the Zabbix proxy,

systemctl stop zabbix-proxy
2. open the database,

mysql
3. Tuncate two tables — proxy history query and IDs,

truncate proxy_history;
truncate ids;

4. exit from the database and start Zabbix proxy,

systemctl start zabbix-proxy

Dropping proxy backlog

5. reload,

zabbix_server -R config_cache_reload
6. open the database,

mysql
7. use Zabbix proxy,

use zabbix-proxy;
and in the query, you will see no backlog.

Proxy backlog dropped

NOTE. Don’t forget to truncate IDs table. If you truncate only proxy history tables, nothing is going to work.

So, if there are too many unsent values and the proxy queue is piling up, you need to drop the backlog.

总结7：各种问题导致的proxy队列大量积压，问题恢复后队列消化缓慢，可以考虑清空当前队列

??注意：这会丢失当前的队列信息，

运维技术-系统