Server goes UP without tcp-check if it resolves again #51

Leen15 · 2019-02-28T13:47:50Z

Hi to all, I have a problem with a haproxy instance (1.9.4) in front of a redis cluster (3 nodes), all inside k8s.

I configured haproxy for a tcp-check like this:

backend bk_redis
  option tcp-check
  tcp-check send AUTH\ RedisTest\r\n
  tcp-check expect string +OK
  tcp-check send PING\r\n
  tcp-check expect string +PONG
  tcp-check send info\ replication\r\n
  tcp-check expect string role:master
  tcp-check send QUIT\r\n
  tcp-check expect string +OK
  default-server  check resolvers kubedns inter 1s downinter 1s fastinter 1s fall 1 rise 30 maxconn 330 no-agent-check on-error mark-down
  server redis-0 redis-ha-server-0.redis-ha.redis-ha.svc.cluster.local:6379
  server redis-1 redis-ha-server-1.redis-ha.redis-ha.svc.cluster.local:6379
  server redis-2 redis-ha-server-2.redis-ha.redis-ha.svc.cluster.local:6379

When the master node goes down it works fine, a replica is promoted to master and haproxy redirects the traffic to that.
The problem is when the old master comes back with a new ip, because haproxy doesn't check again for the master role but instead it puts immediately the old node as UP.

this is the log:

[NOTICE] 058/125637 (1) : New worker #1 (6) forked
[WARNING] 058/125637 (6) : Health check for server bk_redis/redis-0 succeeded, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 0ms, status: 1/1 UP.
[WARNING] 058/125639 (6) : Health check for server bk_redis/redis-1 failed, reason: Layer7 timeout, info: " at step 6 of tcp-check (expect string 'role:master')", check duration: 1001ms, status: 0/30 DOWN.
[WARNING] 058/125639 (6) : Server bk_redis/redis-1 is DOWN. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING] 058/125639 (6) : Health check for server bk_redis/redis-2 failed, reason: Layer7 timeout, info: " at step 6 of tcp-check (expect string 'role:master')", check duration: 1001ms, status: 0/30 DOWN.
[WARNING] 058/125639 (6) : Server bk_redis/redis-2 is DOWN. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING] 058/125657 (6) : Health check for server bk_redis/redis-0 failed, reason: Layer4 timeout, info: " at step 1 of tcp-check (send)", check duration: 1001ms, status: 0/30 DOWN.
[WARNING] 058/125657 (6) : Server bk_redis/redis-0 is DOWN. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 058/125657 (6) : backend 'bk_redis' has no server available!
[WARNING] 058/125706 (6) : Health check for server bk_redis/redis-2 failed, reason: Layer7 invalid response, info: "TCPCHK did not match content 'role:master' at step 6", check duration: 532ms, status: 0/30 DOWN.
[WARNING] 058/125706 (6) : Health check for server bk_redis/redis-1 failed, reason: Layer7 invalid response, info: "TCPCHK did not match content 'role:master' at step 6", check duration: 835ms, status: 0/30 DOWN.
[WARNING] 058/125707 (6) : Health check for server bk_redis/redis-2 succeeded, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 2ms, status: 1/30 DOWN.
[WARNING] 058/125708 (6) : Health check for server bk_redis/redis-2 succeeded, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 2ms, status: 2/30 DOWN.
[WARNING] 058/125708 (6) : Health check for server bk_redis/redis-1 failed, reason: Layer7 timeout, info: " at step 6 of tcp-check (expect string 'role:master')", check duration: 1001ms, status: 0/30 DOWN.
[WARNING] 058/125709 (6) : Health check for server bk_redis/redis-2 succeeded, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 4ms, status: 3/30 DOWN.
[WARNING] 058/125710 (6) : Health check for server bk_redis/redis-2 succeeded, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 4ms, status: 4/30 DOWN.
[WARNING] 058/125711 (6) : Health check for server bk_redis/redis-2 succeeded, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 2ms, status: 5/30 DOWN.
[WARNING] 058/125712 (6) : Health check for server bk_redis/redis-2 succeeded, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 2ms, status: 6/30 DOWN.
[WARNING] 058/125713 (6) : Server bk_redis/redis-0 was DOWN and now enters maintenance (DNS NX status).
[WARNING] 058/125713 (6) : Health check for server bk_redis/redis-2 succeeded, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 2ms, status: 7/30 DOWN.
[WARNING] 058/125714 (6) : Health check for server bk_redis/redis-2 succeeded, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 3ms, status: 8/30 DOWN.
[WARNING] 058/125715 (6) : Health check for server bk_redis/redis-2 succeeded, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 3ms, status: 9/30 DOWN.
[WARNING] 058/125716 (6) : Health check for server bk_redis/redis-2 succeeded, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 4ms, status: 10/30 DOWN.
[WARNING] 058/125717 (6) : Health check for server bk_redis/redis-2 succeeded, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 3ms, status: 11/30 DOWN.
[WARNING] 058/125718 (6) : Health check for server bk_redis/redis-2 succeeded, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 4ms, status: 12/30 DOWN.
[WARNING] 058/125719 (6) : Health check for server bk_redis/redis-2 succeeded, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 2ms, status: 13/30 DOWN.
[WARNING] 058/125720 (6) : Health check for server bk_redis/redis-2 succeeded, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 2ms, status: 14/30 DOWN.
[WARNING] 058/125721 (6) : Health check for server bk_redis/redis-2 succeeded, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 2ms, status: 15/30 DOWN.
[WARNING] 058/125722 (6) : Health check for server bk_redis/redis-2 succeeded, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 3ms, status: 16/30 DOWN.
[WARNING] 058/125723 (6) : Health check for server bk_redis/redis-2 succeeded, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 3ms, status: 17/30 DOWN.
[WARNING] 058/125724 (6) : Health check for server bk_redis/redis-2 succeeded, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 4ms, status: 18/30 DOWN.
[WARNING] 058/125725 (6) : Health check for server bk_redis/redis-2 succeeded, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 3ms, status: 19/30 DOWN.
[WARNING] 058/125726 (6) : Health check for server bk_redis/redis-2 succeeded, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 2ms, status: 20/30 DOWN.
[WARNING] 058/125727 (6) : Health check for server bk_redis/redis-2 succeeded, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 2ms, status: 21/30 DOWN.
[WARNING] 058/125728 (6) : Health check for server bk_redis/redis-2 succeeded, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 3ms, status: 22/30 DOWN.
[WARNING] 058/125729 (6) : Health check for server bk_redis/redis-2 succeeded, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 3ms, status: 23/30 DOWN.
[WARNING] 058/125730 (6) : Health check for server bk_redis/redis-2 succeeded, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 4ms, status: 24/30 DOWN.
[WARNING] 058/125731 (6) : Health check for server bk_redis/redis-2 succeeded, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 4ms, status: 25/30 DOWN.
[WARNING] 058/125732 (6) : Health check for server bk_redis/redis-2 succeeded, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 3ms, status: 26/30 DOWN.
[WARNING] 058/125733 (6) : Health check for server bk_redis/redis-2 succeeded, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 3ms, status: 27/30 DOWN.
[WARNING] 058/125734 (6) : Health check for server bk_redis/redis-2 succeeded, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 2ms, status: 28/30 DOWN.
[WARNING] 058/125735 (6) : Health check for server bk_redis/redis-2 succeeded, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 2ms, status: 29/30 DOWN.
[WARNING] 058/125736 (6) : Health check for server bk_redis/redis-2 succeeded, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 1ms, status: 1/1 UP.
[WARNING] 058/125736 (6) : Server bk_redis/redis-2 is UP. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 058/125945 (6) : bk_redis/redis-0 changed its IP from 10.42.4.85 to 10.42.4.87 by kubedns/namesrv1.
[WARNING] 058/125945 (6) : Server bk_redis/redis-0 ('redis-ha-server-0.redis-ha.redis-ha.svc.cluster.local') is UP/READY (resolves again).
[WARNING] 058/125945 (6) : Server bk_redis/redis-0 administratively READY thanks to valid DNS answer.
[WARNING] 058/125947 (6) : Health check for server bk_redis/redis-0 failed, reason: Layer7 timeout, info: " at step 6 of tcp-check (expect string 'role:master')", check duration: 1000ms, status: 0/30 DOWN.
[WARNING] 058/125947 (6) : Server bk_redis/redis-0 is DOWN. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

If you see last lines, when the bk_redis/redis-0 has a new ip (BUT IT WAS DOWN) it goes immediately UP without do the tcp-check (that it start after a second and of course it fails).

How can I avoid this ?
Is there a way to force that when it resolves again the ip it waits for the tcp-check for go UP ?

The text was updated successfully, but these errors were encountered:

git001 · 2019-02-28T14:03:24Z

What are the settings in your resolvers block?

Leen15 · 2019-02-28T14:16:56Z

global
  daemon
  maxconn 1000

resolvers kubedns
  nameserver namesrv1 kube-dns.kube-system.svc.cluster.local:53
  resolve_retries  3
  timeout retry 1s
  hold other 1s
  hold refused 1s
  hold nx 1s
  hold timeout 1s
  hold valid 1s

defaults REDIS
  mode tcp
  timeout connect  4s
  timeout server  30s
  timeout client  30s
  option  log-health-checks

frontend ft_redis
  bind :6379 name redis
  default_backend bk_redis

backend bk_redis
  option tcp-check
  tcp-check send AUTH\ RedisTest\r\n
  tcp-check expect string +OK
  tcp-check send PING\r\n
  tcp-check expect string +PONG
  tcp-check send info\ replication\r\n
  tcp-check expect string role:master
  tcp-check send QUIT\r\n
  tcp-check expect string +OK
  default-server  check resolvers kubedns inter 1s downinter 1s fastinter 1s fall 1 rise 30 maxconn 330 no-agent-check on-error mark-down
  server redis-0 redis-ha-server-0.redis-ha.redis-ha.svc.cluster.local:6379
  server redis-1 redis-ha-server-1.redis-ha.redis-ha.svc.cluster.local:6379
  server redis-2 redis-ha-server-2.redis-ha.redis-ha.svc.cluster.local:6379

Leen15 · 2019-02-28T14:19:08Z

It's ok that it changes IP, because a new istance exists. It's not normal that haproxy sets it as UP before do it the tcp-check and respect the rise parameter...

lukastribus · 2019-02-28T20:13:33Z

It is expected behavior that a new servers are actually by default up, not down, before the first health check is done.

This of course does not work with the configuration you are using, because here you are basically misusing the health check system for application master/slave logic - which is not the use-case it's designed for.

I can see how it would be useful to be able to configure the default state for new servers coming from DNS though.

@bedis any opinion about this?

Leen15 · 2019-02-28T20:34:32Z

I dont understand how this is possible... If the server is DOWN first of all it should follow the "rise" parameter logic before go UP.. No?
It's the same of when the master role pass to another server... The replica is down (but the dns is ok), it follows the rise logic for pass to UP so why the DNS resolver has an higher priority compared to the rise one?

lukastribus · 2019-02-28T21:19:32Z

Servers are UP by default.

When health checks are not used, all servers are UP. When health checks are used, but did not start yet or the status is not yet determined, then the current server status will be UP.

This is documented and expected behavior.

"rise" is about health check behavior. Not about pre health-check behavior.

wtarreau · 2019-03-02T08:27:19Z

We wanted to have an "init-state {up|down}" setting a while ago when developing the server-template stuff, and we figured that "init-addr none" already covered that so it was not implemented. But here is an example where it proves this is not the case. While we were originally focused on the server state when the process starts, we didn't think about the state once the server has an address again. I'm wondering if it's a resolver thing or a status thing. I don't know what happens when we set an IP on a server from the CLI, does it automatically go up. If not then we could address this by an extra resolver option. If it does, it's a wider thing to address : we need to set the server state after it is assigned an address.

wtarreau · 2019-03-02T08:40:51Z

From what I'm seeing in the code, the resolver sets the server into the maintenance state, which makes sense and matches what appears in Luca's logs above. So what puts the server UP is that it leaves maintenance mode. After all being able to configure this state when leaving maintenance is more general and not specific to the resolver. For instance, an admin who disabled a server for an upgrade could want it to start with checks first when turning it back on.

The code responsible for this is in srv_update_status(), in this block :
else if ((s->cur_admin & SRV_ADMF_MAINT) && !(s->next_admin & SRV_ADMF_MAINT)) {
More specifially this part :

                if (s->check.state & CHK_ST_ENABLED) {
                        s->check.state &= ~CHK_ST_PAUSED;
                        check->health = check->rise; /* start OK but check immediately */
                }

I'm seeing that we already support switching out of maintenance to the down or starting state, it's the case when the server tracks another one, in which case it'll turn to this other server's state. So for me this proves that all the logic to handle the transition exists and is safe to reuse. Thus we could have this "init-state" config element to change this behaviour.

I'm tagging this as a good first issue in case someone is interested in jumping into this development which seems quite accessible to me.

sveniu · 2019-08-29T10:36:42Z

I'm hitting the same issue. My use case is for often-changing cloud infrastructure in AWS:

Running HAProxy on AWS ECS Fargate.
Running server backends on AWS ECS Fargate, too.
Backend ECS tasks (the containers, basically) register their IPs using AWS Cloud Map (service discovery).
Backends end up being available on service.sd.example.com, resolving to all IPs.
haproxy.cfg uses server-template mybackend 16 ... to handle a maximum of 16 backends.
Backend ECS tasks go up and down as new deploys and autoscaling happens.
HAProxy does the "right thing", reusing backends by detecting that IPs change.
When detecting new backends via DNS, they're immediately marked as UP.
The TCP check then follows a bit after, marking them down since they're still starting up.
The TCP check succeeds a bit after that, marking them up again.

Brief excerpt from haproxy.cfg:

resolvers dnsserver
  parse-resolv-conf
  hold valid 1s

defaults
  default-server init-addr none resolvers dnsserver weight 50 check inter 10s fastinter 2s fall 3 rise 20

listen myservice
  option tcp-check
  tcp-check connect
  tcp-check send-binary ...
  server-template mybackend 16 myservice.sd.example.com:80

rayitopy · 2020-06-10T08:15:05Z

I find this behavior very annoying too. I think that assuming a server is UP without performing the healthcheck is not a good thing.

Leen15 added status: needs-triage type: bug labels Feb 28, 2019

Leen15 changed the title ~~Server goes UP without tcp-check if it resolvers again~~ Server goes UP without tcp-check if it resolves again Feb 28, 2019

lukastribus added type: feature help wanted subsystem: dns and removed status: needs-triage type: bug labels Feb 28, 2019

wtarreau added good first issue subsystem: checks status: reviewed labels Mar 2, 2019

wojiushixiaobai mentioned this issue Oct 8, 2021

[Question] 访问通过HAProxy代理的Redis Sentinel集群报错 jumpserver/jumpserver#6965

Closed

Apr	MAY	Jun
	19
2021	2022	2023

haproxy / haproxy Public

Server goes UP without tcp-check if it resolves again #51

Server goes UP without tcp-check if it resolves again #51

Leen15 commented Feb 28, 2019 •

edited

git001 commented Feb 28, 2019

Leen15 commented Feb 28, 2019

Leen15 commented Feb 28, 2019

lukastribus commented Feb 28, 2019

Leen15 commented Feb 28, 2019 •

edited

lukastribus commented Feb 28, 2019

wtarreau commented Mar 2, 2019

wtarreau commented Mar 2, 2019

sveniu commented Aug 29, 2019

rayitopy commented Jun 10, 2020

haproxy / haproxy Public

Server goes UP without tcp-check if it resolves again #51

Server goes UP without tcp-check if it resolves again #51

Comments

Leen15 commented Feb 28, 2019 • edited

git001 commented Feb 28, 2019

Leen15 commented Feb 28, 2019

Leen15 commented Feb 28, 2019

lukastribus commented Feb 28, 2019

Leen15 commented Feb 28, 2019 • edited

lukastribus commented Feb 28, 2019

wtarreau commented Mar 2, 2019

wtarreau commented Mar 2, 2019

sveniu commented Aug 29, 2019

rayitopy commented Jun 10, 2020

Leen15 commented Feb 28, 2019 •

edited

Leen15 commented Feb 28, 2019 •

edited