🐛 BUG: unable to find host with relay error after a relayed client is restarted #1241

theblop · 2024-10-10T09:09:25Z

What version of `nebula` are you using? (`nebula -version`)

1.9.4

What operating system are you using?

Linux

Describe the Bug

setup

I have a nebula client configured with 3 relays (which are also lighthouses but I don't think it matters) to connect to the mesh. Other clients don't need the relays:

client (relayed, in private subnet) -> relays (lighthouses) -> (MESH on public internet) <- other clients (non-relayed)

problem

When the relayed client first registers with the relays, everything works fine. When the client restarts and reconnects to the relays, some non-relayed clients can't connect back to it until I either restart them one by one, or at least one of the relays (which fixes all the problematic non-relayed clients in one go).

The relayed client is in a private subnet
the lighthouses/relays are exposed to the internet with 1:1 NAT
We don't have full control over the network infra of the non-relayed clients: they are on-prem at various customers locations, we asked them to open and forward the nebula port but I think some may still have problematic NAT (that's why we have punchy true)
some non-relayed clients are not affected by this issue (they reconnect immediately to the restarted relayed client), but since we don't have any control over their network infra it's hard to tell the difference between them and the other clients

Note: traffic between all the non-relayed clients is never affected.

fix (workaround)

Either:

restart the non-relayed clients (only fixes connection for individual clients to the relayed client)
or restart at least one lighthouse (fixes all clients connections to the relayed client)

Logs from affected hosts

some anonymized logs when the error happens (the timestamps don't match exactly, but the same messages are looped forever anyway):

lighthouses (relays):

logs:

time="2024-10-09T13:55:22Z" level=info msg="Failed to find target host info by ip" certName=otherclient1.mesh error="unable to find host with relay" localIndex=80854473 relayTo=100.96.2.13 remoteIndex=3176915269 vpnIp=100.99.63.1
...

(this message is quickly repeated for all the other clients and loops forever. 100.96.2.13 is the relayed client)

client (relayed):

logs:

time="2024-10-09T13:54:35Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2192126849 localIndex=2192126849 remoteIndex=0 udpAddrs="[X:X:X:X:4242 10.10.1.23:4242]" vpnIp=100.99.63.1
time="2024-10-09T13:54:41Z" level=info msg="Handshake timed out" durationNs=6073492769 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2192126849 localIndex=2192126849 remoteIndex=0 udpAddrs="[X:X:X:X:4242 10.10.1.23:4242]" vpnIp=100.99.63.1
...

(loops forever)

other clients (not relayed):

logs:

time="2024-10-09T15:55:22+02:00" level=info msg="Attempt to relay through hosts" localIndex=4278002031 relays="[100.96.0.1 100.96.0.2 100.96.0.3 100.96.0.1 100.96.0.2 100.96.0.3 100.96.0.1 100.96.0.2 100.96.0.3]" remoteIndex=0 vpnIp=100.96.2.13
time="2024-10-09T15:55:22+02:00" level=info msg="Send handshake via relay" localIndex=4278002031 relay=100.96.0.1 remoteIndex=0 vpnIp=100.96.2.13
time="2024-10-09T15:55:23+02:00" level=info msg="Handshake timed out" durationNs=3420931786 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=4278002031 localIndex=4278002031 remoteIndex=0 udpAddrs="[10.88.37.91:4242]" vpnIp=100.96.2.13

(loops forever)

Config files from affected hosts

lighthouses (relays):

config:

static_host_map:
lighthouse:
  am_lighthouse: true
punchy:
  punch: true
relay:
  am_relay: true
  use_relays: true

client (relayed):

config:

static_host_map:
  100.96.0.1:
    - lh1.example.org:4242
  100.96.0.2:
    - lh2.example.org:4242
  100.96.0.3:
    - lh3.example.org:4242
lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
    - 100.96.0.1
    - 100.96.0.2
    - 100.96.0.3
relay:
  am_relay: false
  use_relays: true
  relays:
    - 100.96.0.1
    - 100.96.0.2
    - 100.96.0.3
punchy:
  punch: true

other clients (not relayed):

config:

static_host_map:
  100.96.0.1:
    - lh1.example.org:4242
  100.96.0.2:
    - lh2.example.org:4242
  100.96.0.3:
    - lh3.example.org:4242
lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
    - "100.96.0.3"
    - "100.96.0.2"
    - "100.96.0.1"
relay:
  am_relay: false
  use_relays: true
punchy:
  punch: true
  respond: true

The text was updated successfully, but these errors were encountered:

johnmaguire added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 BUG: unable to find host with relay error after a relayed client is restarted #1241

🐛 BUG: unable to find host with relay error after a relayed client is restarted #1241

theblop commented Oct 10, 2024 •

edited

Loading

🐛 BUG: unable to find host with relay error after a relayed client is restarted #1241

🐛 BUG: unable to find host with relay error after a relayed client is restarted #1241

Comments

theblop commented Oct 10, 2024 • edited Loading

What version of nebula are you using? (nebula -version)

What operating system are you using?

Describe the Bug

setup

problem

fix (workaround)

Logs from affected hosts

lighthouses (relays):

client (relayed):

other clients (not relayed):

Config files from affected hosts

lighthouses (relays):

client (relayed):

other clients (not relayed):

theblop commented Oct 10, 2024 •

edited

Loading

What version of `nebula` are you using? (`nebula -version`)