Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TAPA unreachable for ~10 sec when its NSM interface is replaced #548

Open
zolug opened this issue Oct 18, 2024 · 4 comments
Open

TAPA unreachable for ~10 sec when its NSM interface is replaced #548

zolug opened this issue Oct 18, 2024 · 4 comments
Labels
kind/bug Something isn't working

Comments

@zolug
Copy link
Collaborator

zolug commented Oct 18, 2024

Describe the bug
When an NSM interface in a TAPA is replaced during NSM heal (old connection is closed part of which the old interface is removed) the new interface most probably will end up with a different MAC address. Yet, the IP address(es) assigned by the proxy component would be most likely the same.

During such NSM heal event the LBs currently won't be informed about the temporary unavailability of said TAPA/Target. However, in an LB the linux neighbor cache might contain a related neighbor entry (with the old/invalid MAC). Renewal of the neighbor entry is delayed by delay_first_probe_time sec (default: 5) and then initially probes are sent out to the invalid MAC in the cache for ucast_solicit times (defaults: 3).
So, even if NSM heal would replace the NSM interface in TAPA instantly, there would be at least 8 seconds delay until LBs could learn the new MAC address.

Context

  • Kernel: 6.8
  • Network Service Mesh: [v1.14.1
  • Meridio: 1.1.4
@zolug zolug added the kind/bug Something isn't working label Oct 18, 2024
@zolug zolug changed the title NSM interface replacement in TAPA can lead to 10 seconds traffic disturbance TAPA unreachable for ~10 sec when its NSM interface is replaced Oct 18, 2024
@zolug zolug added this to Meridio Oct 18, 2024
@zolug
Copy link
Collaborator Author

zolug commented Oct 18, 2024

Unfortunately, enabling arp_notify in the TAPA won't resolve the problem. That's because of the chained architecture of NSM meaning the interface is first created and its state is set to UP and then the addresses are configured afterwards. While, linux for IPV4 won't send a gARP in such case: https://github.com/torvalds/linux/blob/v6.8/net/ipv4/devinet.c#L1606

For IPv6, ndisc_notify does work though: https://github.com/torvalds/linux/blob/v6.8/net/ipv6/addrconf.c#L4292

There's also the the issue, how to set any sysctl in the TAPA i.e. application POD. A privileged init container could be used for this purpose. But if that's not feasible, a multus network-attachment-definition relying on tuning CNI could be used to set ndisc_notify for the default interface (thus NSM interfaces created later could inherit the default value). Or NSM feature set could be extended to include setting arp_notify/ndisc_notify sysctls for a particular interface.

@zolug
Copy link
Collaborator Author

zolug commented Oct 21, 2024

ideas to resolve:

1., Lowering probe delay by tweaking sysctl params in LB, primarily ucast_solicit and delay_first_probe_time values:
Unfortunately, for the 'default' interface these two cannot be set in other than the init linux network namespace. Therefore, we cannot employ a privileged init container or a tuning network-attachment-definition. Therefore, in order to work these values should be set per NSM interface instead by some privileged entity.

2., Use arping program in TAPA (NSC) to send out gARP:
It would require CAP_NET_RAW privilege at least. Any privilege requirement towards TAPA users must be avoided. Could be outsourced to NSM.

3., Change NSM Target registration protocol. E.g. report if associated NSM connection's state becomes DELETE, upon which LBs could remove any cached neighbor entries. This would be sensitive to NSP unavailability.

4., Proxy to send promisc gARP instead of the TAPA when a new NSM connection gets established between a TAPA and said proxy:
Proxy would need to learn the MAC of the NSM interface in the TAPA. For which it could use some ping or anything. Once, the MAC was learned, it could do a gARP on behalf of the TAPA using arping program with promisc mode. Due to the promisc mode, most probably it would require additional privileges than just CAP_NET_RAW (needed to create RAW socket to send out gARP).

5., Using some custom communication channel proxy could signal the Target IPs to LBs when a new TAPA->proxy connection gets established. So, that the LB could remove any possible cached neighbor entries.

6., LBs could rely on NSM's Connection Monitor feature to learn if TAPA->Proxy NSM connection state changes to DELETE, and extract associated IPs from the connection to issue neighbor delete operations. Unfortunately, MonitorScopeSelector in NSM currently only supports PathSegment related filters like path name, id and token. Which are not known by LB, and can change dynamically over time. On the other hand, without filters there's a risk of receiving way too much events depending on the deployments using NSM. At the moment I don't see any obstacle or drawback to add a new filter option based on network service name.

@zolug
Copy link
Collaborator Author

zolug commented Dec 5, 2024

NSM MonitorScopeSelector should from now on support network service based filtering:
networkservicemesh/api#179

@zolug zolug moved this to 🏗 In progress in Meridio Dec 11, 2024
@zolug
Copy link
Collaborator Author

zolug commented Jan 14, 2025

Test:

1., Deploy a Trench with a Conduit, Stream etc. and single Attractor (LB).
2., Deploy example-target with a single Target POD connecting the Conduit and Stream.
3., Send external traffic for a short time (e.g. using ctraffic) towards the VIP hosted by the Target POD to let the LB resolve the MAC(s) of selected target IP(s).
4., Provoke a NSM heal event between TAPA (Target POD) and its local Meridio Proxy without container impact.
e.g.:
while true; do kc delete nses -n nsm $(kubectl get pods -l app=proxy-load-balancer-a1 -o jsonpath='{.items[*].metadata.name}') 2>/dev/null ; done
Note: On the next refresh attempt, this should lead to a failed refresh thus triggering heal. Stop the script allowing for recovery.
5., In case of the baseline, LB will keep the old MAC address associated with the Target IP(s).
6., Once NSM heal recovers the connection, start test traffic again. If old neighbor cache entries are present, expect >8 sec traffic disturbance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
Status: 🏗 In progress
Development

No branches or pull requests

1 participant