Federation¶
NovaEdge federation connects multiple Kubernetes clusters into a unified control plane, enabling cross-cluster service discovery, configuration synchronization, and traffic failover. Federation supports three operating modes -- hub-spoke, mesh, and unified -- to match different organizational and network topologies.
Overview¶
In a single-cluster deployment, NovaEdge manages load balancing, routing, and VIPs for services within one Kubernetes cluster. Federation extends this to multiple clusters:
- Configuration synchronization -- routing rules, policies, and backend definitions propagate across clusters automatically
- Cross-cluster endpoint merging -- the snapshot builder combines local and remote service endpoints so agents can route traffic to backends in any cluster
- Locality-aware routing -- traffic prefers local backends, overflowing to remote clusters only when local capacity is insufficient
- Split-brain protection -- vector clocks, quorum checks, and write fencing prevent conflicting updates during network partitions
Federation is managed through two CRDs:
| CRD | Purpose |
|---|---|
NovaEdgeFederation |
Defines the federation topology, sync settings, conflict resolution, and split-brain policy |
NovaEdgeRemoteCluster |
Registers a remote or edge cluster with connection details, routing weights, and health checks |
Federation Modes¶
NovaEdge supports three federation modes, each suited to different use cases:
| Mode | Sync Direction | Endpoint Merging | Best For |
|---|---|---|---|
hub-spoke |
Hub pushes to spokes | No | Centralized management with edge clusters |
mesh |
Bidirectional | Yes | Active-active multi-region deployments |
unified |
Bidirectional | Yes (all endpoints) | Global anycast-style load balancing |
Hub-Spoke¶
A central hub cluster owns all configuration and pushes it one-directionally to spoke clusters. Spokes execute the configuration but do not originate changes.
When to use:
- A single team manages all routing and policy
- Edge or satellite clusters need to mirror the hub's configuration
- You want a simple, predictable sync topology
flowchart TB
subgraph Hub["Hub Cluster (us-east-1)"]
HC["Controller<br/>(authoritative)"]
HA["Agents"]
end
subgraph Spoke1["Spoke Cluster (eu-west-1)"]
SC1["Controller<br/>(receiver)"]
SA1["Agents"]
end
subgraph Spoke2["Spoke Cluster (ap-southeast-1)"]
SC2["Controller<br/>(receiver)"]
SA2["Agents"]
end
HC -->|"push config"| SC1
HC -->|"push config"| SC2
HA -->|"primary"| HC
SA1 -->|"primary"| SC1
SA2 -->|"primary"| SC2
style Hub fill:#e8f5e9
style Spoke1 fill:#e1f5ff
style Spoke2 fill:#fff5e6
Mesh¶
All clusters synchronize bidirectionally. Each controller is both a producer and consumer of configuration. Endpoint merging allows services in one cluster to route traffic to backends in another.
When to use:
- Active-active multi-region deployments
- Cross-cluster service discovery and failover
- Multiple teams manage configuration in their own clusters
flowchart TB
subgraph ClusterA["Cluster US (us-east-1)"]
CA["Controller"]
AA["Agents"]
end
subgraph ClusterB["Cluster EU (eu-west-1)"]
CB["Controller"]
AB["Agents"]
end
subgraph ClusterC["Cluster AP (ap-south-1)"]
CC["Controller"]
AC["Agents"]
end
CA <-->|"bidirectional sync"| CB
CB <-->|"bidirectional sync"| CC
CA <-->|"bidirectional sync"| CC
AA -->|"primary"| CA
AB -->|"primary"| CB
AC -->|"primary"| CC
style ClusterA fill:#e8f5e9
style ClusterB fill:#e1f5ff
style ClusterC fill:#fff5e6
Unified¶
All clusters share a single service namespace and behave as one logical load balancer. Endpoints from every cluster are merged, and location-aware routing directs traffic to the nearest healthy backend. This mode is the most aggressive form of federation.
When to use:
- Global anycast-style load balancing
- Region-aware failover with automatic cross-region fallback
- A single control-plane view across all regions
flowchart TB
Client((Client)) --> DNS{{"DNS / Anycast"}}
DNS --> GW1
DNS --> GW2
DNS --> GW3
subgraph Unified["Unified Namespace"]
subgraph US["us-east-1"]
GW1["Gateway"] --> EP1a["backend-a<br/>(local)"]
GW1 -.->|"overflow"| EP2a
GW1 -.->|"overflow"| EP3a
end
subgraph EU["eu-west-1"]
GW2["Gateway"] --> EP2a["backend-a<br/>(local)"]
GW2 -.->|"overflow"| EP1a
end
subgraph AP["ap-south-1"]
GW3["Gateway"] --> EP3a["backend-a<br/>(local)"]
GW3 -.->|"overflow"| EP2a
end
end
style US fill:#e8f5e9
style EU fill:#e1f5ff
style AP fill:#fff5e6
Setup Walkthrough¶
Prerequisites¶
Before configuring federation:
- Kubernetes 1.29+ on all participating clusters
- NovaEdge installed on each cluster
- Network connectivity between controllers on gRPC port 9443 (or via a tunnel for NAT/firewall environments)
- TLS certificates for secure controller-to-controller mTLS communication
Step 1: Create TLS Secrets¶
Federation controllers communicate over gRPC with mutual TLS. Create the CA certificate and client certificate secrets in every participating cluster:
# Generate a CA (or use your organization's CA)
openssl req -x509 -newkey ec -pkeyopt ec_paramgen_curve:P-256 \
-days 3650 -nodes -keyout ca.key -out ca.crt \
-subj "/CN=NovaEdge Federation CA"
# Generate client certificate for this cluster
openssl req -newkey ec -pkeyopt ec_paramgen_curve:P-256 \
-nodes -keyout client.key -out client.csr \
-subj "/CN=cluster-us"
openssl x509 -req -in client.csr -CA ca.crt -CAkey ca.key \
-CAcreateserial -days 365 -out client.crt
# Create Kubernetes secrets
kubectl -n novaedge-system create secret tls federation-ca \
--cert=ca.crt --key=ca.key
kubectl -n novaedge-system create secret tls federation-client-cert \
--cert=client.crt --key=client.key
Use the same CA across all clusters
Every cluster in the federation must trust the same CA certificate so that mTLS handshakes succeed between any pair of controllers.
Step 2: Create the NovaEdgeFederation Resource¶
Deploy a NovaEdgeFederation resource on each cluster. The localMember section identifies the current cluster, while members lists the peers.
Hub-spoke example (deploy on the hub):
apiVersion: novaedge.io/v1alpha1
kind: NovaEdgeFederation
metadata:
name: global-federation
namespace: novaedge-system
spec:
mode: hub-spoke
federationID: global-prod
localMember:
name: hub-us-east
region: us-east-1
zone: us-east-1a
endpoint: "federation-hub.example.com:9443"
members:
- name: spoke-eu-west
region: eu-west-1
zone: eu-west-1a
endpoint: "spoke-eu.example.com:9443"
priority: 10
tls:
enabled: true
caSecretRef:
name: federation-ca
namespace: novaedge-system
clientCertSecretRef:
name: federation-client-cert
namespace: novaedge-system
serverName: spoke-eu.example.com
sync:
interval: 5s
timeout: 30s
batchSize: 100
compression: true
healthCheck:
interval: 10s
timeout: 5s
failureThreshold: 3
successThreshold: 1
Mesh example (deploy on each cluster, adjusting localMember):
apiVersion: novaedge.io/v1alpha1
kind: NovaEdgeFederation
metadata:
name: mesh-federation
namespace: novaedge-system
spec:
mode: mesh
federationID: prod-mesh
localMember:
name: cluster-us
region: us-east-1
zone: us-east-1b
endpoint: "novaedge-us.example.com:9443"
members:
- name: cluster-eu
region: eu-west-1
zone: eu-west-1a
endpoint: "novaedge-eu.example.com:9443"
priority: 100
tls:
enabled: true
caSecretRef:
name: federation-ca
namespace: novaedge-system
clientCertSecretRef:
name: federation-client-cert
namespace: novaedge-system
serverName: novaedge-eu.example.com
conflictResolution:
strategy: Merge
vectorClocks: true
tombstoneTTL: 24h
splitBrain:
enabled: true
partitionTimeout: 30s
quorumMode: AgentAssisted
quorumRequired: true
fencingEnabled: true
healingGracePeriod: 5s
autoResolveOnHeal: true
Step 3: Register Remote Clusters (Optional)¶
For hub-spoke deployments with edge or satellite clusters, create NovaEdgeRemoteCluster resources on the hub to register each remote cluster:
apiVersion: novaedge.io/v1alpha1
kind: NovaEdgeRemoteCluster
metadata:
name: edge-eu-west
namespace: novaedge-system
spec:
clusterName: edge-eu-west
region: eu-west-1
zone: eu-west-1a
connection:
mode: Direct
controllerEndpoint: "novaedge-hub.example.com:9090"
tls:
enabled: true
caSecretRef:
name: remote-cluster-ca
namespace: novaedge-system
clientCertSecretRef:
name: remote-agent-cert
namespace: novaedge-system
serverName: novaedge-hub.example.com
routing:
enabled: true
priority: 100
weight: 50
localPreference: true
allowCrossClusterTraffic: true
healthCheck:
enabled: true
interval: 30s
timeout: 10s
unhealthyThreshold: 3
failoverEnabled: true
Step 4: Verify Federation Health¶
Check the federation status:
# View federation status
kubectl -n novaedge-system get novaedgefederations
# Detailed status with member health
kubectl -n novaedge-system describe novaedgefederation global-federation
# Check remote clusters
kubectl -n novaedge-system get novaedgeremoteclusters
A healthy federation shows Phase: Healthy with all members listed as healthy:
TLS Configuration¶
All federation communication uses mutual TLS (mTLS) by default. Each peer entry in the NovaEdgeFederation resource specifies its TLS configuration:
tls:
# Enable mTLS (default: true)
enabled: true
# CA certificate used to verify the peer's server certificate
caSecretRef:
name: federation-ca
namespace: novaedge-system
# Client certificate and key for authenticating to the peer
clientCertSecretRef:
name: federation-client-cert
namespace: novaedge-system
# Expected server name in the peer's certificate (for SNI verification)
serverName: novaedge-eu.example.com
# Skip certificate verification (NOT recommended for production)
insecureSkipVerify: false
Certificate rotation
Update the Kubernetes secrets with new certificates before expiry. NovaEdge reloads TLS certificates automatically when the underlying secrets change.
Conflict Resolution¶
In mesh and unified modes, concurrent writes to the same resource from different clusters create conflicts. NovaEdge provides three strategies:
| Strategy | Behavior | When to Use |
|---|---|---|
LastWriterWins |
Most recent change wins (vector clock ordering) | Default; simple and predictable |
Merge |
Non-overlapping field changes are merged; overlapping fields fall back to LastWriterWins |
Teams editing different parts of the same resource |
Manual |
Conflicts are flagged in status for operator resolution | Strict change control requirements |
Vector clocks provide causal ordering of changes across clusters. They are enabled by default and should remain enabled for all multi-writer modes.
Tombstones prevent deleted resources from reappearing during sync. The tombstoneTTL (default: 24h) controls how long deletion markers are retained.
Split-Brain Protection¶
When network partitions prevent controllers from reaching each other, split-brain protection prevents divergent writes:
stateDiagram-v2
[*] --> Healthy
Healthy --> Suspected: peer unreachable
Suspected --> Confirmed: timeout exceeded
Suspected --> Healthy: peer reachable
Confirmed --> Healing: peer reachable again
Healing --> Healthy: grace period elapsed
Configuration options:
| Field | Default | Description |
|---|---|---|
partitionTimeout |
30s | Time without peer contact before confirming partition |
quorumMode |
Controllers | Controllers (3+ controllers) or AgentAssisted (2 controllers using agent witnesses) |
quorumRequired |
false | Reject writes when quorum is lost |
fencingEnabled |
false | Block all writes during confirmed partition |
healingGracePeriod |
5s | Wait time after healing before resuming normal writes |
autoResolveOnHeal |
true | Automatically resolve conflicts when partition heals |
Agent-assisted quorum
With only two controllers, traditional majority quorum is impossible. AgentAssisted mode uses agent reachability as additional voting participants. Controllers have a configurable weight (default: 10) versus agents (default: 1), preventing a controller with no agents from claiming quorum.
Monitoring¶
Prometheus Metrics¶
NovaEdge exposes federation-specific Prometheus metrics:
| Metric | Type | Description |
|---|---|---|
novaedge_federation_peers_total |
Gauge | Total number of federation peers |
novaedge_federation_peers_healthy |
Gauge | Number of healthy peers |
novaedge_federation_peers_connected |
Gauge | Number of connected peers |
novaedge_federation_phase |
Gauge (labeled) | Current federation phase (1 = active) |
novaedge_federation_sync_total |
Counter | Total sync operations by type and direction |
novaedge_federation_sync_duration_seconds |
Histogram | Sync operation duration per peer |
novaedge_federation_changes_received_total |
Counter | Changes received from each peer |
Status Conditions¶
The NovaEdgeFederation status includes standard Kubernetes conditions and per-member health:
kubectl -n novaedge-system get fed global-federation -o jsonpath='{.status.phase}'
# Healthy
kubectl -n novaedge-system get fed global-federation -o jsonpath='{.status.members}' | jq .
Key status fields:
phase-- overall federation health (Initializing,Syncing,Healthy,Degraded,Partitioned)members[].healthy-- per-peer healthmembers[].syncLag-- how far behind this peer issplitBrain.partitionState-- current partition state (Healthy,Suspected,Confirmed,Healing)splitBrain.haveQuorum-- whether we currently hold quorumconflictsPending-- number of unresolved conflicts
Troubleshooting¶
Federation stays in Initializing phase¶
Cause: Controllers cannot reach each other on the gRPC port.
Resolution:
- Verify network connectivity:
nc -zv <peer-endpoint> 9443 - Check firewall rules allow traffic on port 9443 between clusters
- Verify TLS secrets exist and contain valid certificates
- Check controller logs:
kubectl -n novaedge-system logs deploy/novaedge-controller | grep federation
Peers show as unhealthy¶
Cause: Health checks are failing.
Resolution:
- Check the
errorfield in the member status:kubectl -n novaedge-system get fed -o yaml - Verify the peer's endpoint is correct and resolvable
- Ensure TLS
serverNamematches the peer's certificate CN/SAN - Increase
healthCheck.timeoutif the network has high latency
Conflicts pending in mesh mode¶
Cause: Concurrent writes to the same resource from different clusters.
Resolution:
- Check pending conflicts:
kubectl -n novaedge-system get fed -o jsonpath='{.status.conflictsPending}' - If using
Manualstrategy, resolve conflicts by updating the resource in one cluster - Consider switching to
MergeorLastWriterWinsfor automatic resolution - Review which teams are editing shared resources and establish ownership conventions
Split-brain detected¶
Cause: Network partition between controllers.
Resolution:
- Check partition state:
kubectl -n novaedge-system get fed -o jsonpath='{.status.splitBrain}' - Verify network connectivity between clusters
- If using tunnels, check tunnel health (see Federation Tunnels)
- If writes are fenced, they will resume automatically once the partition heals and the grace period elapses
Sync lag increasing¶
Cause: Sync operations are falling behind, possibly due to high change rate or network latency.
Resolution:
- Increase
sync.batchSizeto sync more resources per batch - Decrease
sync.intervalfor more frequent sync cycles - Enable
sync.compressionto reduce bandwidth - Check controller CPU and memory -- sync operations are CPU-intensive with large resource sets
Related Guides¶
- Cross-Cluster Routing -- how endpoint merging and locality-aware routing work
- Federation Tunnels -- WireGuard, SSH, and WebSocket tunnels for NAT/firewall environments
- Architecture: Federation -- internal architecture details