Federated Control Plane Architecture¶
NovaEdge supports a federated active/active control plane architecture that eliminates single points of failure by distributing the management plane across multiple clusters.
Overview¶
In a federated deployment:
- Multiple controllers run in different clusters, all active simultaneously
- Controllers sync configuration bidirectionally in real-time
- Agents connect to a primary controller but can failover to secondary controllers
- Changes can be made on any controller and propagate to all others
flowchart TB
subgraph Federation["Federated Control Plane"]
subgraph Cluster1["Cluster 1 (Region: US-West)"]
C1["Controller 1<br/>(Active)"]
A1["Agents"]
end
subgraph Cluster2["Cluster 2 (Region: EU-West)"]
C2["Controller 2<br/>(Active)"]
A2["Agents"]
end
subgraph Cluster3["Cluster 3 (Region: AP-East)"]
C3["Controller 3<br/>(Active)"]
A3["Agents"]
end
end
C1 <-->|"Sync"| C2
C2 <-->|"Sync"| C3
C1 <-->|"Sync"| C3
A1 -->|"Primary"| C1
A1 -.->|"Failover"| C2
A1 -.->|"Failover"| C3
A2 -->|"Primary"| C2
A2 -.->|"Failover"| C1
A2 -.->|"Failover"| C3
A3 -->|"Primary"| C3
A3 -.->|"Failover"| C1
A3 -.->|"Failover"| C2
style Cluster1 fill:#e6f3ff
style Cluster2 fill:#f0fff0
style Cluster3 fill:#fff5e6
Key Concepts¶
Federation Member¶
A Federation Member is a NovaEdge controller that participates in the federation. Each member:
- Has a unique identifier within the federation
- Maintains a complete copy of all configuration
- Can accept configuration changes from users/operators
- Syncs changes to other federation members
- Serves configuration to agents (local and remote)
Agent Controller Preferences¶
Each agent is configured with:
- Primary Controller: The preferred controller (usually in the same cluster)
- Secondary Controllers: Ordered list of failover controllers
- Failover Policy: When and how to failover
Configuration Sync¶
Controllers use a CRDT-based (Conflict-free Replicated Data Type) sync protocol:
- Last-Writer-Wins for simple fields (with vector clocks)
- Merge semantics for lists (routes, backends, etc.)
- Tombstones for deletions (with TTL cleanup)
Architecture Components¶
NovaEdgeFederation CRD¶
Defines the federation and its members:
apiVersion: novaedge.io/v1alpha1
kind: NovaEdgeFederation
metadata:
name: global-federation
namespace: novaedge-system
spec:
# Unique federation identifier
federationID: "prod-global"
# This controller's identity in the federation
localMember:
name: "us-west-controller"
region: "us-west"
zone: "us-west-2a"
endpoint: "controller.us-west.novaedge.example.com:9090"
# Other federation members
members:
- name: "eu-west-controller"
region: "eu-west"
zone: "eu-west-1a"
endpoint: "controller.eu-west.novaedge.example.com:9090"
tls:
secretRef:
name: federation-eu-west-tls
- name: "ap-east-controller"
region: "ap-east"
zone: "ap-east-1a"
endpoint: "controller.ap-east.novaedge.example.com:9090"
tls:
secretRef:
name: federation-ap-east-tls
# Sync configuration
sync:
# How often to sync with peers
interval: "5s"
# Timeout for sync operations
timeout: "30s"
# Batch size for incremental sync
batchSize: 100
# Enable compression for sync traffic
compression: true
# Conflict resolution strategy
conflictResolution:
# Strategy: LastWriterWins, Merge, or Manual
strategy: "LastWriterWins"
# Use vector clocks for ordering
vectorClocks: true
# Health check configuration
healthCheck:
interval: "10s"
timeout: "5s"
failureThreshold: 3
Agent Controller Configuration¶
Agents are configured with controller preferences:
apiVersion: novaedge.io/v1alpha1
kind: NovaEdgeCluster
metadata:
name: novaedge
namespace: novaedge-system
spec:
agent:
# Controller connection configuration
controllers:
# Primary controller (highest priority)
primary:
endpoint: "controller.us-west.novaedge.example.com:9090"
tls:
secretRef:
name: controller-us-west-tls
# Secondary controllers (ordered by priority)
secondary:
- endpoint: "controller.eu-west.novaedge.example.com:9090"
priority: 100
tls:
secretRef:
name: controller-eu-west-tls
- endpoint: "controller.ap-east.novaedge.example.com:9090"
priority: 200
tls:
secretRef:
name: controller-ap-east-tls
# Failover behavior
failover:
# How long to wait before failing over
timeout: "30s"
# How often to check primary availability
healthCheckInterval: "10s"
# Number of failures before failover
failureThreshold: 3
# How long to wait before trying to return to primary
recoveryDelay: "60s"
# Prefer lower latency controller during failover
latencyAware: true
Sync Protocol¶
State Synchronization¶
sequenceDiagram
participant C1 as Controller 1
participant C2 as Controller 2
participant C3 as Controller 3
Note over C1,C3: Initial Sync (Full State)
C1->>C2: SyncRequest{type: FULL, vectorClock: {}}
C2->>C1: SyncResponse{resources: [...], vectorClock: {c2: 5}}
C1->>C3: SyncRequest{type: FULL, vectorClock: {}}
C3->>C1: SyncResponse{resources: [...], vectorClock: {c3: 3}}
Note over C1,C3: Incremental Sync (Changes Only)
loop Every 5s
C1->>C2: SyncRequest{type: INCREMENTAL, since: {c2: 5}}
C2->>C1: SyncResponse{changes: [...], vectorClock: {c2: 7}}
end
Note over C1,C3: Change Propagation
C1->>C1: User creates ProxyRoute
C1->>C2: PushChange{resource: ProxyRoute, op: CREATE, clock: {c1: 10}}
C1->>C3: PushChange{resource: ProxyRoute, op: CREATE, clock: {c1: 10}}
C2->>C1: Ack{clock: {c1: 10, c2: 8}}
C3->>C1: Ack{clock: {c1: 10, c3: 4}}
Conflict Resolution¶
When the same resource is modified on multiple controllers simultaneously:
flowchart TD
A[Change on C1] --> C{Conflict?}
B[Change on C2] --> C
C -->|No| D[Apply Both]
C -->|Yes| E{Resolution Strategy}
E -->|LastWriterWins| F[Compare Timestamps<br/>Keep Latest]
E -->|Merge| G[Merge Fields<br/>Union Lists]
E -->|Manual| H[Flag for<br/>Operator Review]
F --> I[Propagate Winner]
G --> I
H --> J[Hold Until Resolved]
Vector Clocks¶
Each resource carries a vector clock for ordering:
message ResourceVersion {
// Logical clock per controller
map<string, uint64> vector_clock = 1;
// Wall clock timestamp (for tie-breaking)
int64 timestamp = 2;
// Controller that last modified
string last_writer = 3;
}
Agent Failover¶
Failover State Machine¶
stateDiagram-v2
[*] --> ConnectedPrimary: Start
ConnectedPrimary --> CheckingPrimary: Health Check Failed
CheckingPrimary --> ConnectedPrimary: Health Check OK
CheckingPrimary --> FailingOver: Threshold Exceeded
FailingOver --> ConnectedSecondary: Secondary Connected
FailingOver --> Disconnected: All Controllers Unavailable
ConnectedSecondary --> RecoveryCheck: Recovery Timer
RecoveryCheck --> ConnectedSecondary: Primary Still Down
RecoveryCheck --> ReturningToPrimary: Primary Available
ReturningToPrimary --> ConnectedPrimary: Handover Complete
ReturningToPrimary --> ConnectedSecondary: Handover Failed
Disconnected --> FailingOver: Retry Timer
Disconnected --> AutonomousMode: Extended Outage
AutonomousMode --> FailingOver: Controller Detected
Autonomous Mode¶
When all controllers are unavailable, agents enter Autonomous Mode:
- Continue serving traffic with last known configuration
- Persist configuration to disk for restart resilience
- Local VIP coordination via agent-to-agent communication
- Queue local changes (health status, metrics) for later sync
agent:
autonomousMode:
# Enable autonomous operation when disconnected
enabled: true
# Path to persist configuration
configPath: "/var/lib/novaedge/config.json"
# Enable agent-to-agent VIP coordination
localVIPCoordination: true
# How long to keep queued updates
queueRetention: "24h"
gRPC Service Extensions¶
Federation Service¶
New gRPC service for controller-to-controller communication:
service FederationService {
// Full state sync (initial connection)
rpc FullSync(FullSyncRequest) returns (FullSyncResponse);
// Incremental sync (ongoing)
rpc IncrementalSync(IncrementalSyncRequest) returns (IncrementalSyncResponse);
// Push change to peer (real-time)
rpc PushChange(stream ChangeEvent) returns (stream ChangeAck);
// Health check
rpc Ping(PingRequest) returns (PingResponse);
// Get federation status
rpc GetFederationStatus(GetFederationStatusRequest) returns (FederationStatus);
}
message FullSyncRequest {
string federation_id = 1;
string member_id = 2;
map<string, uint64> vector_clock = 3;
}
message FullSyncResponse {
repeated ResourceSnapshot resources = 1;
map<string, uint64> vector_clock = 2;
}
message ChangeEvent {
string resource_type = 1; // ProxyGateway, ProxyRoute, etc.
string namespace = 2;
string name = 3;
ChangeOperation operation = 4;
bytes resource_data = 5;
ResourceVersion version = 6;
}
enum ChangeOperation {
CREATE = 0;
UPDATE = 1;
DELETE = 2;
}
message ChangeAck {
string resource_id = 1;
bool accepted = 2;
string error = 3;
map<string, uint64> vector_clock = 4;
}
Extended Config Service¶
Agent-facing service extended for failover:
service ConfigService {
// Existing: Stream configuration updates
rpc StreamConfig(StreamConfigRequest) returns (stream ConfigSnapshot);
// New: Get controller info for failover
rpc GetControllerInfo(GetControllerInfoRequest) returns (ControllerInfo);
// New: Report failover event
rpc ReportFailover(FailoverEvent) returns (FailoverAck);
// New: Handover to new controller
rpc InitiateHandover(HandoverRequest) returns (HandoverResponse);
}
message StreamConfigRequest {
string node_name = 1;
string cluster_name = 2;
map<string, string> labels = 3;
string config_version = 4;
// New: Agent's controller preferences
ControllerPreferences preferences = 5;
}
message ControllerPreferences {
string primary_controller = 1;
repeated string secondary_controllers = 2;
string current_controller = 3;
bool is_failover = 4;
}
message ControllerInfo {
string controller_id = 1;
string federation_id = 2;
repeated FederationMember federation_members = 3;
bool is_primary_for_agent = 4;
}
message FederationMember {
string id = 1;
string endpoint = 2;
string region = 3;
bool healthy = 4;
int64 last_sync = 5;
}
Deployment Patterns¶
Pattern 1: Regional Federation¶
One controller per region, agents prefer local controller:
┌─────────────────────────────────────────────────────────────┐
│ Global Federation │
├───────────────────┬───────────────────┬─────────────────────┤
│ US-West │ EU-West │ AP-East │
│ │ │ │
│ ┌───────────┐ │ ┌───────────┐ │ ┌───────────┐ │
│ │Controller │◄──►│ │Controller │◄──►│ │Controller │ │
│ └─────┬─────┘ │ └─────┬─────┘ │ └─────┬─────┘ │
│ │ │ │ │ │ │
│ ┌─────▼─────┐ │ ┌─────▼─────┐ │ ┌─────▼─────┐ │
│ │ Agents │ │ │ Agents │ │ │ Agents │ │
│ │ (Primary) │ │ │ (Primary) │ │ │ (Primary) │ │
│ └───────────┘ │ └───────────┘ │ └───────────┘ │
└───────────────────┴───────────────────┴─────────────────────┘
Pattern 2: Active/Standby Federation¶
Two controllers, one primary datacenter:
┌─────────────────────────────────────────────────────────────┐
│ Federation │
├─────────────────────────────┬───────────────────────────────┤
│ Primary DC │ Standby DC │
│ │ │
│ ┌───────────┐ │ ┌───────────┐ │
│ │Controller │◄────────────►│ │Controller │ │
│ │ (Active) │ Sync │ │ (Standby) │ │
│ └─────┬─────┘ │ └─────┬─────┘ │
│ │ │ │ │
│ ┌─────▼─────┐ │ ┌─────▼─────┐ │
│ │ Agents │ │ │ Agents │ │
│ │ (Primary) │──────────────│──│(Failover) │ │
│ └───────────┘ │ └───────────┘ │
└─────────────────────────────┴───────────────────────────────┘
Pattern 3: Mesh Federation¶
All controllers peer with all others:
┌───────────┐
│Controller │
│ A │
└─────┬─────┘
╱│╲
╱ │ ╲
╱ │ ╲
┌───────────┐◄──╱ │ ╲──►┌───────────┐
│Controller │ │ │Controller │
│ B │◄──────┴──────►│ C │
└─────┬─────┘ └─────┬─────┘
│ │
┌─────▼─────┐ ┌─────▼─────┐
│ Agents │ │ Agents │
└───────────┘ └───────────┘
Consistency Guarantees¶
Eventual Consistency¶
The federation provides eventual consistency:
- Changes propagate to all controllers within sync interval
- During network partitions, controllers operate independently
- After partition heals, state converges automatically
Read-Your-Writes¶
Within a single controller:
- Changes are immediately visible
- Agents connected to that controller see updates in real-time
Conflict Windows¶
Potential conflict window = sync_interval + network_latency
Recommendations:
| Sync Interval | Conflict Risk | Network Cost |
|---|---|---|
| 1s | Very Low | High |
| 5s | Low | Medium |
| 30s | Medium | Low |
Monitoring and Observability¶
Federation Metrics¶
# Sync lag between controllers
novaedge_federation_sync_lag_seconds{peer="eu-west"}
# Sync failures
rate(novaedge_federation_sync_failures_total[5m])
# Conflicts detected
rate(novaedge_federation_conflicts_total[5m])
# Federation member health
novaedge_federation_member_healthy{member="eu-west"}
# Agent failover events
rate(novaedge_agent_failover_total[5m])
# Current controller per agent
novaedge_agent_controller{agent="worker-1", controller="us-west"}
Federation Status¶
# Check federation status
novactl federation status
# Output:
# Federation: prod-global
# Local Member: us-west-controller
#
# Members:
# NAME REGION STATUS SYNC LAG LAST SYNC
# us-west-controller us-west Local - -
# eu-west-controller eu-west Healthy 1.2s 5s ago
# ap-east-controller ap-east Healthy 2.5s 5s ago
#
# Agents:
# CLUSTER AGENTS PRIMARY FAILOVER
# us-west 5 us-west (5) eu-west (0)
# eu-west 3 eu-west (3) us-west (0)
# ap-east 4 ap-east (4) eu-west (0)
Security Considerations¶
mTLS Between Controllers¶
All federation traffic uses mTLS:
spec:
members:
- name: "eu-west-controller"
tls:
# CA for validating peer certificate
caSecretRef:
name: federation-ca
# Client cert for authenticating to peer
clientCertSecretRef:
name: federation-client-cert
# Expected server name
serverName: "controller.eu-west.novaedge.example.com"
RBAC for Federation¶
Dedicated service account for federation:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: novaedge-federation
rules:
- apiGroups: ["novaedge.io"]
resources: ["*"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get", "list", "watch"]
resourceNames: ["federation-*"]
Audit Logging¶
All federation operations are logged:
{
"level": "info",
"ts": "2024-01-15T10:30:00.123Z",
"msg": "federation sync completed",
"peer": "eu-west-controller",
"changes_sent": 5,
"changes_received": 3,
"conflicts": 0,
"duration_ms": 45
}
Migration Guide¶
From Hub-Spoke to Federation¶
- Deploy additional controllers in other clusters
- Create NovaEdgeFederation resource on each controller
- Wait for initial sync to complete
- Update agent configuration with secondary controllers
- Verify failover by testing controller failure scenarios
Rollback¶
To rollback to hub-spoke:
- Remove
NovaEdgeFederationresources - Update agent configuration to single controller
- Decommission additional controllers
Limitations¶
- Network connectivity required between controllers for sync
- Conflict resolution may require manual intervention for complex cases
- Increased resource usage for sync traffic and state storage
- Eventual consistency means brief inconsistency windows
Next Steps¶
- Installation Guide - Deploy NovaEdge
- Multi-Cluster Guide - Hub-spoke architecture
- CRD Reference - Federation CRD details