Federated Control Plane Architecture¶

NovaEdge supports a federated active/active control plane architecture that eliminates single points of failure by distributing the management plane across multiple clusters.

Overview¶

In a federated deployment:

Multiple controllers run in different clusters, all active simultaneously
Controllers sync configuration bidirectionally in real-time
Agents connect to a primary controller but can failover to secondary controllers
Changes can be made on any controller and propagate to all others

flowchart TB
    subgraph Federation["Federated Control Plane"]
        subgraph Cluster1["Cluster 1 (Region: US-West)"]
            C1["Controller 1<br/>(Active)"]
            A1["Agents"]
        end

        subgraph Cluster2["Cluster 2 (Region: EU-West)"]
            C2["Controller 2<br/>(Active)"]
            A2["Agents"]
        end

        subgraph Cluster3["Cluster 3 (Region: AP-East)"]
            C3["Controller 3<br/>(Active)"]
            A3["Agents"]
        end
    end

    C1 <-->|"Sync"| C2
    C2 <-->|"Sync"| C3
    C1 <-->|"Sync"| C3

    A1 -->|"Primary"| C1
    A1 -.->|"Failover"| C2
    A1 -.->|"Failover"| C3

    A2 -->|"Primary"| C2
    A2 -.->|"Failover"| C1
    A2 -.->|"Failover"| C3

    A3 -->|"Primary"| C3
    A3 -.->|"Failover"| C1
    A3 -.->|"Failover"| C2

    style Cluster1 fill:#e6f3ff
    style Cluster2 fill:#f0fff0
    style Cluster3 fill:#fff5e6

Key Concepts¶

Federation Member¶

A Federation Member is a NovaEdge controller that participates in the federation. Each member:

Has a unique identifier within the federation
Maintains a complete copy of all configuration
Can accept configuration changes from users/operators
Syncs changes to other federation members
Serves configuration to agents (local and remote)

Agent Controller Preferences¶

Each agent is configured with:

Primary Controller: The preferred controller (usually in the same cluster)
Secondary Controllers: Ordered list of failover controllers
Failover Policy: When and how to failover

Configuration Sync¶

Controllers use a CRDT-based (Conflict-free Replicated Data Type) sync protocol:

Last-Writer-Wins for simple fields (with vector clocks)
Merge semantics for lists (routes, backends, etc.)
Tombstones for deletions (with TTL cleanup)

Architecture Components¶

NovaEdgeFederation CRD¶

Defines the federation and its members:

apiVersion: novaedge.io/v1alpha1
kind: NovaEdgeFederation
metadata:
  name: global-federation
  namespace: novaedge-system
spec:
  # Unique federation identifier
  federationID: "prod-global"

  # This controller's identity in the federation
  localMember:
    name: "us-west-controller"
    region: "us-west"
    zone: "us-west-2a"
    endpoint: "controller.us-west.novaedge.example.com:9090"

  # Other federation members
  members:
    - name: "eu-west-controller"
      region: "eu-west"
      zone: "eu-west-1a"
      endpoint: "controller.eu-west.novaedge.example.com:9090"
      tls:
        secretRef:
          name: federation-eu-west-tls

    - name: "ap-east-controller"
      region: "ap-east"
      zone: "ap-east-1a"
      endpoint: "controller.ap-east.novaedge.example.com:9090"
      tls:
        secretRef:
          name: federation-ap-east-tls

  # Sync configuration
  sync:
    # How often to sync with peers
    interval: "5s"
    # Timeout for sync operations
    timeout: "30s"
    # Batch size for incremental sync
    batchSize: 100
    # Enable compression for sync traffic
    compression: true

  # Conflict resolution strategy
  conflictResolution:
    # Strategy: LastWriterWins, Merge, or Manual
    strategy: "LastWriterWins"
    # Use vector clocks for ordering
    vectorClocks: true

  # Health check configuration
  healthCheck:
    interval: "10s"
    timeout: "5s"
    failureThreshold: 3

Agent Controller Configuration¶

Agents are configured with controller preferences:

apiVersion: novaedge.io/v1alpha1
kind: NovaEdgeCluster
metadata:
  name: novaedge
  namespace: novaedge-system
spec:
  agent:
    # Controller connection configuration
    controllers:
      # Primary controller (highest priority)
      primary:
        endpoint: "controller.us-west.novaedge.example.com:9090"
        tls:
          secretRef:
            name: controller-us-west-tls

      # Secondary controllers (ordered by priority)
      secondary:
        - endpoint: "controller.eu-west.novaedge.example.com:9090"
          priority: 100
          tls:
            secretRef:
              name: controller-eu-west-tls

        - endpoint: "controller.ap-east.novaedge.example.com:9090"
          priority: 200
          tls:
            secretRef:
              name: controller-ap-east-tls

      # Failover behavior
      failover:
        # How long to wait before failing over
        timeout: "30s"
        # How often to check primary availability
        healthCheckInterval: "10s"
        # Number of failures before failover
        failureThreshold: 3
        # How long to wait before trying to return to primary
        recoveryDelay: "60s"
        # Prefer lower latency controller during failover
        latencyAware: true

Sync Protocol¶

State Synchronization¶

sequenceDiagram
    participant C1 as Controller 1
    participant C2 as Controller 2
    participant C3 as Controller 3

    Note over C1,C3: Initial Sync (Full State)
    C1->>C2: SyncRequest{type: FULL, vectorClock: {}}
    C2->>C1: SyncResponse{resources: [...], vectorClock: {c2: 5}}
    C1->>C3: SyncRequest{type: FULL, vectorClock: {}}
    C3->>C1: SyncResponse{resources: [...], vectorClock: {c3: 3}}

    Note over C1,C3: Incremental Sync (Changes Only)
    loop Every 5s
        C1->>C2: SyncRequest{type: INCREMENTAL, since: {c2: 5}}
        C2->>C1: SyncResponse{changes: [...], vectorClock: {c2: 7}}
    end

    Note over C1,C3: Change Propagation
    C1->>C1: User creates ProxyRoute
    C1->>C2: PushChange{resource: ProxyRoute, op: CREATE, clock: {c1: 10}}
    C1->>C3: PushChange{resource: ProxyRoute, op: CREATE, clock: {c1: 10}}
    C2->>C1: Ack{clock: {c1: 10, c2: 8}}
    C3->>C1: Ack{clock: {c1: 10, c3: 4}}

Conflict Resolution¶

When the same resource is modified on multiple controllers simultaneously:

flowchart TD
    A[Change on C1] --> C{Conflict?}
    B[Change on C2] --> C

    C -->|No| D[Apply Both]
    C -->|Yes| E{Resolution Strategy}

    E -->|LastWriterWins| F[Compare Timestamps<br/>Keep Latest]
    E -->|Merge| G[Merge Fields<br/>Union Lists]
    E -->|Manual| H[Flag for<br/>Operator Review]

    F --> I[Propagate Winner]
    G --> I
    H --> J[Hold Until Resolved]

Vector Clocks¶

Each resource carries a vector clock for ordering:

message ResourceVersion {
  // Logical clock per controller
  map<string, uint64> vector_clock = 1;

  // Wall clock timestamp (for tie-breaking)
  int64 timestamp = 2;

  // Controller that last modified
  string last_writer = 3;
}

Agent Failover¶

Failover State Machine¶

stateDiagram-v2
    [*] --> ConnectedPrimary: Start

    ConnectedPrimary --> CheckingPrimary: Health Check Failed
    CheckingPrimary --> ConnectedPrimary: Health Check OK
    CheckingPrimary --> FailingOver: Threshold Exceeded

    FailingOver --> ConnectedSecondary: Secondary Connected
    FailingOver --> Disconnected: All Controllers Unavailable

    ConnectedSecondary --> RecoveryCheck: Recovery Timer
    RecoveryCheck --> ConnectedSecondary: Primary Still Down
    RecoveryCheck --> ReturningToPrimary: Primary Available

    ReturningToPrimary --> ConnectedPrimary: Handover Complete
    ReturningToPrimary --> ConnectedSecondary: Handover Failed

    Disconnected --> FailingOver: Retry Timer
    Disconnected --> AutonomousMode: Extended Outage

    AutonomousMode --> FailingOver: Controller Detected

Autonomous Mode¶

When all controllers are unavailable, agents enter Autonomous Mode:

Continue serving traffic with last known configuration
Persist configuration to disk for restart resilience
Local VIP coordination via agent-to-agent communication
Queue local changes (health status, metrics) for later sync

agent:
  autonomousMode:
    # Enable autonomous operation when disconnected
    enabled: true
    # Path to persist configuration
    configPath: "/var/lib/novaedge/config.json"
    # Enable agent-to-agent VIP coordination
    localVIPCoordination: true
    # How long to keep queued updates
    queueRetention: "24h"

gRPC Service Extensions¶

Federation Service¶

New gRPC service for controller-to-controller communication:

service FederationService {
  // Full state sync (initial connection)
  rpc FullSync(FullSyncRequest) returns (FullSyncResponse);

  // Incremental sync (ongoing)
  rpc IncrementalSync(IncrementalSyncRequest) returns (IncrementalSyncResponse);

  // Push change to peer (real-time)
  rpc PushChange(stream ChangeEvent) returns (stream ChangeAck);

  // Health check
  rpc Ping(PingRequest) returns (PingResponse);

  // Get federation status
  rpc GetFederationStatus(GetFederationStatusRequest) returns (FederationStatus);
}

message FullSyncRequest {
  string federation_id = 1;
  string member_id = 2;
  map<string, uint64> vector_clock = 3;
}

message FullSyncResponse {
  repeated ResourceSnapshot resources = 1;
  map<string, uint64> vector_clock = 2;
}

message ChangeEvent {
  string resource_type = 1;  // ProxyGateway, ProxyRoute, etc.
  string namespace = 2;
  string name = 3;
  ChangeOperation operation = 4;
  bytes resource_data = 5;
  ResourceVersion version = 6;
}

enum ChangeOperation {
  CREATE = 0;
  UPDATE = 1;
  DELETE = 2;
}

message ChangeAck {
  string resource_id = 1;
  bool accepted = 2;
  string error = 3;
  map<string, uint64> vector_clock = 4;
}

Extended Config Service¶

Agent-facing service extended for failover:

service ConfigService {
  // Existing: Stream configuration updates
  rpc StreamConfig(StreamConfigRequest) returns (stream ConfigSnapshot);

  // New: Get controller info for failover
  rpc GetControllerInfo(GetControllerInfoRequest) returns (ControllerInfo);

  // New: Report failover event
  rpc ReportFailover(FailoverEvent) returns (FailoverAck);

  // New: Handover to new controller
  rpc InitiateHandover(HandoverRequest) returns (HandoverResponse);
}

message StreamConfigRequest {
  string node_name = 1;
  string cluster_name = 2;
  map<string, string> labels = 3;
  string config_version = 4;

  // New: Agent's controller preferences
  ControllerPreferences preferences = 5;
}

message ControllerPreferences {
  string primary_controller = 1;
  repeated string secondary_controllers = 2;
  string current_controller = 3;
  bool is_failover = 4;
}

message ControllerInfo {
  string controller_id = 1;
  string federation_id = 2;
  repeated FederationMember federation_members = 3;
  bool is_primary_for_agent = 4;
}

message FederationMember {
  string id = 1;
  string endpoint = 2;
  string region = 3;
  bool healthy = 4;
  int64 last_sync = 5;
}

Deployment Patterns¶

Pattern 1: Regional Federation¶

One controller per region, agents prefer local controller:

┌─────────────────────────────────────────────────────────────┐
│                     Global Federation                        │
├───────────────────┬───────────────────┬─────────────────────┤
│    US-West        │     EU-West       │      AP-East        │
│                   │                   │                     │
│  ┌───────────┐    │  ┌───────────┐    │  ┌───────────┐     │
│  │Controller │◄──►│  │Controller │◄──►│  │Controller │     │
│  └─────┬─────┘    │  └─────┬─────┘    │  └─────┬─────┘     │
│        │          │        │          │        │           │
│  ┌─────▼─────┐    │  ┌─────▼─────┐    │  ┌─────▼─────┐     │
│  │  Agents   │    │  │  Agents   │    │  │  Agents   │     │
│  │ (Primary) │    │  │ (Primary) │    │  │ (Primary) │     │
│  └───────────┘    │  └───────────┘    │  └───────────┘     │
└───────────────────┴───────────────────┴─────────────────────┘

Pattern 2: Active/Standby Federation¶

Two controllers, one primary datacenter:

┌─────────────────────────────────────────────────────────────┐
│                     Federation                               │
├─────────────────────────────┬───────────────────────────────┤
│      Primary DC             │        Standby DC             │
│                             │                               │
│  ┌───────────┐              │  ┌───────────┐               │
│  │Controller │◄────────────►│  │Controller │               │
│  │ (Active)  │    Sync      │  │ (Standby) │               │
│  └─────┬─────┘              │  └─────┬─────┘               │
│        │                    │        │                     │
│  ┌─────▼─────┐              │  ┌─────▼─────┐               │
│  │  Agents   │              │  │  Agents   │               │
│  │ (Primary) │──────────────│──│(Failover) │               │
│  └───────────┘              │  └───────────┘               │
└─────────────────────────────┴───────────────────────────────┘

Pattern 3: Mesh Federation¶

All controllers peer with all others:

                    ┌───────────┐
                    │Controller │
                    │     A     │
                    └─────┬─────┘
                         ╱│╲
                        ╱ │ ╲
                       ╱  │  ╲
      ┌───────────┐◄──╱   │   ╲──►┌───────────┐
      │Controller │       │       │Controller │
      │     B     │◄──────┴──────►│     C     │
      └─────┬─────┘               └─────┬─────┘
            │                           │
      ┌─────▼─────┐               ┌─────▼─────┐
      │  Agents   │               │  Agents   │
      └───────────┘               └───────────┘

Consistency Guarantees¶

Eventual Consistency¶

The federation provides eventual consistency:

Changes propagate to all controllers within sync interval
During network partitions, controllers operate independently
After partition heals, state converges automatically

Read-Your-Writes¶

Within a single controller:

Changes are immediately visible
Agents connected to that controller see updates in real-time

Conflict Windows¶

Potential conflict window = sync_interval + network_latency

Recommendations:

Sync Interval	Conflict Risk	Network Cost
1s	Very Low	High
5s	Low	Medium
30s	Medium	Low

Monitoring and Observability¶

Federation Metrics¶

# Sync lag between controllers
novaedge_federation_sync_lag_seconds{peer="eu-west"}

# Sync failures
rate(novaedge_federation_sync_failures_total[5m])

# Conflicts detected
rate(novaedge_federation_conflicts_total[5m])

# Federation member health
novaedge_federation_member_healthy{member="eu-west"}

# Agent failover events
rate(novaedge_agent_failover_total[5m])

# Current controller per agent
novaedge_agent_controller{agent="worker-1", controller="us-west"}

Federation Status¶

# Check federation status
novactl federation status

# Output:
# Federation: prod-global
# Local Member: us-west-controller
#
# Members:
#   NAME                  REGION    STATUS    SYNC LAG   LAST SYNC
#   us-west-controller    us-west   Local     -          -
#   eu-west-controller    eu-west   Healthy   1.2s       5s ago
#   ap-east-controller    ap-east   Healthy   2.5s       5s ago
#
# Agents:
#   CLUSTER       AGENTS   PRIMARY          FAILOVER
#   us-west       5        us-west (5)      eu-west (0)
#   eu-west       3        eu-west (3)      us-west (0)
#   ap-east       4        ap-east (4)      eu-west (0)

Security Considerations¶

mTLS Between Controllers¶

All federation traffic uses mTLS:

spec:
  members:
    - name: "eu-west-controller"
      tls:
        # CA for validating peer certificate
        caSecretRef:
          name: federation-ca
        # Client cert for authenticating to peer
        clientCertSecretRef:
          name: federation-client-cert
        # Expected server name
        serverName: "controller.eu-west.novaedge.example.com"

RBAC for Federation¶

Dedicated service account for federation:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: novaedge-federation
rules:
  - apiGroups: ["novaedge.io"]
    resources: ["*"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["get", "list", "watch"]
    resourceNames: ["federation-*"]

Audit Logging¶

All federation operations are logged:

{
  "level": "info",
  "ts": "2024-01-15T10:30:00.123Z",
  "msg": "federation sync completed",
  "peer": "eu-west-controller",
  "changes_sent": 5,
  "changes_received": 3,
  "conflicts": 0,
  "duration_ms": 45
}

Migration Guide¶

From Hub-Spoke to Federation¶

Deploy additional controllers in other clusters
Create NovaEdgeFederation resource on each controller
Wait for initial sync to complete
Update agent configuration with secondary controllers
Verify failover by testing controller failure scenarios

Rollback¶

To rollback to hub-spoke:

Remove NovaEdgeFederation resources
Update agent configuration to single controller
Decommission additional controllers

Limitations¶

Network connectivity required between controllers for sync
Conflict resolution may require manual intervention for complex cases
Increased resource usage for sync traffic and state storage
Eventual consistency means brief inconsistency windows

Next Steps¶

Installation Guide - Deploy NovaEdge
Multi-Cluster Guide - Hub-spoke architecture
CRD Reference - Federation CRD details