Skip to content

Service Mesh

NovaEdge includes a sidecar-free service mesh for east-west (pod-to-pod) traffic. It intercepts ClusterIP traffic transparently using iptables TPROXY, authenticates services with SPIFFE-based mTLS, and enforces authorization policies -- all without injecting sidecar containers.

Overview

Traditional service meshes inject a sidecar proxy into every pod, adding latency, memory overhead, and operational complexity. NovaEdge takes a different approach: the node agent (DaemonSet) intercepts service traffic at the kernel level using TPROXY and tunnels it over mTLS HTTP/2 connections between nodes.

Key properties:

  • No sidecars -- traffic interception happens at the node level via iptables TPROXY
  • Opt-in per service -- annotate services with novaedge.io/mesh: "enabled" to enroll them
  • SPIFFE identities -- each agent gets a workload certificate with a SPIFFE URI SAN
  • mTLS everywhere -- node-to-node tunnel traffic is encrypted with TLS 1.3
  • Authorization policies -- control which services can communicate using ALLOW/DENY rules
  • Automatic certificate rotation -- certificates are renewed at 80% of their 24-hour lifetime

Architecture

flowchart TB
    subgraph Node1["Node 1 (NovaEdge Agent)"]
        Pod1["Pod A<br/>(client)"] -->|"ClusterIP:port"| IPT1["iptables TPROXY<br/>NOVAEDGE_MESH chain"]
        IPT1 -->|"redirect"| TL1["Transparent Listener<br/>:15001"]
        TL1 --> PD1["Protocol Detect"]
        PD1 --> ST1["Service Table<br/>Lookup"]
        ST1 --> TP1["Tunnel Pool<br/>(HTTP/2 mTLS client)"]
    end

    TP1 -->|"mTLS HTTP/2<br/>CONNECT :15002"| TS2

    subgraph Node2["Node 2 (NovaEdge Agent)"]
        TS2["Tunnel Server<br/>:15002"] --> AZ2["Authorizer<br/>(ALLOW/DENY)"]
        AZ2 --> Pod2["Pod B<br/>(backend)"]
    end

    subgraph Controller["NovaEdge Controller"]
        CA["Mesh CA<br/>(ECDSA P-384)"]
        SB["Config Snapshot<br/>Builder"]
    end

    CA -.->|"Sign CSR<br/>(gRPC)"| TL1
    SB -.->|"Push services +<br/>authz policies"| Node1
    SB -.->|"Push services +<br/>authz policies"| Node2

    style Pod1 fill:#e1f5ff
    style Pod2 fill:#e1f5ff
    style CA fill:#fff4e6
    style AZ2 fill:#f3e5f5
    style TS2 fill:#e8f5e9
    style TL1 fill:#e8f5e9

Components

Component File Port Purpose
TPROXY Manager internal/agent/mesh/tproxy.go -- Manages iptables mangle rules in NOVAEDGE_MESH chain
Transparent Listener internal/agent/mesh/listener.go 15001 Accepts TPROXY-redirected connections
Protocol Detector internal/agent/mesh/detect.go -- Peeks at first bytes to identify HTTP/1, HTTP/2, TLS, or opaque TCP
Service Table internal/agent/mesh/manager.go -- Maps ClusterIP:port to backend endpoints with round-robin LB
Tunnel Server internal/agent/mesh/tunnel.go 15002 HTTP/2 CONNECT server for incoming mTLS tunnels
Tunnel Pool internal/agent/mesh/tunnel.go -- Persistent HTTP/2 client pool for outbound tunnels
TLS Provider internal/agent/mesh/tls.go -- Manages TLS certificates with mutex-protected rotation
Certificate Requester internal/agent/mesh/cert.go -- Generates CSR, requests cert from controller, auto-renews
Authorizer internal/agent/mesh/authz.go -- Evaluates ALLOW/DENY policies per service
Mesh CA internal/controller/meshca/ca.go -- Controller-side CA that signs workload certificates

Enabling the Service Mesh

Annotate services

Add the novaedge.io/mesh annotation to any Kubernetes Service you want to enroll:

apiVersion: v1
kind: Service
metadata:
  name: my-backend
  annotations:
    novaedge.io/mesh: "enabled"
spec:
  selector:
    app: my-backend
  ports:
    - port: 8080
      targetPort: 8080

When the NovaEdge controller detects this annotation, it includes the service in the InternalService list pushed to agents via ConfigSnapshot. The agent then creates iptables TPROXY rules to intercept traffic to the service's ClusterIP.

Disable mesh for a service

Remove the annotation or set it to any value other than "enabled":

annotations:
  novaedge.io/mesh: "disabled"

The agent will remove the corresponding TPROXY rules on the next config reconciliation.

How TPROXY Interception Works

NovaEdge uses iptables TPROXY in the mangle table to intercept traffic destined to mesh-enrolled ClusterIP services without modifying the packets. This preserves the original destination address so the agent can look up the correct backend.

Packet flow

sequenceDiagram
    participant App as Pod A (client)
    participant IPT as iptables (mangle)
    participant TL as Transparent Listener (:15001)
    participant ST as Service Table
    participant Backend as Pod B (backend)

    App->>IPT: TCP SYN to 10.43.0.50:8080 (ClusterIP)
    Note over IPT: PREROUTING -> NOVAEDGE_MESH chain<br/>Match: -d 10.43.0.50 --dport 8080<br/>Action: TPROXY --on-port 15001<br/>Set fwmark 0x1

    IPT->>TL: Connection redirected (original dst preserved)
    TL->>TL: Extract original destination via SO_ORIGINAL_DST
    TL->>TL: DetectProtocol (peek first 16 bytes)
    TL->>ST: Lookup("10.43.0.50", 8080)
    ST-->>TL: Endpoint{address: "10.42.3.15", port: 8080}
    TL->>Backend: TCP connect to 10.42.3.15:8080
    Note over TL,Backend: Bidirectional proxy (io.Copy)

iptables rules created

The TPROXY manager creates the following network configuration:

# 1. Custom chain in the mangle table
iptables -t mangle -N NOVAEDGE_MESH

# 2. Jump from PREROUTING to the custom chain
iptables -t mangle -A PREROUTING -j NOVAEDGE_MESH

# 3. Per-service TPROXY rules (one per ClusterIP:port)
iptables -t mangle -A NOVAEDGE_MESH \
  -p tcp -d 10.43.0.50 --dport 8080 \
  -j TPROXY --tproxy-mark 0x1/0x1 --on-port 15001

# 4. Policy routing to deliver marked packets locally
ip rule add fwmark 1 lookup 100
ip route add local 0.0.0.0/0 dev lo table 100

Rules are reconciled on every config update: new services get rules added, removed services get rules deleted. On shutdown, all rules are cleaned up.

Why TPROXY instead of REDIRECT

TPROXY preserves the original destination address in the socket, allowing the transparent listener to extract it with SO_ORIGINAL_DST. With REDIRECT/DNAT, the original destination is lost (replaced by the proxy's address), requiring additional mechanisms to recover it. TPROXY also avoids the connection tracking overhead of NAT.

How the mTLS Tunnel Works

When a connection needs to reach a pod on a different node, the agent establishes an HTTP/2 CONNECT tunnel through the peer agent's tunnel server. All tunnel traffic is encrypted with mTLS using SPIFFE certificates.

sequenceDiagram
    participant Client as Source Agent (Node 1)
    participant Server as Dest Agent (Node 2)
    participant Backend as Backend Pod

    Note over Client,Server: TLS 1.3 handshake<br/>ALPN: h2<br/>Both sides present SPIFFE certs

    Client->>Server: HTTP/2 CONNECT 10.42.3.15:8080<br/>X-NovaEdge-Source-ID: spiffe://cluster.local/agent/node-1<br/>X-NovaEdge-Dest-Service: my-backend.default

    Server->>Server: Verify client cert (mTLS)
    Server->>Server: Extract SPIFFE ID from peer cert
    Server->>Server: Authorize(source, dest, method, path)

    alt Authorized
        Server->>Backend: TCP connect to 10.42.3.15:8080
        Server-->>Client: 200 OK
        Note over Client,Backend: Bidirectional data over HTTP/2 stream
    else Denied
        Server-->>Client: 403 Forbidden
    end

Tunnel configuration

Parameter Value Description
Port 15002 Tunnel server listen port
TLS version TLS 1.3 minimum Enforced via MinVersion: tls.VersionTLS13
Client auth RequireAndVerifyClientCert Both sides must present valid certificates
ALPN h2 HTTP/2 protocol negotiation
Connect timeout 5 seconds Timeout for dialing backend pods

Connection pooling

The TunnelPool maintains persistent HTTP/2 connections to peer agents, keyed by node address. Multiple tunnel streams are multiplexed over a single TLS connection, reducing handshake overhead for subsequent requests to the same node.

Certificate Lifecycle

NovaEdge uses SPIFFE-compatible workload certificates for mesh identity. The certificate lifecycle is fully automatic.

sequenceDiagram
    participant Agent as Node Agent
    participant CR as Cert Requester
    participant Controller as Controller (Mesh CA)

    Note over Agent: Agent starts mesh manager

    CR->>CR: Generate ECDSA P-256 key pair
    CR->>CR: Build CSR with SPIFFE URI SAN<br/>spiffe://cluster.local/agent/<node-name>
    CR->>Controller: gRPC: RequestMeshCertificate(CSR, nodeName)

    Controller->>Controller: Verify CSR signature
    Controller->>Controller: Sign with CA key (ECDSA P-384)<br/>Validity: 24 hours<br/>ExtKeyUsage: ClientAuth + ServerAuth
    Controller-->>CR: Certificate + CA bundle + SPIFFE ID + Expiry

    CR->>Agent: UpdateTLSCertificate(cert, key, ca, spiffeID)
    Agent->>Agent: TLSProvider updates cert under write lock
    Note over Agent: Tunnel server and pool use<br/>dynamic TLS callbacks (read lock)

    Note over CR: Wait for 80% of lifetime (19.2h)

    CR->>CR: Generate new key pair + CSR
    CR->>Controller: gRPC: RequestMeshCertificate(CSR, nodeName)
    Controller-->>CR: New certificate
    CR->>Agent: UpdateTLSCertificate(...)
    Note over Agent: Zero-downtime rotation<br/>(mutex-protected swap)

Certificate properties

Property Value
Key algorithm ECDSA P-256 (workload), ECDSA P-384 (CA)
SPIFFE URI SAN spiffe://<trust-domain>/agent/<node-name>
Default trust domain cluster.local
Workload cert validity 24 hours
Root CA validity ~10 years
Renewal threshold 80% of lifetime (19.2 hours for 24h certs)
Minimum renewal interval 30 seconds (prevents tight loops)
CSR request timeout 30 seconds
Retry delay on failure 5 seconds

Mesh CA

The controller runs an embedded Mesh CA (internal/controller/meshca/) that signs workload certificates:

  • Root CA key: ECDSA P-384, stored in Kubernetes Secret novaedge-mesh-ca in namespace novaedge-system
  • On first startup, the CA generates a new root key and persists it to the Secret
  • On subsequent startups, it loads the existing key from the Secret
  • Issued certificates include SPIFFE URI SANs and both ClientAuth and ServerAuth extended key usage

TLS rotation

The TLSProvider uses dynamic TLS callbacks (GetCertificate, GetClientCertificate, GetConfigForClient) so that certificate rotation is transparent to active connections. New connections automatically use the latest certificate without restarting the tunnel server or pool.

Authorization Policies

The mesh authorizer enforces service-level access control. Policies are pushed by the controller as part of the ConfigSnapshot.

Policy evaluation order

  1. DENY policies are evaluated first. If any DENY rule matches, the request is denied immediately.
  2. ALLOW policies are evaluated next. If any ALLOW rule matches, the request is allowed.
  3. If ALLOW policies exist but none match, the request is denied (default-deny when explicit ALLOW rules are present).
  4. If only DENY policies exist and none match, the request is allowed.
  5. If no policies exist for the destination service, the request is allowed (default-allow).

Policy structure

Policies are defined per target service and include source (from) and destination (to) constraints:

MeshAuthorizationPolicy:
  name: string
  action: "ALLOW" | "DENY"
  target_service: string        # e.g., "my-backend"
  target_namespace: string      # e.g., "default"
  rules:
    - from:                     # Source constraints (empty = match all)
        - namespaces: [...]
          serviceAccounts: [...]
          spiffeIds: [...]      # Glob patterns
      to:                       # Destination constraints (empty = match all)
        - methods: [...]        # HTTP methods (case-insensitive)
          paths: [...]          # Glob patterns

Source matching

Field Match type Example
namespaces Exact ["production", "staging"]
serviceAccounts Exact ["frontend-sa"]
spiffeIds Glob ["spiffe://cluster.local/ns/*/sa/frontend-*"]

Destination matching

Field Match type Example
methods Case-insensitive exact ["GET", "POST"]
paths Glob ["/api/*", "/health"]

For opaque TCP connections (non-HTTP), destination rules with methods or paths set will not match. Use source-only rules for L4 authorization.

Example: allow only frontend to access backend

# Pushed via ConfigSnapshot (protobuf MeshAuthorizationPolicy)
action: ALLOW
target_service: my-backend
target_namespace: default
rules:
  - from:
      - namespaces: ["default"]
        serviceAccounts: ["frontend-sa"]
    to:
      - methods: ["GET", "POST"]
        paths: ["/api/*"]

Example: deny a specific namespace

action: DENY
target_service: my-backend
target_namespace: default
rules:
  - from:
      - namespaces: ["untrusted"]

Troubleshooting

Check if mesh is active on a node

# Verify the NOVAEDGE_MESH chain exists
iptables -t mangle -L NOVAEDGE_MESH -n -v

# Expected output shows per-service TPROXY rules:
# Chain NOVAEDGE_MESH (1 references)
#  pkts bytes target     prot opt in     out     source     destination
#   142  8520 TPROXY     tcp  --  *      *       0.0.0.0/0  10.43.0.50    tcp dpt:8080 TPROXY redirect 0.0.0.0:15001 mark 0x1/0x1

Check TPROXY routing setup

# Verify the ip rule for fwmark 1
ip rule show | grep "fwmark 0x1"
# Expected: 0: from all fwmark 0x1 lookup 100

# Verify the local route in table 100
ip route show table 100
# Expected: local default dev lo scope host

Check the transparent listener

# Verify port 15001 is listening
ss -tlnp | grep 15001

Check the tunnel server

# Verify port 15002 is listening
ss -tlnp | grep 15002

Check certificate status

# Check agent logs for certificate lifecycle events
kubectl logs -n novaedge-system -l app.kubernetes.io/name=novaedge-agent | grep "mesh.*cert"

# Expected log lines:
# "Mesh certificate obtained, scheduling renewal" expiry=... lifetime=24h0m0s renew_in=19h12m0s
# "Mesh certificate applied" spiffe_id=spiffe://cluster.local/agent/node-1

# Verify the CA secret exists
kubectl get secret novaedge-mesh-ca -n novaedge-system

Check mesh service count

# Look for mesh config application in agent logs
kubectl logs -n novaedge-system -l app.kubernetes.io/name=novaedge-agent | grep "Mesh config applied"

# Expected: "Mesh config applied" services=5 intercept_rules=8 routing_entries=8 authz_policies=3

Connection not being intercepted

If traffic to a mesh-enrolled service is not being intercepted:

  1. Verify the service has the annotation: kubectl get svc <name> -o jsonpath='{.metadata.annotations.novaedge\.io/mesh}'
  2. Check that the corresponding iptables rule exists: iptables -t mangle -L NOVAEDGE_MESH -n | grep <clusterIP>
  3. Verify the agent received the service in its config: check agent logs for intercept_rules count
  4. Confirm the transparent listener is accepting connections: ss -tlnp | grep 15001

Tunnel connection failures

# Check for tunnel errors in agent logs
kubectl logs -n novaedge-system -l app.kubernetes.io/name=novaedge-agent | grep -i tunnel

# Common issues:
# - "no mesh TLS certificate loaded" -> cert requester has not obtained a cert yet
# - "CONNECT ... returned status 403" -> authorization policy is denying the connection
# - "Failed to dial backend" -> backend pod is unreachable from the destination node

Authorization denied unexpectedly

# Check authorizer debug logs (set log level to debug)
kubectl logs -n novaedge-system -l app.kubernetes.io/name=novaedge-agent | grep "mesh authorization"

# Expected for denials:
# "mesh authorization denied by DENY policy" policy=... source=... dest=...
# "mesh authorization denied: no ALLOW policy matched" source=... dest=...

Protocol Detection

The transparent listener peeks at the first 16 bytes of each intercepted connection to detect the application protocol:

Protocol Detection method Handling
HTTP/1.x Starts with GET, POST, PUT, etc. L4 proxy (L7 routing planned)
HTTP/2 Starts with PRI * HTTP/2 (connection preface) L4 proxy
TLS Starts with 0x16 0x03 (ClientHello) L4 proxy
Opaque TCP None of the above L4 proxy (passthrough)

All protocols are currently proxied as L4 TCP. HTTP-aware routing (L7 mesh) is planned for a future release.

  • TLS -- TLS certificate management for ingress traffic
  • Policies -- Rate limiting, authentication, and WAF policies for north-south traffic
  • VIP Management -- Virtual IP management for external access