De microservices-monitoring-crisis#

Traditionele uptime-monitoring vraagt: "Reageert je website?"

Moderne microservices-architectuur vraagt: "Reageren alle 47 microservices, en reageren ze correct op elkaar?"

Een microservices-platform bestaat uit tientallen onderling verbonden services:

API Gateway → Auth Service → User Service → Database
             → Payment Service → Stripe
             → Notification Service → SendGrid
             → Reporting Service → Data Warehouse
             → Search Service → Elasticsearch
             → Cache → Redis

Als ÉÉN van deze services faalt:

Auth Service down = niemand kan inloggen
Payment Service down = klanten kunnen niet betalen
Search Service down = klanten kunnen producten niet vinden
Redis-cache down = systeem wordt traag (cascading failure)

Traditionele uptime-monitoring mist deze complexiteit volledig.

Waarom microservices-monitoring anders is#

1. Gedistribueerde failures

In monolithische systemen is failure binair: up of down.

In microservices zijn failures gedeeltelijk en cascaderen ze:

Scenario 1: Search Service Slow
  - Search requests take 10 seconds (normally 200ms)
  - Frontend requests time out waiting for search results
  - API request latency spikes 10x
  - Users see "page loading slowly"
  - Not detected by simple uptime checks (API is technically responding)

Scenario 2: Cache Service Degraded
  - Redis cache hit rate drops from 90% to 20%
  - Database gets 4x more queries
  - Database CPU spikes to 100%
  - All requests slow down (even unrelated ones)
  - Cascading failure across entire system

Scenario 3: Service Communication Failure
  - Auth Service is up
  - User Service is up
  - But network issue prevents Auth → User communication
  - Users can't log in
  - Both services show as "up" in monitoring

2. Kubernetes-specifieke failures

Kubernetes voegt complexiteit toe:

Pod Restart Cycle:
  - Pod crashes due to memory leak
  - Kubernetes automatically restarts pod
  - Pod is restarting, not responding to traffic
  - Appears "up" (pod exists) but isn't serving requests

Node Failure:
  - Kubernetes node goes down
  - Scheduler moves pods to other nodes
  - Pods are fine but temporarily unavailable during migration
  - Brief request failures during transition

Deployment Issue:
  - New deployment has bug in startup
  - Pods fail to start
  - Existing pods still serving requests
  - System is degraded but not fully down

3. Service mesh-complexiteit

Met service mesh (Istio, Linkerd):

Service Mesh adds another layer:
  - Service → Sidecar → Network → Sidecar → Service
  - Sidecar can inject failures (rate limiting, timeouts)
  - Circuit breakers can trip (preventing requests)
  - Load balancing can be uneven (some instances getting more traffic)

Monitoring must include:
  - Sidecar health
  - Circuit breaker status
  - Load balancing distribution
  - Service mesh control plane health

Microservices-monitoring-architectuur#

Niveau 1: service-beschikbaarheid#

Monitor elke service apart:

Auth Service:
  - Is the service running? (pod count > 0)
  - Is it responding to health checks? (GET /health returns 200)
  - What's the response time? (P95 &lt; 100ms)
  - What's the error rate? (&lt; 0.1%)

Payment Service:
  - Is the service running?
  - Is it responding? (GET /health)
  - Can it communicate with Stripe? (Stripe is responding)
  - What's the latency? (P95 &lt; 500ms)

Niveau 2: service-communicatie#

Monitor service-naar-service-communicatie:

Auth Service → User Service:
  - Can Auth Service reach User Service? (network connectivity)
  - What's the latency? (P95 &lt; 50ms)
  - What's the error rate? (&lt; 0.01%)

User Service → Database:
  - Can User Service reach database? (connection pool health)
  - Is connection pool exhausted? (waiting for connections)
  - What's the query latency? (P95 &lt; 100ms)

Niveau 3: distributed transaction tracing#

Monitor complete request-flows:

User Login Flow:
  1. API Gateway receives login request
  2. Routes to Auth Service (latency: 10ms)
  3. Auth Service queries User Service (latency: 15ms)
  4. User Service queries database (latency: 20ms)
  5. Auth Service signs JWT token (latency: 5ms)
  6. API Gateway returns response (latency: 2ms)

Total latency: 52ms (acceptable)

If User Service query takes 500ms instead of 20ms:
  - Total latency jumps to 532ms
  - User sees slow login
  - Application performance degrades
  - Monitoring catches the root cause (User Service query slow)

Niveau 4: Kubernetes-specifieke monitoring#

Monitor Kubernetes-infrastructuur:

Pod Health:
  - Pod restart count (alert if > 5 in 24h)
  - Pod memory usage (alert if approaching limits)
  - Pod CPU usage (alert if sustained > 80%)

Node Health:
  - Node status (Ready, NotReady, Unknown)
  - Node disk pressure (alert if > 85% full)
  - Node memory pressure (alert if &lt; 10% available)

Deployment Health:
  - Desired replicas vs. ready replicas (alert if mismatch)
  - Pod creation latency (alert if > 30 seconds)
  - Image pull errors (alert if pods can't start)

Praktijkvoorbeeld: een microservices-monitoring-failure#

Organisatie: FinTech-platform met 30+ microservices, 50 Kubernetes-pods

Architectuur:

API Gateway (5 replicas)
Auth Service (3 replicas)
Payment Service (5 replicas)
Notification Service (3 replicas)
Data Pipeline (2 replicas)
Cache (Redis-cluster)
Database (PostgreSQL)

Het incident: payment processing traag

Tijdlijn:

14:00: Notification Service-pods herstarten (memory leak)
14:05: Notification Service-pod-creatie traag (Kubernetes-scheduler overbelast)
14:10: Payment Service-requests naar Notification Service timeouten (service reageert niet)
14:10: Payment Service implementeert client-side retry (exponential backoff)
14:15: Retry-storm cascadeert naar API Gateway
14:15: API Gateway overbelast, request-queue loopt vol
14:20: Alle user-requests timeouten
14:20: Incident gedeclareerd, on-call engineer gepaged

Wat monitoring had moeten vangen:

14:01: Alert "Notification Service pod restart count exceeded threshold"
14:06: Alert "Notification Service pod creation latency > 30 seconds"
14:07: Alert "Notification Service health check failing"
14:08: Alert "Payment Service → Notification Service latency spike"
14:09: Alert "Payment Service error rate spike"

Vroege alerts hadden de cascade kunnen voorkomen: Notification Service handmatig herstarten, of de retry-strategie in Payment Service aanpassen.

Implementatie van microservices-monitoring#

Optie 1: DIY met Prometheus + Grafana#

Prometheus:
  - Scrapes /metrics endpoint from each service
  - Stores metrics in time-series database
  - Runs alert rules (if metric exceeds threshold, alert)

Grafana:
  - Visualizes metrics from Prometheus
  - Dashboards per service
  - Per-service and cross-service dashboards

Implementation: Open-source, free, requires infrastructure knowledge
Cost: Low (just infrastructure), High (engineering time)
Suitable for: Teams with DevOps expertise

Optie 2: managed service (Datadog, New Relic)#

Datadog/New Relic Agent:
  - Runs as sidecar in each pod
  - Collects metrics, logs, traces
  - Sends to managed service

Dashboard:
  - Pre-built dashboards for Kubernetes
  - APM for service communication
  - Distributed tracing built-in

Implementation: Vendor tool, complex configuration, powerful
Cost: High (per-host/per-GB pricing), Low (engineering time)
Suitable for: Teams with budget, less ops expertise

Optie 3: service mesh met ingebouwde monitoring#

Istio/Linkerd:
  - Sidecar proxy intercepts all service communication
  - Automatically tracks latency, errors, traffic
  - Provides observability without code changes

Monitoring:
  - Service mesh dashboards show service dependencies
  - Circuit breaker status
  - Traffic distribution
  - Request latency per route

Implementation: Infrastructure-level change, powerful but complex
Cost: Infrastructure overhead (sidecar CPU/memory)
Suitable for: Large organizations with many services

Microservices-monitoring-checklist#

Per service#

☐ Health check endpoint implemented (/health returns 200 when healthy)
☐ Metrics endpoint exposed (/metrics for Prometheus scraping)
☐ Service availability monitored (pod running, health check passing)
☐ Service response time monitored (P95 latency tracked)
☐ Service error rate monitored (4xx, 5xx errors tracked)
☐ Service logs centralized (ELK, Splunk, or similar)

Per service-communicatie#

☐ Latency between services monitored (service A → service B)
☐ Error rate per service communication tracked
☐ Circuit breaker status monitored (if applicable)
☐ Timeout handling verified (proper retry logic)

Per Kubernetes-cluster#

☐ Pod health monitored (restart count, memory, CPU)
☐ Node health monitored (status, disk, memory)
☐ Deployment health monitored (desired vs. ready replicas)
☐ Persistent volume health monitored (available space, I/O errors)
☐ Namespace resource usage monitored (CPU, memory limits)

Distributed tracing#

☐ Request traces collected (request ID propagated through services)
☐ Trace visualization available (see request flow)
☐ Trace latency breakdown visible (which service is slow?)
☐ Trace error detection (which service failed?)

Best practices voor microservices-monitoring#

1. Correlation-IDs voor distributed tracing

Elke request moet een unieke ID hebben die door alle services wordt doorgegeven:

User Request:
  Header: X-Request-ID: 12345

Service A:
  logs: "Request 12345: Started"
  calls Service B with header: X-Request-ID: 12345

Service B:
  logs: "Request 12345: Received from Service A"
  calls Service C with header: X-Request-ID: 12345

Service C:
  logs: "Request 12345: Processing complete"

Later, retrieve all logs for request 12345 to see full flow

2. Structured logging

Log niet: "Error occurred" Log wel: JSON met context

{
  "timestamp": "2026-02-20T14:32:15Z",
  "request_id": "12345",
  "service": "payment-service",
  "level": "error",
  "message": "Payment authorization failed",
  "error_code": "STRIPE_AUTH_FAILED",
  "attempt": 1,
  "latency_ms": 2500,
  "status_code": 503
}

3. Circuit breaker-pattern

Implementeer circuit breakers om cascading failures te voorkomen:

Normal:
  Payment Service → calls Stripe API
  Stripe returns 200 (success)
  Request continues

Failure Mode (Circuit Open):
  Stripe API returns 503 (service down)
  After 5 failures, circuit breaker opens
  Subsequent requests fail fast (no Stripe call attempt)
  Stripe API call latency drops from 2s to 50ms
  System degrades gracefully instead of cascading failure

Recovery:
  Circuit breaker half-open (test 1 request)
  If successful, slowly increase requests to Stripe
  If fails, circuit goes back to open

Kubernetes-specifieke monitoring-valkuilen#

Valkuil 1: pod-IP verandert bij restart#

Wanneer een pod herstart, verandert het IP. DNS-records moeten worden bijgewerkt.

Monitoring moet hier rekening mee houden:

Monitor op service-naam, niet op pod-IP
Gebruik Kubernetes-DNS (service-name.namespace.svc.cluster.local)
Monitor Kubernetes service-endpoints (zijn pods geregistreerd?)

Valkuil 2: init container-vertragingen#

Kubernetes init-containers draaien vóór de hoofdcontainer. Als de init-container traag is:

Pod Status: "Running" (technically correct)
But actually: Init container still running, service not ready yet
Health checks might fail (service not responding yet)

Oplossingen:

Niet alerten op health check-failures in de eerste 60 seconden
Gebruik startup probes (anders dan liveness probes)
Monitor init container-completion-latency

Valkuil 3: resource limits raken de performance#

Als de pod-CPU-limit te laag is:

Pod Status: Running
But actually: CPU throttled, latency spike
Monitoring shows latency spike but CPU usage shows 100% throttled

Solution: Monitor both CPU usage AND CPU throttling
Alert if CPU throttled > 20% of the time

Nova Uptime voor microservices-monitoring#

De gratis tier van Nova Uptime kan microservices monitoren:

Service health checks: monitor health-endpoints van elke service
API-monitoring: track response-tijden en error-percentages
Webhook-monitoring: verifieer service-naar-service-communicatie
E-mail-monitoring: verifieer dat de notification service e-mails levert
Public API-monitoring: als services publieke API's blootstellen

Voor uitgebreide microservices-monitoring gebruik je gespecialiseerde tools (Prometheus, Datadog, New Relic). Maar Nova Uptime kan ze aanvullen door externe service-health en customer-facing API's te monitoren.

Samenvatting: microservices-monitoring voorbij simpele uptime#

Microservices vragen om geavanceerdere monitoring dan traditionele systemen.

Je actieplan:

Implementeer health check-endpoints in elke microservice
Stel metrics bloot (/metrics-endpoint voor Prometheus)
Voeg correlation-IDs toe aan alle request-flows
Implementeer circuit breakers om cascading failures te voorkomen
Gebruik structured logging met request-context
Monitor service-communicatie (latency, errors)
Monitor Kubernetes-infrastructuur (pods, nodes, deployments)
Zet distributed tracing op (zie de volledige request-flow)
Maak per-service-dashboards (zicht op elke service)
Alert op degradatie (niet alleen op volledige failures)

Begin met Nova Uptime om je customer-facing API's te monitoren. Voeg health check-monitoring toe. Vul aan met Prometheus/Grafana of een managed service voor diepere infrastructuur-monitoring.

Je microservices zijn complex. Je monitoring moet die complexiteit matchen.

Microservices en Kubernetes monitoren: voorbij simpele uptime-checks