Microservices en Kubernetes monitoren: voorbij simpele uptime-checks
Microservices vragen om gedistribueerde monitoring. Leer hoe je service-afhankelijkheden, orchestration-health en gedistribueerde failures monitort.
De microservices-monitoring-crisis#
Traditionele uptime-monitoring vraagt: "Reageert je website?"
Moderne microservices-architectuur vraagt: "Reageren alle 47 microservices, en reageren ze correct op elkaar?"
Een microservices-platform bestaat uit tientallen onderling verbonden services:
API Gateway → Auth Service → User Service → Database
→ Payment Service → Stripe
→ Notification Service → SendGrid
→ Reporting Service → Data Warehouse
→ Search Service → Elasticsearch
→ Cache → Redis
Als ÉÉN van deze services faalt:
- Auth Service down = niemand kan inloggen
- Payment Service down = klanten kunnen niet betalen
- Search Service down = klanten kunnen producten niet vinden
- Redis-cache down = systeem wordt traag (cascading failure)
Traditionele uptime-monitoring mist deze complexiteit volledig.
Waarom microservices-monitoring anders is#
1. Gedistribueerde failures
In monolithische systemen is failure binair: up of down.
In microservices zijn failures gedeeltelijk en cascaderen ze:
Scenario 1: Search Service Slow
- Search requests take 10 seconds (normally 200ms)
- Frontend requests time out waiting for search results
- API request latency spikes 10x
- Users see "page loading slowly"
- Not detected by simple uptime checks (API is technically responding)
Scenario 2: Cache Service Degraded
- Redis cache hit rate drops from 90% to 20%
- Database gets 4x more queries
- Database CPU spikes to 100%
- All requests slow down (even unrelated ones)
- Cascading failure across entire system
Scenario 3: Service Communication Failure
- Auth Service is up
- User Service is up
- But network issue prevents Auth → User communication
- Users can't log in
- Both services show as "up" in monitoring
2. Kubernetes-specifieke failures
Kubernetes voegt complexiteit toe:
Pod Restart Cycle:
- Pod crashes due to memory leak
- Kubernetes automatically restarts pod
- Pod is restarting, not responding to traffic
- Appears "up" (pod exists) but isn't serving requests
Node Failure:
- Kubernetes node goes down
- Scheduler moves pods to other nodes
- Pods are fine but temporarily unavailable during migration
- Brief request failures during transition
Deployment Issue:
- New deployment has bug in startup
- Pods fail to start
- Existing pods still serving requests
- System is degraded but not fully down
3. Service mesh-complexiteit
Met service mesh (Istio, Linkerd):
Service Mesh adds another layer:
- Service → Sidecar → Network → Sidecar → Service
- Sidecar can inject failures (rate limiting, timeouts)
- Circuit breakers can trip (preventing requests)
- Load balancing can be uneven (some instances getting more traffic)
Monitoring must include:
- Sidecar health
- Circuit breaker status
- Load balancing distribution
- Service mesh control plane health
Microservices-monitoring-architectuur#
Niveau 1: service-beschikbaarheid#
Monitor elke service apart:
Auth Service:
- Is the service running? (pod count > 0)
- Is it responding to health checks? (GET /health returns 200)
- What's the response time? (P95 < 100ms)
- What's the error rate? (< 0.1%)
Payment Service:
- Is the service running?
- Is it responding? (GET /health)
- Can it communicate with Stripe? (Stripe is responding)
- What's the latency? (P95 < 500ms)
Niveau 2: service-communicatie#
Monitor service-naar-service-communicatie:
Auth Service → User Service:
- Can Auth Service reach User Service? (network connectivity)
- What's the latency? (P95 < 50ms)
- What's the error rate? (< 0.01%)
User Service → Database:
- Can User Service reach database? (connection pool health)
- Is connection pool exhausted? (waiting for connections)
- What's the query latency? (P95 < 100ms)
Niveau 3: distributed transaction tracing#
Monitor complete request-flows:
User Login Flow:
1. API Gateway receives login request
2. Routes to Auth Service (latency: 10ms)
3. Auth Service queries User Service (latency: 15ms)
4. User Service queries database (latency: 20ms)
5. Auth Service signs JWT token (latency: 5ms)
6. API Gateway returns response (latency: 2ms)
Total latency: 52ms (acceptable)
If User Service query takes 500ms instead of 20ms:
- Total latency jumps to 532ms
- User sees slow login
- Application performance degrades
- Monitoring catches the root cause (User Service query slow)
Niveau 4: Kubernetes-specifieke monitoring#
Monitor Kubernetes-infrastructuur:
Pod Health:
- Pod restart count (alert if > 5 in 24h)
- Pod memory usage (alert if approaching limits)
- Pod CPU usage (alert if sustained > 80%)
Node Health:
- Node status (Ready, NotReady, Unknown)
- Node disk pressure (alert if > 85% full)
- Node memory pressure (alert if < 10% available)
Deployment Health:
- Desired replicas vs. ready replicas (alert if mismatch)
- Pod creation latency (alert if > 30 seconds)
- Image pull errors (alert if pods can't start)
Praktijkvoorbeeld: een microservices-monitoring-failure#
Organisatie: FinTech-platform met 30+ microservices, 50 Kubernetes-pods
Architectuur:
- API Gateway (5 replicas)
- Auth Service (3 replicas)
- Payment Service (5 replicas)
- Notification Service (3 replicas)
- Data Pipeline (2 replicas)
- Cache (Redis-cluster)
- Database (PostgreSQL)
Het incident: payment processing traag
Tijdlijn:
- 14:00: Notification Service-pods herstarten (memory leak)
- 14:05: Notification Service-pod-creatie traag (Kubernetes-scheduler overbelast)
- 14:10: Payment Service-requests naar Notification Service timeouten (service reageert niet)
- 14:10: Payment Service implementeert client-side retry (exponential backoff)
- 14:15: Retry-storm cascadeert naar API Gateway
- 14:15: API Gateway overbelast, request-queue loopt vol
- 14:20: Alle user-requests timeouten
- 14:20: Incident gedeclareerd, on-call engineer gepaged
Wat monitoring had moeten vangen:
- 14:01: Alert "Notification Service pod restart count exceeded threshold"
- 14:06: Alert "Notification Service pod creation latency > 30 seconds"
- 14:07: Alert "Notification Service health check failing"
- 14:08: Alert "Payment Service → Notification Service latency spike"
- 14:09: Alert "Payment Service error rate spike"
Vroege alerts hadden de cascade kunnen voorkomen: Notification Service handmatig herstarten, of de retry-strategie in Payment Service aanpassen.
Implementatie van microservices-monitoring#
Optie 1: DIY met Prometheus + Grafana#
Prometheus:
- Scrapes /metrics endpoint from each service
- Stores metrics in time-series database
- Runs alert rules (if metric exceeds threshold, alert)
Grafana:
- Visualizes metrics from Prometheus
- Dashboards per service
- Per-service and cross-service dashboards
Implementation: Open-source, free, requires infrastructure knowledge
Cost: Low (just infrastructure), High (engineering time)
Suitable for: Teams with DevOps expertise
Optie 2: managed service (Datadog, New Relic)#
Datadog/New Relic Agent:
- Runs as sidecar in each pod
- Collects metrics, logs, traces
- Sends to managed service
Dashboard:
- Pre-built dashboards for Kubernetes
- APM for service communication
- Distributed tracing built-in
Implementation: Vendor tool, complex configuration, powerful
Cost: High (per-host/per-GB pricing), Low (engineering time)
Suitable for: Teams with budget, less ops expertise
Optie 3: service mesh met ingebouwde monitoring#
Istio/Linkerd:
- Sidecar proxy intercepts all service communication
- Automatically tracks latency, errors, traffic
- Provides observability without code changes
Monitoring:
- Service mesh dashboards show service dependencies
- Circuit breaker status
- Traffic distribution
- Request latency per route
Implementation: Infrastructure-level change, powerful but complex
Cost: Infrastructure overhead (sidecar CPU/memory)
Suitable for: Large organizations with many services
Microservices-monitoring-checklist#
Per service#
☐ Health check endpoint implemented (/health returns 200 when healthy)
☐ Metrics endpoint exposed (/metrics for Prometheus scraping)
☐ Service availability monitored (pod running, health check passing)
☐ Service response time monitored (P95 latency tracked)
☐ Service error rate monitored (4xx, 5xx errors tracked)
☐ Service logs centralized (ELK, Splunk, or similar)
Per service-communicatie#
☐ Latency between services monitored (service A → service B)
☐ Error rate per service communication tracked
☐ Circuit breaker status monitored (if applicable)
☐ Timeout handling verified (proper retry logic)
Per Kubernetes-cluster#
☐ Pod health monitored (restart count, memory, CPU)
☐ Node health monitored (status, disk, memory)
☐ Deployment health monitored (desired vs. ready replicas)
☐ Persistent volume health monitored (available space, I/O errors)
☐ Namespace resource usage monitored (CPU, memory limits)
Distributed tracing#
☐ Request traces collected (request ID propagated through services)
☐ Trace visualization available (see request flow)
☐ Trace latency breakdown visible (which service is slow?)
☐ Trace error detection (which service failed?)
Best practices voor microservices-monitoring#
1. Correlation-IDs voor distributed tracing
Elke request moet een unieke ID hebben die door alle services wordt doorgegeven:
User Request:
Header: X-Request-ID: 12345
Service A:
logs: "Request 12345: Started"
calls Service B with header: X-Request-ID: 12345
Service B:
logs: "Request 12345: Received from Service A"
calls Service C with header: X-Request-ID: 12345
Service C:
logs: "Request 12345: Processing complete"
Later, retrieve all logs for request 12345 to see full flow
2. Structured logging
Log niet: "Error occurred" Log wel: JSON met context
{
"timestamp": "2026-02-20T14:32:15Z",
"request_id": "12345",
"service": "payment-service",
"level": "error",
"message": "Payment authorization failed",
"error_code": "STRIPE_AUTH_FAILED",
"attempt": 1,
"latency_ms": 2500,
"status_code": 503
}
3. Circuit breaker-pattern
Implementeer circuit breakers om cascading failures te voorkomen:
Normal:
Payment Service → calls Stripe API
Stripe returns 200 (success)
Request continues
Failure Mode (Circuit Open):
Stripe API returns 503 (service down)
After 5 failures, circuit breaker opens
Subsequent requests fail fast (no Stripe call attempt)
Stripe API call latency drops from 2s to 50ms
System degrades gracefully instead of cascading failure
Recovery:
Circuit breaker half-open (test 1 request)
If successful, slowly increase requests to Stripe
If fails, circuit goes back to open
Kubernetes-specifieke monitoring-valkuilen#
Valkuil 1: pod-IP verandert bij restart#
Wanneer een pod herstart, verandert het IP. DNS-records moeten worden bijgewerkt.
Monitoring moet hier rekening mee houden:
- Monitor op service-naam, niet op pod-IP
- Gebruik Kubernetes-DNS (service-name.namespace.svc.cluster.local)
- Monitor Kubernetes service-endpoints (zijn pods geregistreerd?)
Valkuil 2: init container-vertragingen#
Kubernetes init-containers draaien vóór de hoofdcontainer. Als de init-container traag is:
Pod Status: "Running" (technically correct)
But actually: Init container still running, service not ready yet
Health checks might fail (service not responding yet)
Oplossingen:
- Niet alerten op health check-failures in de eerste 60 seconden
- Gebruik startup probes (anders dan liveness probes)
- Monitor init container-completion-latency
Valkuil 3: resource limits raken de performance#
Als de pod-CPU-limit te laag is:
Pod Status: Running
But actually: CPU throttled, latency spike
Monitoring shows latency spike but CPU usage shows 100% throttled
Solution: Monitor both CPU usage AND CPU throttling
Alert if CPU throttled > 20% of the time
Nova Uptime voor microservices-monitoring#
De gratis tier van Nova Uptime kan microservices monitoren:
- Service health checks: monitor health-endpoints van elke service
- API-monitoring: track response-tijden en error-percentages
- Webhook-monitoring: verifieer service-naar-service-communicatie
- E-mail-monitoring: verifieer dat de notification service e-mails levert
- Public API-monitoring: als services publieke API's blootstellen
Voor uitgebreide microservices-monitoring gebruik je gespecialiseerde tools (Prometheus, Datadog, New Relic). Maar Nova Uptime kan ze aanvullen door externe service-health en customer-facing API's te monitoren.
Samenvatting: microservices-monitoring voorbij simpele uptime#
Microservices vragen om geavanceerdere monitoring dan traditionele systemen.
Je actieplan:
- Implementeer health check-endpoints in elke microservice
- Stel metrics bloot (/metrics-endpoint voor Prometheus)
- Voeg correlation-IDs toe aan alle request-flows
- Implementeer circuit breakers om cascading failures te voorkomen
- Gebruik structured logging met request-context
- Monitor service-communicatie (latency, errors)
- Monitor Kubernetes-infrastructuur (pods, nodes, deployments)
- Zet distributed tracing op (zie de volledige request-flow)
- Maak per-service-dashboards (zicht op elke service)
- Alert op degradatie (niet alleen op volledige failures)
Begin met Nova Uptime om je customer-facing API's te monitoren. Voeg health check-monitoring toe. Vul aan met Prometheus/Grafana of een managed service voor diepere infrastructuur-monitoring.
Je microservices zijn complex. Je monitoring moet die complexiteit matchen.
Monitor Your Website Before It Goes Down
Get uptime monitoring, SSL tracking, domain expiry alerts, and email health checks. Free plan — no credit card required.
Start Monitoring FreeGerelateerde artikelen
Domain Health Check: een volledige gratis audit (DNS + SSL + e-mail + uptime)
Run in 5 minuten een volledige gratis domain-health-audit: DNS, SSL, e-mail-auth (SPF/DKIM/DMARC), blacklists en uptime. Stappenplan inbegrepen.
Domeinverloop vs SSL-verloop: wat is het verschil?
Domeinverloop vs SSL-verloop: wat er gebeurt wanneer elk verloopt, de cruciale verschillen, en hoe je beide effectief monitort.
Monitoring in het AI-tijdperk: wat verandert er als je app LLMs gebruikt
AI-apps vereisen andere monitoring. Houd LLM API-kosten, latency, kwaliteitsproblemen en AI-hallucinaties die gebruikers schaden bij.