2026-05-01
When your service goes down at 2 AM, logs are the difference between a 10-minute fix and a 4-hour guessing game. But most teams either log too little ("something went wrong") or too much (terabytes of noise). Good observability rests on three pillars: logs, metrics, and traces. Each answers a different question.
Logs answer "what happened?" They're discrete events. Metrics answer "how is the system behaving overall?" They're aggregated numbers — request rate, error rate, latency percentiles. Traces answer "what was the journey of this specific request?" They follow a single operation across service boundaries.
Here's a practical example. A user reports that checkout is slow. Metrics tell you the p99 latency for /checkout spiked from 200ms to 3s at 14:32. Traces let you grab a slow request and see it spent 2.8s waiting on the payment service. Logs from the payment service reveal it was retrying against a failing downstream provider.
Structured logging is non-negotiable. Stop writing log.info("Processing order for user"). Instead, emit structured fields:
{"event": "order.processing", "user_id": "u-482", "order_id": "ord-991", "amount": 84.50, "duration_ms": 47}Structured logs are searchable, filterable, and aggregatable. Unstructured strings are grep fodder at best.
Log levels matter. Use them consistently across your team:
Rule of thumb for log volume: in production, aim for roughly 5–15 log lines per request at INFO level. If you're consistently above 50 per request, you're logging too much and paying for storage and signal-to-noise problems. If you're below 3, you'll be blind during incidents.
Two common mistakes to avoid. First, logging sensitive data — never log passwords, tokens, full credit card numbers, or PII without masking. This isn't just good practice, it's a compliance requirement under GDPR and PCI-DSS. Second, missing correlation IDs. Every request entering your system should get a unique ID that propagates across all services. Without it, connecting logs from different services for the same user action is nearly impossible.
Finally, invest in alerting on symptoms, not causes. Alert on "error rate exceeds 1%" or "p99 latency above 2s," not on "database CPU is high." High CPU might be fine. Broken user experience is never fine.
