Observability

  • Ability to measure a system’s current state based on the data it generates, such as logs, metrics and traces.
  • Advantages:
    • Better Visibility
    • Better Alerting
    • Better Efficiency
  • 3 Pillars:
    • Logs
    • Metrics
    • Tracing
  • These 3 pillars are referred to as Telemetry Data
  • Facade libraries in Spring boot:
    • SLF4J (logs)
    • Micrometer (metrics)
    • Spring Cloud Sleuth (tracing) — Successor: Micrometer Tracing

Logging

  • Detailed information about individual things that are ongoing in your application
  • It is better to use centralized logging since simple logging:
    • Doesn’t scale well
    • Low Usability: difficult to reconstruct the chain of events in concurrent logs
  • Stream log information to centralized logging
  • Use standard formats like:
  • Using standard formats help centralized logging system to index and make search faster
  • Example tools:
    • ELK stack: Logstash Elasticsearch Kibana
    • Cloud Providers:
      • Splunk
      • Graylog
      • Solarwinds Loggly
    • Google Cloud Logging
    • Amazon CloudWatch
    • Azure Log Analytics
  • Spring Boot Logging

Metrics

  • Aggregated information like counts, averages etc. about application features
  • Some types of metrics:
    • Counters
    • Gauges
    • Timers
    • Summary
  • Examples:
    • CPU Usage: 18%
    • Memory Usage: 195 MB
    • Disk Read/Write: 51.2 MB
    • Network I/O: 3.7 GB/1.8 GB
  • Advantages:
    • Alerts: Based on some criteria on metrics an alert can be created
    • Trends: How metrics change over time
    • Impact of failure: In case of failures it can provide visibility of the impact
    • Performance tuning
    • Verifies the system architecture
  • There are two ways metrics are collected:
    • Push (eg. NewRelic, AppDynamics)
    • Pull (eg. prometheus)
  • Metrics are stored inMemory and better to publish it to monitoring system usually saving it into time series database
  • Time Series database examples:
    • Prometheus
    • Wavefront
    • Dynatrace
  • Metrics are published to monitoring systems:
    • Elastic APM
    • Prometheus
    • Dynatrace
    • Wavefront
  • Spring Boot Metrics

Tracing

  • Sampled information across multiple services
  • Sampling traces some but not all requests since it can overload the system
  • Sampling rate is by default 10 per second
  • Advantages:
    • Create service map to show communication between services
    • Path breakdown
    • Timing information for each service
    • Improve Mean Time to Detect (MTTD) and Mean Time to Repaid (MTTR)
  • Tracing backend examples:
    • Wavefront
    • Zipkin (OpenZipkin originally developed by Google)
    • Jaeger (part of CNCF, originally developed by Uber)
  • Spring Boot Tracing

Correlation

  • TraceID is used to correlate logging and tracing
  • methodName (URL) is used to correlate tracing and metrics
  • With right data, effective correlations can be obtained to find the root cause
  • For example:
    • Traffic spike is highly correlated with the user john@email.com

Monitoring

  • collecting and analyzing predefined data types (network bandwidth, CPU utilization rates, etc.) in order to detect abnormal behaviors that might indicate problems.
  • part of Observability
  • with monitoring, you might be asking “is an individual piece (network, website, application or other service) up and running as expected?”
  • with observability, you’re asking a bigger question: “How well is everything working?”

Profiling

  • Profiling refers to the practice of collecting and analyzing data about the performance and behavior of software applications or systems

Real User Monitoring

  • aka RUM
  • used in frontend
  • collects information on the users of your apps and the actions they perform on the frontend applications

Instrumentation

  • refers to adding capabilities to systems and applications to track and capture information that can be used to observe the behavior and performance

OpenTelemetry (aka OTel)

  • open source observability framework that provides standardized protocols and tools for collecting and routing telemetry data.

APM

  • Application Performance Monitoring
  • examples:
    • New Relic
    • Dynatrace
    • App Dynamics

Log Analysis

  • ELK stack
  • Splunk
  • Graylog

Log Visualization

  • Grafana
  • Kibana

Incident Response Metrics

  • Mean Time to Detect
  • Mean Time to respond
  • Mean Time to resolve