Extra Learning · Reference Guide

Monitoring Concepts

Eight core monitoring concepts every L2 engineer must understand — each mapped to what it is, the tools and commands you use, and the real operational use case where it matters most.

Metrics Logs Alerts Health Checks APM Dashboards SLA/SLO RCA

01 How to Use This Reference

Purpose

Every monitoring concept follows the same three-column structure that directly maps to how you use it on the job:

Concept — what this monitoring type actually is and what it measures.
Tools / Commands — the specific tools and terminal commands you use to access it.
Ops Use Case — the real L2 situation where this concept saves you time or prevents an incident.

Together these 8 concepts form your complete monitoring knowledge base — from raw numbers to root cause.

02 The 8 Monitoring Concepts

Metrics

Numbers over time

Concept

CPU, Memory, Disk, and Network usage — numeric values measured repeatedly over time. Metrics give you the big picture of how healthy your server and application are at any moment. They are the first signal that something is wrong.

Tools / Commands

top — live CPU and memory view
free -h — memory and swap
df -h — disk usage per partition
Grafana — visual dashboards for all metrics

Ops Use Case

Detect server overload. When CPU is at 94% or disk hits 100%, metrics tell you before a single client complaint arrives. This is the difference between proactive and reactive support.

Logs

Event records

Concept

Application and system logs — timestamped records of every event your software and OS write. Logs tell you exactly what happened and when, with the precise error message. Where metrics show a number, logs show the story behind it.

Tools / Commands

tail -f app.log — live log stream
grep "ERROR" app.log — search by pattern
ELK Stack — Elasticsearch, Logstash, Kibana
Loki — lightweight log aggregation

Ops Use Case

Find errors. When a metric alert fires, logs tell you the exact error — DB_CONNECTION_TIMEOUT, NullPointerException, SOCKET_TIMEOUT — so you know what to fix, not just that something broke.

Alerts

Threshold triggers

Concept

Threshold-based notifications — rules that fire automatically when a metric crosses a defined limit. Alerts are the bridge between monitoring and human response. Without them you must watch screens constantly. With them the system watches for you.

Tools / Commands

Alertmanager — routes and manages alerts
email — low-urgency notification channel
Slack webhook — team channel alerts
PagerDuty — on-call phone escalation

Ops Use Case

Immediate action. A CRITICAL alert at 3 AM wakes the on-call engineer before any client notices. A WARNING alert during office hours creates a Slack message to investigate at the next opportunity. Severity routing means the right person gets the right urgency.

Health Checks

Service availability

Concept

Service availability verification — automated checks that confirm your application and its dependencies (DB, MQ, external APIs) are reachable and responding. A health check endpoint returns a structured status telling you the health of every component in one call.

Tools / Commands

curl /health — call the health endpoint
systemctl status — check service status
ping — basic network reachability
netstat -an — check open connections

Ops Use Case

Ensure uptime. When a client reports failures, calling the /health endpoint is the fastest first step — it tells you whether the application, DB, and MQ are all up or which one is broken, in under 2 seconds.

APM

App performance

Concept

Latency and response monitoring — Application Performance Monitoring tracks how long each API endpoint takes to respond, what the error rate is per endpoint, and how many requests per second each one handles. It watches the software layer, not just the server.

Tools / Commands

Grafana APM — per-endpoint performance
Datadog APM — traces and latency maps
Jaeger — distributed request tracing
New Relic — full-stack APM platform

Ops Use Case

Detect slow APIs. When the system feels slow but CPU and disk look fine, APM shows you that /raast/transfer has p99 latency of 8 seconds while all other endpoints are under 200ms. That specificity makes root cause identification immediate.

Dashboards

Visual overview

Concept

Visual status overview — dashboards pull together metrics, logs, and APM data into a single visual screen. Every key number is visible at once — CPU%, error rate, response time, queue depth — colour-coded so you can read system health in under 5 seconds.

Tools / Commands

Grafana — most common, highly customisable
Kibana — log-based dashboards (ELK)
Datadog — all-in-one with built-in dashboards
Splunk — enterprise dashboard for log data

Ops Use Case

Quick decision making. During an S1 bridge call, opening the dashboard gives you everything needed to answer "what is the current status" in seconds. You read from the dashboard — error rate, affected services, when it started — without opening a terminal.

SLA / SLO

Reliability targets

Concept

Service reliability targets — SLA (Service Level Agreement) is the contractual commitment to a client on uptime and response time. SLO (Service Level Objective) is the internal target your team works to. Both define what "acceptable" looks like numerically — e.g. 99.9% uptime per month.

Tools / Commands

Reports — monthly uptime and incident reports
Grafana SLO panels — visual SLO tracking
Incident log — duration of each outage
Error budget — how much downtime remains

Ops Use Case

Measure performance. Every incident you handle contributes to or drains the SLA. A 15-minute outage on a 99.9% SLA uses a significant portion of the month's allowed downtime. Knowing the SLA makes every incident feel appropriately urgent.

RCA

Root cause analysis

Concept

Root cause analysis — the structured investigation after an incident to find not just what failed but why it failed. A good RCA uses evidence from logs and metrics to produce a timeline, a root cause statement, and next steps that prevent recurrence.

Tools / Commands

Logs — timestamps, error messages, build-up
Metrics — when did numbers change
Jira — where the RCA is documented
grep / Kibana — searching evidence

Ops Use Case

Prevent recurrence. An RCA that says "DB pool exhausted because of a connection leak" leads to a code fix. Without the RCA the same incident happens again in 2 weeks. Good RCAs turn reactive incidents into proactive improvements.

03 Quick Reference — All 8 Concepts at a Glance

💡 Keep this table as your monitoring reference. Each concept has a distinct job — know which one to reach for depending on what question you are trying to answer.

🔖 Monitoring Concepts Cheat Sheet

Topic	What it answers	Key Tool	Ops Use Case
Metrics	How much? How fast? Is the server stressed?	`top, free, df, Grafana`	Detect server overload
Logs	What happened? When? What was the error?	`tail, grep, ELK/Loki`	Find errors
Alerts	Did anything cross a threshold I care about?	`Alertmanager, Slack, email`	Immediate action
Health Checks	Is my service and its dependencies up right now?	`curl, systemctl`	Ensure uptime
APM	Which API is slow and by how much?	`Grafana APM, Datadog`	Detect slow APIs
Dashboards	What is the overall system status right now?	`Grafana`	Quick decision making
SLA / SLO	Are we meeting our reliability commitments?	`Reports, error budget`	Measure performance
RCA	Why did it fail and how do we stop it recurring?	`Logs + metrics`	Prevent recurrence

✅ Monitoring Concepts — What I Know

Metrics — measure CPU, Memory, Disk, and Network as numbers over time using top, free, df, and Grafana to detect server overload before clients report it
Logs — application and system event records searched with tail, grep, and ELK or Loki to find the exact error message and timestamp behind any incident
Alerts — threshold-based rules routed through Alertmanager to Slack, email, or PagerDuty so the right person is notified at the right urgency at any time of day
Health Checks — service availability verified via curl /health and systemctl to confirm application, DB, and MQ are all up in one call — the fastest first step in any incident
APM — latency and response monitoring per API endpoint using Grafana APM and Datadog to pinpoint exactly which API is slow and by how much, even when server metrics look normal
Dashboards — visual status overview in Grafana that shows error rate, CPU, queue depth, and response time simultaneously so you can read system health in under 5 seconds during an S1
SLA / SLO — service reliability targets defined contractually (SLA) and internally (SLO), measured via monthly reports and error budgets, making every incident feel appropriately urgent
RCA — root cause analysis using correlated logs and metrics to build a timestamped timeline, evidence-based root cause statement, and next steps that prevent the same incident from recurring