Eight core monitoring concepts every L2 engineer must understand — each mapped to what it is, the tools and commands you use, and the real operational use case where it matters most.
Every monitoring concept follows the same three-column structure that directly maps to how you use it on the job:
Concept — what this monitoring type actually is and what it measures.
Tools / Commands — the specific tools and terminal commands you use to access it.
Ops Use Case — the real L2 situation where this concept saves you time or prevents an incident.
Together these 8 concepts form your complete monitoring knowledge base — from raw numbers to root cause.
CPU, Memory, Disk, and Network usage — numeric values measured repeatedly over time. Metrics give you the big picture of how healthy your server and application are at any moment. They are the first signal that something is wrong.
top — live CPU and memory viewfree -h — memory and swapdf -h — disk usage per partitionGrafana — visual dashboards for all metricsDetect server overload. When CPU is at 94% or disk hits 100%, metrics tell you before a single client complaint arrives. This is the difference between proactive and reactive support.
Application and system logs — timestamped records of every event your software and OS write. Logs tell you exactly what happened and when, with the precise error message. Where metrics show a number, logs show the story behind it.
tail -f app.log — live log streamgrep "ERROR" app.log — search by patternELK Stack — Elasticsearch, Logstash, KibanaLoki — lightweight log aggregationFind errors. When a metric alert fires, logs tell you the exact error — DB_CONNECTION_TIMEOUT, NullPointerException, SOCKET_TIMEOUT — so you know what to fix, not just that something broke.
Threshold-based notifications — rules that fire automatically when a metric crosses a defined limit. Alerts are the bridge between monitoring and human response. Without them you must watch screens constantly. With them the system watches for you.
Alertmanager — routes and manages alertsemail — low-urgency notification channelSlack webhook — team channel alertsPagerDuty — on-call phone escalationImmediate action. A CRITICAL alert at 3 AM wakes the on-call engineer before any client notices. A WARNING alert during office hours creates a Slack message to investigate at the next opportunity. Severity routing means the right person gets the right urgency.
Service availability verification — automated checks that confirm your application and its dependencies (DB, MQ, external APIs) are reachable and responding. A health check endpoint returns a structured status telling you the health of every component in one call.
curl /health — call the health endpointsystemctl status — check service statusping — basic network reachabilitynetstat -an — check open connectionsEnsure uptime. When a client reports failures, calling the /health endpoint is the fastest first step — it tells you whether the application, DB, and MQ are all up or which one is broken, in under 2 seconds.
Latency and response monitoring — Application Performance Monitoring tracks how long each API endpoint takes to respond, what the error rate is per endpoint, and how many requests per second each one handles. It watches the software layer, not just the server.
Grafana APM — per-endpoint performanceDatadog APM — traces and latency mapsJaeger — distributed request tracingNew Relic — full-stack APM platformDetect slow APIs. When the system feels slow but CPU and disk look fine, APM shows you that /raast/transfer has p99 latency of 8 seconds while all other endpoints are under 200ms. That specificity makes root cause identification immediate.
Visual status overview — dashboards pull together metrics, logs, and APM data into a single visual screen. Every key number is visible at once — CPU%, error rate, response time, queue depth — colour-coded so you can read system health in under 5 seconds.
Grafana — most common, highly customisableKibana — log-based dashboards (ELK)Datadog — all-in-one with built-in dashboardsSplunk — enterprise dashboard for log dataQuick decision making. During an S1 bridge call, opening the dashboard gives you everything needed to answer "what is the current status" in seconds. You read from the dashboard — error rate, affected services, when it started — without opening a terminal.
Service reliability targets — SLA (Service Level Agreement) is the contractual commitment to a client on uptime and response time. SLO (Service Level Objective) is the internal target your team works to. Both define what "acceptable" looks like numerically — e.g. 99.9% uptime per month.
Reports — monthly uptime and incident reportsGrafana SLO panels — visual SLO trackingIncident log — duration of each outageError budget — how much downtime remainsMeasure performance. Every incident you handle contributes to or drains the SLA. A 15-minute outage on a 99.9% SLA uses a significant portion of the month's allowed downtime. Knowing the SLA makes every incident feel appropriately urgent.
Root cause analysis — the structured investigation after an incident to find not just what failed but why it failed. A good RCA uses evidence from logs and metrics to produce a timeline, a root cause statement, and next steps that prevent recurrence.
Logs — timestamps, error messages, build-upMetrics — when did numbers changeJira — where the RCA is documentedgrep / Kibana — searching evidencePrevent recurrence. An RCA that says "DB pool exhausted because of a connection leak" leads to a code fix. Without the RCA the same incident happens again in 2 weeks. Good RCAs turn reactive incidents into proactive improvements.
| Topic | What it answers | Key Tool | Ops Use Case |
|---|---|---|---|
| Metrics | How much? How fast? Is the server stressed? | top, free, df, Grafana |
Detect server overload |
| Logs | What happened? When? What was the error? | tail, grep, ELK/Loki |
Find errors |
| Alerts | Did anything cross a threshold I care about? | Alertmanager, Slack, email |
Immediate action |
| Health Checks | Is my service and its dependencies up right now? | curl, systemctl |
Ensure uptime |
| APM | Which API is slow and by how much? | Grafana APM, Datadog |
Detect slow APIs |
| Dashboards | What is the overall system status right now? | Grafana |
Quick decision making |
| SLA / SLO | Are we meeting our reliability commitments? | Reports, error budget |
Measure performance |
| RCA | Why did it fail and how do we stop it recurring? | Logs + metrics |
Prevent recurrence |