L2 Support Engineer · Fintech · Extra Learning
Monitoring Concepts
Extra Learning · Reference Guide

Monitoring Concepts

Eight core monitoring concepts every L2 engineer must understand — each mapped to what it is, the tools and commands you use, and the real operational use case where it matters most.

Metrics Logs Alerts Health Checks APM Dashboards SLA/SLO RCA
01 How to Use This Reference
Purpose

Every monitoring concept follows the same three-column structure that directly maps to how you use it on the job:

Concept — what this monitoring type actually is and what it measures.
Tools / Commands — the specific tools and terminal commands you use to access it.
Ops Use Case — the real L2 situation where this concept saves you time or prevents an incident.

Together these 8 concepts form your complete monitoring knowledge base — from raw numbers to root cause.

02 The 8 Monitoring Concepts
01
Metrics
Numbers over time
Concept

CPU, Memory, Disk, and Network usage — numeric values measured repeatedly over time. Metrics give you the big picture of how healthy your server and application are at any moment. They are the first signal that something is wrong.

Tools / Commands
  • top — live CPU and memory view
  • free -h — memory and swap
  • df -h — disk usage per partition
  • Grafana — visual dashboards for all metrics
Ops Use Case

Detect server overload. When CPU is at 94% or disk hits 100%, metrics tell you before a single client complaint arrives. This is the difference between proactive and reactive support.

02
Logs
Event records
Concept

Application and system logs — timestamped records of every event your software and OS write. Logs tell you exactly what happened and when, with the precise error message. Where metrics show a number, logs show the story behind it.

Tools / Commands
  • tail -f app.log — live log stream
  • grep "ERROR" app.log — search by pattern
  • ELK Stack — Elasticsearch, Logstash, Kibana
  • Loki — lightweight log aggregation
Ops Use Case

Find errors. When a metric alert fires, logs tell you the exact error — DB_CONNECTION_TIMEOUT, NullPointerException, SOCKET_TIMEOUT — so you know what to fix, not just that something broke.

03
Alerts
Threshold triggers
Concept

Threshold-based notifications — rules that fire automatically when a metric crosses a defined limit. Alerts are the bridge between monitoring and human response. Without them you must watch screens constantly. With them the system watches for you.

Tools / Commands
  • Alertmanager — routes and manages alerts
  • email — low-urgency notification channel
  • Slack webhook — team channel alerts
  • PagerDuty — on-call phone escalation
Ops Use Case

Immediate action. A CRITICAL alert at 3 AM wakes the on-call engineer before any client notices. A WARNING alert during office hours creates a Slack message to investigate at the next opportunity. Severity routing means the right person gets the right urgency.

04
Health Checks
Service availability
Concept

Service availability verification — automated checks that confirm your application and its dependencies (DB, MQ, external APIs) are reachable and responding. A health check endpoint returns a structured status telling you the health of every component in one call.

Tools / Commands
  • curl /health — call the health endpoint
  • systemctl status — check service status
  • ping — basic network reachability
  • netstat -an — check open connections
Ops Use Case

Ensure uptime. When a client reports failures, calling the /health endpoint is the fastest first step — it tells you whether the application, DB, and MQ are all up or which one is broken, in under 2 seconds.

05
APM
App performance
Concept

Latency and response monitoring — Application Performance Monitoring tracks how long each API endpoint takes to respond, what the error rate is per endpoint, and how many requests per second each one handles. It watches the software layer, not just the server.

Tools / Commands
  • Grafana APM — per-endpoint performance
  • Datadog APM — traces and latency maps
  • Jaeger — distributed request tracing
  • New Relic — full-stack APM platform
Ops Use Case

Detect slow APIs. When the system feels slow but CPU and disk look fine, APM shows you that /raast/transfer has p99 latency of 8 seconds while all other endpoints are under 200ms. That specificity makes root cause identification immediate.

06
Dashboards
Visual overview
Concept

Visual status overview — dashboards pull together metrics, logs, and APM data into a single visual screen. Every key number is visible at once — CPU%, error rate, response time, queue depth — colour-coded so you can read system health in under 5 seconds.

Tools / Commands
  • Grafana — most common, highly customisable
  • Kibana — log-based dashboards (ELK)
  • Datadog — all-in-one with built-in dashboards
  • Splunk — enterprise dashboard for log data
Ops Use Case

Quick decision making. During an S1 bridge call, opening the dashboard gives you everything needed to answer "what is the current status" in seconds. You read from the dashboard — error rate, affected services, when it started — without opening a terminal.

07
SLA / SLO
Reliability targets
Concept

Service reliability targets — SLA (Service Level Agreement) is the contractual commitment to a client on uptime and response time. SLO (Service Level Objective) is the internal target your team works to. Both define what "acceptable" looks like numerically — e.g. 99.9% uptime per month.

Tools / Commands
  • Reports — monthly uptime and incident reports
  • Grafana SLO panels — visual SLO tracking
  • Incident log — duration of each outage
  • Error budget — how much downtime remains
Ops Use Case

Measure performance. Every incident you handle contributes to or drains the SLA. A 15-minute outage on a 99.9% SLA uses a significant portion of the month's allowed downtime. Knowing the SLA makes every incident feel appropriately urgent.

08
RCA
Root cause analysis
Concept

Root cause analysis — the structured investigation after an incident to find not just what failed but why it failed. A good RCA uses evidence from logs and metrics to produce a timeline, a root cause statement, and next steps that prevent recurrence.

Tools / Commands
  • Logs — timestamps, error messages, build-up
  • Metrics — when did numbers change
  • Jira — where the RCA is documented
  • grep / Kibana — searching evidence
Ops Use Case

Prevent recurrence. An RCA that says "DB pool exhausted because of a connection leak" leads to a code fix. Without the RCA the same incident happens again in 2 weeks. Good RCAs turn reactive incidents into proactive improvements.

03 Quick Reference — All 8 Concepts at a Glance
💡 Keep this table as your monitoring reference. Each concept has a distinct job — know which one to reach for depending on what question you are trying to answer.
🔖 Monitoring Concepts Cheat Sheet
TopicWhat it answersKey ToolOps Use Case
Metrics How much? How fast? Is the server stressed? top, free, df, Grafana Detect server overload
Logs What happened? When? What was the error? tail, grep, ELK/Loki Find errors
Alerts Did anything cross a threshold I care about? Alertmanager, Slack, email Immediate action
Health Checks Is my service and its dependencies up right now? curl, systemctl Ensure uptime
APM Which API is slow and by how much? Grafana APM, Datadog Detect slow APIs
Dashboards What is the overall system status right now? Grafana Quick decision making
SLA / SLO Are we meeting our reliability commitments? Reports, error budget Measure performance
RCA Why did it fail and how do we stop it recurring? Logs + metrics Prevent recurrence

✅ Monitoring Concepts — What I Know