Week 6.5 · Day 1

Monitoring Basics

Before you can fix a problem, you need to know it exists. Monitoring is how your system tells you what is happening — in real time, all the time, even when you are not watching.

Metrics Logs Traces Observability Alert Levels Dashboard

01 The Simple Idea First

Real-life Analogy

Think of monitoring like the dashboard of a car. While you drive, the dashboard constantly tells you: current speed (metric), engine temperature (metric), fuel level (metric). If something goes wrong — the check engine light turns on (alert).

You do not have to open the bonnet every 5 minutes to check if the engine is fine. The dashboard watches for you. That is exactly what monitoring does for your server. It watches everything, measures the important numbers, and alerts you the moment something crosses a limit — before your clients notice.

What is Monitoring?

Monitoring means continuously collecting, measuring, and displaying data about your system — so you can see what is happening right now and compare it with what normal looks like.

In fintech, monitoring covers everything — how many transactions are processing per second, how long each one takes, how full the disk is, how much CPU the payment service is using, and how many errors the logs contain. All of it, measured every few seconds, displayed on a dashboard, and triggering alerts when limits are crossed.

Without monitoring — you only find out something is wrong when a client calls. With monitoring — you know before the client does.

02 The 3 Pillars of Observability

Observability vs Monitoring — What is the difference?

Monitoring tells you that something is wrong — CPU is high, error rate is up, disk is full.

Observability tells you why it is wrong — which query is slow, which transaction failed, what the exact error message was, and how long it has been happening. Observability is built from 3 pillars: Metrics, Logs, and Traces.

📊

Pillar 1

Metrics

Numbers measured over time. CPU %, transaction count per second, error rate, response time in ms. Metrics tell you how much and how fast.

CPU: 87% ↑
TXN/sec: 142
Error rate: 4.2%

📋

Pillar 2

Logs

Timestamped records of events. Every INFO, WARN, ERROR line your application writes. Logs tell you what happened and when exactly.

[14:02] ERROR
DB_TIMEOUT TXN-501
pool exhausted

🔗

Pillar 3

Traces

The full journey of one request through all systems. Shows exactly where time was spent. Traces tell you where the delay is in the chain.

API: 12ms
DB query: 2,840ms ⚠️
Network: 4ms

📋 Metrics vs Logs vs Traces — Quick Reference

Pillar	What it answers	Example in fintech	Tool examples
Metrics	How much? How fast? How many?	Payment success rate: 98.2% / DB CPU: 74% / Queue depth: 12	Grafana Prometheus
Logs	What happened? When? What was the error?	[14:02] ERROR TXN-501 DB_CONNECTION_TIMEOUT after 30s	Kibana Splunk
Traces	Where is the slowness? Which step took longest?	API call total 3.1s — breakdown: DB 2.8s, network 0.3s	Jaeger Zipkin

💡 As L2, you use all three. Metrics tell you something is wrong → Logs show you what the error was → Traces show you where in the system it happened. They work together, not separately.

03 Key Metrics Every L2 Should Know

📈 The Metrics You Watch Every Day

Metric	What it measures	Normal range	When to act
TXN Success Rate	% of transactions completing successfully	Above 99%	Below 95% — investigate immediately
Error Rate	% of requests resulting in an error	Below 1%	Above 5% — P1 candidate
Response Time (p99)	How long 99% of requests take to complete	Under 500ms	Above 2s — something is slow
CPU Usage	How much of the processor is being used	Under 70%	Above 90% — find the heavy process
Disk Usage	How full the hard drive is	Under 80%	Above 90% — clean up or expand
Memory / RAM	How much RAM is available	Above 30% free	Below 15% free — investigate swap usage
Queue Depth	How many messages are waiting in the MQ	Near 0	Growing steadily — consumer may be down
DB Connection Pool	How many DB connections are in use	Under 80%	Above 95% — pool near exhaustion

04 Alert Levels — OK, Warning, Critical

What is an Alert?

An alert fires automatically when a metric crosses a configured threshold. You do not have to sit watching a dashboard all day — the monitoring system watches for you and notifies you when something needs attention. This is what your Slack and WhatsApp scripts from Week 5 were doing — they were basic alerting systems.

OK — Everything normal

All metrics within expected ranges. No action needed. System is healthy. Example: CPU at 45%, error rate 0.2%, disk at 62%.

WARNING — Something to watch

A metric is elevated but not yet causing failures. Monitor it closely. It may resolve itself or worsen into Critical. Example: CPU at 78%, disk at 84%, response time creeping up.

HIGH — Action required soon

A metric is clearly abnormal. Investigate now — before it becomes a full outage. Example: CPU at 89%, error rate at 4%, queue depth growing steadily for 10 minutes.

CRITICAL — Immediate action / P1

A metric has crossed the point of causing real failures or outages. Open the bridge immediately. Example: CPU at 98%, success rate below 80%, DB pool at 100%, disk full.

05 Reading a Monitoring Dashboard

What you see on a real monitoring dashboard

Dashboards are the visual interface of your monitoring system. They show metrics as numbers, graphs, and bars — all updating in real time. Your job as L2 is to look at this dashboard and instantly know whether everything is fine or something needs attention.

PaySysLabs — Payment Service Dashboard ● LIVE

TXN Success Rate

98.7%

Last 15 minutes

CPU Usage

83%

⚠ Above 80% threshold

Error Rate

0.8%

Normal range

Disk Usage

97%

🚨 Critical — act now

DB Connection Pool68% used — OK

CPU83% — WARNING

Disk97% — CRITICAL

RAM Available12% free — Healthy

🚨 Looking at this dashboard — what action do you take? Disk at 97% is CRITICAL — this is the most urgent item. Investigate immediately before the disk hits 100% and the DB stops writing. CPU at 83% is a WARNING — monitor but less urgent than disk. TXN success rate and error rate are healthy — no transaction issue currently.

06 Hands-on Lab — Identify Metrics from a Dashboard

🔬 Lab: Read a Dashboard and Identify What Needs Action

Observability · Decision Making

Check your server's real metrics right now on Kali

These commands pull the same data a monitoring dashboard shows — you are building it manually first so you understand where the numbers come from.

terminal — pull all key metrics

# CPU — what % is idle (lower idle = higher load)
top -bn1 | grep "Cpu(s)"

# Memory — available RAM
free -h

# Disk — all partitions with usage %
df -h

# Load average — 1min, 5min, 15min
cat /proc/loadavg

# Error count from your log file
grep -c "ERROR" ~/payment-service.log 2>/dev/null || echo "0 errors"

→ You now have real metric numbers: CPU%, free RAM, disk%, load average, and error count — exactly what a dashboard shows.

Build a simple personal dashboard script

This script collects all metrics and prints them in a dashboard format — your own monitoring view.

my-dashboard.sh — create and run this

#!/bin/bash
# Simple personal monitoring dashboard
# Run: chmod +x my-dashboard.sh && ./my-dashboard.sh

echo "================================================"
echo " My Monitoring Dashboard — $(date)"
echo "================================================"

# CPU
CPU_IDLE=$(top -bn1 | grep "Cpu(s)" | awk '{print $8}' | tr -d '%id,')
CPU_USED=$(echo "100 - $CPU_IDLE" | bc 2>/dev/null || echo "N/A")
echo "[CPU ] Used: ${CPU_USED}%"

# Memory
MEM_AVAIL=$(free -h | awk 'NR==2 {print $7}')
echo "[MEMORY ] Available: $MEM_AVAIL"

# Disk
DISK=$(df / | awk 'NR==2 {print $5}')
echo "[DISK ] Used: $DISK"

# Load
LOAD=$(cat /proc/loadavg | awk '{print $1, $2, $3}')
echo "[LOAD ] 1m 5m 15m: $LOAD"

# Errors
ERRS=$(grep -c "ERROR" ~/payment-service.log 2>/dev/null || echo "0")
echo "[ERRORS ] In log: $ERRS"
echo "================================================"

→ Run this and you see your own dashboard. Every number tells a story — compare against the thresholds table above to know what is OK and what needs attention.

Practice reading a dashboard scenario — make the call

Look at this set of metrics and decide: what is OK, what is a Warning, what is Critical?

Scenario — read and classify each metric

Dashboard Reading at 14:30:

TXN Success Rate : 96.4%
Error Rate : 3.6%
CPU Usage : 71%
Disk Usage : 88%
RAM Available : 24%
Queue Depth : 0 messages
DB Pool : 62% used
Response Time : 1,400ms

Your job: classify each metric and decide
what to investigate first.

→ TXN Success at 96.4% = WARNING (below 99%). Error rate 3.6% = WARNING (above 1%). Disk 88% = WARNING approaching Critical. Response time 1.4s = WARNING (above 500ms). CPU 71% = OK. Queue 0 = OK. Start with disk — it will become critical fastest.

Understand how metrics connect to logs and traces

A metric tells you something is wrong. A log tells you what. A trace tells you where.

The investigation chain — 3 pillars working together

Step 1 — Metric fires alert:
Error rate jumped from 0.8% to 4.2% at 14:02

Step 2 — Check logs to find what error:
grep "ERROR" ~/payment-service.log | tail -10
→ Finds: DB_CONNECTION_TIMEOUT repeated every 10 seconds

Step 3 — Check trace to find where delay is:
→ DB query step took 28,000ms (28 seconds)
→ All other steps normal (<100ms)

Conclusion: The DB is the source. Escalate to DBA.

→ Metric → Log → Trace. Three steps. Each one narrows down the problem until you have a clear root cause. ✅

07 Common Monitoring Tools You Will Encounter

🛠️ Monitoring Tools — What Each One Is Used For

Tool	Type	What it does
Grafana	Metrics	Visualises metrics in dashboards — graphs, gauges, and alert panels. The most common L2 dashboard tool.
Prometheus	Metrics	Collects and stores metrics from all your services. Grafana reads from Prometheus to build dashboards.
Kibana	Logs	Search and visualise logs from all servers in one place. Part of the ELK stack (Elasticsearch, Logstash, Kibana).
Splunk	Logs	Enterprise log management. Search across millions of log lines instantly. Common in large fintech companies.
Jaeger	Traces	Distributed tracing — shows the full journey of a request across all microservices with timing at each step.
PagerDuty	Alerting	Routes alerts to the on-call engineer's phone. Escalates automatically if the first responder doesn't acknowledge.
Datadog	All-in-one	Metrics + Logs + Traces in one platform. Increasingly popular. Expensive but very powerful for L2 work.

08 Real L2 Scenarios

At 9 AM you open the Grafana dashboard. TXN success rate is at 94%. Error rate is 6%. Both are above warning threshold. You immediately check logs — DB_CONNECTION_TIMEOUT. You escalate to DBA before any client calls. You found it from the metric before anyone reported it.

Alert fires at 2 AM: Disk at 93%. Metric told you first. You check logs — the application has been writing debug logs for 3 days and they are enormous. You archive old logs, disk drops to 71%. Crisis averted. Nobody lost sleep except you — and you fixed it in 10 minutes.

Response time metric is showing 3.2 seconds average. Normal is under 500ms. Metric says something is slow. Log says DB_SLOW_QUERY appearing since 13:45. Trace shows the SELECT query on TRANSACTIONS_LOG taking 3,100ms. DBA adds an index. Response time drops to 180ms within 2 minutes.

Manager asks: "How was the system during last night's batch run?" You open the Grafana dashboard, change the time range to 11 PM–3 AM. You see CPU spiked to 91% at midnight and came back down by 1 AM. TXN rate was high but success rate stayed at 99.1%. System handled it well. Report delivered in 2 minutes.

✅ Week 6.5 · Day 1 Outcomes

Explain what monitoring is and why it matters — you know before the client does
Explain the difference between monitoring (something is wrong) and observability (why it is wrong)
Describe the 3 pillars of observability — Metrics, Logs, and Traces — and what each one answers
Know the key fintech metrics — success rate, error rate, CPU, disk, memory, queue depth, DB pool, and response time
Classify metrics into OK, Warning, High, and Critical — and know what action each level requires
Read a monitoring dashboard and identify which metrics need immediate action vs which to monitor
Complete the lab — pull real metrics from Kali using terminal commands, build a personal dashboard script, and practice classifying a full set of metrics from a scenario
Understand how Metric → Log → Trace work together as an investigation chain to find root cause
Name the common monitoring tools — Grafana, Prometheus, Kibana, Splunk, Jaeger, PagerDuty, Datadog