L2 Support Engineer · Fintech · Week 6.5
Week 6.5 Day 1
Week 6.5 · Day 1

Monitoring Basics

Before you can fix a problem, you need to know it exists. Monitoring is how your system tells you what is happening — in real time, all the time, even when you are not watching.

Metrics Logs Traces Observability Alert Levels Dashboard
01 The Simple Idea First
Real-life Analogy

Think of monitoring like the dashboard of a car. While you drive, the dashboard constantly tells you: current speed (metric), engine temperature (metric), fuel level (metric). If something goes wrong — the check engine light turns on (alert).

You do not have to open the bonnet every 5 minutes to check if the engine is fine. The dashboard watches for you. That is exactly what monitoring does for your server. It watches everything, measures the important numbers, and alerts you the moment something crosses a limit — before your clients notice.

What is Monitoring?

Monitoring means continuously collecting, measuring, and displaying data about your system — so you can see what is happening right now and compare it with what normal looks like.

In fintech, monitoring covers everything — how many transactions are processing per second, how long each one takes, how full the disk is, how much CPU the payment service is using, and how many errors the logs contain. All of it, measured every few seconds, displayed on a dashboard, and triggering alerts when limits are crossed.

Without monitoring — you only find out something is wrong when a client calls. With monitoring — you know before the client does.

02 The 3 Pillars of Observability

Observability vs Monitoring — What is the difference?

Monitoring tells you that something is wrong — CPU is high, error rate is up, disk is full.

Observability tells you why it is wrong — which query is slow, which transaction failed, what the exact error message was, and how long it has been happening. Observability is built from 3 pillars: Metrics, Logs, and Traces.

📊
Pillar 1

Metrics

Numbers measured over time. CPU %, transaction count per second, error rate, response time in ms. Metrics tell you how much and how fast.

CPU: 87% ↑
TXN/sec: 142
Error rate: 4.2%
📋
Pillar 2

Logs

Timestamped records of events. Every INFO, WARN, ERROR line your application writes. Logs tell you what happened and when exactly.

[14:02] ERROR
DB_TIMEOUT TXN-501
pool exhausted
🔗
Pillar 3

Traces

The full journey of one request through all systems. Shows exactly where time was spent. Traces tell you where the delay is in the chain.

API: 12ms
DB query: 2,840ms ⚠️
Network: 4ms
📋 Metrics vs Logs vs Traces — Quick Reference
PillarWhat it answersExample in fintechTool examples
Metrics How much? How fast? How many? Payment success rate: 98.2% / DB CPU: 74% / Queue depth: 12 Grafana Prometheus
Logs What happened? When? What was the error? [14:02] ERROR TXN-501 DB_CONNECTION_TIMEOUT after 30s Kibana Splunk
Traces Where is the slowness? Which step took longest? API call total 3.1s — breakdown: DB 2.8s, network 0.3s Jaeger Zipkin
💡 As L2, you use all three. Metrics tell you something is wrong → Logs show you what the error was → Traces show you where in the system it happened. They work together, not separately.
03 Key Metrics Every L2 Should Know
📈 The Metrics You Watch Every Day
MetricWhat it measuresNormal rangeWhen to act
TXN Success Rate% of transactions completing successfullyAbove 99%Below 95% — investigate immediately
Error Rate% of requests resulting in an errorBelow 1%Above 5% — P1 candidate
Response Time (p99)How long 99% of requests take to completeUnder 500msAbove 2s — something is slow
CPU UsageHow much of the processor is being usedUnder 70%Above 90% — find the heavy process
Disk UsageHow full the hard drive isUnder 80%Above 90% — clean up or expand
Memory / RAMHow much RAM is availableAbove 30% freeBelow 15% free — investigate swap usage
Queue DepthHow many messages are waiting in the MQNear 0Growing steadily — consumer may be down
DB Connection PoolHow many DB connections are in useUnder 80%Above 95% — pool near exhaustion
04 Alert Levels — OK, Warning, Critical

What is an Alert?

An alert fires automatically when a metric crosses a configured threshold. You do not have to sit watching a dashboard all day — the monitoring system watches for you and notifies you when something needs attention. This is what your Slack and WhatsApp scripts from Week 5 were doing — they were basic alerting systems.

OK — Everything normal

All metrics within expected ranges. No action needed. System is healthy. Example: CPU at 45%, error rate 0.2%, disk at 62%.

WARNING — Something to watch

A metric is elevated but not yet causing failures. Monitor it closely. It may resolve itself or worsen into Critical. Example: CPU at 78%, disk at 84%, response time creeping up.

HIGH — Action required soon

A metric is clearly abnormal. Investigate now — before it becomes a full outage. Example: CPU at 89%, error rate at 4%, queue depth growing steadily for 10 minutes.

CRITICAL — Immediate action / P1

A metric has crossed the point of causing real failures or outages. Open the bridge immediately. Example: CPU at 98%, success rate below 80%, DB pool at 100%, disk full.

05 Reading a Monitoring Dashboard

What you see on a real monitoring dashboard

Dashboards are the visual interface of your monitoring system. They show metrics as numbers, graphs, and bars — all updating in real time. Your job as L2 is to look at this dashboard and instantly know whether everything is fine or something needs attention.

PaySysLabs — Payment Service Dashboard ● LIVE
TXN Success Rate
98.7%
Last 15 minutes
CPU Usage
83%
⚠ Above 80% threshold
Error Rate
0.8%
Normal range
Disk Usage
97%
🚨 Critical — act now
DB Connection Pool68% used — OK
CPU83% — WARNING
Disk97% — CRITICAL
RAM Available12% free — Healthy
🚨 Looking at this dashboard — what action do you take? Disk at 97% is CRITICAL — this is the most urgent item. Investigate immediately before the disk hits 100% and the DB stops writing. CPU at 83% is a WARNING — monitor but less urgent than disk. TXN success rate and error rate are healthy — no transaction issue currently.
06 Hands-on Lab — Identify Metrics from a Dashboard

🔬 Lab: Read a Dashboard and Identify What Needs Action

Observability · Decision Making
01
Check your server's real metrics right now on Kali
These commands pull the same data a monitoring dashboard shows — you are building it manually first so you understand where the numbers come from.
terminal — pull all key metrics
# CPU — what % is idle (lower idle = higher load)
top -bn1 | grep "Cpu(s)"

# Memory — available RAM
free -h

# Disk — all partitions with usage %
df -h

# Load average — 1min, 5min, 15min
cat /proc/loadavg

# Error count from your log file
grep -c "ERROR" ~/payment-service.log 2>/dev/null || echo "0 errors"
→ You now have real metric numbers: CPU%, free RAM, disk%, load average, and error count — exactly what a dashboard shows.
02
Build a simple personal dashboard script
This script collects all metrics and prints them in a dashboard format — your own monitoring view.
my-dashboard.sh — create and run this
#!/bin/bash
# Simple personal monitoring dashboard
# Run: chmod +x my-dashboard.sh && ./my-dashboard.sh

echo "================================================"
echo " My Monitoring Dashboard — $(date)"
echo "================================================"

# CPU
CPU_IDLE=$(top -bn1 | grep "Cpu(s)" | awk '{print $8}' | tr -d '%id,')
CPU_USED=$(echo "100 - $CPU_IDLE" | bc 2>/dev/null || echo "N/A")
echo "[CPU ] Used: ${CPU_USED}%"

# Memory
MEM_AVAIL=$(free -h | awk 'NR==2 {print $7}')
echo "[MEMORY ] Available: $MEM_AVAIL"

# Disk
DISK=$(df / | awk 'NR==2 {print $5}')
echo "[DISK ] Used: $DISK"

# Load
LOAD=$(cat /proc/loadavg | awk '{print $1, $2, $3}')
echo "[LOAD ] 1m 5m 15m: $LOAD"

# Errors
ERRS=$(grep -c "ERROR" ~/payment-service.log 2>/dev/null || echo "0")
echo "[ERRORS ] In log: $ERRS"
echo "================================================"
→ Run this and you see your own dashboard. Every number tells a story — compare against the thresholds table above to know what is OK and what needs attention.
03
Practice reading a dashboard scenario — make the call
Look at this set of metrics and decide: what is OK, what is a Warning, what is Critical?
Scenario — read and classify each metric
Dashboard Reading at 14:30:

TXN Success Rate : 96.4%
Error Rate : 3.6%
CPU Usage : 71%
Disk Usage : 88%
RAM Available : 24%
Queue Depth : 0 messages
DB Pool : 62% used
Response Time : 1,400ms

Your job: classify each metric and decide
what to investigate first.
→ TXN Success at 96.4% = WARNING (below 99%). Error rate 3.6% = WARNING (above 1%). Disk 88% = WARNING approaching Critical. Response time 1.4s = WARNING (above 500ms). CPU 71% = OK. Queue 0 = OK. Start with disk — it will become critical fastest.
04
Understand how metrics connect to logs and traces
A metric tells you something is wrong. A log tells you what. A trace tells you where.
The investigation chain — 3 pillars working together
Step 1 — Metric fires alert:
Error rate jumped from 0.8% to 4.2% at 14:02

Step 2 — Check logs to find what error:
grep "ERROR" ~/payment-service.log | tail -10
→ Finds: DB_CONNECTION_TIMEOUT repeated every 10 seconds

Step 3 — Check trace to find where delay is:
→ DB query step took 28,000ms (28 seconds)
→ All other steps normal (<100ms)

Conclusion: The DB is the source. Escalate to DBA.
→ Metric → Log → Trace. Three steps. Each one narrows down the problem until you have a clear root cause. ✅
07 Common Monitoring Tools You Will Encounter
🛠️ Monitoring Tools — What Each One Is Used For
ToolTypeWhat it does
GrafanaMetricsVisualises metrics in dashboards — graphs, gauges, and alert panels. The most common L2 dashboard tool.
PrometheusMetricsCollects and stores metrics from all your services. Grafana reads from Prometheus to build dashboards.
KibanaLogsSearch and visualise logs from all servers in one place. Part of the ELK stack (Elasticsearch, Logstash, Kibana).
SplunkLogsEnterprise log management. Search across millions of log lines instantly. Common in large fintech companies.
JaegerTracesDistributed tracing — shows the full journey of a request across all microservices with timing at each step.
PagerDutyAlertingRoutes alerts to the on-call engineer's phone. Escalates automatically if the first responder doesn't acknowledge.
DatadogAll-in-oneMetrics + Logs + Traces in one platform. Increasingly popular. Expensive but very powerful for L2 work.
08 Real L2 Scenarios
01

At 9 AM you open the Grafana dashboard. TXN success rate is at 94%. Error rate is 6%. Both are above warning threshold. You immediately check logs — DB_CONNECTION_TIMEOUT. You escalate to DBA before any client calls. You found it from the metric before anyone reported it.

02

Alert fires at 2 AM: Disk at 93%. Metric told you first. You check logs — the application has been writing debug logs for 3 days and they are enormous. You archive old logs, disk drops to 71%. Crisis averted. Nobody lost sleep except you — and you fixed it in 10 minutes.

03

Response time metric is showing 3.2 seconds average. Normal is under 500ms. Metric says something is slow. Log says DB_SLOW_QUERY appearing since 13:45. Trace shows the SELECT query on TRANSACTIONS_LOG taking 3,100ms. DBA adds an index. Response time drops to 180ms within 2 minutes.

04

Manager asks: "How was the system during last night's batch run?" You open the Grafana dashboard, change the time range to 11 PM–3 AM. You see CPU spiked to 91% at midnight and came back down by 1 AM. TXN rate was high but success rate stayed at 99.1%. System handled it well. Report delivered in 2 minutes.

✅ Week 6.5 · Day 1 Outcomes