Before you can fix a problem, you need to know it exists. Monitoring is how your system tells you what is happening — in real time, all the time, even when you are not watching.
Think of monitoring like the dashboard of a car. While you drive, the dashboard constantly tells you: current speed (metric), engine temperature (metric), fuel level (metric). If something goes wrong — the check engine light turns on (alert).
You do not have to open the bonnet every 5 minutes to check if the engine is fine. The dashboard watches for you. That is exactly what monitoring does for your server. It watches everything, measures the important numbers, and alerts you the moment something crosses a limit — before your clients notice.
Monitoring means continuously collecting, measuring, and displaying data about your system — so you can see what is happening right now and compare it with what normal looks like.
In fintech, monitoring covers everything — how many transactions are processing per second, how long each one takes, how full the disk is, how much CPU the payment service is using, and how many errors the logs contain. All of it, measured every few seconds, displayed on a dashboard, and triggering alerts when limits are crossed.
Without monitoring — you only find out something is wrong when a client calls. With monitoring — you know before the client does.
Monitoring tells you that something is wrong — CPU is high, error rate is up, disk is full.
Observability tells you why it is wrong — which query is slow, which transaction failed, what the exact error message was, and how long it has been happening. Observability is built from 3 pillars: Metrics, Logs, and Traces.
Numbers measured over time. CPU %, transaction count per second, error rate, response time in ms. Metrics tell you how much and how fast.
Timestamped records of events. Every INFO, WARN, ERROR line your application writes. Logs tell you what happened and when exactly.
The full journey of one request through all systems. Shows exactly where time was spent. Traces tell you where the delay is in the chain.
| Pillar | What it answers | Example in fintech | Tool examples |
|---|---|---|---|
| Metrics | How much? How fast? How many? | Payment success rate: 98.2% / DB CPU: 74% / Queue depth: 12 | Grafana Prometheus |
| Logs | What happened? When? What was the error? | [14:02] ERROR TXN-501 DB_CONNECTION_TIMEOUT after 30s | Kibana Splunk |
| Traces | Where is the slowness? Which step took longest? | API call total 3.1s — breakdown: DB 2.8s, network 0.3s | Jaeger Zipkin |
| Metric | What it measures | Normal range | When to act |
|---|---|---|---|
| TXN Success Rate | % of transactions completing successfully | Above 99% | Below 95% — investigate immediately |
| Error Rate | % of requests resulting in an error | Below 1% | Above 5% — P1 candidate |
| Response Time (p99) | How long 99% of requests take to complete | Under 500ms | Above 2s — something is slow |
| CPU Usage | How much of the processor is being used | Under 70% | Above 90% — find the heavy process |
| Disk Usage | How full the hard drive is | Under 80% | Above 90% — clean up or expand |
| Memory / RAM | How much RAM is available | Above 30% free | Below 15% free — investigate swap usage |
| Queue Depth | How many messages are waiting in the MQ | Near 0 | Growing steadily — consumer may be down |
| DB Connection Pool | How many DB connections are in use | Under 80% | Above 95% — pool near exhaustion |
An alert fires automatically when a metric crosses a configured threshold. You do not have to sit watching a dashboard all day — the monitoring system watches for you and notifies you when something needs attention. This is what your Slack and WhatsApp scripts from Week 5 were doing — they were basic alerting systems.
All metrics within expected ranges. No action needed. System is healthy. Example: CPU at 45%, error rate 0.2%, disk at 62%.
A metric is elevated but not yet causing failures. Monitor it closely. It may resolve itself or worsen into Critical. Example: CPU at 78%, disk at 84%, response time creeping up.
A metric is clearly abnormal. Investigate now — before it becomes a full outage. Example: CPU at 89%, error rate at 4%, queue depth growing steadily for 10 minutes.
A metric has crossed the point of causing real failures or outages. Open the bridge immediately. Example: CPU at 98%, success rate below 80%, DB pool at 100%, disk full.
Dashboards are the visual interface of your monitoring system. They show metrics as numbers, graphs, and bars — all updating in real time. Your job as L2 is to look at this dashboard and instantly know whether everything is fine or something needs attention.
| Tool | Type | What it does |
|---|---|---|
| Grafana | Metrics | Visualises metrics in dashboards — graphs, gauges, and alert panels. The most common L2 dashboard tool. |
| Prometheus | Metrics | Collects and stores metrics from all your services. Grafana reads from Prometheus to build dashboards. |
| Kibana | Logs | Search and visualise logs from all servers in one place. Part of the ELK stack (Elasticsearch, Logstash, Kibana). |
| Splunk | Logs | Enterprise log management. Search across millions of log lines instantly. Common in large fintech companies. |
| Jaeger | Traces | Distributed tracing — shows the full journey of a request across all microservices with timing at each step. |
| PagerDuty | Alerting | Routes alerts to the on-call engineer's phone. Escalates automatically if the first responder doesn't acknowledge. |
| Datadog | All-in-one | Metrics + Logs + Traces in one platform. Increasingly popular. Expensive but very powerful for L2 work. |
At 9 AM you open the Grafana dashboard. TXN success rate is at 94%. Error rate is 6%. Both are above warning threshold. You immediately check logs — DB_CONNECTION_TIMEOUT. You escalate to DBA before any client calls. You found it from the metric before anyone reported it.
Alert fires at 2 AM: Disk at 93%. Metric told you first. You check logs — the application has been writing debug logs for 3 days and they are enormous. You archive old logs, disk drops to 71%. Crisis averted. Nobody lost sleep except you — and you fixed it in 10 minutes.
Response time metric is showing 3.2 seconds average. Normal is under 500ms. Metric says something is slow. Log says DB_SLOW_QUERY appearing since 13:45. Trace shows the SELECT query on TRANSACTIONS_LOG taking 3,100ms. DBA adds an index. Response time drops to 180ms within 2 minutes.
Manager asks: "How was the system during last night's batch run?" You open the Grafana dashboard, change the time range to 11 PM–3 AM. You see CPU spiked to 91% at midnight and came back down by 1 AM. TXN rate was high but success rate stayed at 99.1%. System handled it well. Report delivered in 2 minutes.