Week 6.5 · Day 3

Application Monitoring

System metrics tell you the server is struggling. Application metrics tell you which part of your software is causing it. Today you learn to read API latency, error rate, and throughput — the three numbers that reveal your application's health.

API Latency Error Rate Throughput App Health Checks Slow API

01 The Simple Idea First

Real-life Analogy

Think of your payment application like a post office counter. System metrics (CPU, disk, memory) tell you how tired the staff are. Application metrics tell you something more specific — how long each customer is waiting at the counter, how many letters got lost, and how many customers are being served per hour.

A post office where staff are not tired but customers wait 20 minutes — that is an application problem, not a server problem.

Application monitoring catches that. It watches the software itself — how fast it responds, how often it fails, and how much work it is doing.

Why Application Monitoring is different from System Monitoring

System monitoring (Day 2) watches the hardware layer — CPU%, disk%, RAM. It tells you the machine is stressed but not which part of the application is causing it.

Application monitoring watches the software layer — each specific API endpoint, each service, each function. It tells you that your /payment/initiate endpoint takes 4 seconds to respond while /account/balance takes 80ms. That specificity is what lets you find the root cause instead of guessing.

In fintech, application monitoring is especially critical because different endpoints carry different risk. A slow /health endpoint is fine. A slow /payment/process endpoint at 8 seconds means real money is being held up for thousands of customers.

02 The 3 Core Application Metrics

⏱️

Metric 1

API Latency

How long an API endpoint takes to respond to a request. Measured in milliseconds. Low latency = fast response. High latency = slow, customers waiting.

Normal: < 500ms
Warning: 500ms – 2s
Critical: > 3s

❌

Metric 2

Error Rate

The percentage of requests that result in an error (HTTP 5xx, timeout, exception). A rising error rate means the application is failing for a portion of users.

Normal: < 1%
Warning: 1% – 5%
Critical: > 5%

📦

Metric 3

Throughput

How many requests per second (RPS) the application is handling. Unusually low throughput means traffic dropped or requests are piling up. Unusually high can cause overload.

Know your baseline
Drop: investigate
Spike: may overload

03 Understanding API Latency in Detail

Latency is not one number — it is a distribution

Instead of just an average, monitoring tools show you percentiles. This matters because averages hide problems.

p50 (median) — 50% of requests are faster than this. Your typical customer experience.

p95 — 95% of requests are faster than this. Shows the slower edge of your users.

p99 — 99% of requests are faster than this. Shows the slowest 1% — often your most important customers doing large transactions.

Example: p50 = 120ms, p95 = 340ms, p99 = 4,200ms. The average might show 200ms and look fine. But 1 in 100 customers is waiting 4.2 seconds — that is the outlier hidden by the average. Percentiles reveal it.

📊 Latency Percentiles — What Each One Tells You

Percentile	What it means	When it matters
p50 (median)	Half of requests are faster than this	Your typical customer experience — the baseline
p95	95% of requests complete within this time	Shows if most users have a good experience
p99	99% of requests complete within this time	Shows worst-case experience for almost everyone
p99.9	99.9% of requests complete within this time	Rare outliers — timeouts, stuck requests

04 Reading an API Health Table

This is what you read when checking application health

Every row is a different API endpoint. Every column is a different health metric. Your job is to scan this table and immediately identify which endpoint is problematic and what kind of problem it has.

Payment Application — API Health Dashboard

● LIVE

Endpoint	Latency p50	Latency p99	Error Rate	Req/sec	Status
/payment/initiate	142ms	380ms	0.2%	84	OK
/payment/status	1,840ms	8,200ms	3.4%	210	WARNING
/account/balance	68ms	190ms	0.1%	340	OK
/raast/transfer	6,100ms	31,000ms	12.8%	22	CRITICAL
/ibft/transfer	220ms	640ms	0.4%	56	OK
/health	12ms	28ms	0%	18	OK

🚨 Reading this table: /raast/transfer is CRITICAL — p99 latency of 31 seconds and 12.8% error rate. This means 1 in 8 RAAST transactions is failing and some are waiting 31 seconds. Investigate this endpoint first. /payment/status is WARNING — high p99 latency may indicate a DB query issue on the status lookup.

05 Health Check Endpoints — What They Are

What is a health check endpoint?

A health check endpoint is a special API route — usually /health or /status — that the application exposes specifically for monitoring. It returns a simple response telling you whether the application and its dependencies are working.

A good health check response tells you: Is the application running? Can it reach the database? Can it reach the message queue? Can it reach external APIs? All in one response, in under 100ms.

What a health check response looks like

# Call the health endpoint with curl
curl -s https://payment-api.company.com/health

# Healthy response
{
  "status": "OK",
  "database": "connected",
  "message_queue": "connected",
  "sbp_gateway": "reachable",
  "response_time_ms": 48
}

# Unhealthy response — DB is down
{
  "status": "DEGRADED",
  "database": "disconnected",
  "message_queue": "connected",
  "sbp_gateway": "reachable",
  "response_time_ms": 2100
}

💡 As L2 — whenever you suspect an application issue, calling the health endpoint is one of the first things you do. The response tells you which dependency is broken in seconds, without reading logs.

06 Hands-on Lab — Find the Slow API

🔬 Lab: Identify and Investigate a Slow API Endpoint

Log Analysis · App Health

Create a simulated API access log with latency data

This log shows every API call with its response time — exactly what a real web server log looks like.

terminal

cat > ~/api-access.log << 'EOF'
2024-03-15 14:00:01 POST /payment/initiate 200 142ms
2024-03-15 14:00:02 GET /account/balance 200 68ms
2024-03-15 14:00:03 POST /raast/transfer 200 6240ms
2024-03-15 14:00:04 GET /payment/status 200 1820ms
2024-03-15 14:00:05 POST /payment/initiate 200 138ms
2024-03-15 14:00:06 POST /raast/transfer 500 8100ms
2024-03-15 14:00:07 GET /account/balance 200 71ms
2024-03-15 14:00:08 POST /raast/transfer 500 9400ms
2024-03-15 14:00:09 GET /payment/status 200 2100ms
2024-03-15 14:00:10 POST /ibft/transfer 200 218ms
2024-03-15 14:00:11 POST /raast/transfer 500 31000ms
2024-03-15 14:00:12 POST /payment/initiate 200 155ms
2024-03-15 14:00:13 GET /account/balance 200 74ms
2024-03-15 14:00:14 POST /raast/transfer 200 7200ms
2024-03-15 14:00:15 GET /health 200 12ms
EOF

→ API log created with 15 entries across 5 endpoints. Some fast, some slow, some failing.

Step 1 — Find which endpoint is slowest

Sort all API calls by response time to immediately see the worst offender.

terminal

# Sort all requests by response time — slowest first
sort -k5 -rn ~/api-access.log

# Show only requests slower than 1000ms
awk '{gsub("ms",""); if($5+0 > 1000) print $0}' ~/api-access.log

→ /raast/transfer appears at the top — slowest response times including one at 31,000ms. /payment/status is second with 1,820ms and 2,100ms.

Step 2 — Find which endpoint has the highest error rate

HTTP 5xx means server error. Find which endpoints are returning errors.

terminal

# Show all 5xx errors
grep " 500 " ~/api-access.log

# Count errors per endpoint
grep " 500 " ~/api-access.log | awk '{print $4}' | sort | uniq -c | sort -rn

# Error rate for raast/transfer (errors / total calls)
echo "Total raast calls: $(grep 'raast' ~/api-access.log | wc -l)"
echo "Failed raast calls: $(grep 'raast' ~/api-access.log | grep ' 500 ' | wc -l)"

→ 3 out of 5 raast/transfer calls failed with 500 error = 60% error rate. This is CRITICAL. All other endpoints show 0 errors.

Step 3 — Calculate throughput per endpoint

How many requests is each endpoint receiving? Low volume can mean clients stopped sending — they may have given up.

terminal

# Count requests per endpoint
awk '{print $4}' ~/api-access.log | sort | uniq -c | sort -rn

# Compare success vs failure throughput for raast
echo "Successful raast: $(grep 'raast' ~/api-access.log | grep ' 200 ' | wc -l)"
echo "Failed raast : $(grep 'raast' ~/api-access.log | grep ' 500 ' | wc -l)"

→ /account/balance has most calls (3). /raast/transfer has 5 calls but 3 are errors — effective successful throughput is only 2. Very low for a critical payment endpoint.

Step 4 — Build an app health check script

Automate the check — this script reads the log and reports the health of each endpoint.

app-health-check.sh

#!/bin/bash
# App health check — reads API log and reports per endpoint
# chmod +x app-health-check.sh && ./app-health-check.sh

LOG="$HOME/api-access.log"
LATENCY_WARN=1000 # ms
ERROR_WARN=1 # count

echo "============================================"
echo " App Health Check — $(date)"
echo "============================================"

# Slow requests
echo ""
echo "[LATENCY] Requests above ${LATENCY_WARN}ms:"
awk '{gsub("ms",""); if($5+0 > 1000) print " "$4" — "$5"ms"}' "$LOG"

# Errors
echo ""
echo "[ERRORS ] 5xx errors by endpoint:"
grep " 500 " "$LOG" | awk '{print " "$4}' | sort | uniq -c

# Throughput
echo ""
echo "[TRAFFIC] Requests per endpoint:"
awk '{print " "$4}' "$LOG" | sort | uniq -c | sort -rn

echo ""
echo "============================================"

→ Single script covers latency, errors, and throughput for every endpoint. Run it to get a full app health report in 2 seconds. ✅

07 Real L2 Scenarios

Client says: "RAAST transfers are failing." You open the API health table — /raast/transfer shows 12.8% error rate and p99 latency of 31 seconds. You call the /health endpoint — database shows disconnected. Root cause found immediately — DB is down, causing RAAST to fail. You declare P1 and call the DBA.

System metrics show CPU and disk are perfectly normal. But users are complaining of slowness. You check the API health dashboard — /payment/status has p99 latency of 8.2 seconds. All other endpoints are fast. This is not a server problem — it is a specific slow query on the status-check endpoint. L3 to optimise that query.

Throughput on /ibft/transfer dropped from 56 req/sec to 4 req/sec in the last 10 minutes. Error rate is still 0%. No errors but traffic disappeared — clients stopped sending. This means the issue is upstream — the client's system is not generating transfers, or they got errors on their end and stopped retrying. You investigate the client's side.

Monitoring alert fires: error rate above 5%. You open the app dashboard — only /raast/transfer is affected, all other endpoints are green. This tells you immediately it is not a system-wide outage. It is isolated to RAAST. You check the SBP status page — SBP is experiencing issues. You inform the bridge and the client: our system is healthy, SBP is degraded.

✅ Week 6.5 · Day 3 Outcomes

Explain the difference between system monitoring and application monitoring — and why both are needed
Define the 3 core application metrics — API latency, error rate, and throughput — and know the healthy thresholds for each
Understand latency percentiles — p50, p95, p99 — and why p99 is more important than the average
Read an API health dashboard table and instantly identify which endpoint is problematic and what kind of problem it has
Explain what a health check endpoint is and how to use it to identify a broken dependency in seconds
Complete the lab — create an API access log, find the slowest endpoint by sorting response times, calculate error rate per endpoint, measure throughput, and build app-health-check.sh that reports all three metrics automatically
Distinguish between a system issue (affects all endpoints equally) and an application issue (affects only one specific endpoint)