Before a client even calls you, the system should already be screaming that something is wrong. That's what monitoring is — your early warning system.
Think about the dashboard of a car. While you drive, it quietly watches everything — fuel level, engine temperature, oil pressure, speed. You don't manually check the engine every 5 minutes. The car tells you when something needs attention.
A small yellow light = warning, keep an eye on it. A red flashing light = stop the car right now.
Monitoring tools do exactly this for your servers and applications. They watch everything 24/7 and send you alerts when something looks wrong — before it becomes a disaster.
Monitoring means your systems are being watched continuously — every second of every day. Tools like Grafana, Datadog, Zabbix, or Nagios collect numbers from your servers and apps (like CPU usage, memory, error rates) and display them on a dashboard.
When a number crosses a limit — say CPU goes above 90% — the monitoring tool fires an alert. That alert lands in your email, Slack, or PagerDuty. As an L2 engineer, you receive these alerts and decide what to do next.
Think of it like a shop. The building itself (walls, electricity, AC, water) is the infrastructure. The work happening inside (cashier processing payments, stock being tracked) is the application. Both can have problems. Both need to be watched.
"Is the building healthy?"
"Is the work inside running fine?"
If the alert is infra-related (high CPU, disk full), you check the server, restart services, or clear space. If it's app-related (high error rate, slow API), you dig into the application logs, check the code config, or look at recent deployments. The starting point is different for each.
A very common mistake for new engineers is treating all alerts the same. They're not. Always ask first: is this an infra problem or an app problem?
Looking at the dashboard above — disk space is critical at 96%. Your first action: log into the server, find what's eating disk space (usually old log files), clean it up or archive it. Don't wait — at 100% the server will crash and stop writing any new data.
The transaction error rate warning at 4.2% also needs attention. This is an app-level alert. You'd check the application logs to see what type of transactions are failing and why.
1. Is this Infra or App? 2. What severity is it — OK, Info, Warning, or Critical?
This is exactly what you'll do every morning when you open your monitoring tool. Here's a full classification exercise:
| Alert Message | Type | Severity | What You Should Do |
|---|---|---|---|
| Server CPU at 97% for last 10 minutes | Infra | Critical | Create P1 ticket. Check what process is eating CPU. Restart if needed. |
| Payment API returning 500 errors — 80% failure rate | App | Critical | Drop everything. Check app logs. Likely a P1 — alert the team immediately. |
| Disk usage reached 82% on DB server | Infra | Warning | Schedule log cleanup. Monitor closely. Will become critical at 90%+. |
| Login failures increased by 30% in last hour | App | Warning | Check if it's a specific user, IP, or region. Could be fraud attempt or an auth bug. |
| Nightly backup completed successfully | Infra | Info | Just log it. No action needed. Good news. |
| API response time: 210ms (normal is under 500ms) | App | OK | All good. No action needed. System is healthy. |
| Database server not responding — connection refused | Infra | Critical | P1 immediately. Every app that uses this DB is now broken. Wake up the team. |
| Transaction queue length growing — 500 jobs waiting | App | Warning | Check if the queue processor is running. If stuck, restart the worker service. |
| New deployment pushed to PROD at 3:00 PM | App | Info | Watch closely for 30 minutes post-deploy. Any new alerts after this = likely the deployment caused it. |
| RAM usage dropped back to 45% after restart | Infra | OK / Resolved | Previous warning is cleared. Update the Jira ticket. Inform client if they were affected. |
At 7 AM you get a CRITICAL alert: "DB server not responding." No client has called yet. You open a P1 ticket immediately, start investigating, and send an early heads-up to the client — "We detected an issue and are already working on it." This is proactive support and clients love it.
You get a WARNING: disk at 83%. You think "it's not critical yet, I'll handle it later." Two hours later it hits 100% and the server crashes. Warnings are your chance to prevent the fire. Always act on warnings before they become critical.
A CRITICAL alert fires right after a deployment. First question to ask: "Was anything deployed recently?" If yes, the deployment is your prime suspect. You check what changed and whether rolling it back fixes the issue.
You get 20 alerts at once — CPU, memory, API errors, login failures all firing together. Don't panic. Usually one root cause triggers everything else. Start with the infra alerts first — if the server is overloaded, that's probably causing all the app-level alerts too.