Logs are the black box of your system. When something breaks, the log file already knows what happened, when it happened, and why. Your job is to read it correctly.
Think of a log file like a CCTV recording of your system. Every single thing the system does gets recorded with a timestamp — every request received, every decision made, every error that happened.
When something goes wrong, you don't guess — you rewind the CCTV. You go to the exact time the problem started, watch what happened, and find exactly where things went wrong. That's log investigation.
A log file is a plain text file that your application writes to continuously. Every action gets a line — with a timestamp, a level (INFO, WARN, ERROR), and a message describing what happened. As an L2 engineer, you read these lines to understand what the system was doing at the moment something failed.
The key skill is not just finding errors — it's reading what happened BEFORE the error to understand why it happened. That context is everything.
1. Timestamp — tells you exactly when each event happened. Always note the time the first ERROR appeared. Everything before that is the lead-up.
2. Log Level — INFO means normal, WARN means something is trending wrong, ERROR means it broke. Notice how the WARNs appeared BEFORE the ERROR — that's the pattern.
3. The message — the actual description of what happened. DB_CONNECTION_TIMEOUT tells you exactly what failed. You now know to investigate the database connection pool — not the payment logic.
Yes — partially. Since you have Kali Linux set up on VirtualBox, you can simulate this entire investigation using a dummy log file we create ourselves. You don't need a real production server. Here is exactly how to do it step by step.
This gives you the same experience as a real investigation — reading real log content, running real Linux commands, and finding a real root cause.
cd ~
cat > payment-service.log << 'EOF'
[2024-03-15 14:01:55] [INFO ] Transaction TXN-9820 received. Amount: 2000
[2024-03-15 14:01:56] [INFO ] TXN-9820 validated and queued successfully
[2024-03-15 14:01:57] [INFO ] TXN-9820 processed successfully — Status: SUCCESS
[2024-03-15 14:02:00] [INFO ] Transaction TXN-9821 received. Amount: 8500
[2024-03-15 14:02:01] [INFO ] TXN-9821 validated and queued successfully
[2024-03-15 14:02:02] [WARN ] DB connection pool at 82% capacity
[2024-03-15 14:02:03] [INFO ] Transaction TXN-9822 received. Amount: 3000
[2024-03-15 14:02:03] [WARN ] DB connection pool at 91% capacity — approaching limit
[2024-03-15 14:02:04] [INFO ] Transaction TXN-9823 received. Amount: 5000
[2024-03-15 14:02:04] [WARN ] DB connection pool at 97% capacity — CRITICAL
[2024-03-15 14:02:05] [ERROR] DB_CONNECTION_TIMEOUT — failed to acquire connection after 30s
[2024-03-15 14:02:05] [ERROR] Transaction TXN-9823 FAILED — unable to write to database
[2024-03-15 14:02:05] [INFO ] Rollback initiated for TXN-9823
[2024-03-15 14:02:06] [ERROR] DB_CONNECTION_TIMEOUT — failed to acquire connection after 30s
[2024-03-15 14:02:06] [ERROR] Transaction TXN-9822 FAILED — unable to write to database
[2024-03-15 14:02:07] [WARN ] Retry attempt 1/3 for TXN-9822
[2024-03-15 14:02:09] [WARN ] Retry attempt 2/3 for TXN-9822
[2024-03-15 14:02:11] [WARN ] Retry attempt 3/3 for TXN-9822
[2024-03-15 14:02:12] [ERROR] Max retries exceeded — TXN-9822 moved to dead-letter queue
[2024-03-15 14:02:15] [INFO ] Transaction TXN-9824 received. Amount: 1200
[2024-03-15 14:02:16] [INFO ] TXN-9824 processed successfully — Status: SUCCESS
EOF
grep "ERROR" payment-service.log
→ You should see 5 ERROR lines all related to DB_CONNECTION_TIMEOUT and failed transactions
grep -c "ERROR" payment-service.log
→ Output: 5 (there are 5 error lines in this file)
grep "TXN-9823" payment-service.log
→ Shows all 3 lines about TXN-9823: received → FAILED → rollback
grep -n "ERROR\|WARN" payment-service.log
→ You see WARN lines at 82%, 91%, 97% BEFORE the ERROR lines appear
grep "SUCCESS\|FAILED" payment-service.log
→ TXN-9820: SUCCESS | TXN-9823: FAILED | TXN-9822: FAILED | TXN-9824: SUCCESS
tail -n 20 payment-service.log
→ Full log displayed — you can now read the entire incident from start to finish
Issue: Transactions TXN-9822 and TXN-9823 failed at 14:02:05.
Root Cause: Database connection pool became exhausted (reached 97% capacity). New transactions could not acquire a database connection, causing them to fail with DB_CONNECTION_TIMEOUT.
Evidence: Log shows WARN messages at 14:02:02 (82%), 14:02:03 (91%), and 14:02:04 (97%) before the first ERROR at 14:02:05. TXN-9822 was also moved to dead-letter queue after 3 failed retries.
Next Step: Escalate to DBA team to investigate connection leak. Check if any long-running query is holding connections open without releasing them.
| Error in Log | Root Cause Indicator | Severity | Who to Escalate To |
|---|---|---|---|
| DB_CONNECTION_TIMEOUT | Database connection pool exhausted or DB server down | Critical | DBA / L3 |
| SocketTimeoutException | External service (SBP, gateway) not responding | Critical | External vendor / L3 |
| NullPointerException | Missing or empty field in the request data | High | Check request data first, then L3 if code bug |
| OutOfMemoryError | Server RAM is completely full | Critical | Infra team / L3 immediately |
| Connection refused | Target server or service is not running | Critical | Infra team to check target server |
| Max retries exceeded | Underlying failure persisted through all retry attempts | High | Check dead-letter queue, find original error |
| 401 Unauthorized | API token expired or wrong credentials | Medium | Check API key/token config — usually a config fix |
| Response time > 10000ms | Slow query or overloaded external service | High | DBA for slow query / vendor for external slowness |
Client says "payments started failing at 2 PM." You grep the log for ERROR around 14:00. You find SocketTimeoutException pointing to the SBP gateway. You check SBP's status page — they had a 15-minute outage. Not your system's fault. You document it and inform the client.
You see 200 ERROR lines in the log but they're all the same: same error, same service, repeating every 2 seconds. This is called an error storm — one root cause is triggering hundreds of errors. Focus on the first occurrence, not the count. Fix the first one and the rest stop.
A client reports "transaction TXN-4421 is showing pending but never completed." You grep the log for TXN-4421 — you find it entered the queue but there are no further lines about it. The queue processor likely stopped. You check the queue — it's backed up with 3,000 messages. The worker crashed silently.
You open a log and see no ERROR lines at all but the client says it's broken. You look at the INFO lines — the last entry is from 2 hours ago. The application stopped logging entirely. Either the app crashed silently or the disk is full and it can't write logs. Run df -h immediately.