L2 Support Engineer · Fintech · Week 2
Week 2 Day 2
Today's Topic

Log Investigation

Logs are the black box of your system. When something breaks, the log file already knows what happened, when it happened, and why. Your job is to read it correctly.

Error Pattern Analysis Root Cause Indicators Failed Transaction Trace
01 The Simple Idea First
Real-life Analogy

Think of a log file like a CCTV recording of your system. Every single thing the system does gets recorded with a timestamp — every request received, every decision made, every error that happened.

When something goes wrong, you don't guess — you rewind the CCTV. You go to the exact time the problem started, watch what happened, and find exactly where things went wrong. That's log investigation.

What is a Log File?

A log file is a plain text file that your application writes to continuously. Every action gets a line — with a timestamp, a level (INFO, WARN, ERROR), and a message describing what happened. As an L2 engineer, you read these lines to understand what the system was doing at the moment something failed.

The key skill is not just finding errors — it's reading what happened BEFORE the error to understand why it happened. That context is everything.

02 Anatomy of a Log Line
payment-service.log
[2024-03-15 14:02:01] [INFO ] Transaction TXN-9823 received. Amount: 5000. Account: 0012XXXX
[2024-03-15 14:02:02] [INFO ] Validating transaction TXN-9823 — passed all checks
[2024-03-15 14:02:02] [INFO ] TXN-9823 queued successfully for processing
[2024-03-15 14:02:03] [WARN ] DB connection pool at 85% capacity — approaching limit
[2024-03-15 14:02:04] [WARN ] DB connection pool at 94% capacity
[2024-03-15 14:02:05] [ERROR] DB_CONNECTION_TIMEOUT — failed to acquire connection after 30s [2024-03-15 14:02:05] [ERROR] Transaction TXN-9823 FAILED — unable to write to database [2024-03-15 14:02:05] [INFO ] Rollback initiated for TXN-9823

How to read this log — 3 things to always look for

1. Timestamp — tells you exactly when each event happened. Always note the time the first ERROR appeared. Everything before that is the lead-up.

2. Log Level — INFO means normal, WARN means something is trending wrong, ERROR means it broke. Notice how the WARNs appeared BEFORE the ERROR — that's the pattern.

3. The message — the actual description of what happened. DB_CONNECTION_TIMEOUT tells you exactly what failed. You now know to investigate the database connection pool — not the payment logic.

03 Common Error Patterns You Will See
🗄️
Pattern 1
Database Timeout / Connection Failure
"App can't talk to the database"
Typical log lines
[WARN ] DB connection pool at 90% capacity
[ERROR] DB_CONNECTION_TIMEOUT after 30000ms
[ERROR] Cannot acquire connection from pool — pool exhausted
Root Cause Indicator: Database connection pool is full. Too many open connections, no new ones available. Check if a previous process leaked connections without closing them. Escalate to DBA or L3.
💥
Pattern 2
NullPointerException / Unexpected Null
"System received empty data where it expected something"
Typical log lines
[ERROR] NullPointerException at PaymentProcessor.java:142
[ERROR] Field 'accountNumber' is null — cannot process transaction
[WARN ] Incoming request missing required field: beneficiaryIBAN
Root Cause Indicator: The request came in with missing or empty data. Usually a problem with how the client sent the request — a field was blank or not included. Check what data arrived in the request and compare to what was expected.
⏱️
Pattern 3
External Service / API Timeout
"System called an external service and got no reply"
Typical log lines
[INFO ] Calling SBP gateway — POST /raast/payment
[WARN ] SBP gateway response time: 8500ms — exceeding threshold
[ERROR] SocketTimeoutException — no response from SBP after 10000ms
[ERROR] TXN-9823 failed — external gateway unreachable
Root Cause Indicator: Your system is fine — the external service (SBP, bank gateway) did not respond in time. Check if the external service has a status page or known outage. This is NOT your system's fault but you still need to document it and inform the client.
🔁
Pattern 4
Repeated Retries / Infinite Loop
"System keeps trying the same thing and failing repeatedly"
Typical log lines
[WARN ] Retry attempt 1/3 for TXN-9823
[WARN ] Retry attempt 2/3 for TXN-9823
[WARN ] Retry attempt 3/3 for TXN-9823
[ERROR] Max retries exceeded — TXN-9823 moved to dead-letter queue
Root Cause Indicator: The system tried multiple times and all failed. The transaction is now in the dead-letter queue — a holding area for transactions that couldn't be processed. You need to check why all retries failed (usually the same root cause as above) and decide whether to reprocess or refund.
04 What to Look For — Quick Reference
🔴
Immediate Red Flags

Stop and act now

  • Words: ERROR, FATAL, CRITICAL, EXCEPTION
  • DB_CONNECTION_TIMEOUT or pool exhausted
  • OutOfMemoryError
  • StackOverflowError
  • Service unreachable or connection refused
🟡
Warning Signs

Investigate before it breaks

  • Words: WARN, WARNING, SLOW, RETRY
  • Response time exceeding threshold
  • Connection pool nearing limit
  • Retry attempts happening
  • Queue depth growing
🔍
Context Clues

Always check before the error

  • What was the last INFO before the ERROR?
  • Was there a deployment recently?
  • Did WARNs appear before the ERROR?
  • Is the same error repeating?
  • What Transaction ID is involved?
Healthy Signs

System is working normally

  • Consistent INFO lines flowing steadily
  • Transaction IDs completing with SUCCESS
  • Response times within normal range
  • No repeated same error lines
  • No WARNs escalating over time
05 Hands-on Lab — Investigate a Failed Transaction

Can you perform this on your own machine?

Yes — partially. Since you have Kali Linux set up on VirtualBox, you can simulate this entire investigation using a dummy log file we create ourselves. You don't need a real production server. Here is exactly how to do it step by step.

This gives you the same experience as a real investigation — reading real log content, running real Linux commands, and finding a real root cause.

🔬 Lab: Simulate & Investigate a Failed Transaction Log

Performable on Kali Linux · VirtualBox
01
Open your Kali Linux terminal
Boot up your VirtualBox Kali machine and open the terminal. Navigate to your home directory.
cd ~
02
Create a dummy log file with real-looking entries
Copy and paste this entire block into your terminal. It creates a realistic payment log file with a failure built into it.
cat > payment-service.log << 'EOF' [2024-03-15 14:01:55] [INFO ] Transaction TXN-9820 received. Amount: 2000 [2024-03-15 14:01:56] [INFO ] TXN-9820 validated and queued successfully [2024-03-15 14:01:57] [INFO ] TXN-9820 processed successfully — Status: SUCCESS [2024-03-15 14:02:00] [INFO ] Transaction TXN-9821 received. Amount: 8500 [2024-03-15 14:02:01] [INFO ] TXN-9821 validated and queued successfully [2024-03-15 14:02:02] [WARN ] DB connection pool at 82% capacity [2024-03-15 14:02:03] [INFO ] Transaction TXN-9822 received. Amount: 3000 [2024-03-15 14:02:03] [WARN ] DB connection pool at 91% capacity — approaching limit [2024-03-15 14:02:04] [INFO ] Transaction TXN-9823 received. Amount: 5000 [2024-03-15 14:02:04] [WARN ] DB connection pool at 97% capacity — CRITICAL [2024-03-15 14:02:05] [ERROR] DB_CONNECTION_TIMEOUT — failed to acquire connection after 30s [2024-03-15 14:02:05] [ERROR] Transaction TXN-9823 FAILED — unable to write to database [2024-03-15 14:02:05] [INFO ] Rollback initiated for TXN-9823 [2024-03-15 14:02:06] [ERROR] DB_CONNECTION_TIMEOUT — failed to acquire connection after 30s [2024-03-15 14:02:06] [ERROR] Transaction TXN-9822 FAILED — unable to write to database [2024-03-15 14:02:07] [WARN ] Retry attempt 1/3 for TXN-9822 [2024-03-15 14:02:09] [WARN ] Retry attempt 2/3 for TXN-9822 [2024-03-15 14:02:11] [WARN ] Retry attempt 3/3 for TXN-9822 [2024-03-15 14:02:12] [ERROR] Max retries exceeded — TXN-9822 moved to dead-letter queue [2024-03-15 14:02:15] [INFO ] Transaction TXN-9824 received. Amount: 1200 [2024-03-15 14:02:16] [INFO ] TXN-9824 processed successfully — Status: SUCCESS EOF
✅ This creates a file called payment-service.log in your home directory with 20 log lines.
03
First — find all ERROR lines
Use grep to pull out only the error lines from the file.
grep "ERROR" payment-service.log → You should see 5 ERROR lines all related to DB_CONNECTION_TIMEOUT and failed transactions
04
Count how many errors occurred
Use -c to count the total number of error lines — quick way to know the scale of the problem.
grep -c "ERROR" payment-service.log → Output: 5 (there are 5 error lines in this file)
05
Find the specific failed transaction — TXN-9823
Use grep to trace everything related to that single transaction ID.
grep "TXN-9823" payment-service.log → Shows all 3 lines about TXN-9823: received → FAILED → rollback
🔍 You can now see the full story of TXN-9823 — it arrived fine but failed at the DB write stage.
06
Get context — what happened just BEFORE the errors?
Use -n to see line numbers, then look at the lines before the first ERROR.
grep -n "ERROR\|WARN" payment-service.log → You see WARN lines at 82%, 91%, 97% BEFORE the ERROR lines appear
💡 Key finding: The WARNs about DB connection pool climbing (82% → 91% → 97%) appeared before the ERROR. This is the pattern — the pool was filling up and nobody acted on the warnings.
07
Check which transactions succeeded vs failed
Compare successful vs failed transactions in the log.
grep "SUCCESS\|FAILED" payment-service.log → TXN-9820: SUCCESS | TXN-9823: FAILED | TXN-9822: FAILED | TXN-9824: SUCCESS
Conclusion: TXN-9820 and TXN-9824 succeeded. TXN-9822 and TXN-9823 failed. The failures happened exactly when the DB pool was exhausted. TXN-9822 also went to dead-letter queue after 3 retries.
08
View the full log in order to see the complete picture
Use tail to see all lines and understand the full timeline from start to finish.
tail -n 20 payment-service.log → Full log displayed — you can now read the entire incident from start to finish

📋 Your Root Cause Report — What to Write in the Jira Ticket

Issue: Transactions TXN-9822 and TXN-9823 failed at 14:02:05.

Root Cause: Database connection pool became exhausted (reached 97% capacity). New transactions could not acquire a database connection, causing them to fail with DB_CONNECTION_TIMEOUT.

Evidence: Log shows WARN messages at 14:02:02 (82%), 14:02:03 (91%), and 14:02:04 (97%) before the first ERROR at 14:02:05. TXN-9822 was also moved to dead-letter queue after 3 failed retries.

Next Step: Escalate to DBA team to investigate connection leak. Check if any long-running query is holding connections open without releasing them.

06 Root Cause Indicators — Quick Reference
🎯 Error Message → What it Likely Means
Error in LogRoot Cause IndicatorSeverityWho to Escalate To
DB_CONNECTION_TIMEOUT Database connection pool exhausted or DB server down Critical DBA / L3
SocketTimeoutException External service (SBP, gateway) not responding Critical External vendor / L3
NullPointerException Missing or empty field in the request data High Check request data first, then L3 if code bug
OutOfMemoryError Server RAM is completely full Critical Infra team / L3 immediately
Connection refused Target server or service is not running Critical Infra team to check target server
Max retries exceeded Underlying failure persisted through all retry attempts High Check dead-letter queue, find original error
401 Unauthorized API token expired or wrong credentials Medium Check API key/token config — usually a config fix
Response time > 10000ms Slow query or overloaded external service High DBA for slow query / vendor for external slowness
07 Real L2 Scenarios
01

Client says "payments started failing at 2 PM." You grep the log for ERROR around 14:00. You find SocketTimeoutException pointing to the SBP gateway. You check SBP's status page — they had a 15-minute outage. Not your system's fault. You document it and inform the client.

02

You see 200 ERROR lines in the log but they're all the same: same error, same service, repeating every 2 seconds. This is called an error storm — one root cause is triggering hundreds of errors. Focus on the first occurrence, not the count. Fix the first one and the rest stop.

03

A client reports "transaction TXN-4421 is showing pending but never completed." You grep the log for TXN-4421 — you find it entered the queue but there are no further lines about it. The queue processor likely stopped. You check the queue — it's backed up with 3,000 messages. The worker crashed silently.

04

You open a log and see no ERROR lines at all but the client says it's broken. You look at the INFO lines — the last entry is from 2 hours ago. The application stopped logging entirely. Either the app crashed silently or the disk is full and it can't write logs. Run df -h immediately.

✅ Week 2 · Day 2 Outcomes — Can You Do This?