Today's Topic

Log Investigation

Logs are the black box of your system. When something breaks, the log file already knows what happened, when it happened, and why. Your job is to read it correctly.

Error Pattern Analysis Root Cause Indicators Failed Transaction Trace

01 The Simple Idea First

Real-life Analogy

Think of a log file like a CCTV recording of your system. Every single thing the system does gets recorded with a timestamp — every request received, every decision made, every error that happened.

When something goes wrong, you don't guess — you rewind the CCTV. You go to the exact time the problem started, watch what happened, and find exactly where things went wrong. That's log investigation.

What is a Log File?

A log file is a plain text file that your application writes to continuously. Every action gets a line — with a timestamp, a level (INFO, WARN, ERROR), and a message describing what happened. As an L2 engineer, you read these lines to understand what the system was doing at the moment something failed.

The key skill is not just finding errors — it's reading what happened BEFORE the error to understand why it happened. That context is everything.

02 Anatomy of a Log Line

payment-service.log

[2024-03-15 14:02:01] [INFO ] Transaction TXN-9823 received. Amount: 5000. Account: 0012XXXX
[2024-03-15 14:02:02] [INFO ] Validating transaction TXN-9823 — passed all checks
[2024-03-15 14:02:02] [INFO ] TXN-9823 queued successfully for processing
[2024-03-15 14:02:03] [WARN ] DB connection pool at 85% capacity — approaching limit
[2024-03-15 14:02:04] [WARN ] DB connection pool at 94% capacity
[2024-03-15 14:02:05] [ERROR] DB_CONNECTION_TIMEOUT — failed to acquire connection after 30s [2024-03-15 14:02:05] [ERROR] Transaction TXN-9823 FAILED — unable to write to database [2024-03-15 14:02:05] [INFO ] Rollback initiated for TXN-9823

How to read this log — 3 things to always look for

1. Timestamp — tells you exactly when each event happened. Always note the time the first ERROR appeared. Everything before that is the lead-up.

2. Log Level — INFO means normal, WARN means something is trending wrong, ERROR means it broke. Notice how the WARNs appeared BEFORE the ERROR — that's the pattern.

3. The message — the actual description of what happened. DB_CONNECTION_TIMEOUT tells you exactly what failed. You now know to investigate the database connection pool — not the payment logic.

03 Common Error Patterns You Will See

🗄️

Pattern 1

Database Timeout / Connection Failure

"App can't talk to the database"

Typical log lines

[WARN ] DB connection pool at 90% capacity
[ERROR] DB_CONNECTION_TIMEOUT after 30000ms
[ERROR] Cannot acquire connection from pool — pool exhausted

Root Cause Indicator: Database connection pool is full. Too many open connections, no new ones available. Check if a previous process leaked connections without closing them. Escalate to DBA or L3.

💥

Pattern 2

NullPointerException / Unexpected Null

"System received empty data where it expected something"

Typical log lines

[ERROR] NullPointerException at PaymentProcessor.java:142
[ERROR] Field 'accountNumber' is null — cannot process transaction
[WARN ] Incoming request missing required field: beneficiaryIBAN

Root Cause Indicator: The request came in with missing or empty data. Usually a problem with how the client sent the request — a field was blank or not included. Check what data arrived in the request and compare to what was expected.

⏱️

Pattern 3

External Service / API Timeout

"System called an external service and got no reply"

Typical log lines

[INFO ] Calling SBP gateway — POST /raast/payment
[WARN ] SBP gateway response time: 8500ms — exceeding threshold
[ERROR] SocketTimeoutException — no response from SBP after 10000ms
[ERROR] TXN-9823 failed — external gateway unreachable

Root Cause Indicator: Your system is fine — the external service (SBP, bank gateway) did not respond in time. Check if the external service has a status page or known outage. This is NOT your system's fault but you still need to document it and inform the client.

🔁

Pattern 4

Repeated Retries / Infinite Loop

"System keeps trying the same thing and failing repeatedly"

Typical log lines

[WARN ] Retry attempt 1/3 for TXN-9823
[WARN ] Retry attempt 2/3 for TXN-9823
[WARN ] Retry attempt 3/3 for TXN-9823
[ERROR] Max retries exceeded — TXN-9823 moved to dead-letter queue

Root Cause Indicator: The system tried multiple times and all failed. The transaction is now in the dead-letter queue — a holding area for transactions that couldn't be processed. You need to check why all retries failed (usually the same root cause as above) and decide whether to reprocess or refund.

04 What to Look For — Quick Reference

🔴

Immediate Red Flags

Stop and act now

Words: ERROR, FATAL, CRITICAL, EXCEPTION
DB_CONNECTION_TIMEOUT or pool exhausted
OutOfMemoryError
StackOverflowError
Service unreachable or connection refused

🟡

Warning Signs

Investigate before it breaks

Words: WARN, WARNING, SLOW, RETRY
Response time exceeding threshold
Connection pool nearing limit
Retry attempts happening
Queue depth growing

🔍

Context Clues

Always check before the error

What was the last INFO before the ERROR?
Was there a deployment recently?
Did WARNs appear before the ERROR?
Is the same error repeating?
What Transaction ID is involved?

✅

Healthy Signs

System is working normally

Consistent INFO lines flowing steadily
Transaction IDs completing with SUCCESS
Response times within normal range
No repeated same error lines
No WARNs escalating over time

05 Hands-on Lab — Investigate a Failed Transaction

Can you perform this on your own machine?

Yes — partially. Since you have Kali Linux set up on VirtualBox, you can simulate this entire investigation using a dummy log file we create ourselves. You don't need a real production server. Here is exactly how to do it step by step.

This gives you the same experience as a real investigation — reading real log content, running real Linux commands, and finding a real root cause.

🔬 Lab: Simulate & Investigate a Failed Transaction Log

Performable on Kali Linux · VirtualBox

Open your Kali Linux terminal

Boot up your VirtualBox Kali machine and open the terminal. Navigate to your home directory.

cd ~

Create a dummy log file with real-looking entries

Copy and paste this entire block into your terminal. It creates a realistic payment log file with a failure built into it.

cat > payment-service.log << 'EOF'
[2024-03-15 14:01:55] [INFO ] Transaction TXN-9820 received. Amount: 2000
[2024-03-15 14:01:56] [INFO ] TXN-9820 validated and queued successfully
[2024-03-15 14:01:57] [INFO ] TXN-9820 processed successfully — Status: SUCCESS
[2024-03-15 14:02:00] [INFO ] Transaction TXN-9821 received. Amount: 8500
[2024-03-15 14:02:01] [INFO ] TXN-9821 validated and queued successfully
[2024-03-15 14:02:02] [WARN ] DB connection pool at 82% capacity
[2024-03-15 14:02:03] [INFO ] Transaction TXN-9822 received. Amount: 3000
[2024-03-15 14:02:03] [WARN ] DB connection pool at 91% capacity — approaching limit
[2024-03-15 14:02:04] [INFO ] Transaction TXN-9823 received. Amount: 5000
[2024-03-15 14:02:04] [WARN ] DB connection pool at 97% capacity — CRITICAL
[2024-03-15 14:02:05] [ERROR] DB_CONNECTION_TIMEOUT — failed to acquire connection after 30s
[2024-03-15 14:02:05] [ERROR] Transaction TXN-9823 FAILED — unable to write to database
[2024-03-15 14:02:05] [INFO ] Rollback initiated for TXN-9823
[2024-03-15 14:02:06] [ERROR] DB_CONNECTION_TIMEOUT — failed to acquire connection after 30s
[2024-03-15 14:02:06] [ERROR] Transaction TXN-9822 FAILED — unable to write to database
[2024-03-15 14:02:07] [WARN ] Retry attempt 1/3 for TXN-9822
[2024-03-15 14:02:09] [WARN ] Retry attempt 2/3 for TXN-9822
[2024-03-15 14:02:11] [WARN ] Retry attempt 3/3 for TXN-9822
[2024-03-15 14:02:12] [ERROR] Max retries exceeded — TXN-9822 moved to dead-letter queue
[2024-03-15 14:02:15] [INFO ] Transaction TXN-9824 received. Amount: 1200
[2024-03-15 14:02:16] [INFO ] TXN-9824 processed successfully — Status: SUCCESS
EOF

✅ This creates a file called payment-service.log in your home directory with 20 log lines.

First — find all ERROR lines

Use grep to pull out only the error lines from the file.

grep "ERROR" payment-service.log → You should see 5 ERROR lines all related to DB_CONNECTION_TIMEOUT and failed transactions

Count how many errors occurred

Use -c to count the total number of error lines — quick way to know the scale of the problem.

grep -c "ERROR" payment-service.log → Output: 5 (there are 5 error lines in this file)

Find the specific failed transaction — TXN-9823

Use grep to trace everything related to that single transaction ID.

grep "TXN-9823" payment-service.log → Shows all 3 lines about TXN-9823: received → FAILED → rollback

🔍 You can now see the full story of TXN-9823 — it arrived fine but failed at the DB write stage.

Get context — what happened just BEFORE the errors?

Use -n to see line numbers, then look at the lines before the first ERROR.

grep -n "ERROR\|WARN" payment-service.log → You see WARN lines at 82%, 91%, 97% BEFORE the ERROR lines appear

💡 Key finding: The WARNs about DB connection pool climbing (82% → 91% → 97%) appeared before the ERROR. This is the pattern — the pool was filling up and nobody acted on the warnings.

Check which transactions succeeded vs failed

Compare successful vs failed transactions in the log.

grep "SUCCESS\|FAILED" payment-service.log → TXN-9820: SUCCESS | TXN-9823: FAILED | TXN-9822: FAILED | TXN-9824: SUCCESS

✅ Conclusion: TXN-9820 and TXN-9824 succeeded. TXN-9822 and TXN-9823 failed. The failures happened exactly when the DB pool was exhausted. TXN-9822 also went to dead-letter queue after 3 retries.

View the full log in order to see the complete picture

Use tail to see all lines and understand the full timeline from start to finish.

tail -n 20 payment-service.log → Full log displayed — you can now read the entire incident from start to finish

📋 Your Root Cause Report — What to Write in the Jira Ticket

Issue: Transactions TXN-9822 and TXN-9823 failed at 14:02:05.

Root Cause: Database connection pool became exhausted (reached 97% capacity). New transactions could not acquire a database connection, causing them to fail with DB_CONNECTION_TIMEOUT.

Evidence: Log shows WARN messages at 14:02:02 (82%), 14:02:03 (91%), and 14:02:04 (97%) before the first ERROR at 14:02:05. TXN-9822 was also moved to dead-letter queue after 3 failed retries.

Next Step: Escalate to DBA team to investigate connection leak. Check if any long-running query is holding connections open without releasing them.

06 Root Cause Indicators — Quick Reference

🎯 Error Message → What it Likely Means

Error in Log	Root Cause Indicator	Severity	Who to Escalate To
DB_CONNECTION_TIMEOUT	Database connection pool exhausted or DB server down	Critical	DBA / L3
SocketTimeoutException	External service (SBP, gateway) not responding	Critical	External vendor / L3
NullPointerException	Missing or empty field in the request data	High	Check request data first, then L3 if code bug
OutOfMemoryError	Server RAM is completely full	Critical	Infra team / L3 immediately
Connection refused	Target server or service is not running	Critical	Infra team to check target server
Max retries exceeded	Underlying failure persisted through all retry attempts	High	Check dead-letter queue, find original error
401 Unauthorized	API token expired or wrong credentials	Medium	Check API key/token config — usually a config fix
Response time > 10000ms	Slow query or overloaded external service	High	DBA for slow query / vendor for external slowness

07 Real L2 Scenarios

Client says "payments started failing at 2 PM." You grep the log for ERROR around 14:00. You find SocketTimeoutException pointing to the SBP gateway. You check SBP's status page — they had a 15-minute outage. Not your system's fault. You document it and inform the client.

You see 200 ERROR lines in the log but they're all the same: same error, same service, repeating every 2 seconds. This is called an error storm — one root cause is triggering hundreds of errors. Focus on the first occurrence, not the count. Fix the first one and the rest stop.

A client reports "transaction TXN-4421 is showing pending but never completed." You grep the log for TXN-4421 — you find it entered the queue but there are no further lines about it. The queue processor likely stopped. You check the queue — it's backed up with 3,000 messages. The worker crashed silently.

You open a log and see no ERROR lines at all but the client says it's broken. You look at the INFO lines — the last entry is from 2 hours ago. The application stopped logging entirely. Either the app crashed silently or the disk is full and it can't write logs. Run df -h immediately.

✅ Week 2 · Day 2 Outcomes — Can You Do This?

Read a log file and identify the timestamp, log level, and message in each line
Use grep to extract ERROR and WARN lines from a log and identify the first occurrence
Trace a single transaction ID through the full log to see its complete journey
Recognise the 4 common error patterns: DB timeout, NullPointer, external timeout, retry exhaustion
Understand that WARNs before an ERROR are the build-up — they explain why it happened
Complete the hands-on lab on Kali Linux — create a dummy log, investigate it, and write a root cause report
Use the root cause indicator table to map an error message to its likely cause and escalation path