Week 6.5 · Day 6

Logs Monitoring

Logs are the most detailed record your system keeps. Central log monitoring means pulling logs from every service into one place and being able to search all of them at once — so finding root cause takes seconds, not hours.

Central Logs Search Patterns Root Cause grep Log Analysis

01 The Simple Idea First

Real-life Analogy

Imagine you manage 10 branches of a bank across a city. Each branch has its own filing cabinet of daily reports. If you need to find a specific customer complaint — you drive to each branch, open each cabinet, and search manually. That takes all day.

Now imagine all those reports are automatically sent to one central office every hour. You sit at one desk and search through all 10 branches simultaneously in under 10 seconds.

That is central log monitoring. Instead of SSHing into 5 different servers one by one to grep through logs — all logs flow into one central system where you search everything from one screen, instantly.

What is Central Log Monitoring?

In a real fintech environment, you have many services — the payment service, the API gateway, the MQ consumer, the DB, the OpenConnect service — each running on different servers and writing their own log files. Central logging collects all of those logs and sends them to one central platform.

Tools like ELK Stack (Elasticsearch, Logstash, Kibana) or Splunk are the most common. Logstash collects and ships logs. Elasticsearch stores and indexes them. Kibana is the search interface where you run queries and see results.

Without central logging — an incident investigation requires logging into 5 servers and running grep on each one separately. With central logging — you search all 5 servers in one query in under 3 seconds.

02 Anatomy of a Log Line — What Each Part Means

Every log line has the same structure

Before you can search logs effectively, you need to understand what each part of a log line tells you. Every well-structured log line contains: when it happened, how serious it is, which service wrote it, and what the message says.

[2024-03-15 14:02:05] [ERROR] [payment-service] DB_CONNECTION_TIMEOUT failed after 30000ms txn_id=TXN-501 server=pay-01

TimestampWhen it happened

LevelHow serious

ServiceWhich app wrote it

MessageWhat happened

MetadataExtra context

03 How Central Logging Works — The Flow

📡 Log Flow — From Service to Search

⚙️

Step 1 — Services

Every service writes its own log file

The payment service writes to payment-service.log. The MQ consumer writes to mq-consumer.log. The API gateway writes to api-gateway.log. Each server, each service — its own file.

pay-server-01: /var/log/payment-service.log
mq-server-01: /var/log/mq-consumer.log

📤

Step 2 — Log Shipper

Log agent ships logs to central platform

A lightweight agent (Filebeat, Logstash, Fluentd) runs on each server. It watches log files, reads new lines as they are written, and ships them to the central platform in real time.

Filebeat → watches /var/log/*.log → ships to Elasticsearch

🗄️

Step 3 — Central Store

Logs stored and indexed centrally

Elasticsearch (or Splunk) receives all logs and indexes them — meaning every word is catalogued so you can search for any term across all logs instantly, regardless of which server it came from.

Elasticsearch indexes → 50 million log lines, all searchable

🔍

Step 4 — Search & Visualise

You search everything from one screen

Kibana (or Splunk's UI) is your search interface. You type a query — "ERROR AND DB_CONNECTION_TIMEOUT" — and within 2 seconds you see every matching line from every service across every server, sorted by time.

Kibana search: "ERROR" last 1 hour → 42 results, 5 services

04 Search Patterns — How to Find What You Need

What is a search pattern?

A search pattern is a specific phrase, keyword, or combination of words you search for in logs to find relevant entries. Knowing the right pattern to search for is the difference between finding root cause in 30 seconds and reading logs for 30 minutes.

Good L2 engineers keep a mental (or written) library of patterns — the known error messages, the timeout text, the specific identifiers — so they can search for them immediately when an incident fires.

🔍 Essential Search Patterns for Fintech L2

Pattern to search	What it finds	Why you search it
ERROR	All error-level log entries	First search in any investigation — how many errors and what type
DB_CONNECTION_TIMEOUT	Database connection failures	DB pool exhausted or DB is down
SOCKET_TIMEOUT	External API timeouts	SBP, 1LINK, or gateway not responding
FAILED AND TXN-	Specific failed transactions	Find all failures, then check individual TXN IDs
PENDING longer than 30	Stuck transactions	Transactions that never got a callback response
NullPointerException	Code-level crash	Application bug — escalate to L3/dev team
OutOfMemoryError	RAM exhausted	Application running out of memory — server resource issue
CALLBACK_ENDPOINT	Callback delivery issues	SBP callback not reaching your system
MAX_RETRY_EXCEEDED	Messages in dead letter	Something failed all retries and needs manual action
Consumer crashed	MQ consumer stopped	Queue will start piling up — restart needed

05 grep Search Patterns — From Simple to Powerful

grep is your log search tool on the command line

In a central logging tool like Kibana, you use a search bar. On Kali Linux directly, you use grep. Both are doing the same thing — searching for patterns in log data. Knowing grep well means you can investigate even without a fancy tool.

grep search patterns — from basic to advanced

# BASIC — find all ERROR lines
grep "ERROR" payment-service.log

# COUNT — how many errors total
grep -c "ERROR" payment-service.log

# CONTEXT — show 2 lines BEFORE and 3 lines AFTER each error
# This shows what was happening just before the error
grep -B 2 -A 3 "ERROR" payment-service.log

# TWO PATTERNS — find lines with ERROR that also mention timeout
grep "ERROR" payment-service.log | grep "TIMEOUT"

# TIME RANGE — find errors only from 14:00 to 15:00
grep "2024-03-15 14:" payment-service.log | grep "ERROR"

# SPECIFIC TXN — trace one transaction through the log
grep "TXN-501" payment-service.log

# MULTIPLE FILES — search across ALL log files at once
grep -r "DB_CONNECTION_TIMEOUT" ~/logs/

# UNIQUE ERRORS — count each error type once
grep "ERROR" payment-service.log | awk '{print $4}' | sort | uniq -c | sort -rn

# LIVE MONITORING — watch log as new lines appear in real time
tail -f payment-service.log | grep "--line-buffered" "ERROR"

💡 The most powerful pattern for root cause: grep -B 3 "ERROR" — shows the 3 lines before each error. Almost always, those 3 lines contain the WARN messages that led to the error. This single command shows you the build-up and the crash together.

06 Hands-on Lab — Find Root Cause from Logs

Lab scenario

Client reports: "Multiple payments failed between 14:00 and 15:00. We don't know which ones or why." You have access to the payment service log. Your job is to find the root cause using search patterns only — no database access yet.

🔬 Lab: Trace Root Cause Using Log Search Patterns

grep · Pattern Search · Kali Linux

Create the investigation log — multi-service scenario

This log combines entries from the payment service and the MQ consumer — simulating what a central log platform would show you when you search across services.

terminal — create investigation log

cat > ~/investigation.log << 'EOF'
[2024-03-15 13:58:00] [INFO ] [payment-service] Service running normally. TXN rate: 142/sec
[2024-03-15 13:59:10] [INFO ] [mq-consumer ] Queue depth: 2. Processing normally.
[2024-03-15 14:00:00] [INFO ] [payment-service] TXN-501 received. Amount: 15000
[2024-03-15 14:00:01] [INFO ] [payment-service] TXN-501 validated. Sending to DB.
[2024-03-15 14:00:01] [WARN ] [payment-service] DB connection pool at 81%
[2024-03-15 14:00:02] [WARN ] [payment-service] DB connection pool at 89%
[2024-03-15 14:00:03] [WARN ] [payment-service] DB connection pool at 96%
[2024-03-15 14:00:04] [ERROR] [payment-service] DB_CONNECTION_TIMEOUT TXN-501 failed after 30000ms
[2024-03-15 14:00:04] [ERROR] [payment-service] TXN-501 FAILED cannot write to database
[2024-03-15 14:00:05] [ERROR] [payment-service] DB_CONNECTION_TIMEOUT TXN-502 failed after 30000ms
[2024-03-15 14:00:05] [ERROR] [payment-service] TXN-502 FAILED cannot write to database
[2024-03-15 14:00:06] [ERROR] [mq-consumer ] Cannot process message. DB unreachable. Retrying.
[2024-03-15 14:00:10] [WARN ] [mq-consumer ] Queue depth growing: 15 messages pending
[2024-03-15 14:01:00] [ERROR] [payment-service] DB_CONNECTION_REFUSED connection refused on port 5432
[2024-03-15 14:01:00] [ERROR] [payment-service] TXN-503 FAILED DB service appears down
[2024-03-15 14:01:00] [ERROR] [mq-consumer ] Queue depth: 28. Consumer paused. Waiting for DB.
[2024-03-15 14:15:00] [INFO ] [payment-service] DB connection restored. Pool at 12%.
[2024-03-15 14:15:01] [INFO ] [payment-service] TXN-504 processed successfully. Status: SUCCESS
[2024-03-15 14:15:02] [INFO ] [mq-consumer ] Queue draining. Depth: 22. Processing resumed.
[2024-03-15 14:18:00] [INFO ] [mq-consumer ] Queue depth: 0. All messages processed.
EOF

→ investigation.log created with 20 entries spanning 2 services over a 20-minute incident window.

Step 1 — First search: count all errors and see what type they are

The very first thing you do in any investigation — how many errors and what kind.

terminal

# Total error count
echo "Total errors: $(grep -c 'ERROR' ~/investigation.log)"

# Error types — what errors appeared and how many times each
grep "ERROR" ~/investigation.log | awk '{print $5}' | sort | uniq -c | sort -rn

# Which services had errors?
grep "ERROR" ~/investigation.log | awk '{print $4}' | sort | uniq -c

→ 7 total errors. DB_CONNECTION_TIMEOUT appears 3 times. DB_CONNECTION_REFUSED once. Both payment-service and mq-consumer affected. Pattern: it is a DB issue.

Step 2 — Find the build-up before the first error

Show 3 lines before the first ERROR — this reveals the warning signs that appeared before the crash.

terminal

# Show context before every error — the build-up
grep -B 3 "ERROR" ~/investigation.log | head -15

# When did the first warning appear?
grep -m 1 "WARN" ~/investigation.log

# When did the first error appear?
grep -m 1 "ERROR" ~/investigation.log

→ WARN lines appeared at 14:00:01, 14:00:02, 14:00:03 — DB pool climbing. First ERROR at 14:00:04. The 3 WARN lines showed the build-up. Root cause: DB connection pool exhausted.

Step 3 — List all affected transactions

Which specific transactions failed? The client needs this for reconciliation.

terminal

# All failed transactions
grep "FAILED" ~/investigation.log

# Just the TXN IDs of failures
grep "FAILED" ~/investigation.log | grep -o "TXN-[0-9]*"

# When did recovery happen?
grep "restored\|SUCCESS\|resumed" ~/investigation.log

→ TXN-501, TXN-502, TXN-503 all failed. Recovery at 14:15. Outage window: 14:00:04 to 14:15:00 — approximately 15 minutes.

Step 4 — Search across multiple log files (simulate central logging)

In a real environment you search all services at once. Here we simulate it by creating a second log file and searching both.

terminal — multi-file search

# Create a second service log (simulating another server)
echo "[2024-03-15 14:00:05] [ERROR] [api-gateway] Upstream DB unreachable. 503 returned." > ~/api-gateway.log
echo "[2024-03-15 14:01:00] [ERROR] [api-gateway] All payment endpoints returning 503." >> ~/api-gateway.log
echo "[2024-03-15 14:15:05] [INFO ] [api-gateway] Upstream restored. Endpoints healthy." >> ~/api-gateway.log

# Now search ALL log files at once — central logging style
grep -r "ERROR" ~/investigation.log ~/api-gateway.log

# Show which FILE each error came from
grep -rh "ERROR" ~/investigation.log ~/api-gateway.log | awk '{print $4}' | sort | uniq -c

→ Errors found across all 3 services: payment-service, mq-consumer, api-gateway — all caused by the same DB outage. This is the power of central search.

Step 5 — Write the root cause from log evidence

Produce the Jira RCA from what the logs told you — no guessing, only evidence.

Jira RCA — log-based evidence

Incident : Payment failures 14:00 – 14:15
Evidence : investigation.log + api-gateway.log

Timeline (from logs):
14:00:01 — DB pool WARN at 81%
14:00:03 — DB pool WARN at 96% (critical threshold)
14:00:04 — DB_CONNECTION_TIMEOUT — pool exhausted
14:00:04 — TXN-501, TXN-502 FAILED
14:01:00 — DB_CONNECTION_REFUSED — DB process down
14:01:00 — TXN-503 FAILED, api-gateway returning 503
14:15:00 — DB connection restored. Queue draining.
14:18:00 — All systems recovered

Root Cause : DB connection pool exhausted at 14:00:04.
DB process subsequently crashed at 14:01:00.
3 transactions failed. MQ and API gateway
both affected as downstream services.
Next Steps : DBA to investigate pool exhaustion cause.
Reconcile TXN-501, TXN-502, TXN-503.

→ Complete log-based RCA. Every claim backed by a log entry with a timestamp. This is professional incident documentation. ✅

07 Real L2 Scenarios

Incident fires. Instead of SSHing into 4 servers one by one, you open Kibana and type "ERROR AND DB_CONNECTION_TIMEOUT" with a time range of the last 30 minutes. In 3 seconds you see 48 results from 4 different services — all pointing to the same DB. Root cause identified before you even open a terminal.

Client says: "TXN-9981 failed but we don't know why." You grep for TXN-9981 across all logs. You find: INFO received → WARN DB pool 94% → ERROR DB timeout → FAILED. Full story in 4 log lines. You tell the client: the transaction failed because the DB connection pool was exhausted. Escalated to DBA.

Manager asks: "Did we have any errors yesterday between 2 AM and 4 AM?" In a central log tool you set the time range and search ERROR. Result: 0 errors in that window. Definitive answer in 5 seconds. Without central logging this would require SSHing into every server and grepping individual files — 20 minutes of work.

You notice a pattern — the same WARN message about DB pool appears every day around 3 PM. It has never crossed into ERROR — but it is building. You raise a proactive ticket before it ever becomes an incident. This is log monitoring used for prevention, not just investigation. The DBA increases the pool size and the warning disappears.

✅ Week 6.5 · Day 6 Outcomes

Explain what central log monitoring is and why it is faster than grepping individual servers
Describe how logs flow from services → log shipper → central store → search interface
Read a log line and identify all 5 parts — timestamp, level, service, message, metadata
Know the 10 essential search patterns for fintech L2 investigation — ERROR, DB_CONNECTION_TIMEOUT, TXN-, and more
Use grep with context flags — -B and -A — to see the build-up before an error, not just the error itself
Search across multiple log files simultaneously to simulate central logging behaviour
Complete the lab — create a multi-service investigation log, search for errors, identify the build-up pattern, list affected transactions, confirm the recovery timestamp, and produce a complete log-evidence-based Jira RCA
Understand how central logging enables proactive pattern detection — spotting recurring warnings before they become incidents