The standard ECID-hunt scripts — find the session, build the kill statement, fire it — are table stakes.
Any competent DBA can do that in five minutes. But what happens after you kill the session and the user
reports the same hang forty minutes later? What happens when the ECID has already left
GV$SESSION by the time you respond, and you need to reconstruct exactly what that WebLogic
thread was doing at 02:17 AM on a Tuesday from AWR history? What do you do when 14 different ECIDs
are all piling onto the same SQL, and killing them one by one is just bailing water out of a sinking boat?
This post takes the standard ECID analysis toolkit and builds it into a full forensic investigation framework — combining ASH, AWR, DBA_HIST, analytic window functions, execution plan deltas, cross-instance RAC correlation, and PL/SQL automation — so you can answer not just "which session?" but "why, since when, from which EBS module, and what do I do to prevent it happening again?"
- ECID Propagation Architecture — The Full Stack
- Why ASH is the Right Tool (and its Limits)
- Query 1 — Enriched ASH ECID Scan with Temporal Heat
- Query 2 — ECID Timeline Reconstruction (Event Sequence)
- Query 3 — Top ECIDs by DB Time with Wait Profile
- Query 4 — ECID to FND User Mapping (No Guessing)
- Query 5 — Cross-Instance RAC ECID Correlation (GV$ASH)
- Query 6 — ECID Plan Stability Check (Child Cursor Explosion)
- Query 7 — AWR Historical Reconstruction (DBA_HIST_ASH + DBA_HIST_SQLSTAT)
- Query 8 — ECID Blocking Chain Tree (Hierarchical + Cross-Node)
- Query 9 — ECID-to-Concurrent-Request Bridge
- Query 10 — PL/SQL Automation: ECID Watchdog with Auto-Kill + Alert Log
- Bash: Extract ECID from oacore Logs (Production-Grade)
- Decision Framework — Which Query When
- Summary & RCA Template
🏗️ 1. ECID Propagation Architecture — The Full Stack
👶 FresherBefore writing a single SQL query, understand where the ECID lives in the EBS stack — because the right extraction point depends on where the hang is occurring.
┌─────────────────────────────────────────────────────────┐
│ BROWSER (User's request) │
│ HTTP POST → F5 LTM (SSL VIP :443) │
└────────────────────┬────────────────────────────────────┘
│ ECID generated here by WLS
┌────────────────────▼────────────────────────────────────┐
│ WebLogic 12c — oacore Managed Server │
│ ▸ ECID stamped in ODL context: ecid=1.xxx...;kXjE │
│ ▸ Visible in: oacore-diagnostic.log / access.log │
│ ▸ Propagated via JDBC thin driver into DB session │
└────────────────────┬────────────────────────────────────┘
│
┌────────────────────▼────────────────────────────────────┐
│ Oracle DB 19c RAC (ASM / Solaris SPARC) │
│ ▸ GV$SESSION.ECID ← live sessions │
│ ▸ GV$ACTIVE_SESSION_HISTORY.ECID ← last ~60 mins │
│ ▸ DBA_HIST_ACTIVE_SESS_HISTORY.ECID ← AWR history │
│ ▸ FND_LOG_MESSAGES.EXECUTION_CONTEXT_ID ← EBS FND │
└─────────────────────────────────────────────────────────┘
🎓 Experienced
The ECID format in 19c is typically: 1.<random_hex>;<short_suffix>.
The short suffix (;kXjE, ;0f3a) is the part that identifies
the WLS request uniquely. When grepping logs, use the middle random hex string as your
LIKE pattern — it is the most stable and unique fragment. Avoid matching on the prefix
1. as it is constant across all ECIDs on the same instance.
GV$ACTIVE_SESSION_HISTORY — in-memory, ~1 hour (sampled every 1 second, 1-in-10 written to AWR)DBA_HIST_ACTIVE_SESS_HISTORY — AWR flushed copy, default 8-day retention, 10-second sample intervalFND_LOG_MESSAGES — EBS-controlled, retention driven by FND: Debug Log Retention profile optionKey implication: If the incident is >1 hour old and AWR retention is set to default, use DBA_HIST. Always confirm AWR retention:
SELECT retention FROM dba_hist_wr_control;
⚙️ 2. Why ASH is the Right Tool (and its Limits)
✅ ASH Strengths
1-second granularity for live sessions. ECID persisted in every sample. Includes SQL_ID, wait event, blocking session, plan hash per sample. No performance overhead on the application.
⚠️ ASH Limitations
Only samples active sessions (STATUS=ACTIVE). Idle/INACTIVE sessions (e.g., holding a lock but not running SQL) are NOT captured. Short-lived sessions (<1 second) may be missed entirely. GV$ is in-memory only — flushed on instance restart.
enq: TX - row lock contention), but the blocker is invisible. For these cases, always cross-check with GV$SESSION live + GV$LOCK for the blocker SID, then trace that blocker separately.
📊 3. Query 1 — Enriched ASH ECID Scan with Temporal Heat ADVANCED
The basic version groups by sample_time which produces too many rows. This version
uses time buckets (5-minute windows) and a heat ratio to immediately
show you when the ECID was burning DB time — not just that it was.
-- ============================================================
-- Q1: ECID Temporal Heat Map — 5-Min Buckets, Last 12 Hours
-- Shows WHEN each ECID caused DB pressure, not just IF
-- ============================================================
SELECT
TO_CHAR(TRUNC(a.sample_time, 'MI') -
MOD(TO_NUMBER(TO_CHAR(a.sample_time,'MI')),5)/(24*60),
'YYYY-MM-DD HH24:MI') AS time_bucket_5min,
a.ecid,
COUNT(*) AS total_samples,
SUM(CASE WHEN a.wait_class != 'Idle'
THEN 1 ELSE 0 END) AS active_samples,
ROUND(SUM(CASE WHEN a.wait_class != 'Idle'
THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1) AS pct_active,
SUM(CASE WHEN a.wait_class = 'User I/O'
THEN 1 ELSE 0 END) AS io_samples,
SUM(CASE WHEN a.wait_class = 'Application'
THEN 1 ELSE 0 END) AS lock_samples,
SUM(CASE WHEN a.wait_class = 'Concurrency'
THEN 1 ELSE 0 END) AS concurrency_samples,
MAX(a.module) AS module,
MAX(a.sql_id)
KEEP (DENSE_RANK LAST ORDER BY
COUNT(*) OVER (PARTITION BY a.ecid, a.sql_id))
AS dominant_sql_id,
MAX(a.event)
KEEP (DENSE_RANK LAST ORDER BY
COUNT(*) OVER (PARTITION BY a.ecid, a.event))
AS dominant_event,
LPAD(RPAD('█', ROUND(SUM(CASE WHEN a.wait_class != 'Idle'
THEN 1 ELSE 0 END) * 20.0 / MAX(COUNT(*))
OVER (PARTITION BY a.ecid)), '░'), 20)
AS heat_bar
FROM gv$active_session_history a
WHERE a.sample_time > SYSDATE - 12/24
AND a.ecid IS NOT NULL
AND a.program LIKE '%oacore%'
GROUP BY
TO_CHAR(TRUNC(a.sample_time, 'MI') -
MOD(TO_NUMBER(TO_CHAR(a.sample_time,'MI')),5)/(24*60),
'YYYY-MM-DD HH24:MI'),
a.ecid
HAVING COUNT(*) > 3
ORDER BY time_bucket_5min DESC, active_samples DESC;
LPAD/RPAD('█'...,'░') construct renders an ASCII bar chart
inline in SQL*Plus / SQL Developer output. Each row becomes a visual timeline entry — you can
immediately see which ECID was burning active samples at which 5-minute window without leaving the DB tool.
The KEEP (DENSE_RANK LAST ORDER BY COUNT(*) OVER PARTITION BY...) pattern selects the
single most-frequent SQL_ID and EVENT per ECID per bucket — the statistical mode, not MAX by sort order.
🔬 4. Query 2 — ECID Event Sequence Reconstruction ADVANCED
This rebuilds the exact sequence of events for a specific ECID as a timeline — useful for proving to the business exactly what the request did step-by-step and for how long.
-- ============================================================
-- Q2: ECID Event Timeline — Ordered Sequence with Gaps
-- Reconstructs what the WLS thread did, step by step
-- ============================================================
SELECT
a.sample_time,
a.ecid,
a.session_id AS sid,
a.session_serial# AS serial#,
a.inst_id,
a.sql_id,
a.sql_child_number,
a.plan_hash_value,
a.event,
a.wait_class,
ROUND(a.time_waited / 1000, 2) AS time_waited_ms,
a.blocking_session,
a.blocking_inst_id,
-- Gap analysis: seconds since previous sample for this ECID
ROUND((CAST(a.sample_time AS TIMESTAMP)
- CAST(LAG(a.sample_time)
OVER (PARTITION BY a.ecid
ORDER BY a.sample_time) AS TIMESTAMP))
* 86400, 1) AS gap_from_prev_secs,
-- Cumulative DB time spent (non-idle) for this ECID
SUM(CASE WHEN a.wait_class != 'Idle' THEN 1 ELSE 0 END)
OVER (PARTITION BY a.ecid ORDER BY a.sample_time
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS cumul_active_samples,
-- Event change flag: did the wait event change from prior sample?
CASE WHEN a.event = LAG(a.event)
OVER (PARTITION BY a.ecid
ORDER BY a.sample_time)
THEN '→ same'
ELSE '★ CHANGED'
END AS event_change,
a.module,
a.action,
a.program,
a.machine
FROM gv$active_session_history a
WHERE a.sample_time > SYSDATE - 12/24
AND a.ecid LIKE '%&ECID_EXTRACT%'
ORDER BY a.sample_time ASC; -- ASC = chronological for readability
event_change = '★ CHANGED' — these are state transitions.
A session that goes CPU → enq: TX row lock → SQL*Net from client → CPU tells a very different
story than one stuck on db file sequential read for 200 consecutive samples.
The gap_from_prev_secs column reveals moments where the session was INACTIVE or completed —
a gap >10 seconds means the ECID was idle (waiting for next WLS call or finished a round-trip).
📈 5. Query 3 — Top ECIDs by DB Time with Full Wait Profile ADVANCED
-- ============================================================
-- Q3: Top ECIDs — Multi-Dimensional Wait Profile
-- Shows the complete wait breakdown per ECID in one row
-- ============================================================
SELECT
ecid,
module,
MIN(sample_time) AS first_seen,
MAX(sample_time) AS last_seen,
ROUND((MAX(sample_time) - MIN(sample_time)) * 1440, 1)
AS elapsed_mins,
COUNT(*) AS total_ash_samples,
-- DB Time breakdown (non-idle active samples)
SUM(CASE WHEN wait_class != 'Idle' THEN 1 ELSE 0 END)
AS db_time_samples,
SUM(CASE WHEN wait_class = 'CPU' THEN 1 ELSE 0 END)
AS cpu_samples,
SUM(CASE WHEN wait_class = 'User I/O' THEN 1 ELSE 0 END)
AS io_samples,
SUM(CASE WHEN wait_class = 'Application' THEN 1 ELSE 0 END)
AS lock_samples,
SUM(CASE WHEN wait_class = 'Concurrency' THEN 1 ELSE 0 END)
AS concurrency_samples,
SUM(CASE WHEN wait_class = 'Network' THEN 1 ELSE 0 END)
AS network_samples,
-- Most frequent SQL ID for this ECID (using subquery-free pivot)
STATS_MODE(sql_id) AS dominant_sql_id,
STATS_MODE(event) AS dominant_event,
STATS_MODE(plan_hash_value) AS dominant_plan_hash,
-- Plan instability flag (multiple plan hashes for same ECID)
CASE WHEN COUNT(DISTINCT plan_hash_value) > 1
THEN '⚠ PLAN UNSTABLE ('
|| COUNT(DISTINCT plan_hash_value) || ' plans)'
ELSE '✔ stable'
END AS plan_stability,
COUNT(DISTINCT session_id || '@' || inst_id) AS distinct_sessions
FROM gv$active_session_history
WHERE sample_time > SYSDATE - 12/24
AND ecid IS NOT NULL
AND program LIKE '%oacore%'
GROUP BY ecid, module
HAVING COUNT(*) > 5
ORDER BY db_time_samples DESC
FETCH FIRST 30 ROWS ONLY;
STATS_MODE(sql_id) returns the statistical mode — the most frequently occurring value.
This is far superior to MAX(sql_id) (which returns alphabetically last, meaningless)
or a correlated subquery. It natively answers: "which SQL did this ECID spend the most time on?"
The plan_stability column immediately flags ECIDs with plan instability —
a key signal that the real problem is a bad execution plan flip, not a lock or I/O issue.
👤 6. Query 4 — ECID to FND User Mapping ADVANCED
GV$SESSION.USERNAME is always APPS in EBS. You need to join through
FND_LOG_MESSAGES or V$SESSION.CLIENT_IDENTIFIER to get the actual
FND user. This query does it directly from ASH without requiring a live session.
-- ============================================================
-- Q4: ECID → FND User + Responsibility Resolution
-- Works even after the session has ended (uses FND log history)
-- ============================================================
SELECT DISTINCT
a.ecid,
a.client_id AS fnd_user_or_client_id,
fu.user_name AS fnd_username,
fu.description AS fnd_user_desc,
flm.module AS fnd_module,
MIN(a.sample_time) OVER (PARTITION BY a.ecid)
AS ecid_first_seen,
MAX(a.sample_time) OVER (PARTITION BY a.ecid)
AS ecid_last_seen,
COUNT(*) OVER (PARTITION BY a.ecid) AS ecid_total_samples,
STATS_MODE(a.event) OVER (PARTITION BY a.ecid)
AS dominant_event,
STATS_MODE(a.sql_id) OVER (PARTITION BY a.ecid)
AS dominant_sql_id
FROM gv$active_session_history a
LEFT JOIN fnd_user fu
ON fu.user_name = UPPER(a.client_id)
LEFT JOIN (
SELECT DISTINCT execution_context_id, module
FROM fnd_log_messages
WHERE log_sequence > (SELECT MAX(log_sequence) - 1000000
FROM fnd_log_messages)
) flm
ON flm.execution_context_id = a.ecid
WHERE a.sample_time > SYSDATE - 12/24
AND a.ecid IS NOT NULL
AND a.program LIKE '%oacore%'
ORDER BY ecid_total_samples DESC;
GV$SESSION.CLIENT_IDENTIFIER
(and therefore ASH.CLIENT_ID) via DBMS_SESSION.SET_IDENTIFIER.
The join to FND_USER converts this to the full user record. The FND_LOG_MESSAGES
subquery adds the OAF module context — the actual page or responsibility the user was navigating.
The log_sequence predicate avoids a full table scan on the often-huge FND_LOG_MESSAGES table.
🌐 7. Query 5 — Cross-Instance RAC ECID Correlation ADVANCED
In a 2-node (or 3-node) RAC, a single WLS request can spawn activity on multiple instances — for example, the SQL runs on node 1 but the blocking session is on node 2, causing global cache waits. This query maps the full multi-node footprint of an ECID.
-- ============================================================
-- Q5: Cross-Instance RAC Footprint for a Specific ECID
-- Identifies global cache pressure, cross-node blocking
-- ============================================================
SELECT
a.inst_id,
a.session_id AS sid,
a.session_serial# AS serial#,
MIN(a.sample_time) AS first_seen,
MAX(a.sample_time) AS last_seen,
COUNT(*) AS samples_on_node,
STATS_MODE(a.event) AS dominant_event,
STATS_MODE(a.sql_id) AS dominant_sql_id,
-- GC (global cache) wait detection
SUM(CASE WHEN a.event LIKE 'gc%' THEN 1 ELSE 0 END)
AS gc_wait_samples,
SUM(CASE WHEN a.event LIKE 'gc cr%' THEN 1 ELSE 0 END)
AS gc_cr_samples,
SUM(CASE WHEN a.event LIKE 'gc buffer busy%' THEN 1 ELSE 0 END)
AS gc_buffer_busy,
-- Cross-node blocking (blocker on different inst_id)
SUM(CASE WHEN a.blocking_inst_id != a.inst_id
AND a.blocking_inst_id IS NOT NULL
THEN 1 ELSE 0 END) AS cross_node_block_samples,
a.machine AS app_server
FROM gv$active_session_history a
WHERE a.sample_time > SYSDATE - 12/24
AND a.ecid LIKE '%&ECID_EXTRACT%'
GROUP BY a.inst_id, a.session_id, a.session_serial#, a.machine
ORDER BY a.inst_id, samples_on_node DESC;
gc_buffer_busy is non-zero,
the ECID is hitting a hot block shared between RAC nodes — usually caused by an unindexed
column with high DML frequency, or sequences without caching. This is a tuning issue,
not a session kill issue. Killing the session will not fix it.
📐 8. Query 6 — ECID Plan Stability Check ADVANCED
-- ============================================================
-- Q6: Plan Instability Detection per ECID
-- Identifies ECIDs where the execution plan changed mid-flight
-- ============================================================
SELECT
a.ecid,
a.sql_id,
a.plan_hash_value,
COUNT(*) AS samples_with_this_plan,
MIN(a.sample_time) AS plan_first_seen,
MAX(a.sample_time) AS plan_last_seen,
AVG(a.time_waited) / 1000 AS avg_wait_ms_this_plan,
-- Is this plan the current one still in shared pool?
CASE WHEN EXISTS (
SELECT 1 FROM gv$sql sq
WHERE sq.sql_id = a.sql_id
AND sq.plan_hash_value = a.plan_hash_value
) THEN '✔ in pool'
ELSE '✘ aged out'
END AS plan_in_shared_pool,
-- Relative plan performance vs best plan for same SQL
ROUND(AVG(a.time_waited) /
MIN(AVG(a.time_waited)) OVER
(PARTITION BY a.ecid, a.sql_id), 2)
AS relative_cost_vs_best_plan
FROM gv$active_session_history a
WHERE a.sample_time > SYSDATE - 12/24
AND a.ecid LIKE '%&ECID_EXTRACT%'
AND a.plan_hash_value IS NOT NULL
GROUP BY a.ecid, a.sql_id, a.plan_hash_value
ORDER BY a.sql_id, samples_with_this_plan DESC;
MIN(AVG(time_waited)) OVER (PARTITION BY ecid, sql_id) finds the lowest
average wait among all plan variants for the same SQL. The ratio tells you how much worse
the bad plan is: a value of 8.3 means the bad plan is 8.3× slower than the best
known plan for that SQL. This is a quantified argument for raising an SR or pinning a plan
with SQL Plan Management (SPM).
🗄️ 9. Query 7 — AWR Historical ECID Reconstruction ADVANCED
When ASH memory has flushed (>1 hour ago), pivot to DBA_HIST_ACTIVE_SESS_HISTORY
and join with DBA_HIST_SQLSTAT for aggregated SQL performance for the same period.
-- ============================================================
-- Q7: AWR Historical ECID Reconstruction + SQL Stats
-- For incidents older than 1 hour — requires Diagnostics Pack
-- ============================================================
WITH ash_hist AS (
SELECT
h.snap_id,
h.ecid,
h.session_id AS sid,
h.session_serial# AS serial#,
h.instance_number AS inst_id,
h.sample_time,
h.sql_id,
h.plan_hash_value,
h.event,
h.wait_class,
h.time_waited,
h.blocking_session,
h.module,
h.action,
h.client_id,
COUNT(*) OVER (PARTITION BY h.ecid) AS ecid_total_samples,
STATS_MODE(h.sql_id) OVER
(PARTITION BY h.ecid) AS dominant_sql_id
FROM dba_hist_active_sess_history h
WHERE h.sample_time > SYSDATE - 12/24
AND h.ecid LIKE '%&ECID_EXTRACT%'
),
sql_perf AS (
-- AWR SQL stats for the same snap range covered by the ASH hits
SELECT
ss.snap_id,
ss.sql_id,
ss.plan_hash_value,
ss.executions_delta AS execs,
ROUND(ss.elapsed_time_delta /
NULLIF(ss.executions_delta, 0) / 1000, 2)
AS avg_elapsed_ms,
ROUND(ss.cpu_time_delta /
NULLIF(ss.executions_delta, 0) / 1000, 2)
AS avg_cpu_ms,
ROUND(ss.buffer_gets_delta /
NULLIF(ss.executions_delta, 0), 0)
AS avg_buffer_gets,
ROUND(ss.disk_reads_delta /
NULLIF(ss.executions_delta, 0), 0)
AS avg_disk_reads,
ss.rows_processed_delta AS total_rows
FROM dba_hist_sqlstat ss
WHERE ss.executions_delta > 0
)
SELECT
a.ecid,
a.sample_time,
a.sid,
a.serial#,
a.inst_id,
a.sql_id,
a.plan_hash_value,
a.event,
a.wait_class,
a.module,
a.client_id,
a.ecid_total_samples,
a.dominant_sql_id,
sp.execs,
sp.avg_elapsed_ms,
sp.avg_cpu_ms,
sp.avg_buffer_gets,
sp.avg_disk_reads
FROM ash_hist a
LEFT JOIN sql_perf sp
ON sp.snap_id = a.snap_id
AND sp.sql_id = a.sql_id
AND sp.plan_hash_value = a.plan_hash_value
ORDER BY a.sample_time DESC;
DBA_HIST samples at 10-second intervals (not 1 second), so short events (<10 sec) may be invisible.
The
snap_id join ensures SQL stats are from the same AWR snapshot period as the ASH samples,
giving you elapsed time and buffer gets that were actually measured during the incident — not a 24-hour aggregate.
🔗 10. Query 8 — ECID Blocking Chain Tree (Hierarchical + Cross-Node) ADVANCED
-- ============================================================
-- Q8: Full Blocking Chain from ECID Session Outward
-- Handles RAC cross-instance chains; shows lock object name
-- ============================================================
SELECT
LPAD(' ', LEVEL * 3) || s.inst_id || ':' || s.sid
AS session_tree,
LEVEL AS chain_depth,
s.status,
s.last_call_et AS active_secs,
s.event,
s.wait_class,
s.seconds_in_wait,
s.sql_id,
s.module,
s.action,
s.username,
s.machine,
s.blocking_session,
s.blocking_instance,
s.blocking_session_status,
-- Resolve lock object (for TX/TM waits)
(SELECT o.object_name || ' (' || o.object_type || ')'
FROM gv$lock lk
JOIN dba_objects o ON o.object_id = lk.id1
WHERE lk.sid = s.sid
AND lk.inst_id = s.inst_id
AND lk.type = 'TM'
AND rownum = 1) AS locked_object,
-- OS SPID for escalation to OS kill if needed
p.spid AS os_spid
FROM gv$session s
LEFT JOIN gv$process p
ON p.addr = s.paddr
AND p.inst_id = s.inst_id
START WITH s.ecid LIKE '%&ECID_EXTRACT%'
AND s.program LIKE '%oacore%'
CONNECT BY NOCYCLE
PRIOR s.blocking_session = s.sid
AND PRIOR s.blocking_instance = s.inst_id
ORDER SIBLINGS BY s.last_call_et DESC;
GV$SESSION). Without NOCYCLE, the query loops infinitely.
The locked_object column resolves the TM lock to the actual table name — critical for
knowing which EBS table is being locked (e.g., PO_HEADERS_ALL,
OE_ORDER_LINES_ALL) so you can engage the functional team.
🔀 11. Query 9 — ECID to Concurrent Request Bridge ADVANCED
Concurrent Manager requests spawn their own DB sessions. If an ECID originates from a CM worker process that incorrectly shows up via oacore (e.g., a self-service submission), this query bridges the gap.
-- ============================================================
-- Q9: ECID ↔ Concurrent Request Cross-Reference
-- Links ASH ECID data to FND_CONCURRENT_REQUESTS
-- ============================================================
SELECT
a.ecid,
fcr.request_id,
fcr.phase_code,
fcr.status_code,
fcr.actual_start_date,
fcr.actual_completion_date,
ROUND((fcr.actual_completion_date - fcr.actual_start_date) * 1440, 1)
AS runtime_mins,
fcp.concurrent_program_name,
fcp.user_concurrent_program_name,
fu.user_name AS submitted_by,
fcr.oracle_session_id,
a.sample_time,
a.sql_id,
a.event,
a.wait_class,
a.module
FROM gv$active_session_history a
JOIN fnd_concurrent_requests fcr
ON fcr.oracle_session_id = a.session_id
AND fcr.actual_start_date < a.sample_time
AND NVL(fcr.actual_completion_date, SYSDATE) >= a.sample_time
JOIN fnd_concurrent_programs fcp
ON fcp.concurrent_program_id = fcr.concurrent_program_id
AND fcp.application_id = fcr.program_application_id
LEFT JOIN fnd_user fu
ON fu.user_id = fcr.requested_by
WHERE a.sample_time > SYSDATE - 12/24
AND a.ecid LIKE '%&ECID_EXTRACT%'
ORDER BY a.sample_time DESC;
🤖 12. Query 10 — PL/SQL ECID Watchdog with Auto-Kill + Alert Log ADVANCED
Rather than reacting to incidents, this procedure runs proactively (via DBMS_SCHEDULER every 5 minutes) and auto-kills oacore sessions that have been active on the same ECID beyond a configurable threshold — with a full audit trail written to a custom log table.
-- ============================================================
-- Setup: Create the watchdog audit table (run once)
-- ============================================================
CREATE TABLE apps.ecid_watchdog_log (
log_id NUMBER GENERATED ALWAYS AS IDENTITY,
log_time DATE DEFAULT SYSDATE,
ecid VARCHAR2(64),
sid NUMBER,
serial# NUMBER,
inst_id NUMBER,
event VARCHAR2(64),
module VARCHAR2(64),
last_call_et NUMBER,
sql_id VARCHAR2(13),
action_taken VARCHAR2(20),
kill_cmd VARCHAR2(120),
notes VARCHAR2(500),
CONSTRAINT ecid_wdg_pk PRIMARY KEY (log_id)
);
-- ============================================================
-- The Watchdog Procedure
-- ============================================================
CREATE OR REPLACE PROCEDURE apps.ecid_watchdog_proc (
p_threshold_secs IN NUMBER DEFAULT 600, -- 10-min default
p_auto_kill IN VARCHAR2 DEFAULT 'N', -- 'Y' to kill, 'N' for log-only
p_program_filter IN VARCHAR2 DEFAULT '%oacore%'
)
AS
l_kill_cmd VARCHAR2(200);
l_action VARCHAR2(20);
BEGIN
FOR r IN (
SELECT
s.ecid,
s.sid,
s.serial#,
s.inst_id,
s.event,
s.module,
s.sql_id,
s.last_call_et,
s.status,
s.blocking_session,
s.blocking_instance
FROM gv$session s
WHERE s.ecid IS NOT NULL
AND s.program LIKE p_program_filter
AND s.status = 'ACTIVE'
AND s.last_call_et > p_threshold_secs
AND s.username NOT IN ('SYS', 'SYSTEM')
AND s.type = 'USER'
ORDER BY s.last_call_et DESC
)
LOOP
l_kill_cmd := 'ALTER SYSTEM KILL SESSION '''
|| r.sid || ',' || r.serial# || ',@' || r.inst_id
|| ''' IMMEDIATE';
l_action := 'LOGGED';
IF p_auto_kill = 'Y' THEN
BEGIN
EXECUTE IMMEDIATE l_kill_cmd;
l_action := 'KILLED';
EXCEPTION
WHEN OTHERS THEN
l_action := 'KILL_FAILED:' || SUBSTR(SQLERRM, 1, 30);
END;
END IF;
INSERT INTO apps.ecid_watchdog_log
(ecid, sid, serial#, inst_id, event, module, last_call_et,
sql_id, action_taken, kill_cmd, notes)
VALUES
(r.ecid, r.sid, r.serial#, r.inst_id, r.event, r.module,
r.last_call_et, r.sql_id, l_action, l_kill_cmd,
'blocker=' || r.blocking_session || '@' || r.blocking_instance
|| ' status=' || r.status);
DBMS_OUTPUT.PUT_LINE(l_action || ': ECID=' || SUBSTR(r.ecid,1,30)
|| ' SID=' || r.sid || ' ET=' || r.last_call_et || 's');
END LOOP;
COMMIT;
END ecid_watchdog_proc;
/
-- ============================================================
-- Schedule as DBMS_SCHEDULER job (every 5 minutes)
-- ============================================================
BEGIN
DBMS_SCHEDULER.CREATE_JOB (
job_name => 'APPS.ECID_WATCHDOG_JOB',
job_type => 'STORED_PROCEDURE',
job_action => 'APPS.ECID_WATCHDOG_PROC',
number_of_arguments => 3,
start_date => SYSTIMESTAMP,
repeat_interval => 'FREQ=MINUTELY;INTERVAL=5',
enabled => FALSE,
auto_drop => FALSE,
comments => 'ECID Watchdog - log-only by default'
);
-- Set args: 300s threshold, log-only mode, oacore filter
DBMS_SCHEDULER.SET_JOB_ARGUMENT_VALUE('APPS.ECID_WATCHDOG_JOB', 1, '300');
DBMS_SCHEDULER.SET_JOB_ARGUMENT_VALUE('APPS.ECID_WATCHDOG_JOB', 2, 'N');
DBMS_SCHEDULER.SET_JOB_ARGUMENT_VALUE('APPS.ECID_WATCHDOG_JOB', 3, '%oacore%');
DBMS_SCHEDULER.ENABLE('APPS.ECID_WATCHDOG_JOB');
END;
/
-- Query the watchdog audit log
SELECT * FROM apps.ecid_watchdog_log
ORDER BY log_time DESC
FETCH FIRST 50 ROWS ONLY;
- Always start with
p_auto_kill = 'N'(log-only) for at least 1 week. Review the logs to ensure it would not kill legitimate long-running batch jobs. - Add exclusion ECIDs: prefix-match known batch ECIDs and skip them in the cursor WHERE clause.
- Gate on
blocking_session IS NULLif you only want to kill self-hung sessions, not sessions blocked by others (which need a different fix). - Set a daily count alert: if the watchdog kills >N sessions in 24 hours, it signals a recurring architectural problem — not a random hang.
🐚 13. Bash: Production-Grade ECID Extraction from oacore Logs
#!/bin/bash
# ============================================================
# ecid_extract.sh — Extract ECID from oacore WLS logs
# Usage: ./ecid_extract.sh [minutes_back] [ecid_fragment]
# ============================================================
DOMAIN_HOME="/u01/oracle/domains/EBSdomain"
OACORE_LOG="${DOMAIN_HOME}/servers/oacore/logs"
MINS_BACK=${1:-60}
ECID_FRAG=${2:-}
echo "=== ECID Extractor: Last ${MINS_BACK} minutes ==="
# Find log files modified in the last N minutes
find "${OACORE_LOG}" -name "*.log" -newer /tmp/.ecid_ref_${MINS_BACK} 2>/dev/null | \
xargs grep -h "ecid=" 2>/dev/null | \
grep -oP 'ecid=\K[^;,\] ]+' | \
sort -u | \
awk 'length($0) > 10' > /tmp/ecid_candidates.txt
echo "Found $(wc -l < /tmp/ecid_candidates.txt) unique ECIDs"
# If a fragment was passed, filter to just matching ECIDs
if [ -n "${ECID_FRAG}" ]; then
grep "${ECID_FRAG}" /tmp/ecid_candidates.txt
else
head -20 /tmp/ecid_candidates.txt
fi
# Extract stuck thread ECIDs from WLS thread dump
echo
echo "=== ECIDs from stuck thread dumps ==="
grep -B5 -A10 "STUCK" "${OACORE_LOG}/oacore_server1.out" 2>/dev/null | \
grep -oP '(?<=ECID-Context: )[^\]"]+' | \
sort -u
# Cross-reference with current time for correlation
echo
echo "=== Generating SQL WHERE clause ==="
echo "-- Paste this into Phase 3.1 / Q1-Q6 scripts:"
head -5 /tmp/ecid_candidates.txt | while read ecid; do
echo " AND s.ecid LIKE '%$(echo ${ecid} | cut -c1-20)%'"
done
🗺️ 14. Decision Framework — Which Query When
| Situation | Use This | Why |
|---|---|---|
| Live incident, session still active | Phase 3.1 (GV$SESSION) | Fastest path to kill command. Zero ASH needed. |
| Session died before you could look | Q2 (ASH Timeline) | Reconstruct exactly what happened, in order, from ASH memory. |
| Incident >1 hour ago, ASH flushed | Q7 (DBA_HIST_ASH) | AWR holds 8 days. Join with DBA_HIST_SQLSTAT for SQL perf context. |
| "Which requests are heaviest right now?" | Q3 (Top ECID Profile) | STATS_MODE + wait breakdown surfaces the real offenders in one row per ECID. |
| "Which EBS user is it?" | Q4 (FND User Mapping) | CLIENT_ID → FND_USER join + FND_LOG cross-reference. |
| RAC: session active across multiple nodes | Q5 (Cross-Node RAC) | Reveals gc waits, cross-node blocking — invisible to V$SESSION on one node. |
| Recurring slow query (same SQL, varies in speed) | Q6 (Plan Stability) | Quantifies how much worse the bad plan is. Drives SPM/hint/profile solution. |
| Need to identify the locked table | Q8 (Blocking Chain Tree) | TM lock join to DBA_OBJECTS gives exact table name + chain depth. |
| ECID came from a Concurrent Request | Q9 (CR Bridge) | Links ASH ECID directly to REQUEST_ID + program + submitter. |
| Prevent future repeats, automate response | Q10 (Watchdog PL/SQL) | Scheduled every 5 minutes, log-only or auto-kill, full audit trail. |
| "Show me the timeline in a heat chart" | Q1 (5-Min Bucket Heat) | ASCII bar chart per 5-minute window — no external tool needed. |
📝 15. Summary & RCA Template
Every ECID incident that you resolve should generate an RCA entry. Here is a minimal post-incident template built from the outputs of the queries above:
/* ============================================================
RCA TEMPLATE — ECID Session Incident
Copy into your incident ticket / Jira / SNOW
============================================================ */
-- 1. INCIDENT WINDOW
-- Start: YYYY-MM-DD HH24:MI (extracted from Q1 heat map first_seen)
-- End: YYYY-MM-DD HH24:MI
-- Duration: NNN minutes
-- 2. ECID
-- Full ECID: 1.xxxxx...;kXjE
-- Source: oacore_server1.out / access.log / FND_LOG_MESSAGES
-- 3. IMPACTED SESSION (from Phase 3.1 / Q2)
-- SID/Serial#/Inst: 247,8931,@2
-- FND User: JSMITH (from Q4)
-- EBS Module: OAF Page: PO/PORequisitionCreate (from MODULE column)
-- OS PID: 12847
-- 4. ROOT CAUSE
-- Dominant Event: enq: TX - row lock contention (from Q3 STATS_MODE)
-- Blocker: SID 189@1 (session on node 1, from Q8 chain tree)
-- Blocked by Table: PO_REQUISITION_LINES_ALL (TM lock, from Q8 locked_object)
-- SQL ID: d8txf935msbx5 (dominant SQL from Q3)
-- Plan Hash: 1234567890 (stable / unstable — from Q6)
-- 5. ACTION TAKEN
-- ALTER SYSTEM KILL SESSION '247,8931,@2' IMMEDIATE;
-- Executed by: [DBA Name], [Time]
-- User impact: Session terminated; user re-submitted — successful
-- 6. PREVENTION
-- Short-term: ECID Watchdog job enabled (log-only, 5-min interval)
-- Medium-term: Analyse SQL d8txf935msbx5 for index opportunities
-- Long-term: Review table locking strategy for PO_REQUISITION_LINES_ALL
-- with functional team; consider row-level lock optimization
No comments:
Post a Comment