Saturday, May 9, 2026

Advanced ECID-ASH Forensics in Oracle EBS 12.2

🧠 Advanced ECID-ASH Forensics in Oracle EBS 12.2 | appsdbastuff
🧠 Advanced ECID–ASH Forensics in Oracle EBS 12.2 — Beyond the Basic Kill
Audience levels: 👶 Fresher 🎓 Experienced DBA 🔥 Lead / COE

The standard ECID-hunt scripts — find the session, build the kill statement, fire it — are table stakes. Any competent DBA can do that in five minutes. But what happens after you kill the session and the user reports the same hang forty minutes later? What happens when the ECID has already left GV$SESSION by the time you respond, and you need to reconstruct exactly what that WebLogic thread was doing at 02:17 AM on a Tuesday from AWR history? What do you do when 14 different ECIDs are all piling onto the same SQL, and killing them one by one is just bailing water out of a sinking boat?

This post takes the standard ECID analysis toolkit and builds it into a full forensic investigation framework — combining ASH, AWR, DBA_HIST, analytic window functions, execution plan deltas, cross-instance RAC correlation, and PL/SQL automation — so you can answer not just "which session?" but "why, since when, from which EBS module, and what do I do to prevent it happening again?"

🏗️ 1. ECID Propagation Architecture — The Full Stack

👶 Fresher

Before writing a single SQL query, understand where the ECID lives in the EBS stack — because the right extraction point depends on where the hang is occurring.

  ┌─────────────────────────────────────────────────────────┐
  │  BROWSER (User's request)                               │
  │  HTTP POST → F5 LTM (SSL VIP :443)                      │
  └────────────────────┬────────────────────────────────────┘
                       │ ECID generated here by WLS
  ┌────────────────────▼────────────────────────────────────┐
  │  WebLogic 12c  — oacore Managed Server                  │
  │  ▸ ECID stamped in ODL context: ecid=1.xxx...;kXjE      │
  │  ▸ Visible in: oacore-diagnostic.log / access.log        │
  │  ▸ Propagated via JDBC thin driver into DB session       │
  └────────────────────┬────────────────────────────────────┘

  ┌────────────────────▼────────────────────────────────────┐
  │  Oracle DB 19c RAC (ASM / Solaris SPARC)                │
  │  ▸ GV$SESSION.ECID         ← live sessions              │
  │  ▸ GV$ACTIVE_SESSION_HISTORY.ECID ← last ~60 mins       │
  │  ▸ DBA_HIST_ACTIVE_SESS_HISTORY.ECID ← AWR history      │
  │  ▸ FND_LOG_MESSAGES.EXECUTION_CONTEXT_ID ← EBS FND      │
  └─────────────────────────────────────────────────────────┘
🎓 Experienced

The ECID format in 19c is typically: 1.<random_hex>;<short_suffix>. The short suffix (;kXjE, ;0f3a) is the part that identifies the WLS request uniquely. When grepping logs, use the middle random hex string as your LIKE pattern — it is the most stable and unique fragment. Avoid matching on the prefix 1. as it is constant across all ECIDs on the same instance.

📌 ASH Retention Rules (Critical to Know):
GV$ACTIVE_SESSION_HISTORY — in-memory, ~1 hour (sampled every 1 second, 1-in-10 written to AWR)
DBA_HIST_ACTIVE_SESS_HISTORY — AWR flushed copy, default 8-day retention, 10-second sample interval
FND_LOG_MESSAGES — EBS-controlled, retention driven by FND: Debug Log Retention profile option
Key implication: If the incident is >1 hour old and AWR retention is set to default, use DBA_HIST. Always confirm AWR retention: SELECT retention FROM dba_hist_wr_control;

⚙️ 2. Why ASH is the Right Tool (and its Limits)

✅ ASH Strengths

1-second granularity for live sessions. ECID persisted in every sample. Includes SQL_ID, wait event, blocking session, plan hash per sample. No performance overhead on the application.

⚠️ ASH Limitations

Only samples active sessions (STATUS=ACTIVE). Idle/INACTIVE sessions (e.g., holding a lock but not running SQL) are NOT captured. Short-lived sessions (<1 second) may be missed entirely. GV$ is in-memory only — flushed on instance restart.

⚠️ The invisible lock holder problem: A session holding a TX lock with STATUS=INACTIVE does NOT appear in ASH. The waiter appears (high enq: TX - row lock contention), but the blocker is invisible. For these cases, always cross-check with GV$SESSION live + GV$LOCK for the blocker SID, then trace that blocker separately.

📊 3. Query 1 — Enriched ASH ECID Scan with Temporal Heat ADVANCED

The basic version groups by sample_time which produces too many rows. This version uses time buckets (5-minute windows) and a heat ratio to immediately show you when the ECID was burning DB time — not just that it was.

-- ============================================================
-- Q1: ECID Temporal Heat Map — 5-Min Buckets, Last 12 Hours
-- Shows WHEN each ECID caused DB pressure, not just IF
-- ============================================================
SELECT
    TO_CHAR(TRUNC(a.sample_time, 'MI') -
        MOD(TO_NUMBER(TO_CHAR(a.sample_time,'MI')),5)/(24*60),
        'YYYY-MM-DD HH24:MI')              AS time_bucket_5min,
    a.ecid,
    COUNT(*)                                AS total_samples,
    SUM(CASE WHEN a.wait_class != 'Idle'
             THEN 1 ELSE 0 END)           AS active_samples,
    ROUND(SUM(CASE WHEN a.wait_class != 'Idle'
             THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1) AS pct_active,
    SUM(CASE WHEN a.wait_class = 'User I/O'
             THEN 1 ELSE 0 END)           AS io_samples,
    SUM(CASE WHEN a.wait_class = 'Application'
             THEN 1 ELSE 0 END)           AS lock_samples,
    SUM(CASE WHEN a.wait_class = 'Concurrency'
             THEN 1 ELSE 0 END)           AS concurrency_samples,
    MAX(a.module)                           AS module,
    MAX(a.sql_id)
        KEEP (DENSE_RANK LAST ORDER BY
              COUNT(*) OVER (PARTITION BY a.ecid, a.sql_id))
                                            AS dominant_sql_id,
    MAX(a.event)
        KEEP (DENSE_RANK LAST ORDER BY
              COUNT(*) OVER (PARTITION BY a.ecid, a.event))
                                            AS dominant_event,
    LPAD(RPAD('█', ROUND(SUM(CASE WHEN a.wait_class != 'Idle'
              THEN 1 ELSE 0 END) * 20.0 / MAX(COUNT(*))
                  OVER (PARTITION BY a.ecid)), '░'), 20)
                                            AS heat_bar
FROM   gv$active_session_history a
WHERE  a.sample_time > SYSDATE - 12/24
  AND  a.ecid        IS NOT NULL
  AND  a.program     LIKE '%oacore%'
GROUP BY
    TO_CHAR(TRUNC(a.sample_time, 'MI') -
        MOD(TO_NUMBER(TO_CHAR(a.sample_time,'MI')),5)/(24*60),
        'YYYY-MM-DD HH24:MI'),
    a.ecid
HAVING COUNT(*) > 3
ORDER BY time_bucket_5min DESC, active_samples DESC;
🔥 Advanced Technique — heat_bar: The LPAD/RPAD('█'...,'░') construct renders an ASCII bar chart inline in SQL*Plus / SQL Developer output. Each row becomes a visual timeline entry — you can immediately see which ECID was burning active samples at which 5-minute window without leaving the DB tool. The KEEP (DENSE_RANK LAST ORDER BY COUNT(*) OVER PARTITION BY...) pattern selects the single most-frequent SQL_ID and EVENT per ECID per bucket — the statistical mode, not MAX by sort order.

🔬 4. Query 2 — ECID Event Sequence Reconstruction ADVANCED

This rebuilds the exact sequence of events for a specific ECID as a timeline — useful for proving to the business exactly what the request did step-by-step and for how long.

-- ============================================================
-- Q2: ECID Event Timeline — Ordered Sequence with Gaps
-- Reconstructs what the WLS thread did, step by step
-- ============================================================
SELECT
    a.sample_time,
    a.ecid,
    a.session_id                                        AS sid,
    a.session_serial#                                   AS serial#,
    a.inst_id,
    a.sql_id,
    a.sql_child_number,
    a.plan_hash_value,
    a.event,
    a.wait_class,
    ROUND(a.time_waited / 1000, 2)                    AS time_waited_ms,
    a.blocking_session,
    a.blocking_inst_id,
    -- Gap analysis: seconds since previous sample for this ECID
    ROUND((CAST(a.sample_time AS TIMESTAMP)
          - CAST(LAG(a.sample_time)
                   OVER (PARTITION BY a.ecid
                         ORDER BY a.sample_time) AS TIMESTAMP))
         * 86400, 1)                                  AS gap_from_prev_secs,
    -- Cumulative DB time spent (non-idle) for this ECID
    SUM(CASE WHEN a.wait_class != 'Idle' THEN 1 ELSE 0 END)
        OVER (PARTITION BY a.ecid ORDER BY a.sample_time
              ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
                                                        AS cumul_active_samples,
    -- Event change flag: did the wait event change from prior sample?
    CASE WHEN a.event = LAG(a.event)
                         OVER (PARTITION BY a.ecid
                               ORDER BY a.sample_time)
         THEN '→ same'
         ELSE '★ CHANGED'
    END                                                 AS event_change,
    a.module,
    a.action,
    a.program,
    a.machine
FROM   gv$active_session_history a
WHERE  a.sample_time > SYSDATE - 12/24
  AND  a.ecid LIKE '%&ECID_EXTRACT%'
ORDER BY a.sample_time ASC;   -- ASC = chronological for readability
💡 Reading the output: Look for rows where event_change = '★ CHANGED' — these are state transitions. A session that goes CPU → enq: TX row lock → SQL*Net from client → CPU tells a very different story than one stuck on db file sequential read for 200 consecutive samples. The gap_from_prev_secs column reveals moments where the session was INACTIVE or completed — a gap >10 seconds means the ECID was idle (waiting for next WLS call or finished a round-trip).

📈 5. Query 3 — Top ECIDs by DB Time with Full Wait Profile ADVANCED

-- ============================================================
-- Q3: Top ECIDs — Multi-Dimensional Wait Profile
-- Shows the complete wait breakdown per ECID in one row
-- ============================================================
SELECT
    ecid,
    module,
    MIN(sample_time)                                AS first_seen,
    MAX(sample_time)                                AS last_seen,
    ROUND((MAX(sample_time) - MIN(sample_time)) * 1440, 1)
                                                     AS elapsed_mins,
    COUNT(*)                                        AS total_ash_samples,
    -- DB Time breakdown (non-idle active samples)
    SUM(CASE WHEN wait_class != 'Idle' THEN 1 ELSE 0 END)
                                                     AS db_time_samples,
    SUM(CASE WHEN wait_class = 'CPU'          THEN 1 ELSE 0 END)
                                                     AS cpu_samples,
    SUM(CASE WHEN wait_class = 'User I/O'    THEN 1 ELSE 0 END)
                                                     AS io_samples,
    SUM(CASE WHEN wait_class = 'Application' THEN 1 ELSE 0 END)
                                                     AS lock_samples,
    SUM(CASE WHEN wait_class = 'Concurrency' THEN 1 ELSE 0 END)
                                                     AS concurrency_samples,
    SUM(CASE WHEN wait_class = 'Network'     THEN 1 ELSE 0 END)
                                                     AS network_samples,
    -- Most frequent SQL ID for this ECID (using subquery-free pivot)
    STATS_MODE(sql_id)                              AS dominant_sql_id,
    STATS_MODE(event)                               AS dominant_event,
    STATS_MODE(plan_hash_value)                     AS dominant_plan_hash,
    -- Plan instability flag (multiple plan hashes for same ECID)
    CASE WHEN COUNT(DISTINCT plan_hash_value) > 1
         THEN '⚠ PLAN UNSTABLE ('
              || COUNT(DISTINCT plan_hash_value) || ' plans)'
         ELSE '✔ stable'
    END                                             AS plan_stability,
    COUNT(DISTINCT session_id || '@' || inst_id) AS distinct_sessions
FROM   gv$active_session_history
WHERE  sample_time > SYSDATE - 12/24
  AND  ecid        IS NOT NULL
  AND  program     LIKE '%oacore%'
GROUP BY ecid, module
HAVING  COUNT(*) > 5
ORDER BY db_time_samples DESC
FETCH FIRST 30 ROWS ONLY;
🔥 Advanced Technique — STATS_MODE(): STATS_MODE(sql_id) returns the statistical mode — the most frequently occurring value. This is far superior to MAX(sql_id) (which returns alphabetically last, meaningless) or a correlated subquery. It natively answers: "which SQL did this ECID spend the most time on?" The plan_stability column immediately flags ECIDs with plan instability — a key signal that the real problem is a bad execution plan flip, not a lock or I/O issue.

👤 6. Query 4 — ECID to FND User Mapping ADVANCED

GV$SESSION.USERNAME is always APPS in EBS. You need to join through FND_LOG_MESSAGES or V$SESSION.CLIENT_IDENTIFIER to get the actual FND user. This query does it directly from ASH without requiring a live session.

-- ============================================================
-- Q4: ECID → FND User + Responsibility Resolution
-- Works even after the session has ended (uses FND log history)
-- ============================================================
SELECT DISTINCT
    a.ecid,
    a.client_id                               AS fnd_user_or_client_id,
    fu.user_name                              AS fnd_username,
    fu.description                            AS fnd_user_desc,
    flm.module                                AS fnd_module,
    MIN(a.sample_time) OVER (PARTITION BY a.ecid)
                                              AS ecid_first_seen,
    MAX(a.sample_time) OVER (PARTITION BY a.ecid)
                                              AS ecid_last_seen,
    COUNT(*) OVER (PARTITION BY a.ecid)     AS ecid_total_samples,
    STATS_MODE(a.event) OVER (PARTITION BY a.ecid)
                                              AS dominant_event,
    STATS_MODE(a.sql_id) OVER (PARTITION BY a.ecid)
                                              AS dominant_sql_id
FROM       gv$active_session_history  a
LEFT JOIN  fnd_user                   fu
        ON fu.user_name = UPPER(a.client_id)
LEFT JOIN  (
    SELECT DISTINCT execution_context_id, module
    FROM  fnd_log_messages
    WHERE log_sequence > (SELECT MAX(log_sequence) - 1000000
                           FROM  fnd_log_messages)
) flm
        ON flm.execution_context_id = a.ecid
WHERE  a.sample_time > SYSDATE - 12/24
  AND  a.ecid        IS NOT NULL
  AND  a.program     LIKE '%oacore%'
ORDER BY  ecid_total_samples DESC;
📌 CLIENT_ID in EBS: WebLogic propagates the FND username into GV$SESSION.CLIENT_IDENTIFIER (and therefore ASH.CLIENT_ID) via DBMS_SESSION.SET_IDENTIFIER. The join to FND_USER converts this to the full user record. The FND_LOG_MESSAGES subquery adds the OAF module context — the actual page or responsibility the user was navigating. The log_sequence predicate avoids a full table scan on the often-huge FND_LOG_MESSAGES table.

🌐 7. Query 5 — Cross-Instance RAC ECID Correlation ADVANCED

In a 2-node (or 3-node) RAC, a single WLS request can spawn activity on multiple instances — for example, the SQL runs on node 1 but the blocking session is on node 2, causing global cache waits. This query maps the full multi-node footprint of an ECID.

-- ============================================================
-- Q5: Cross-Instance RAC Footprint for a Specific ECID
-- Identifies global cache pressure, cross-node blocking
-- ============================================================
SELECT
    a.inst_id,
    a.session_id                                        AS sid,
    a.session_serial#                                   AS serial#,
    MIN(a.sample_time)                                  AS first_seen,
    MAX(a.sample_time)                                  AS last_seen,
    COUNT(*)                                            AS samples_on_node,
    STATS_MODE(a.event)                                 AS dominant_event,
    STATS_MODE(a.sql_id)                                AS dominant_sql_id,
    -- GC (global cache) wait detection
    SUM(CASE WHEN a.event LIKE 'gc%' THEN 1 ELSE 0 END)
                                                        AS gc_wait_samples,
    SUM(CASE WHEN a.event LIKE 'gc cr%' THEN 1 ELSE 0 END)
                                                        AS gc_cr_samples,
    SUM(CASE WHEN a.event LIKE 'gc buffer busy%' THEN 1 ELSE 0 END)
                                                        AS gc_buffer_busy,
    -- Cross-node blocking (blocker on different inst_id)
    SUM(CASE WHEN a.blocking_inst_id != a.inst_id
              AND  a.blocking_inst_id IS NOT NULL
         THEN 1 ELSE 0 END)                           AS cross_node_block_samples,
    a.machine                                           AS app_server
FROM   gv$active_session_history a
WHERE  a.sample_time > SYSDATE - 12/24
  AND  a.ecid LIKE '%&ECID_EXTRACT%'
GROUP BY a.inst_id, a.session_id, a.session_serial#, a.machine
ORDER BY a.inst_id, samples_on_node DESC;
⚠️ gc buffer busy = RAC hot block: If gc_buffer_busy is non-zero, the ECID is hitting a hot block shared between RAC nodes — usually caused by an unindexed column with high DML frequency, or sequences without caching. This is a tuning issue, not a session kill issue. Killing the session will not fix it.

📐 8. Query 6 — ECID Plan Stability Check ADVANCED

-- ============================================================
-- Q6: Plan Instability Detection per ECID
-- Identifies ECIDs where the execution plan changed mid-flight
-- ============================================================
SELECT
    a.ecid,
    a.sql_id,
    a.plan_hash_value,
    COUNT(*)                                AS samples_with_this_plan,
    MIN(a.sample_time)                       AS plan_first_seen,
    MAX(a.sample_time)                       AS plan_last_seen,
    AVG(a.time_waited) / 1000              AS avg_wait_ms_this_plan,
    -- Is this plan the current one still in shared pool?
    CASE WHEN EXISTS (
        SELECT 1 FROM gv$sql sq
        WHERE  sq.sql_id        = a.sql_id
          AND  sq.plan_hash_value = a.plan_hash_value
    ) THEN '✔ in pool'
      ELSE '✘ aged out'
    END                                     AS plan_in_shared_pool,
    -- Relative plan performance vs best plan for same SQL
    ROUND(AVG(a.time_waited) /
          MIN(AVG(a.time_waited)) OVER
              (PARTITION BY a.ecid, a.sql_id), 2)
                                            AS relative_cost_vs_best_plan
FROM   gv$active_session_history a
WHERE  a.sample_time   > SYSDATE - 12/24
  AND  a.ecid          LIKE '%&ECID_EXTRACT%'
  AND  a.plan_hash_value IS NOT NULL
GROUP BY a.ecid, a.sql_id, a.plan_hash_value
ORDER BY a.sql_id, samples_with_this_plan DESC;
🔥 Advanced Technique — relative_cost_vs_best_plan: The MIN(AVG(time_waited)) OVER (PARTITION BY ecid, sql_id) finds the lowest average wait among all plan variants for the same SQL. The ratio tells you how much worse the bad plan is: a value of 8.3 means the bad plan is 8.3× slower than the best known plan for that SQL. This is a quantified argument for raising an SR or pinning a plan with SQL Plan Management (SPM).

🗄️ 9. Query 7 — AWR Historical ECID Reconstruction ADVANCED

When ASH memory has flushed (>1 hour ago), pivot to DBA_HIST_ACTIVE_SESS_HISTORY and join with DBA_HIST_SQLSTAT for aggregated SQL performance for the same period.

-- ============================================================
-- Q7: AWR Historical ECID Reconstruction + SQL Stats
-- For incidents older than 1 hour — requires Diagnostics Pack
-- ============================================================
WITH ash_hist AS (
    SELECT
        h.snap_id,
        h.ecid,
        h.session_id                           AS sid,
        h.session_serial#                      AS serial#,
        h.instance_number                      AS inst_id,
        h.sample_time,
        h.sql_id,
        h.plan_hash_value,
        h.event,
        h.wait_class,
        h.time_waited,
        h.blocking_session,
        h.module,
        h.action,
        h.client_id,
        COUNT(*) OVER (PARTITION BY h.ecid)  AS ecid_total_samples,
        STATS_MODE(h.sql_id) OVER
            (PARTITION BY h.ecid)             AS dominant_sql_id
    FROM   dba_hist_active_sess_history h
    WHERE  h.sample_time > SYSDATE - 12/24
      AND  h.ecid         LIKE '%&ECID_EXTRACT%'
),
sql_perf AS (
    -- AWR SQL stats for the same snap range covered by the ASH hits
    SELECT
        ss.snap_id,
        ss.sql_id,
        ss.plan_hash_value,
        ss.executions_delta                    AS execs,
        ROUND(ss.elapsed_time_delta /
              NULLIF(ss.executions_delta, 0) / 1000, 2)
                                               AS avg_elapsed_ms,
        ROUND(ss.cpu_time_delta /
              NULLIF(ss.executions_delta, 0) / 1000, 2)
                                               AS avg_cpu_ms,
        ROUND(ss.buffer_gets_delta /
              NULLIF(ss.executions_delta, 0), 0)
                                               AS avg_buffer_gets,
        ROUND(ss.disk_reads_delta /
              NULLIF(ss.executions_delta, 0), 0)
                                               AS avg_disk_reads,
        ss.rows_processed_delta                AS total_rows
    FROM  dba_hist_sqlstat ss
    WHERE ss.executions_delta > 0
)
SELECT
    a.ecid,
    a.sample_time,
    a.sid,
    a.serial#,
    a.inst_id,
    a.sql_id,
    a.plan_hash_value,
    a.event,
    a.wait_class,
    a.module,
    a.client_id,
    a.ecid_total_samples,
    a.dominant_sql_id,
    sp.execs,
    sp.avg_elapsed_ms,
    sp.avg_cpu_ms,
    sp.avg_buffer_gets,
    sp.avg_disk_reads
FROM      ash_hist  a
LEFT JOIN sql_perf  sp
       ON sp.snap_id       = a.snap_id
      AND sp.sql_id        = a.sql_id
      AND sp.plan_hash_value = a.plan_hash_value
ORDER BY a.sample_time DESC;
📌 DBA_HIST vs GV$ASH — Key Differences:
DBA_HIST samples at 10-second intervals (not 1 second), so short events (<10 sec) may be invisible.
The snap_id join ensures SQL stats are from the same AWR snapshot period as the ASH samples, giving you elapsed time and buffer gets that were actually measured during the incident — not a 24-hour aggregate.

🔗 10. Query 8 — ECID Blocking Chain Tree (Hierarchical + Cross-Node) ADVANCED

-- ============================================================
-- Q8: Full Blocking Chain from ECID Session Outward
-- Handles RAC cross-instance chains; shows lock object name
-- ============================================================
SELECT
    LPAD('  ', LEVEL * 3) || s.inst_id || ':' || s.sid
                                              AS session_tree,
    LEVEL                                   AS chain_depth,
    s.status,
    s.last_call_et                          AS active_secs,
    s.event,
    s.wait_class,
    s.seconds_in_wait,
    s.sql_id,
    s.module,
    s.action,
    s.username,
    s.machine,
    s.blocking_session,
    s.blocking_instance,
    s.blocking_session_status,
    -- Resolve lock object (for TX/TM waits)
    (SELECT o.object_name || ' (' || o.object_type || ')'
       FROM  gv$lock    lk
       JOIN  dba_objects o ON o.object_id = lk.id1
      WHERE  lk.sid     = s.sid
        AND  lk.inst_id = s.inst_id
        AND  lk.type    = 'TM'
        AND  rownum     = 1)               AS locked_object,
    -- OS SPID for escalation to OS kill if needed
    p.spid                                  AS os_spid
FROM       gv$session s
LEFT JOIN  gv$process p
        ON p.addr    = s.paddr
       AND p.inst_id = s.inst_id
START WITH  s.ecid LIKE '%&ECID_EXTRACT%'
        AND s.program LIKE '%oacore%'
CONNECT BY NOCYCLE
    PRIOR s.blocking_session  = s.sid
    AND   PRIOR s.blocking_instance = s.inst_id
ORDER SIBLINGS BY s.last_call_et DESC;
💡 NOCYCLE is mandatory here. In rare cases, Oracle can report circular blocking chains (two sessions each blocking the other — a deadlock resolved by Oracle but not yet cleaned up in GV$SESSION). Without NOCYCLE, the query loops infinitely. The locked_object column resolves the TM lock to the actual table name — critical for knowing which EBS table is being locked (e.g., PO_HEADERS_ALL, OE_ORDER_LINES_ALL) so you can engage the functional team.

🔀 11. Query 9 — ECID to Concurrent Request Bridge ADVANCED

Concurrent Manager requests spawn their own DB sessions. If an ECID originates from a CM worker process that incorrectly shows up via oacore (e.g., a self-service submission), this query bridges the gap.

-- ============================================================
-- Q9: ECID ↔ Concurrent Request Cross-Reference
-- Links ASH ECID data to FND_CONCURRENT_REQUESTS
-- ============================================================
SELECT
    a.ecid,
    fcr.request_id,
    fcr.phase_code,
    fcr.status_code,
    fcr.actual_start_date,
    fcr.actual_completion_date,
    ROUND((fcr.actual_completion_date - fcr.actual_start_date) * 1440, 1)
                                           AS runtime_mins,
    fcp.concurrent_program_name,
    fcp.user_concurrent_program_name,
    fu.user_name                           AS submitted_by,
    fcr.oracle_session_id,
    a.sample_time,
    a.sql_id,
    a.event,
    a.wait_class,
    a.module
FROM       gv$active_session_history  a
JOIN       fnd_concurrent_requests    fcr
        ON fcr.oracle_session_id = a.session_id
       AND fcr.actual_start_date < a.sample_time
       AND NVL(fcr.actual_completion_date, SYSDATE) >= a.sample_time
JOIN       fnd_concurrent_programs    fcp
        ON fcp.concurrent_program_id = fcr.concurrent_program_id
       AND fcp.application_id        = fcr.program_application_id
LEFT JOIN  fnd_user                   fu
        ON fu.user_id = fcr.requested_by
WHERE  a.sample_time > SYSDATE - 12/24
  AND  a.ecid LIKE '%&ECID_EXTRACT%'
ORDER BY  a.sample_time DESC;

🤖 12. Query 10 — PL/SQL ECID Watchdog with Auto-Kill + Alert Log ADVANCED

Rather than reacting to incidents, this procedure runs proactively (via DBMS_SCHEDULER every 5 minutes) and auto-kills oacore sessions that have been active on the same ECID beyond a configurable threshold — with a full audit trail written to a custom log table.

-- ============================================================
-- Setup: Create the watchdog audit table (run once)
-- ============================================================
CREATE TABLE apps.ecid_watchdog_log (
    log_id          NUMBER         GENERATED ALWAYS AS IDENTITY,
    log_time        DATE          DEFAULT SYSDATE,
    ecid            VARCHAR2(64),
    sid             NUMBER,
    serial#         NUMBER,
    inst_id         NUMBER,
    event           VARCHAR2(64),
    module          VARCHAR2(64),
    last_call_et    NUMBER,
    sql_id          VARCHAR2(13),
    action_taken    VARCHAR2(20),
    kill_cmd        VARCHAR2(120),
    notes           VARCHAR2(500),
    CONSTRAINT ecid_wdg_pk PRIMARY KEY (log_id)
);

-- ============================================================
-- The Watchdog Procedure
-- ============================================================
CREATE OR REPLACE PROCEDURE apps.ecid_watchdog_proc (
    p_threshold_secs  IN NUMBER   DEFAULT 600,   -- 10-min default
    p_auto_kill       IN VARCHAR2 DEFAULT 'N',   -- 'Y' to kill, 'N' for log-only
    p_program_filter  IN VARCHAR2 DEFAULT '%oacore%'
)
AS
    l_kill_cmd  VARCHAR2(200);
    l_action    VARCHAR2(20);
BEGIN
    FOR r IN (
        SELECT
            s.ecid,
            s.sid,
            s.serial#,
            s.inst_id,
            s.event,
            s.module,
            s.sql_id,
            s.last_call_et,
            s.status,
            s.blocking_session,
            s.blocking_instance
        FROM   gv$session s
        WHERE  s.ecid         IS NOT NULL
          AND  s.program      LIKE p_program_filter
          AND  s.status       = 'ACTIVE'
          AND  s.last_call_et > p_threshold_secs
          AND  s.username     NOT IN ('SYS', 'SYSTEM')
          AND  s.type         = 'USER'
        ORDER BY s.last_call_et DESC
    )
    LOOP
        l_kill_cmd := 'ALTER SYSTEM KILL SESSION '''
                      || r.sid || ',' || r.serial# || ',@' || r.inst_id
                      || ''' IMMEDIATE';
        l_action := 'LOGGED';

        IF p_auto_kill = 'Y' THEN
            BEGIN
                EXECUTE IMMEDIATE l_kill_cmd;
                l_action := 'KILLED';
            EXCEPTION
                WHEN OTHERS THEN
                    l_action := 'KILL_FAILED:' || SUBSTR(SQLERRM, 1, 30);
            END;
        END IF;

        INSERT INTO apps.ecid_watchdog_log
            (ecid, sid, serial#, inst_id, event, module, last_call_et,
             sql_id, action_taken, kill_cmd, notes)
        VALUES
            (r.ecid, r.sid, r.serial#, r.inst_id, r.event, r.module,
             r.last_call_et, r.sql_id, l_action, l_kill_cmd,
             'blocker=' || r.blocking_session || '@' || r.blocking_instance
             || ' status=' || r.status);

        DBMS_OUTPUT.PUT_LINE(l_action || ': ECID=' || SUBSTR(r.ecid,1,30)
            || ' SID=' || r.sid || ' ET=' || r.last_call_et || 's');
    END LOOP;

    COMMIT;
END ecid_watchdog_proc;
/

-- ============================================================
-- Schedule as DBMS_SCHEDULER job (every 5 minutes)
-- ============================================================
BEGIN
    DBMS_SCHEDULER.CREATE_JOB (
        job_name        => 'APPS.ECID_WATCHDOG_JOB',
        job_type        => 'STORED_PROCEDURE',
        job_action      => 'APPS.ECID_WATCHDOG_PROC',
        number_of_arguments => 3,
        start_date      => SYSTIMESTAMP,
        repeat_interval => 'FREQ=MINUTELY;INTERVAL=5',
        enabled         => FALSE,
        auto_drop       => FALSE,
        comments        => 'ECID Watchdog - log-only by default'
    );
    -- Set args: 300s threshold, log-only mode, oacore filter
    DBMS_SCHEDULER.SET_JOB_ARGUMENT_VALUE('APPS.ECID_WATCHDOG_JOB', 1, '300');
    DBMS_SCHEDULER.SET_JOB_ARGUMENT_VALUE('APPS.ECID_WATCHDOG_JOB', 2, 'N');
    DBMS_SCHEDULER.SET_JOB_ARGUMENT_VALUE('APPS.ECID_WATCHDOG_JOB', 3, '%oacore%');
    DBMS_SCHEDULER.ENABLE('APPS.ECID_WATCHDOG_JOB');
END;
/

-- Query the watchdog audit log
SELECT * FROM apps.ecid_watchdog_log
ORDER BY log_time DESC
FETCH FIRST 50 ROWS ONLY;
🚨 Production Safeguards for the Watchdog:
  • Always start with p_auto_kill = 'N' (log-only) for at least 1 week. Review the logs to ensure it would not kill legitimate long-running batch jobs.
  • Add exclusion ECIDs: prefix-match known batch ECIDs and skip them in the cursor WHERE clause.
  • Gate on blocking_session IS NULL if you only want to kill self-hung sessions, not sessions blocked by others (which need a different fix).
  • Set a daily count alert: if the watchdog kills >N sessions in 24 hours, it signals a recurring architectural problem — not a random hang.

🐚 13. Bash: Production-Grade ECID Extraction from oacore Logs

#!/bin/bash
# ============================================================
# ecid_extract.sh — Extract ECID from oacore WLS logs
# Usage: ./ecid_extract.sh [minutes_back] [ecid_fragment]
# ============================================================

DOMAIN_HOME="/u01/oracle/domains/EBSdomain"
OACORE_LOG="${DOMAIN_HOME}/servers/oacore/logs"
MINS_BACK=${1:-60}
ECID_FRAG=${2:-}

echo "=== ECID Extractor: Last ${MINS_BACK} minutes ==="

# Find log files modified in the last N minutes
find "${OACORE_LOG}" -name "*.log" -newer /tmp/.ecid_ref_${MINS_BACK} 2>/dev/null | \
xargs grep -h "ecid=" 2>/dev/null | \
grep -oP 'ecid=\K[^;,\] ]+' | \
sort -u | \
awk 'length($0) > 10' > /tmp/ecid_candidates.txt

echo "Found $(wc -l < /tmp/ecid_candidates.txt) unique ECIDs"

# If a fragment was passed, filter to just matching ECIDs
if [ -n "${ECID_FRAG}" ]; then
    grep "${ECID_FRAG}" /tmp/ecid_candidates.txt
else
    head -20 /tmp/ecid_candidates.txt
fi

# Extract stuck thread ECIDs from WLS thread dump
echo
echo "=== ECIDs from stuck thread dumps ==="
grep -B5 -A10 "STUCK" "${OACORE_LOG}/oacore_server1.out" 2>/dev/null | \
grep -oP '(?<=ECID-Context: )[^\]"]+' | \
sort -u

# Cross-reference with current time for correlation
echo
echo "=== Generating SQL WHERE clause ==="
echo "-- Paste this into Phase 3.1 / Q1-Q6 scripts:"
head -5 /tmp/ecid_candidates.txt | while read ecid; do
    echo "  AND s.ecid LIKE '%$(echo ${ecid} | cut -c1-20)%'"
done

🗺️ 14. Decision Framework — Which Query When

Situation Use This Why
Live incident, session still active Phase 3.1 (GV$SESSION) Fastest path to kill command. Zero ASH needed.
Session died before you could look Q2 (ASH Timeline) Reconstruct exactly what happened, in order, from ASH memory.
Incident >1 hour ago, ASH flushed Q7 (DBA_HIST_ASH) AWR holds 8 days. Join with DBA_HIST_SQLSTAT for SQL perf context.
"Which requests are heaviest right now?" Q3 (Top ECID Profile) STATS_MODE + wait breakdown surfaces the real offenders in one row per ECID.
"Which EBS user is it?" Q4 (FND User Mapping) CLIENT_ID → FND_USER join + FND_LOG cross-reference.
RAC: session active across multiple nodes Q5 (Cross-Node RAC) Reveals gc waits, cross-node blocking — invisible to V$SESSION on one node.
Recurring slow query (same SQL, varies in speed) Q6 (Plan Stability) Quantifies how much worse the bad plan is. Drives SPM/hint/profile solution.
Need to identify the locked table Q8 (Blocking Chain Tree) TM lock join to DBA_OBJECTS gives exact table name + chain depth.
ECID came from a Concurrent Request Q9 (CR Bridge) Links ASH ECID directly to REQUEST_ID + program + submitter.
Prevent future repeats, automate response Q10 (Watchdog PL/SQL) Scheduled every 5 minutes, log-only or auto-kill, full audit trail.
"Show me the timeline in a heat chart" Q1 (5-Min Bucket Heat) ASCII bar chart per 5-minute window — no external tool needed.

📝 15. Summary & RCA Template

Every ECID incident that you resolve should generate an RCA entry. Here is a minimal post-incident template built from the outputs of the queries above:

/* ============================================================
   RCA TEMPLATE — ECID Session Incident
   Copy into your incident ticket / Jira / SNOW
   ============================================================ */

-- 1. INCIDENT WINDOW
--    Start:     YYYY-MM-DD HH24:MI (extracted from Q1 heat map first_seen)
--    End:       YYYY-MM-DD HH24:MI
--    Duration:  NNN minutes

-- 2. ECID
--    Full ECID:     1.xxxxx...;kXjE
--    Source:        oacore_server1.out / access.log / FND_LOG_MESSAGES

-- 3. IMPACTED SESSION (from Phase 3.1 / Q2)
--    SID/Serial#/Inst: 247,8931,@2
--    FND User:         JSMITH (from Q4)
--    EBS Module:       OAF Page: PO/PORequisitionCreate (from MODULE column)
--    OS PID:           12847

-- 4. ROOT CAUSE
--    Dominant Event:   enq: TX - row lock contention (from Q3 STATS_MODE)
--    Blocker:          SID 189@1 (session on node 1, from Q8 chain tree)
--    Blocked by Table: PO_REQUISITION_LINES_ALL (TM lock, from Q8 locked_object)
--    SQL ID:           d8txf935msbx5 (dominant SQL from Q3)
--    Plan Hash:        1234567890 (stable / unstable — from Q6)

-- 5. ACTION TAKEN
--    ALTER SYSTEM KILL SESSION '247,8931,@2' IMMEDIATE;
--    Executed by:  [DBA Name], [Time]
--    User impact:  Session terminated; user re-submitted — successful

-- 6. PREVENTION
--    Short-term:   ECID Watchdog job enabled (log-only, 5-min interval)
--    Medium-term:  Analyse SQL d8txf935msbx5 for index opportunities
--    Long-term:    Review table locking strategy for PO_REQUISITION_LINES_ALL
--                  with functional team; consider row-level lock optimization
💡 Final Advanced Tip — Proactive ECID Baseline: Run Q3 (Top ECIDs Summary) daily during business hours for 2 weeks. Store the output. You will quickly build a baseline of which ECID patterns are normal (same module, same SQL, same wait class, consistent samples) vs anomalous (sudden spike in lock_samples, new plan_hash, cross-node blocking). This baseline turns reactive incident response into proactive capacity and tuning governance.

No comments:

Post a Comment