Wednesday, May 13, 2026

Oracle EBS Long Running Concurrent Request RCA: CPU Bound SQL, TEMP Spill, and Stale Statistics

 

Oracle EBS Long Running Concurrent Request RCA: CPU Bound SQL, TEMP Spill, and Stale Statistics

1. Issue Summary

A long-running Oracle EBS concurrent request was reported during production processing. The request was active for several hours and required DBA investigation to identify whether the delay was due to storage, blocking, CPU, TEMP usage, stale statistics, or execution plan inefficiency.


2. Initial Concurrent Request Monitoring

ColumnSample Value
Request ID<REQUEST_ID>
Program Name<CONCURRENT_PROGRAM_NAME>
SID<SID>
SQL_ID<SQL_ID>
Wait Eventdb file sequential read
StatusACTIVE
RuntimeSeveral hours

At first glance, the wait event appeared to indicate single-block read activity.


3. Step 1: Check Current Session Wait

SELECT inst_id,
sid,
event,
wait_class,
p1text,
p1,
p2text,
p2,
p3text,
p3,
seconds_in_wait,
state
FROM gv$session_wait
WHERE sid = <SID>;

Sample Output

EVENTWAIT_CLASSP1TEXTP1P2TEXTP2STATE
db file sequential readUser I/Ofile#xxxblock#xxxxxWAITED SHORT TIME

Initial Interpretation

db file sequential read normally indicates single-block reads through index access or rowid lookup.

However, one current wait event alone does not prove the root cause.


4. Step 2: Check ASH Wait Profile

SELECT event,
wait_class,
sql_id,
COUNT(*) samples
FROM gv$active_session_history
WHERE session_id = <SID>
AND sample_time > SYSDATE - 1/24
GROUP BY event, wait_class, sql_id
ORDER BY samples DESC;

Sample Output

EventWait ClassSQL_IDSamples
null / ON CPUnull<SQL_ID>3574
direct path write tempUser I/O<SQL_ID>18
db file sequential readUser I/O<SQL_ID>3

Finding

The SQL was primarily CPU-bound and not storage-bound.


5. Step 3: Check Parallel Execution

SELECT sql_id,
px_servers_executions,
executions,
elapsed_time/1000000 elapsed_sec
FROM gv$sql
WHERE sql_id = '<SQL_ID>';

Sample Output

SQL_IDPX_SERVERS_EXECUTIONSEXECUTIONSELAPSED_SEC
<SQL_ID>0236863

Finding

The SQL executed completely in serial mode.


6. Step 4: Execution Plan Analysis

SELECT *
FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(
sql_id => '<SQL_ID>',
cursor_child_no => NULL,
format => 'ALLSTATS LAST +MEMSTATS +PARALLEL +PEEKED_BINDS'
));

Sample Explain Plan Output

--------------------------------------------------------------------------------------------
| Id | Operation | Name | A-Rows | Used-Tmp |
--------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | |
| 1 | SORT UNIQUE | | 1 | |
| 2 | VIEW | | 1 | |
| 3 | UNION ALL (RECURSIVE WITH) | | | |
| 4 | TABLE ACCESS BY INDEX ROWID | ***_LEVELS_TBL | 8 | |
| 5 | INDEX RANGE SCAN | ***_LEVELS_N3 | 45 | |
| 6 | HASH JOIN | | 1 | 24G |
| 7 | RECURSIVE WITH PUMP | | | |
| 8 | BUFFER SORT (REUSE) | | 5600 | |
| 9 | TABLE ACCESS BY INDEX ROWID | ***_LEVELS_TBL | 700 | |
| 10 | INDEX RANGE SCAN | ***_LEVELS_N3 | 851 | |
--------------------------------------------------------------------------------------------

Important Execution Plan Findings

A. Recursive WITH Processing

UNION ALL (RECURSIVE WITH)
RECURSIVE WITH PUMP

Meaning

The SQL is performing recursive hierarchy expansion.

This type of SQL is commonly used in:

  • BOM explosion
  • Planning hierarchy
  • Recursive validation
  • Cost rollup
  • Parent-child hierarchy traversal

Recursive queries can exponentially increase intermediate rowsets.


B. HASH JOIN Operation

HASH JOIN
Used-Tmp : 24G

Meaning

Oracle selected a HASH JOIN strategy.

However, the HASH JOIN workarea overflowed and spilled approximately:

24 GB TEMP

This caused:

  • heavy TEMP usage
  • CPU-intensive processing
  • long runtime

C. TABLE ACCESS BY INDEX ROWID

TABLE ACCESS BY INDEX ROWID

Meaning

Oracle used index-based row retrieval rather than full table scan.

This explains the occasional:

db file sequential read

seen in wait events.

However, this was not the root cause of the slowdown.


D. INDEX RANGE SCAN

INDEX RANGE SCAN

Meaning

Oracle used selective index access paths.

The issue was not missing indexes, but inefficient join/workarea processing caused by optimizer misestimation.


E. Serial Execution

PX_SERVERS_EXECUTIONS = 0

Meaning

The SQL executed entirely in serial mode.

Large recursive hierarchy processing in serial mode significantly increased elapsed runtime.


7. Step 5: Check PGA Configuration

SHOW PARAMETER pga;

Sample Output

ParameterValue
pga_aggregate_target80G
pga_aggregate_limit0

Finding

PGA was already configured with a large target.

Therefore the issue was not simply low PGA allocation.


8. Step 6: Validate Table Statistics

SELECT owner,
table_name,
num_rows,
last_analyzed,
stale_stats
FROM dba_tab_statistics
WHERE owner = '<OWNER>'
AND table_name = '<TABLE_NAME>';

Sample Output

OWNERTABLE_NAMENUM_ROWSLAST_ANALYZEDSTALE_STATS
<OWNER><TABLE_NAME>89<DATE>YES

9. Step 7: Validate Actual Row Count

SELECT COUNT(*)
FROM <OWNER>.<TABLE_NAME>;

Sample Output

COUNT(*)
1347

Critical Observation

SourceRow Count
Optimizer statistics89
Actual table rows1347

Optimizer statistics were stale and underestimated table cardinality.


10. Root Cause Chain

Stale statistics
→ wrong cardinality estimate
→ optimizer underestimated rows
→ inefficient HASH JOIN strategy
→ insufficient workarea estimation
→ HASH JOIN TEMP spill (~24GB)
→ CPU-intensive recursive processing
→ long serial execution runtime

11. Final RCA Statement

The concurrent request experienced prolonged runtime due to inefficient execution plan selection for a recursive hierarchy SQL.

Execution plan analysis identified recursive WITH processing combined with HASH JOIN operations. The HASH JOIN spilled significantly to TEMP (~24GB), causing CPU-intensive execution.

Further validation confirmed stale optimizer statistics on one of the key processing tables. Optimizer statistics estimated only 89 rows while actual table count was 1347. This cardinality misestimation caused the optimizer to underestimate recursive row expansion and allocate insufficient HASH JOIN workarea memory.

ASH analysis confirmed the session was predominantly CPU-bound with minimal storage waits, indicating the issue was related to execution plan inefficiency rather than storage subsystem latency.


12. Immediate Mitigation

Gather Fresh Table Statistics

EXEC DBMS_STATS.GATHER_TABLE_STATS(
ownname => '<OWNER>',
tabname => '<TABLE_NAME>',
estimate_percent => DBMS_STATS.AUTO_SAMPLE_SIZE,
cascade => TRUE,
degree => 4,
method_opt => 'FOR ALL COLUMNS SIZE AUTO'
);

13. Preventive Mitigation

Recommended Operational Flow

Data load/interface refresh
→ Gather targeted table statistics
→ Execute critical concurrent request
→ Validate execution plan stability

Proactive Stale Stats Validation

SELECT owner,
table_name,
stale_stats,
last_analyzed,
num_rows
FROM dba_tab_statistics
WHERE owner = '<OWNER>'
AND table_name = '<TABLE_NAME>';

14. Best Practice Recommendation

For volatile Oracle EBS interface/staging/recursive processing tables:

Do NOT rely only on weekly schema statistics.

Implement:

  • targeted post-load stats gathering
  • month-end validation checks
  • execution plan monitoring
  • stale stats validation before critical requests

15. Customer-Facing Summary

The delay was caused by stale optimizer statistics on a key processing table, leading to incorrect cardinality estimates and inefficient recursive HASH JOIN processing. The SQL spilled heavily to TEMP during hierarchy expansion, resulting in CPU-intensive serial execution and extended runtime. Targeted statistics gathering after data load and before critical concurrent processing is recommended to prevent recurrence.

No comments:

Post a Comment