Oracle EBS Long Running Concurrent Request RCA: CPU Bound SQL, TEMP Spill, and Stale Statistics
1. Issue Summary
A long-running Oracle EBS concurrent request was reported during production processing. The request was active for several hours and required DBA investigation to identify whether the delay was due to storage, blocking, CPU, TEMP usage, stale statistics, or execution plan inefficiency.
2. Initial Concurrent Request Monitoring
| Column | Sample Value |
|---|---|
| Request ID | <REQUEST_ID> |
| Program Name | <CONCURRENT_PROGRAM_NAME> |
| SID | <SID> |
| SQL_ID | <SQL_ID> |
| Wait Event | db file sequential read |
| Status | ACTIVE |
| Runtime | Several hours |
At first glance, the wait event appeared to indicate single-block read activity.
3. Step 1: Check Current Session Wait
SELECT inst_id,
sid,
event,
wait_class,
p1text,
p1,
p2text,
p2,
p3text,
p3,
seconds_in_wait,
state
FROM gv$session_wait
WHERE sid = <SID>;
Sample Output
| EVENT | WAIT_CLASS | P1TEXT | P1 | P2TEXT | P2 | STATE |
|---|---|---|---|---|---|---|
| db file sequential read | User I/O | file# | xxx | block# | xxxxx | WAITED SHORT TIME |
Initial Interpretation
db file sequential read normally indicates single-block reads through index access or rowid lookup.
However, one current wait event alone does not prove the root cause.
4. Step 2: Check ASH Wait Profile
SELECT event,
wait_class,
sql_id,
COUNT(*) samples
FROM gv$active_session_history
WHERE session_id = <SID>
AND sample_time > SYSDATE - 1/24
GROUP BY event, wait_class, sql_id
ORDER BY samples DESC;
Sample Output
| Event | Wait Class | SQL_ID | Samples |
|---|---|---|---|
| null / ON CPU | null | <SQL_ID> | 3574 |
| direct path write temp | User I/O | <SQL_ID> | 18 |
| db file sequential read | User I/O | <SQL_ID> | 3 |
Finding
The SQL was primarily CPU-bound and not storage-bound.
5. Step 3: Check Parallel Execution
SELECT sql_id,
px_servers_executions,
executions,
elapsed_time/1000000 elapsed_sec
FROM gv$sql
WHERE sql_id = '<SQL_ID>';
Sample Output
| SQL_ID | PX_SERVERS_EXECUTIONS | EXECUTIONS | ELAPSED_SEC |
|---|---|---|---|
<SQL_ID> | 0 | 2 | 36863 |
Finding
The SQL executed completely in serial mode.
6. Step 4: Execution Plan Analysis
SELECT *
FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(
sql_id => '<SQL_ID>',
cursor_child_no => NULL,
format => 'ALLSTATS LAST +MEMSTATS +PARALLEL +PEEKED_BINDS'
));
Sample Explain Plan Output
--------------------------------------------------------------------------------------------
| Id | Operation | Name | A-Rows | Used-Tmp |
--------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | |
| 1 | SORT UNIQUE | | 1 | |
| 2 | VIEW | | 1 | |
| 3 | UNION ALL (RECURSIVE WITH) | | | |
| 4 | TABLE ACCESS BY INDEX ROWID | ***_LEVELS_TBL | 8 | |
| 5 | INDEX RANGE SCAN | ***_LEVELS_N3 | 45 | |
| 6 | HASH JOIN | | 1 | 24G |
| 7 | RECURSIVE WITH PUMP | | | |
| 8 | BUFFER SORT (REUSE) | | 5600 | |
| 9 | TABLE ACCESS BY INDEX ROWID | ***_LEVELS_TBL | 700 | |
| 10 | INDEX RANGE SCAN | ***_LEVELS_N3 | 851 | |
--------------------------------------------------------------------------------------------
Important Execution Plan Findings
A. Recursive WITH Processing
UNION ALL (RECURSIVE WITH)
RECURSIVE WITH PUMP
Meaning
The SQL is performing recursive hierarchy expansion.
This type of SQL is commonly used in:
- BOM explosion
- Planning hierarchy
- Recursive validation
- Cost rollup
- Parent-child hierarchy traversal
Recursive queries can exponentially increase intermediate rowsets.
B. HASH JOIN Operation
HASH JOIN
Used-Tmp : 24G
Meaning
Oracle selected a HASH JOIN strategy.
However, the HASH JOIN workarea overflowed and spilled approximately:
24 GB TEMP
This caused:
- heavy TEMP usage
- CPU-intensive processing
- long runtime
C. TABLE ACCESS BY INDEX ROWID
TABLE ACCESS BY INDEX ROWID
Meaning
Oracle used index-based row retrieval rather than full table scan.
This explains the occasional:
db file sequential read
seen in wait events.
However, this was not the root cause of the slowdown.
D. INDEX RANGE SCAN
INDEX RANGE SCAN
Meaning
Oracle used selective index access paths.
The issue was not missing indexes, but inefficient join/workarea processing caused by optimizer misestimation.
E. Serial Execution
PX_SERVERS_EXECUTIONS = 0
Meaning
The SQL executed entirely in serial mode.
Large recursive hierarchy processing in serial mode significantly increased elapsed runtime.
7. Step 5: Check PGA Configuration
SHOW PARAMETER pga;
Sample Output
| Parameter | Value |
|---|---|
| pga_aggregate_target | 80G |
| pga_aggregate_limit | 0 |
Finding
PGA was already configured with a large target.
Therefore the issue was not simply low PGA allocation.
8. Step 6: Validate Table Statistics
SELECT owner,
table_name,
num_rows,
last_analyzed,
stale_stats
FROM dba_tab_statistics
WHERE owner = '<OWNER>'
AND table_name = '<TABLE_NAME>';
Sample Output
| OWNER | TABLE_NAME | NUM_ROWS | LAST_ANALYZED | STALE_STATS |
|---|---|---|---|---|
<OWNER> | <TABLE_NAME> | 89 | <DATE> | YES |
9. Step 7: Validate Actual Row Count
SELECT COUNT(*)
FROM <OWNER>.<TABLE_NAME>;
Sample Output
| COUNT(*) |
|---|
| 1347 |
Critical Observation
| Source | Row Count |
|---|---|
| Optimizer statistics | 89 |
| Actual table rows | 1347 |
Optimizer statistics were stale and underestimated table cardinality.
10. Root Cause Chain
Stale statistics
→ wrong cardinality estimate
→ optimizer underestimated rows
→ inefficient HASH JOIN strategy
→ insufficient workarea estimation
→ HASH JOIN TEMP spill (~24GB)
→ CPU-intensive recursive processing
→ long serial execution runtime
11. Final RCA Statement
The concurrent request experienced prolonged runtime due to inefficient execution plan selection for a recursive hierarchy SQL.
Execution plan analysis identified recursive WITH processing combined with HASH JOIN operations. The HASH JOIN spilled significantly to TEMP (~24GB), causing CPU-intensive execution.
Further validation confirmed stale optimizer statistics on one of the key processing tables. Optimizer statistics estimated only 89 rows while actual table count was 1347. This cardinality misestimation caused the optimizer to underestimate recursive row expansion and allocate insufficient HASH JOIN workarea memory.
ASH analysis confirmed the session was predominantly CPU-bound with minimal storage waits, indicating the issue was related to execution plan inefficiency rather than storage subsystem latency.
12. Immediate Mitigation
Gather Fresh Table Statistics
EXEC DBMS_STATS.GATHER_TABLE_STATS(
ownname => '<OWNER>',
tabname => '<TABLE_NAME>',
estimate_percent => DBMS_STATS.AUTO_SAMPLE_SIZE,
cascade => TRUE,
degree => 4,
method_opt => 'FOR ALL COLUMNS SIZE AUTO'
);
13. Preventive Mitigation
Recommended Operational Flow
Data load/interface refresh
→ Gather targeted table statistics
→ Execute critical concurrent request
→ Validate execution plan stability
Proactive Stale Stats Validation
SELECT owner,
table_name,
stale_stats,
last_analyzed,
num_rows
FROM dba_tab_statistics
WHERE owner = '<OWNER>'
AND table_name = '<TABLE_NAME>';
14. Best Practice Recommendation
For volatile Oracle EBS interface/staging/recursive processing tables:
Do NOT rely only on weekly schema statistics.
Implement:
- targeted post-load stats gathering
- month-end validation checks
- execution plan monitoring
- stale stats validation before critical requests
15. Customer-Facing Summary
The delay was caused by stale optimizer statistics on a key processing table, leading to incorrect cardinality estimates and inefficient recursive HASH JOIN processing. The SQL spilled heavily to TEMP during hierarchy expansion, resulting in CPU-intensive serial execution and extended runtime. Targeted statistics gathering after data load and before critical concurrent processing is recommended to prevent recurrence.
No comments:
Post a Comment