Tuesday, May 12, 2026

APPS DBA = SQL Plan Regression Runbook


SQL_ID: <SQL_ID> Incident: Month-End Performance Degradation Bad Plan Hash: <BAD_PLAN_HASH> Good Plan Hash (if known): <GOOD_PLAN_HASH>


1. Purpose

This runbook provides a standardized, production-safe procedure to diagnose and resolve SQL plan regression for the identified SQL statement during month-end processing.

2. Scope

Applies when this SQL_ID shows significantly higher elapsed time due to plan change from good to bad plan hash value.

3. Prerequisites

  • Access to the database as a privileged user (with SELECT on AWR views, DBA_* views).
  • AWR snapshots available for the affected period.
  • SQL Tuning Advisor and SQL Plan Management licenses (Enterprise Edition).

4. Step-by-Step Diagnosis

Step 4.1: Confirm Multiple Plans Exist

SQL
SELECT DISTINCT plan_hash_value,
       MIN(begin_interval_time) first_seen,
       MAX(begin_interval_time) last_seen,
       COUNT(*) snaps
FROM dba_hist_sqlstat
WHERE sql_id = '<SQL_ID>'
GROUP BY plan_hash_value
ORDER BY first_seen;

Step 4.2: Compare Performance by Plan

SQL
SELECT 
  plan_hash_value,
  SUM(executions_delta) executions,
  ROUND(SUM(elapsed_time_delta)/1e6 / NULLIF(SUM(executions_delta),0), 2) "AVG_ELA_SEC",
  ROUND(SUM(cpu_time_delta)/1e6 / NULLIF(SUM(executions_delta),0), 2) "AVG_CPU_SEC",
  ROUND(SUM(buffer_gets_delta) / NULLIF(SUM(executions_delta),0), 0) "AVG_BUF_GETS",
  ROUND(SUM(disk_reads_delta) / NULLIF(SUM(executions_delta),0), 0) "AVG_DISK_READS"
FROM dba_hist_sqlstat s
JOIN dba_hist_snapshot sn USING (snap_id, instance_number)
WHERE sql_id = '<SQL_ID>'
  AND executions_delta > 0
GROUP BY plan_hash_value
ORDER BY "AVG_ELA_SEC" DESC;

Decision: If <BAD_PLAN_HASH> has significantly higher AVG_ELA_SEC (typically ≥ 3x), proceed.

Step 4.3: Compare Execution Plans

SQL
-- Current / Bad Plan
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR('<SQL_ID>', 0, 'ADVANCED ALLSTATS LAST'));

-- Historical Good Plan (AWR)
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_AWR('<SQL_ID>', <GOOD_PLAN_HASH>, NULL, 'ADVANCED'));

Look for differences in: Full Table Scan vs Index, Join order/method, Cardinality estimates.

Step 4.4: Check Stale Statistics

SQL
EXEC DBMS_STATS.FLUSH_DATABASE_MONITORING_INFO;

SELECT owner, table_name, partition_name, last_analyzed, stale_stats 
FROM dba_tab_statistics 
WHERE stale_stats = 'YES' OR last_analyzed IS NULL;

Step 4.5: Check Bind Sensitivity & Child Cursors

SQL
SELECT child_number, plan_hash_value, is_bind_sensitive, is_bind_aware, executions, 
       elapsed_time/1e6 avg_ela_sec
FROM v$sql 
WHERE sql_id = '<SQL_ID>';

SELECT * FROM v$sql_shared_cursor WHERE sql_id = '<SQL_ID>';

5. Corrective Actions (Production-Safe Order)

PriorityActionCommand / StepsRisk Level
1 (Recommended)SQL Plan Baseline (fixes plan stability)```sql
2SQL Profile (if no good plan in cache)Run SQL Tuning Advisor → Accept ProfileLow-Medium
3Gather fresh statistics (maintenance window)EXEC DBMS_STATS.GATHER_TABLE_STATS(ownname=>'OWNER', tabname=>'TABLE', degree=>8);Medium

Do NOT use OPTIMIZER_GATHER_STATS hints or drop plans in production without testing.


6. Verification Steps

After applying fix:

  1. Re-run performance comparison query (Step 4.2).
  2. Confirm new executions are using <GOOD_PLAN_HASH>.
  3. Monitor real-time:
    SQL
    SELECT sql_id, plan_hash_value, executions, elapsed_time/1e6 
    FROM v$sql WHERE sql_id = '<SQL_ID>';
  4. Check AWR report for next snapshot.

Success Criteria: Average elapsed time returns to normal/good plan levels.


7. Rollback / Backout Plan

  • Drop Baseline:
    SQL
    EXEC DBMS_SPM.DROP_SQL_PLAN_BASELINE(plan_name => '<PLAN_NAME>');
  • Drop SQL Profile:
    SQL
    EXEC DBMS_SQLTUNE.DROP_SQL_PROFILE(profile_name => '<PROFILE_NAME>');

8. Root Cause Analysis (RCA) Template for Management

Root Cause: The SQL with ID <SQL_ID> experienced a plan regression during month-end from Plan Hash <GOOD_PLAN_HASH> (efficient) to <BAD_PLAN_HASH> (sub-optimal). This was triggered by [stale statistics / bind peeking issue / data volume skew / cardinality misestimation].

Business Impact: Elapsed time increased by ~X times, causing month-end batch delay of XX minutes/hours.

Corrective Action: Implemented SQL Plan Baseline (or Profile) to enforce the good plan. Statistics gathering scheduled.

Prevention Measures:

  • Enable SQL Plan Management for critical SQLs.
  • Review statistics gathering strategy for month-end tables.
  • Monitor plan changes via AWR / custom alerts.

9. Escalation Path

  • Level 1: DBA performing analysis
  • Level 2: Senior DBA / Performance Architect
  • Level 3: Application Owner + Architecture Team

10. References

  • Oracle Documentation: DBMS_SPM, DBMS_SQLTUNE, DBMS_XPLAN
  • AWR Views: DBA_HIST_SQLSTAT, DBA_HIST_SNAPSHOT

No comments:

Post a Comment