## Introduction
This document provides a comprehensive guide for performing root cause analysis (RCA) on a Solaris-based server running Oracle 19c Gold Tier databases experiencing high server load. The environment includes 22 integrations, which may contribute to the performance issues. This guide includes SQL queries, Solaris commands, a step-by-step RCA process, and actionable recommendations to resolve the identified issues.
## 1. System-Level Diagnostics (Solaris Commands)
### 1.1 CPU Utilization
```bash
# Overall CPU utilization
mpstat 5 5
prstat -a -c -n 20 -s cpu 5 5
# Process-specific CPU usage (Oracle processes)
prstat -a -c -n 20 -s cpu -P 'oracle'
# Thread-level analysis
prstat -amLc -n 20 -s cpu 5 5
# Historical CPU data
sar -u 5 5
```
### 1.2 Memory Usage
```bash
# Overall memory statistics
vmstat 5 5
echo "::memstat" | mdb -k
# Process memory usage
prstat -a -c -n 20 -s size 5 5
pmap -x `pgrep -f ora_pmon`
# Swap usage
swap -s
swap -l
```
### 1.3 Disk I/O Performance
```bash
# Overall I/O statistics
iostat -xnz 5 5
# I/O by process
iotop -P 'oracle' 5 5
# File system usage
df -h
zpool status
zpool iostat -v 5 5
# I/O wait time
iostat -xnzc 5 5
```
### 1.4 Network Activity
```bash
# Network interface statistics
netstat -i 5
nicstat 5 5
# Connection statistics
netstat -an | grep ESTABLISHED | wc -l
netstat -an | grep "\.1521" | wc -l
# Network throughput
nicstat -i all 5 5
```
### 1.5 Process Activity
```bash
# Process listing
ps -ef | grep oracle
ps -o pid,pcpu,pmem,vsz,rss,time,args -p `pgrep -f ora_`
# Process tree
ptree `pgrep -f ora_pmon`
# Process resource usage over time
prstat -a -c -n 20 5 5
```
### 1.6 System Logs
```bash
# Check system logs for errors
tail -100 /var/adm/messages
dmesg | tail -100
# Check Oracle alert logs
tail -100 $ORACLE_BASE/diag/rdbms/$ORACLE_SID/$ORACLE_SID/trace/alert_$ORACLE_SID.log
```
## 2. Database-Level Diagnostics (Oracle SQL Queries)
### 2.1 Overall Database Health
```sql
-- Instance status and resource usage
SELECT instance_name, status, database_status, active_state, blocked
FROM v$instance;
-- Database resource usage summary
SELECT * FROM v$resource_limit
WHERE resource_name IN ('processes', 'sessions', 'transactions')
ORDER BY resource_name;
-- Database time model
SELECT stat_name, value, ROUND(value/1000000,2) as seconds
FROM v$sys_time_model
WHERE stat_name IN ('DB time', 'DB CPU', 'background cpu time', 'background elapsed time')
ORDER BY value DESC;
```
### 2.2 High-Load SQL Statements
```sql
-- Top SQL by CPU usage
SELECT sql_id, plan_hash_value, executions,
ROUND(elapsed_time/1000000,2) as elapsed_sec,
ROUND(cpu_time/1000000,2) as cpu_sec,
ROUND(buffer_gets/DECODE(executions,0,1,executions),2) as buffer_gets_per_exec,
ROUND(disk_reads/DECODE(executions,0,1,executions),2) as disk_reads_per_exec,
ROUND(rows_processed/DECODE(executions,0,1,executions),2) as rows_per_exec,
parsing_schema_name
FROM v$sql
WHERE executions > 0
ORDER BY cpu_time DESC
FETCH FIRST 20 ROWS ONLY;
-- Top SQL by elapsed time
SELECT sql_id, plan_hash_value, executions,
ROUND(elapsed_time/1000000,2) as elapsed_sec,
ROUND(cpu_time/1000000,2) as cpu_sec,
ROUND(buffer_gets/DECODE(executions,0,1,executions),2) as buffer_gets_per_exec,
ROUND(disk_reads/DECODE(executions,0,1,executions),2) as disk_reads_per_exec,
ROUND(rows_processed/DECODE(executions,0,1,executions),2) as rows_per_exec,
parsing_schema_name
FROM v$sql
WHERE executions > 0
ORDER BY elapsed_time DESC
FETCH FIRST 20 ROWS ONLY;
-- SQL text for high-load statements
SELECT sql_id, sql_text
FROM v$sqlarea
WHERE sql_id IN (
SELECT sql_id FROM (
SELECT sql_id FROM v$sql
ORDER BY cpu_time DESC
FETCH FIRST 20 ROWS ONLY
)
);
-- SQL execution plans for high-load statements
SELECT * FROM table(DBMS_XPLAN.DISPLAY_CURSOR('&sql_id', NULL, 'ALLSTATS LAST'));
```
### 2.3 Active Session History (ASH) Analysis
```sql
-- Top wait events from ASH
SELECT event, COUNT(*) as wait_count
FROM v$active_session_history
WHERE sample_time > SYSDATE - INTERVAL '1' HOUR
GROUP BY event
ORDER BY wait_count DESC
FETCH FIRST 20 ROWS ONLY;
-- Top SQL from ASH
SELECT sql_id, COUNT(*) as activity_count
FROM v$active_session_history
WHERE sample_time > SYSDATE - INTERVAL '1' HOUR
AND sql_id IS NOT NULL
GROUP BY sql_id
ORDER BY activity_count DESC
FETCH FIRST 20 ROWS ONLY;
-- Top sessions from ASH
SELECT session_id, session_serial#, user_id, COUNT(*) as activity_count
FROM v$active_session_history
WHERE sample_time > SYSDATE - INTERVAL '1' HOUR
GROUP BY session_id, session_serial#, user_id
ORDER BY activity_count DESC
FETCH FIRST 20 ROWS ONLY;
-- ASH by wait class
SELECT wait_class, COUNT(*) as wait_count
FROM v$active_session_history
WHERE sample_time > SYSDATE - INTERVAL '1' HOUR
GROUP BY wait_class
ORDER BY wait_count DESC;
```
### 2.4 Wait Events Analysis
```sql
-- System-wide wait events
SELECT event, total_waits, total_timeouts,
ROUND(time_waited/100,2) as time_waited_sec,
ROUND(average_wait/100,2) as average_wait_ms
FROM v$system_event
WHERE wait_class != 'Idle'
ORDER BY time_waited DESC
FETCH FIRST 20 ROWS ONLY;
-- Session wait events
SELECT s.sid, s.serial#, s.username, s.status, s.machine, s.program,
sw.event, sw.wait_class, sw.state,
ROUND(sw.seconds_in_wait,2) as wait_sec
FROM v$session s, v$session_wait sw
WHERE s.sid = sw.sid
AND sw.wait_class != 'Idle'
AND s.status = 'ACTIVE'
ORDER BY sw.seconds_in_wait DESC;
-- Historical wait events (AWR)
SELECT event_name, wait_class,
ROUND(time_waited_micro/1000000,2) as time_waited_sec
FROM dba_hist_system_event
WHERE snap_id BETWEEN &start_snap_id AND &end_snap_id
AND wait_class != 'Idle'
ORDER BY time_waited_micro DESC
FETCH FIRST 20 ROWS ONLY;
```
### 2.5 Session Activity
```sql
-- Active sessions
SELECT s.sid, s.serial#, s.username, s.status, s.machine, s.program,
s.last_call_et, s.logon_time, s.sql_id, s.prev_sql_id,
p.spid as os_pid, s.blocking_session
FROM v$session s, v$process p
WHERE s.paddr = p.addr
AND s.status = 'ACTIVE'
AND s.username IS NOT NULL
ORDER BY s.last_call_et DESC;
-- Session resource usage
SELECT se.sid, se.serial#, se.username, se.status, se.machine, se.program,
ss.value as logical_reads
FROM v$session se, v$sesstat ss, v$statname sn
WHERE se.sid = ss.sid
AND ss.statistic# = sn.statistic#
AND sn.name = 'session logical reads'
AND se.username IS NOT NULL
ORDER BY ss.value DESC
FETCH FIRST 20 ROWS ONLY;
-- Long-running operations
SELECT sid, serial#, opname, target, sofar, totalwork,
ROUND(elapsed_seconds/60,2) as elapsed_min,
ROUND(time_remaining/60,2) as remaining_min,
ROUND(sofar/totalwork*100,2) as pct_complete
FROM v$session_longops
WHERE sofar < totalwork
ORDER BY elapsed_seconds DESC;
```
### 2.6 Database Resource Consumption
```sql
-- Memory usage
SELECT * FROM v$sgastat
ORDER BY bytes DESC
FETCH FIRST 20 ROWS ONLY;
SELECT * FROM v$pgastat;
-- Undo usage
SELECT * FROM v$undostat
ORDER BY begin_time DESC
FETCH FIRST 10 ROWS ONLY;
-- Redo log activity
SELECT * FROM v$log;
SELECT thread#, sequence#, ROUND(bytes/1024/1024,2) as size_mb,
members, archived, status
FROM v$log
ORDER BY thread#, sequence#;
-- Tablespace usage
SELECT df.tablespace_name, df.file_name, df.bytes/1024/1024 as size_mb,
ROUND((df.bytes - fs.bytes)/1024/1024,2) as used_mb,
ROUND(fs.bytes/1024/1024,2) as free_mb,
ROUND((df.bytes - fs.bytes) / df.bytes * 100, 2) as pct_used
FROM dba_data_files df,
(SELECT file_id, SUM(bytes) as bytes
FROM dba_free_space
GROUP BY file_id) fs
WHERE df.file_id = fs.file_id
ORDER BY pct_used DESC;
```
### 2.7 Contended Objects and Locking Issues
```sql
-- Locked objects
SELECT l.session_id, l.oracle_username, l.os_user_name,
o.owner, o.object_name, o.object_type,
l.locked_mode
FROM v$locked_object l, dba_objects o
WHERE l.object_id = o.object_id
ORDER BY l.session_id;
-- Blocking sessions
SELECT s1.username || '@' || s1.machine as blocker,
s2.username || '@' || s2.machine as blockee,
l1.sid, l2.sid, l1.type,
o.owner, o.object_name, o.object_type,
l1.lmode, l2.request
FROM v$lock l1, v$lock l2, v$session s1, v$session s2, dba_objects o
WHERE l1.id1 = l2.id1
AND l1.block = 1
AND l2.request > 0
AND l1.sid = s1.sid
AND l2.sid = s2.sid
AND l1.id1 = o.object_id;
-- Hot objects (frequently accessed)
SELECT o.owner, o.object_name, o.object_type, o.status,
s.statistic_name, s.value
FROM v$segment_statistics s, dba_objects o
WHERE s.owner = o.owner
AND s.object_name = o.object_name
AND s.object_type = o.object_type
AND s.statistic_name IN ('logical reads', 'physical reads', 'physical writes')
ORDER BY s.value DESC
FETCH FIRST 20 ROWS ONLY;
```
### 2.8 Integration-Specific Analysis
```sql
-- Sessions by module (to identify integrations)
SELECT module, COUNT(*) as session_count
FROM v$session
WHERE module IS NOT NULL
GROUP BY module
ORDER BY session_count DESC;
-- Resource usage by module
SELECT se.module, COUNT(DISTINCT se.sid) as session_count,
SUM(ss.value) as total_logical_reads
FROM v$session se, v$sesstat ss, v$statname sn
WHERE se.sid = ss.sid
AND ss.statistic# = sn.statistic#
AND sn.name = 'session logical reads'
AND se.module IS NOT NULL
GROUP BY se.module
ORDER BY total_logical_reads DESC;
-- SQL by module
SELECT se.module, sq.sql_id, COUNT(*) as execution_count
FROM v$session se, v$sql sq
WHERE se.sql_id = sq.sql_id
AND se.module IS NOT NULL
GROUP BY se.module, sq.sql_id
ORDER BY execution_count DESC
FETCH FIRST 20 ROWS ONLY;
```
## 3. AWR and ADDM Analysis
### 3.1 Generate AWR Reports
```sql
-- List available snapshots
SELECT snap_id, begin_interval_time, end_interval_time
FROM dba_hist_snapshot
ORDER BY snap_id DESC
FETCH FIRST 20 ROWS ONLY;
-- Generate AWR report
SELECT output FROM TABLE(DBMS_WORKLOAD_REPOSITORY.AWR_REPORT_TEXT(
l_dbid => &database_id,
l_inst_num => &instance_number,
l_bid => &begin_snap_id,
l_eid => &end_snap_id,
l_options => 0
));
-- Generate AWR Global report for RAC
SELECT output FROM TABLE(DBMS_WORKLOAD_REPOSITORY.AWR_GLOBAL_REPORT_TEXT(
l_dbid => &database_id,
l_inst_num => &instance_number,
l_bid => &begin_snap_id,
l_eid => &end_snap_id,
l_options => 0
));
```
### 3.2 Generate ADDM Reports
```sql
-- Run ADDM analysis
DECLARE
task_name VARCHAR2(30);
task_desc VARCHAR2(256);
task_id NUMBER;
BEGIN
task_name := 'ADDM_' || TO_CHAR(SYSDATE, 'YYYYMMDD_HH24MISS');
task_desc := 'ADDM run for high load analysis';
DBMS_ADVISOR.CREATE_TASK(
advisor_name => 'ADDM',
task_id => task_id,
task_name => task_name,
task_desc => task_desc
);
DBMS_ADVISOR.SET_TASK_PARAMETER(
task_name => task_name,
parameter => 'START_SNAPSHOT',
value => &begin_snap_id
);
DBMS_ADVISOR.SET_TASK_PARAMETER(
task_name => task_name,
parameter => 'END_SNAPSHOT',
value => &end_snap_id
);
DBMS_ADVISOR.EXECUTE_TASK(task_name => task_name);
END;
/
-- View ADDM report
SELECT DBMS_ADVISOR.GET_TASK_REPORT('&task_name', 'TEXT', 'ALL') as report
FROM DUAL;
```
## 4. Step-by-Step RCA Process
### 4.1 Initial Assessment
1. **Collect baseline metrics**:
- Run `mpstat`, `vmstat`, `iostat`, and `prstat` to establish current system load
- Query v$instance, v$resource_limit, and v$system_time_model for database status
2. **Identify immediate symptoms**:
- Check for CPU saturation using `mpstat` and `prstat`
- Check for memory pressure using `vmstat` and `echo "::memstat" | mdb -k`
- Check for I/O bottlenecks using `iostat -xnz`
- Check for network issues using `netstat` and `nicstat`
3. **Review alert logs**:
- Check system logs: `/var/adm/messages` and `dmesg`
- Check Oracle alert logs: `$ORACLE_BASE/diag/rdbms/$ORACLE_SID/$ORACLE_SID/trace/alert_$ORACLE_SID.log`
### 4.2 Database Performance Analysis
1. **Generate and analyze AWR reports**:
- Identify peak load periods from AWR snapshots
- Generate AWR reports for those periods
- Review Top 5 Timed Events section to identify main wait events
2. **Analyze active sessions**:
- Query v$session for currently active sessions
- Query v$session_wait for current wait events
- Query v$active_session_history for historical session activity
3. **Identify high-load SQL statements**:
- Query v$sql for top SQL by CPU and elapsed time
- Get execution plans for high-load SQL statements
- Analyze SQL statistics for inefficient execution
4. **Check for resource contention**:
- Query v$locked_object for object locks
- Check for blocking sessions
- Identify hot objects from v$segment_statistics
### 4.3 Integration-Specific Analysis
1. **Identify active integrations**:
- Query v$session by module to identify active integration connections
- Group sessions by module to identify high-resource integrations
2. **Analyze integration patterns**:
- Check connection frequency and duration
- Identify SQL statements executed by each integration
- Analyze resource consumption patterns by integration
3. **Correlation with system metrics**:
- Match integration activity peaks with system load spikes
- Identify integrations causing excessive I/O or CPU usage
### 4.4 Root Cause Identification
1. **Cross-reference all metrics**:
- Correlate system-level bottlenecks with database wait events
- Match high-load periods with specific SQL or integration activity
- Identify resource contention patterns
2. **Categorize issues by impact**:
- CPU-bound issues (high CPU usage, runaway processes)
- Memory-bound issues (insufficient SGA/PGA, excessive sorting)
- I/O-bound issues (disk bottlenecks, inefficient I/O patterns)
- Concurrency issues (locks, latches, contention)
- Application design issues (inefficient SQL, excessive connections)
3. **Validate findings**:
- Confirm patterns across multiple data points
- Verify timing correlation between symptoms and causes
- Test hypotheses with targeted queries
## 5. Recommendations for Resolution
### 5.1 SQL Tuning Recommendations
1. **Optimize high-load SQL statements**:
```sql
-- Create SQL profile for high-load statement
DECLARE
l_sql_text CLOB;
BEGIN
SELECT sql_fulltext INTO l_sql_text
FROM v$sql
WHERE sql_id = '&sql_id'
AND rownum = 1;
DBMS_SQLTUNE.CREATE_TUNING_TASK(
sql_text => l_sql_text,
task_name => 'TUNE_&sql_id',
description => 'Tuning task for high-load SQL'
);
DBMS_SQLTUNE.EXECUTE_TUNING_TASK(task_name => 'TUNE_&sql_id');
END;
/
-- View tuning recommendations
SELECT DBMS_SQLTUNE.REPORT_TUNING_TASK('TUNE_&sql_id') AS recommendations
FROM DUAL;
-- Implement SQL profile if recommended
DECLARE
l_result VARCHAR2(4000);
BEGIN
l_result := DBMS_SQLTUNE.ACCEPT_SQL_PROFILE(
task_name => 'TUNE_&sql_id',
name => 'PROF_&sql_id',
force_match => TRUE
);
END;
/
```
2. **Create missing indexes**:
```sql
-- Identify missing indexes
SELECT recommendation, benefit
FROM TABLE(DBMS_ADVISOR.GET_TASK_REPORT('TUNE_&sql_id', 'TEXT', 'ALL'))
WHERE recommendation LIKE '%CREATE INDEX%';
-- Create recommended index
CREATE INDEX owner.index_name ON owner.table_name(column1, column2);
```
3. **Consider SQL plan baselines**:
```sql
-- Create SQL plan baseline
DECLARE
l_sql_handle VARCHAR2(30);
l_plan_name VARCHAR2(30);
BEGIN
DBMS_SPM.LOAD_PLANS_FROM_CURSOR_CACHE(
sql_id => '&sql_id',
plan_hash_value => &plan_hash_value,
sql_handle => l_sql_handle,
plan_name => l_plan_name
);
END;
/
```
### 5.2 Database Configuration Changes
1. **Adjust memory parameters**:
```sql
-- Check current memory allocation
SHOW PARAMETER sga;
SHOW PARAMETER pga;
-- Adjust SGA and PGA sizes
ALTER SYSTEM SET sga_target = 16G SCOPE=SPFILE;
ALTER SYSTEM SET pga_aggregate_target = 8G SCOPE=SPFILE;
-- Adjust shared pool size if needed
ALTER SYSTEM SET shared_pool_size = 4G SCOPE=SPFILE;
```
2. **Optimize redo log configuration**:
```sql
-- Check current redo log configuration
SELECT group#, bytes/1024/1024 as size_mb, members, status FROM v$log;
-- Add redo log groups or resize if needed
ALTER DATABASE ADD LOGFILE GROUP 4 ('/path/to/redo04a.log', '/path/to/redo04b.log') SIZE 1G;
ALTER DATABASE ADD LOGFILE GROUP 5 ('/path/to/redo05a.log', '/path/to/redo05b.log') SIZE 1G;
```
3. **Adjust database parameters**:
```sql
-- Optimize parallel execution
ALTER SYSTEM SET parallel_max_servers = 32 SCOPE=BOTH;
ALTER SYSTEM SET parallel_min_servers = 8 SCOPE=BOTH;
-- Optimize optimizer parameters
ALTER SYSTEM SET optimizer_index_cost_adj = 50 SCOPE=BOTH;
ALTER SYSTEM SET optimizer_index_caching = 75 SCOPE=BOTH;
-- Adjust session limits for integrations
ALTER SYSTEM SET sessions = 1000 SCOPE=SPFILE;
ALTER SYSTEM SET processes = 800 SCOPE=SPFILE;
```
### 5.3 System-Level Optimizations
1. **Solaris kernel parameter tuning**:
```bash
# Check current kernel parameters
/usr/sbin/ndd -get /dev/tcp tcp_conn_req_max_q
/usr/sbin/ndd -get /dev/tcp tcp_conn_req_max_q0
/usr/sbin/ndd -get /dev/tcp tcp_xmit_hiwat
/usr/sbin/ndd -get /dev/tcp tcp_recv_hiwat
# Set optimized parameters
echo "tcp_conn_req_max_q=1024" >> /etc/system
echo "tcp_conn_req_max_q0=4096" >> /etc/system
echo "tcp_xmit_hiwat=65536" >> /etc/system
echo "tcp_recv_hiwat=65536" >> /etc/system
```
2. **I/O scheduling optimization**:
```bash
# Check current I/O scheduler
echo "::iosched" | mdb -k
# Adjust I/O priority for Oracle processes
priocntl -e -c RT -p 59 -i pid `pgrep -f ora_dbw`
priocntl -e -c RT -p 59 -i pid `pgrep -f ora_lgwr`
```
3. **File system optimization**:
```bash
# Check current mount options
mount | grep oracle
# Remount with optimized options
mount -o remount,forcedirectio,nologging /oracle
```
### 5.4 Integration-Specific Recommendations
1. **Connection pooling optimization**:
- Implement connection pooling for all integrations
- Set appropriate pool sizes based on workload
- Configure timeout and validation parameters
2. **Batch processing improvements**:
- Convert real-time queries to batch processing where possible
- Schedule integration jobs during off-peak hours
- Implement incremental processing for large data sets
3. **Load balancing**:
- Distribute integration connections across RAC nodes
- Implement service-based routing for workload management
- Configure services with appropriate runtime priorities
### 5.5 Monitoring and Alerting Setup
1. **Set up proactive monitoring**:
- Configure OEM metrics and thresholds
- Create custom metrics for integration-specific monitoring
- Set up baseline deviation alerts
2. **Implement automated response**:
- Create incident rules for common issues
- Configure automatic trace collection for problematic sessions
- Implement automatic SQL tuning for high-load statements
3. **Long-term trend analysis**:
- Configure AWR retention for longer periods
- Set up regular AWR comparison reports
- Implement capacity planning based on growth trends
## 6. Implementation Plan
### 6.1 Immediate Actions (0-24 hours)
1. Identify and tune top 5 resource-intensive SQL statements
2. Resolve any blocking sessions or locking issues
3. Adjust memory parameters if insufficient
4. Implement connection pooling for high-volume integrations
5. Set up basic monitoring alerts for critical resources
### 6.2 Short-Term Actions (1-7 days)
1. Complete SQL tuning for all high-load statements
2. Implement recommended indexes and SQL profiles
3. Adjust database parameters based on workload analysis
4. Optimize Solaris kernel parameters
5. Implement batch processing improvements for integrations
### 6.3 Long-Term Actions (1-4 weeks)
1. Redesign inefficient application code or database structures
2. Implement comprehensive monitoring and alerting
3. Develop capacity planning based on growth projections
4. Consider hardware upgrades if necessary
5. Implement regular performance review process
## 7. Conclusion
This root cause analysis guide provides a comprehensive approach to diagnosing and resolving high server load issues on a Solaris-based Oracle 19c Gold Tier database environment with multiple integrations. By systematically collecting and analyzing both system and database metrics, you can identify the specific bottlenecks causing performance degradation and implement targeted solutions to address them.
The recommendations provided are designed to be implemented in a phased approach, starting with immediate actions to alleviate critical issues, followed by short-term and long-term improvements to ensure sustained performance. Regular monitoring and proactive tuning will help prevent similar issues from recurring in the future.
No comments:
Post a Comment