SQL Server Profiling – How to get bad sqls information


1. Start the profiling with following settings in profiler

Start with custom profiling settings ( it will start with all columns and events)

Events Not to be included 

  • Brokers
  • CLR
  • Database
  • Depreciation
  • Locks (particularly release and accuired locks)
  • Errors and warnings (moslty)
  • OLEDB
  • Objects
  • Scans( particularly)
  • Security Audit
  • Server
  • Sessions
  • Transactions ( particularly start stop transaction)

Events to be included

  • Cursors (mostly)
  • Errors and Warnings ( Attention, Exchange Spill event, Blocked process report)
  • Performance (mostly)
  • Sessions( existing connections – for testing)
  • Stored procedures
  • TSQL

Columns( to be selected)

  • applicationname
  • clientprocessid
  • databasename
  • eventsequnce
  • spid
  • start time
  • end time
  • transaction id
  • xact sequence
  • reads
  • writes
  • rowcounts

2. Capture the trace and save the trace information

(a)  Save the trace to some trace file and then read the file from the command prompt by using the function fn_trace_gettable

(b) Save results into table

Trace related queries

3. Run the below queries to get the main problem queries

Note:-  These queries can give result of same query coming multiple times( row 1 and 2 can point to same query). Still it gives us clear idea what are top time consuming sqls with timings,

When trace file is used to store trace information

Run the below query in SQL Studio session

select textdata, duration, reads, writes, rowcounts from fn_trace_gettable(C:\trace\badsqls.trc’,DEFAULT) trace order by duration desc;

When Table is used to store trace information

In SQL Server, at a given point of multiple user sessions can be there.

a- This query is to find the timings on per connection basis. 

select          applicationname ,clientprocessid,
sum(duration)/(1000)”Total time in millisecs”
from
dbo.trace_table  — Give the table name as mentioned in step 2
where
— this is to leave the already running sqls before start of trace , if any coming
STARTTIME >=(select STARTTIME from dbo.trace_table Where ROWNUMBER=1)
group by   clientprocessid,applicationname
order by 3 desc

b- This query is to find timings of all sqls coming for particular connection

select
starttime,endtime,applicationname ,clientprocessid,
duration/(1000)”Total time in millisecs”,
transactionid,xactsequence,
objectname , textdata
from
dbo.trace_table
where
clientprocessid in (31680)  — captured above
AND STARTTIME >=(select STARTTIME from dbo.trace_table Where ROWNUMBER=1)
order by duration desc — to get the sqls in order of descending order of elapsed time

Disclaimer:- Due to Oracle background, idea comes to have something similar to trace/AWR. I am following this approach during SQL Server Tuning exercises on regular basis to find the top culprit sqls.

Debugging ORA-12518 and permissions issue for Oracle Listener


As per documentation, it means that TNS Listener process is not able to hand off client connection. 

Function of Listener can be think as 2 step process:

1. It acts as ear on the Database side ( hey I am here ) and then..

2. Pass the incoming request ( after hearing) to the particular database instance ( saying hello further to Database). 

This error relates to second step so it has nothing to do with listener as such. It means listener has worked fine in its functionality of hearing requests, but something happened while saying hello to database.

This second step can further be divided into 2 steps

1. Listener process checks if it is able to communicate with Oracle process. This checking actually means that OS checks at user level execute permissions for both tnslistener and oracle processes.

Putting it more simply, OS wants that owner of listener process to have necessary access permissions for the Oracle processes

2. After permission checking, it checks if Oracle processes are really running.

In most cases, when we hit this error, we find that Oracle is down(or not responding) as first step is generally passed successfully.

In my case, few days back on Windows 2008 machine, I found that database is up , listener is up and still this error !!!

Then further checking at the services level( services,msc), it was found that, both the processes Oracle and TNSListener are not started by same user( generally Administrator on Windows server).

After setting the tnslistener user to same as that of Oracle server process, in the services.msc everything was fine.  

So, it means, checking the process owners to be same for both Listener and Oracle can save us from this kind of situation..

SDU Settings bug in 11.2.0 documentation


Issue:

If you are hitting the issue where even after doing the all settings as per the Oracle Documentation, still SDU size change settings are not getting reflected in the listener trace file, then what else we are missing !!

Solution:

Then you might be hitting the situation as mentioned in below blog, in the same way I was hitting for 2 days

http://oraganism.wordpress.com/2011/09/24/setting-sdu-size-mainly-in-11-2/

Main thing is we have to set the value in the Listener’s description also apart from setting in the tnsnames.ora(client side), sqlnet.ora( both client and server side) and listener.ora(server ..in SID_LIST_LISTENER) , so that new SDU value should be picked.

LISTENER =
(DESCRIPTION_LIST =
(DESCRIPTION =
(SDU=32768)
(ADDRESS = (PROTOCOL = TCP)(HOST = 172.16.89.11)(PORT = 1521))
)
)

As part of this testing, main thing we noticed that SDU can at the max be set to 32768 value and not more, as mentioned at some places that it can be increased to 64K.

Need to Increase the SDU size

If at the database level, in the AWR report we are seeing SQL*Net more data to client among top 5 timed events, then it means we are sending more data  than the size of SDU in single Fetch call. It means we need to increase the value of this SDU to more than default value of 8K so that we can send more bytes in single SDU and thus decrease this wait event calls.

Understanding Oracle traces and plan


Below are few of the excellent notes on execution plans and trace from expert Charles Hooper

http://hoopercharles.wordpress.com/2010/01/30/execution-plans-what-is-the-plan-and-where-do-i-find-it/ (3 part series)
http://hoopercharles.wordpress.com/2009/12/01/10046-extended-sql-trace-interpretation/ (2 part series)
http://hoopercharles.wordpress.com/2010/01/11/explain-plan-lies-autotrace-lies-tkprof-lies-what-is-the-plan/

 

Parallelism and DEFAULT value


Recently in one of the client systems, performance issue was reported in sqls which were quite simple. Explain plan revealed that it is going for parallelism but we had not given any parallel hint in the query

So from where it is coming….

10053 trace of the query was showing default parallelism was used. Parallel_query_default_dop parameter in 1003 trace tells us about this default parallelism used. It is optimizer defined run time parameter and is not init.ora parameter.

DEFAULT value of the DEGREE column in the dba_tables for that particular showed the path. It can be set to DEFAULT when we run ALTER table PARALLEL option without degree alter table XA_1 parallel;

The default Degree of Parallelism(DOP) is determined based on the system configuration: People(at least me !!) generally assume that When column DEGREE has value default, it means it will use parallel =1, that is serial mode, but that is not the case.

DEFAULT DOP = cpu_count * parallel_threads_per_cpu* cluster_database_instances

In our case cpu_count was 64 with parallel_thread_per_cpu=2 and cluser_database_instance=2 (Since system was RAC) so DEFAULT DOP = 62*2*2 = 256. This exactly matches with Parallel_query_default_dop parameter found in the 10053 trace

Yes !!!. this was the issue since DEGREE was DEFAULT it was picking parallel plan.

By changing the DEGREE to 1 plans were as expected.

alter table XA_1 parallel 1;

So in case where plans are in PARALLEL option , checking the DEGREE column can be good starting point. Generally we never want this to be DEFAULT.

Below are few good links:

http://gerardnico.com/wiki/database/oracle/dop https://forums.oracle.com/forums/thread.jspa?threadID=1120327

Getting Row lock wait information from AWR repository


There was case in one of the Production environments, where “row cache lock” wait event was topmost wait event. Below is one of the approaches that one can follow to see what is the real issue

Row Cache Lock

As seen from below awrs for time period of 12 to 18, most of time is lost on “row cache lock” wait event.

Snap Id Snap Time Sessions Curs/Sess
--------- ------------------- -------- ---------
Begin Snap: 38729 03-Dec-12 12:00:36 408 114.4
End Snap: 38735 03-Dec-12 18:00:10 282 118.7
Elapsed: 359.57 (mins)
DB Time: 6,839.92 (mins)

Top 5 Timed Events Avg %Total
~~~~~~~~~~~~~~~~~~ wait Call
Event Waits Time (s) (ms) Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
row cache lock 3,980,592 285,082 72 69.5 Concurrenc
CPU time 48,706 11.9
gc buffer busy 1,591,825 13,796 9 3.4 Cluster
db file sequential read 5,521,456 12,893 2 3.1 User I/O
enq: TX - row lock contention 202,525 12,454 61 3.0 Applicatio
-------------------------------------------------------------

Now question came, what is major culprit.

Checked the Row lock wait count for that period – 3 Dec (12-18PM )

Used the below sqls

1. To check what are wait events during the particular time period(based on SNAP ids)
It will reconfirm the AWR that row lock wait is coming at top
select
EVENT, count(1)
from
DBA_HIST_ACTIVE_SESS_HISTORY
where
instance_number =1 and
snap_id between 38729 and 38735
group by event order by 2 asc;

2. To check sql ids based on event
It will give what sqls that contributed to this wait event

select distinct sql_id
from
DBA_HIST_ACTIVE_SESS_HISTORY
where
instance_number =1 and
snap_id between 38729 and 38735
and event='row cache lock';

3. To check sql, program details for sqls for particluar wait

select sql_id, PROGRAM, MODULE,count(event)
from
DBA_HIST_ACTIVE_SESS_HISTORY
where
instance_number =2 and
snap_id between 38729 and 38735
and event='row cache lock' and sql_id in ('bbacffquhrm08','b6sddsvvh6cq4')
group by sql_id, PROGRAM, MODULE order by 2 asc;

Row lock waits are related to cache_id 13, which is related to dc_sequences.

SQL_ID PROGRAM MODULE COUNT(EVENT)
————- ——- ————————
b6sdnpwzh6cq4 10535
bbarfqquhrm08 11884

4.
To check various parameters of wait events
For the sqls(captured above) that are contributing major, this will give details of the wait

select distinct sql_id, p1, p1text, p2, p2text, p3, p3text
from
DBA_HIST_ACTIVE_SESS_HISTORY
where
instance_number =2 and
snap_id between 38729 and 38735
and event='row cache lock' and sql_id in (:b1, :b2);
Wait was related to sequences.

SQL_ID P1 P1TEXT P2 P2TEXT
———- —— ——————- ———- —————————-
bbacffquhrm08 13 cache id 0 mode
b6sddsvvh6cq4 13 cache id 0 mode

Queries related to fnd_concurrent_requests were major contributor to “row cache lock” wait.

5. Now you can see what object type it is referring .

select CACHE#, PARAMETER from v$rowcache where cache#=13;

CACHE# PARAMETER
———- ——————————-
13 dc_sequences

It signifies that lock row lock waits were related to the object of SEQUENCES type.

In the same manner you can find for other objects in your case

6. Below query can be used to see other cache related information during that problem time period

select PARAMETER ,b.snap_id, b.instance_number, gets, GETMISSES, SCANS, SCANMISSES, MODIFICATIONS, FLUSHES
from
dba_hist_rowcache_summary a,
dba_hist_snapshot b
where a.SNAP_ID =b.SNAP_ID and a.instance_number=b.instance_number
--and b.instance_numer=1
and parameter like 'dc_sequences%' -- change as per your case
and
(
b.END_INTERVAL_TIME between to_timestamp('2012-12-03-12-30-00','YYYY-MM-DD-HH24-MI-SS')
and to_timestamp('2012-12-03-18-30-00','YYYY-MM-DD-HH24-MI-SS')
) order by 2 asc;

Sequence objects – row cache lock details

PARAMETER SNAP_ID INSTANCE_NUMBER GETS GETMISSES SCANS SCANMISSES MODIFICATIONS FLUSHES
dc_sequences 38730 2 1063376 418020 0 0 1063376 1063376
dc_sequences 38730 1 4019806 411951 0 0 4019806 4019806

So in my case it gave me lead that on RAC some sequences issue is there. This is already known fact but now from here I have to dig more :)