Identifying excessive load causes with oracle enterprise manager 11g


There are times when people ask me to identify the root causes of excessive Oracle Database (and host) loads in the past, still being astonished how i do manage to get this information out of nothing. Well, really, there is no out of nothing, never. There may be a complex scenario and a couple of tools at hand which span another level of complexity, unfortunately. That’s all of it. You see, complexity is a bad thing, something you will want to avoid or eliminate at all times. Some people furthermore aim to prove their skills in talking about what degree of complexity they are able to handle… hhm, poor ones, I don’t envy them, always struggling to tighten a plethora of strings, over and over, I’m more a convention-over-configuration guy, but that’s another discussion.

Back to to the point, Oracle Database is a vast stack of technology, indeed inducing a fair level of complexity in analysis and so is Oracle Enterprise Manager, the number one monitoring tool for the database and further (so called) targets like hosts and jvms, in operation. In fact, Oracle Enterprise Manager, being based on data collection in the database (AWR – Automatic Workload Repository), data evaluation in the database (ADDM – Automatic Database Diagnostic Monitor) and data transfer to some production instance (Enterprise Manager Agent) constitutes another mountain top in the Oracle landscape not everyone is able or willing to climb. So iff you misplaced your hiking boots for today, the following is for you.

Ok then, log yourself in, using your EM credentials.

In the targets/databases view, do search for the instance in question and give it a click.

Then change to the perfomance tab, another login will come up. Don’t use a sysdba, it is not necessary just for reading data.


On the perfomance tab, scroll down to the related links and select advisor central, which will get you to an overview of all the available advisory process executions for the different advisor types. However, this time, we just want to run ADDM for a dedicated time period, so we select ADDM on the top left.


Running ADDM for a dedicated time period may require you to scroll to the day in question using the timeline control at first. You should see the reported load peak in the chart. You will furthermore have to select the start- and end-snapshots for the run in order to narrow down to the data actually important.

The resulting report gives a couple of first indications about the cause of the problem. Here, we obviously have some grumpy sql around and a huge overload in cpu usage. In addition, the report also points out what resources did not pose any issues in the timespan. That is, one may already conclude, that the cpu overload has been solely caused by the sql engine processing data, not i/o-ing data or parsing or concurrency or network or whatever.

Peeking in to the cpu usage finding confirms our first suspicion concerning the sql engine processing data but also emphasises the fact, that cpu overload in particular will show up as extraordinary wait times in charts and reports, exactly fitting the picture we saw before.

The top sql statements finding again names the same sql id we know from other views with an predominant impact, so let’s have a look what is going on for this code execution by clicking the sql id into a new tab.

Coming up now, we have the statistics tab for this specific sql id (v$sql you know) with the sql text according to a plan hash from the sql area. You again may have to select a snapshot on the timeline control (may have to be scrolled either) to inspect the dedicated time period. As you may already have noticed, there is no longer any AWR data available for the time in question because the instance has been bounced to get off the peak load … But, it won’t matter, there still is more than enough quirksmode to discover anyway.

You may, for example, take note of the elapsed time nearly equaling the cpu time, knowing from earlier reports, that the shown wait time is a side effect of excessive cpu load. We also again have this sql time near the elapsed time/cpu time and a regardable of pl/sql time. Remember, that any switch between sql and pl/sql is a costly operation because both languages use a slightly different runtime environment. Finally, we need to take into account that the statement fetches all rows from one or more underlying tables for 100 % of all executions. That might be inevitable for some environments, data warehouses for example, but is unlikely for that high number of executions.

The given sql exution plan on the plan tab, while still not original to the problem time, gives first directions what to go for in sql optimization. A hash unique is very costly, comparing all column values for the complete result set (distinct) and overloading cpu as well as memory/disk temp space. Really necessary? hash join outer and nested loops outer… maybe in-resultset-windowing analytic functions usually help here. That is to say, they have been introduced to overcome these full outer joinings even self-joinings, but a lot of people don’t even know they exist. union-all will in turn cram the cpu and the temp space and merge join cartesian, given that there is no lost join condition, points to cases where the optimizer just helps itself to get around the so called over-indexing.

Have fun (anyway), Peter.

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s