Client-failover for dataguard switchover and failover


Introduction

The following is nothing more than some extension of the oracle metalink document:

[ID 740029.1] Step By Step Guide On How To Configure And Test Client-Failover For Dataguard Switchover And Failover

That is, the article does not only present the cookbook how-to of its anchestor but also examines what happens in the database and what the clients experience will be concerning a so called transparent application failover (TAF). The scenario has been tested on an oracle 10.2.0.3 on windows, please note that the metalink document points to a some more applicable technique available with oracle 11gR2 (Client Failover Best Practices for Highly Available Oracle Databases: Oracle Database 11g Release 2).

Client-failover for dataguard, away from other failover scenarios, essentially aims at using a general tnsnames entry against some server-side database endpoint, no matter what server-side database instance is currently running in a dataguard primary role. This is actually the transparency in a transparent application failover where the application, client here, does not have to care to switch its network configuration in any way (some simple implementation would be, not uncommonly, to provide some tnsnames.ora, tnsnames.ora.prm and tnsnames.ora.stb and rename the files) during a failover or switchover.

The key to the solution is to have an equally named service being prepared to run on any database instance within the dataguard network. However, the service is defined to actually run on only one of these  database instances. This is achieved by firing a database trigger after startup that checks for the database role being primary along in v$database and then starts the service (see below). Do note, that services may also serve other requirements, such as workload management and analysis as well as load balancing. Such a way there is no overhead in having a couple of services being active on a database (you already know or use the xdb service that comes with a standard installation).

Set-up and test

The set-up consists of two database instances plt and plts where the first one is the usual primary. As sys, you at first create a service on plt being the current primary as follows:

--exec dbms_service.stop_service('pltp');
--exec dbms_service.delete_service('pltp');
exec dbms_service.create_service('pltp', 'pltp');

This will just register the service but not yet launch any code or whatsoever. The next step creates the database startup trigger to actually start the service for a primary instance:

--drop trigger manage_dg_service;
create or replace trigger manage_dg_service after startup on database
declare
  v_role varchar(30);
begin
  select database_role into v_role from v$database;
  if (v_role = 'PRIMARY') then
    dbms_service.start_service('pltp');
  end if;
end;
/

With the next planned switchover (or an explicit dbms_service.start_service(...);, depending on whatever 24X7 requirements you have), the service will be started. I verified this at two locations. On one hand I inspected the alert.log for any logging entries and in fact, the regarded service will, in terms of the trigger, be started after database startup but not before alter database open with an ALTER SYSTEM SET in memory only:

...
Wed Apr 13 17:17:16 2011
ALTER SYSTEM SET service_names='pltp' SCOPE=MEMORY SID='plts';
Wed Apr 13 17:17:19 2011
Completed: alter database open
...

The other spot i inspected was v$services with a subselect to include information from v$active_services as well. Its interesting to see, that service on a standby do not even have the service_id column populated:

-- current primary
SQL> column service_name format a20
SQL> column network_name format a20
SQL> set colsep ' | '
SQL> select service_id, name as service_name, network_name,
  2    (select count(*) from v$active_services where name = A.name) as active
  3  from v$services A order by name;

SERVICE_ID | SERVICE_NAME     | NETWORK_NAME     |     ACTIVE
---------- | ---------------- | ---------------- | ----------
         4 | plt              | plt              |          1
         5 | pltp             | pltp             |          1
         3 | pltXDB           | pltXDB           |          1
         1 | SYS$BACKGROUND   |                  |          1
         2 | SYS$USERS        |                  |          1

-- current standby
SQL> column service_name format a20
SQL> column network_name format a20
SQL> set colsep ' | '
SQL> select service_id, name as service_name, network_name,
  2    (select count(*) from v$active_services where name = A.name) as active
  3  from v$services A order by name;

SERVICE_ID | SERVICE_NAME     | NETWORK_NAME     |     ACTIVE
---------- | ---------------- | ---------------- | ----------
         0 | plts             | plts             |          0
         0 | SYS$BACKGROUND   |                  |          1
         0 | SYS$USERS        |                  |          1
         0 |                  | pltsXDB          |          0

Ok, having the regarded service being up and running successfully, we go and configure an according tnsname entry with the tnsnames.ora on the clients. It may look as follows whereby the two database instance host ip’s serve as the failover network endpoints in the ADDRESS_LIST / ADDRESS blocks of the configuration:

dg_plt_prod =
  (DESCRIPTION =
    (ADDRESS_LIST =
      (ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.29.48)(PORT = 1521))
      (ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.29.49)(PORT = 1521)))
    (CONNECT_DATA =
      (SERVICE_NAME = pltp)))

Let’s now test what happens on a switchover as seen from some client side. I prepared a database link and a simple procedure on a third database instance (client) to select from v$database of the current primary and print some relevant information, most importantly v$database.db_unique_name to sqlplus:

create database link dg_plt_prod connect to xxx identified by xxx using 'dg_plt_prod';

create or replace procedure sfo_test as
begin
  for cur in (
    select '[' || to_char(sysdate, 'hh24:mi:ss') ||
      '] name: ' || name || ', db_unique_name: ' || db_unique_name || ', database_role: ' || database_role
      as status
    from v$database@dg_plt_prod
  ) loop
    dbms_output.put_line(cur.status);
  end loop;
end;
/

The output of the third database instance (client) sqlplus session reads as follows (the exception messages are in german but you can examine the error codes for inspection):

SQL> set serveroutput on

-- dg environment is running smoothly
SQL> exec sfo_test
[10:50:43] name: PLT, db_unique_name: plt, database_role: PRIMARY
PL/SQL-Prozedur wurde erfolgreich abgeschlossen.

-- last primary is going down, the connection being cancelled
SQL> exec sfo_test
BEGIN sfo_test; END;
*
FEHLER in Zeile 1:
ORA-02068: aufgrund eines schwerwiegenden Fehlers von DG_PLT_PROD
ORA-03113: Unerwartetes _bertragungsende in Kommunikation
ORA-06512: in "XXX.SFO_TEST", Zeile 3
ORA-06512: in Zeile 1

-- last primary is down, last standby not yet up,
--   the regarded service name unknown with sql*net
SQL> exec sfo_test
BEGIN sfo_test; END;
*
FEHLER in Zeile 1:
ORA-12514: TNS: Listener kennt den Service aktuell nicht, der in dem
Connect-Deskriptor angefordert wird
ORA-06512: in "XXX.SFO_TEST", Zeile 3
ORA-06512: in Zeile 1

-- last primary is the new standby, last standby is the new primary,
--   the dg environment is running smoothly again
SQL> exec sfo_test
[10:53:15] name: PLT, db_unique_name: plts, database_role: PRIMARY
PL/SQL-Prozedur wurde erfolgreich abgeschlossen.

So far, anything works as expected. The big shot is that there is no need to reconnect with the third database instance (client) due to the fixed user database link. The client session on the third database instance will care for this automatically (like with a short network outage).

Although this was all the functionality I actually needed for my business case, I mused about the coming out of a directly connected session. Will it failover that transparently as well? Let’s see:

-- still everything ok
select '[' || to_char(sysdate, 'hh24:mi:ss') || '] name: ' || name || ', db_unique_name: ' ||
  db_unique_name || ', database_role: ' || database_role as status
from v$database;

STATUS
--------------------------------------------------------------------
[14:03:27] name: PLT, db_unique_name: plts, database_role: PRIMARY

-- switchover in progress
select '[' || to_char(sysdate, 'hh24:mi:ss') || '] name: ' || name || ', db_unique_name: ' ||
  db_unique_name || ', database_role: ' || database_role as status
from v$database;

*
FEHLER in Zeile 1:
ORA-03113: Unerwartetes _bertragungsende in Kommunikation
Prozess-ID: 0
Session-ID: 128 Seriennummer: 216
ERROR: ORA-03114: Nicht mit ORACLE verbunden

-- switchover being done
select '[' || to_char(sysdate, 'hh24:mi:ss') || '] name: ' || name || ', db_unique_name: ' ||
  db_unique_name || ', database_role: ' || database_role as status
from v$database;

ERROR: ORA-03114: Nicht mit ORACLE verbunden
ERROR: ORA-03114: Nicht mit ORACLE verbunden

-- another explicit reconnect on sqlplus
select '[' || to_char(sysdate, 'hh24:mi:ss') || '] name: ' || name || ', db_unique_name: ' ||
  db_unique_name || ', database_role: ' || database_role as status
from v$database;

STATUS
-----------------------------------------------------------------------------
[14:24:07] name: PLT, db_unique_name: plt, database_role: PRIMARY

Nope, obviously, it won’t. Only the explicit reconnect serves us with the required response again.

But, why not dig deeper into the TAF-functionality of services by setting the failover_*-parameters of dbms_service.create_service(...); or dbms_service.modify_service(...);, respectively. Likewise we may continously have a look into the TAF-columns of v$session, namely failover_type, failover_method and failed_over.

At a first spot the TAF-columns of our directly connected session read like this so far:

select sid, serial#, logon_time, status, failover_type, failover_method, failed_over
  from v$session where program = 'sqlplus.exe';

       SID|   SERIAL#|STATUS  |FAILOVER_TYPE|FAILOVER_M|FAI
----------|----------|--------|-------------|----------|---
       125|        96|ACTIVE  |NONE         |NONE      |NO

A modify against our regarded service (within another session on the server) sets all the above mentioned parameters according to the possibilities of 10gR2 (for a detailed documentation see http://download.oracle.com/docs/cd/B19306_01/appdev.102/b14258/d_serv.htm and http://vnull.pcnet.com.pl/dl/oracle/MAA_WP_10gR2_ClientFailoverBestPractices.pdf). This includes not applying aq_ha_notifications because although fast application notification (FAN) for avoiding long tcp timeouts is available since 10gR2 it is not be integrated into data guard, specifically the broker, before 11gR2 (!):

FAILOVER_METHOD_BASIC: Server side TAF method is BASIC. BASIC is the only value currently supported. This means that a new connection is established at failure time. It is not possible to pre-establish a backup connection. (which is to say, PRECONNECT is not supported).

FAILOVER_TYPE_SESSION: Server side TAF failover type is SESSION. At failure time, if the failover type is SESSION, TAF will re-connect to a surviving node and re-establish a vanilla database session. Customizations (for example, ALTER SESSION) must be re-executed in a failover callback.

That results in a session-level TAF at a sum of 5 minutes by 60 attempts at 5 secs.

begin
  dbms_service.modify_service(
    service_name => 'pltp',
    failover_method => dbms_service.FAILOVER_METHOD_BASIC,
    failover_type => dbms_service.FAILOVER_TYPE_SESSION,
    failover_retries => 60,
    failover_delay => 5);
end;
/

Let’s reconnect the directly connected session and remark the changed failover_type and failover_method values being listed immediately.

select sid, serial#, status, failover_type, failover_method, failed_over
  from v$session where program = 'sqlplus.exe';

       SID|   SERIAL#|STATUS  |FAILOVER_TYPE|FAILOVER_M|FAI
----------|----------|--------|-------------|----------|---
       125|        98|ACTIVE  |SESSION      |BASIC     |NO

ok, now execute another manual switchover in the background and check again what happens:

-- launch the test set-up standard select for the direct session
select '[' || to_char(sysdate, 'hh24:mi:ss') || '] name: ' || name || ', db_unique_name: ' ||
  db_unique_name || ', database_role: ' || database_role as status
from v$database;

STATUS
-----------------------------------------------------------------------------
[15:07:03] name: PLT, db_unique_name: plt, database_role: PRIMARY

-- switchover being underway the following select ***waits*** for completion
--   and does not thow exceptions about network errors like before
--   (however, after the switchover it complains about an impossible
--   replay)
select sid, serial#, status, failover_type, failover_method, failed_over
  from v$session where program = 'sqlplus.exe';
*
FEHLER in Zeile 1:
ORA-25408: Aufruf kann nicht sicher wiederholt werden

-- just try again and see it works (!) and features a failed_over = 'YES'
select sid, serial#, status, failover_type, failover_method, failed_over
  from v$session where program = 'sqlplus.exe';

       SID|   SERIAL#|STATUS  |FAILOVER_TYPE|FAILOVER_M|FAI
----------|----------|--------|-------------|----------|---
       111|         1|ACTIVE  |SESSION      |BASIC     |YES

-- well, try the test select again, works also
select '[' || to_char(sysdate, 'hh24:mi:ss') || '] name: ' || name || ', db_unique_name: ' ||
  db_unique_name || ', database_role: ' || database_role as status
from v$database;

STATUS
------------------------------------------------------------------------------
[15:09:59] name: PLT, db_unique_name: plts, database_role: PRIMARY

Great stuff really. But what about that ORA-25408 error? This is expected behaviour as according to metalink note 473145.1: SQLPLUS does not failover service receiving FAN events. Sqlplus can’t just handle this scenario before 10.2.0.4 and unless being started with the -F option. Martin Bach writes in an article (RAC One Node and Database Protection) that it is more or less the responsibility of the calling environment to handle just this exception (when the database is unable to replay the transactional operations that occur past the last committed transaction). I would go with that, especially concerning the logic to perform a sufficient rollback; before proceeding with transactions in the failed over session. OCI-JDBC, beware of the recent change from TAF to fast connection failover (FCF), in fact provides a couple of methods to get along with (http://download.oracle.com/docs/cd/B19306_01/java.102/b14355/ocitaf.htm or metalink note 276225.1 JDBC Sample on How to Implement OCI TAF callbacks with DML and PL/SQL).

conclusion

So, what’s next? Thinking about having a much smarter solution for the actual business case with a database link client session now? I got to disappoint you, sorry. TAF being based on service and an accordingly set failover_method as well as failover_type will not work over database links as described in metalink note 399453.1: Database Link Usage with Connect Time Failover or Transparent Application Failover:

Database Link Usage with Connect Time Failover or Transparent Application Failover [ID 399453.1]
Modified 06-AUG-2010 Type HOWTO Status PUBLISHED
Applies to:
Oracle Net Services – Version: 9.2.0.1.0 and later [Release: 9.2 and later ]
Information in this document applies to any platform.
Goal
Database link usage with Connect Time Failover (CTF) or Transparent Application Failover (TAF)
Solution
Database links do not currently support either Connection Time Failover or Transparent Application Failover.
This is confirmed by Oracle Development in the Bug 2393357 ORA-25402 USING TAF WITH DB-LINKS. This is closed with code “Status 92,Closed, Not a Bug”
The following enhancement request is outstanding for this issue. Bug 3389688 ORA-25402 WHEN USING TAF WITH DATABASE LINKS

One has to catch all this disconnect and unknown service exceptions. There is no way out. It’s a shame.

There is another valuable collection of metalink notes for TAF on 10g from Julian Dyke at TAF References (where Julian Dyke = ‘Pro Oracle Database 10g RAC on Linux’ and Martin Bach = ‘Pro Oracle Database 11g RAC on Linux’, ok?).

update concerning cold fusion

The last few days I also tested the service-based TAF, as described above, against(adobe) cold fusion. In fact, cold fusion does support TAF but only in the enterprise (or developer) edition and by employing the macromedia driver. I tested cold fusion version 9.x with a 11g jdbc driver and used the following datasource settings (note that you’ll need to select the other type driver class when creating the datasource):

--> JDBC URL
jdbc:macromedia:oracle://192.168.29.48:1521;ServiceName=hugopltp;AlternateServers=(192.168.29.49:1521);ConnectionRetryCount=60;ConnectionRetryDelay=5
--> Driver Class
macromedia.jdbc.MacromediaDriver

To keep it short, it does work with just one poor flaw. Upon a planned switchover the web site in action notices the role change somehow and correctly waits for completion. However, when the following finishes for the former primary:

alter database commit to switchover to physical standby with session shutdown;

and the database was dismounted, the web site throws an exception with this text:

Error Executing Database Query.
[Macromedia][Oracle JDBC Driver][Oracle]ORA-00604: error occurred at recursive SQL level 1 ORA-01219: database not open: queries allowed on fixed tables/views only
 
The error occurred in D:/wwwroot/_test/plt/plt_db_test.cfm: line 25

23 : 	name="qry_selectTest" result="res">
24 : 	
25 : 	#preserveSingleQuotes(form.sqlStatement)#
26 : 	
27 : </cfquery>

VENDORERRORCODE 	  604
SQLSTATE 	  HY008
SQL 	   select '[' || to_char(sysdate, 'hh24:mi:ss') || '] name: ' || name || ', db_unique_name: ' || db_unique_name || ', database_role: ' || database_role as status from v$database
DATASOURCE 	  PLT_DG_TEST

Iff you do a reload on that, the web site waits again until the switchover has been done and redisplays the data in order. So what? Is this cold fusion’s way to tell us about ORA-25408 from above or is it in fact different. Me, i don’t know. An recommended article from from datadirect (http://www.datadirect.com/resources/jdbc/oracle-rac/failover.html), where datadirect seems to be the actual producer of macromedia.jdbc.MacromediaDriver, does not really encourage someone to examine this question further (whereupon the article itself is ok but not the bottom line). That’s it.

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s