Monday, July 30, 2018

Oracle scheduler jobs and client timezone

Two weeks back I was working on one of the client "Chedraui" issue. Implementation team reported that two of the IGP jobs are running long from last couple of weeks. Unfortunately these were weekly jobs so if missed this time, next time we can debug after a week only.

Also it was reported that same job was running longer earlier as well and after disabling a trigger it went fine.

Recently they refreshed environment from prod again.

When I started looking into the issue, observed there were no active sessions in database and it was custom IGP loading job, so there cannot be anything on app server.

I have verified job logs as well and every time job took close to 2 hours whereas earlier it used to take close to 10 mins.

I have verified statspack report as well and nothing was useful in statspack report as well.

It was really difficult what's going wrong. After going through complete job code and all related packages and procedures I started querying into job tables. As these were oracle scheduled jobs so started querying below tables.

 select * from sys.scheduler$_job
Above table had an entry without end date populated. That gave some hint that job in fact in place but waiting for something.

ex: 241680 BEGIN NWMGR.call_proc; END; 7/30/2018 6:29:29.000000 AM -07:00 7/30/2018 4:28:59.594874 AM -07:00 17372 3 1 0 9.0071992548803E15 Test job to check completion time NWMGR NLS_LANGUAGE='AMERICAN' NLS_TERRITORY='AMERICA' NLS_CURRENCY='$' NLS_ISO_CURRENCY='AMERICA' NLS_NUMERIC_CHARACTERS='.,' NLS_CALENDAR='GREGORIAN' NLS_DATE_FORMAT='DD-MON-RR' NLS_DATE_LANGUAGE='AMERICAN' NLS_SORT='BINARY' NLS_TIME_FORMAT='HH.MI.SSXFF AM' NLS_TIMESTAMP_FORMAT='DD-MON-RR HH.MI.SSXFF AM' NLS_TIME_TZ_FORMAT='HH.MI.SSXFF AM TZR' NLS_TIMESTAMP_TZ_FORMAT='DD-MON-RR HH.MI.SSXFF AM TZR' NLS_DUAL_CURRENCY='$' NLS_COMP='BINARY' NLS_LENGTH_SEMANTICS='CHAR' NLS_NCHAR_CONV_EXCP='FALSE' 0102000000000000 0 PRIMARY 0 1 7/30/2018 6:29:29.000000 AM -07:00 0 0 0 2147483648 0 0 0 0

Then I queried below table.

SELECT * FROM USER_SCHEDULER_RUNNING_JOBS;
I think this table was also having running jobs information. I missed to capture details on that day so does not have current information.

So to check historical details of these jobs I started querying job history tables based on the keywords provided in code.

select * from USER_SCHEDULER_JOB_RUN_DETAILS where job_name LIKE '%LOAD_OO6%' ORDER BY ACTUAL_START_DATE DESC;

With historical data in this table , things were more clear. My first observation was that each job was taking exactly 2 hours time. This cannot be just coincidence.

100900 7/5/2018 2:19:17.210435 PM -05:00 NWMGR LOAD_OO6 SUCCEEDED 0 7/5/2018 12:18:39.000000 PM -07:00 7/5/2018 12:18:39.408172 PM -07:00 +00 00:00:38.000000 1 533,36729 12685 +00 00:00:36.920000
99830 7/4/2018 11:26:46.054728 AM -05:00 NWMGR LOAD_OO6 SUCCEEDED 0 7/4/2018 9:26:25.000000 AM -07:00 7/4/2018 9:26:25.747494 AM -07:00 +00 00:00:20.000000 1 289,12857 19307 +00 00:00:20.090000

In above records you can see that there is close to 2 hours difference in start time and finish time. Job took 38 seconds and 20 seconds respectively.

Also the most important information you can collect fro this is time zone. You can see that start date has -7 in timezone and log date has -5. So it was pretty clear by now that it not database that is taking time but job is waiting before it starts executing code.

So my suspect was that client machine and DB server were configured with different time zone.

But I was not having access to both servers and there was nobody to help find out current time zone configuration on client and DB servers.

So I started writing code to test the same.

Sample code:

CREATE TABLE job_test
(
   id      NUMBER,
   descr   VARCHAR2 (200));

CREATE OR REPLACE PROCEDURE test_proc
AS
BEGIN
   INSERT INTO job_test
        VALUES (1, 'Job completed at' || SYSTIMESTAMP);
END;
/

show errors;


CREATE OR REPLACE PROCEDURE call_proc
AS
   v_stat   VARCHAR2 (200);
   begin
   v_stat :='BEGIN TEST_PROC;COMMIT;END;';

   EXECUTE IMMEDIATE v_stat;
   END;
/

show errors;

CREATE OR REPLACE PROCEDURE run_proc
AS
BEGIN
   DBMS_SCHEDULER.create_job (
      job_name          => 'test_job',
      job_type          => 'PLSQL_BLOCK',
      job_action        => 'BEGIN NWMGR.call_proc; END;',
      start_date        => SYSTIMESTAMP + (0.5 / 1440),
      repeat_interval   => NULL,
      end_date          => NULL,
      ENABLED           => TRUE,
      comments          => 'Test job to check completion time');
END;
/

show errors;

begin
run_proc;
end;
/

alter session set time_zone='-07:00';

begin
run_proc;
end;
/

In first test job executed immediately and record got inserted into job_test table. But when I changed time zone and re-submitted job, I noticed that my job was waiting. After two hours job completed.

Now I have proof to that it up with implementation team. I advised them to change timezone on application server and they realized that it was changed recently as part of server maintenance activity.

After changing timezone ,job started completing within few seconds.

Party time!!!!
 

resmgr:cpu quantum and oracle 12.1.0.2 standard edition

Post upgrade to Oracle database to 12.1.* , we are facing lot of issue, specially when database version is standard edition.

There are multiple reason for that .
1. Oracle allows maximum 16 CPU threads at any time.
2. With earlier versions of oracle and with enterprise edition, we can disable resource manager but not with oracle SE2. Our application sessions get stuck on  resmgr:cpu quantum for long and it creates lot of perofrmance issues. 

To mitigate performance issue related to resource manager, we keep resource manager disabled. But with Oracle SE2 , even if we set resource manager plan null and disable all default maintenance windows, still we can see same wait event. 

This becomes even more critical when even single application query gets stuck on resource manager and when check in background , you will see CPU utilization on DB server is very less. 

Ex: 
Cpu(s):  1.9%us,  0.6%sy,  0.0%ni, 97.0%id,  0.5%wa,  0.0%hi,  0.0%si,  0.0%st

Above is the cpu utilization on database sever and query was getting stuck for almost 50 minutes. 

On further analysis found that even a single core(single thread) consumes ~100% CPU that thread with get caged by instance. 

Ex: 
top - 03:38:16 up 21 days,  2:36,  1 user,  load average: 2.48, 4.06, 4.38
Tasks: 481 total,   2 running, 479 sleeping,   0 stopped,   0 zombie
Cpu(s):  9.6%us,  0.3%sy,  0.0%ni, 90.0%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  74252448k total, 52846708k used, 21405740k free,  1178888k buffers
Swap: 58720212k total,    24576k used, 58695636k free, 45195604k cached

90% ideal


but still query waiting on cpu
from last 285 seconds
query plan is perfect
Plan hash value: 2943048428

------------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name                | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                     |       |       |   393 (100)|          |
|   1 |  SORT AGGREGATE                      |                     |     1 |   450 |            |          |
|   2 |   NESTED LOOPS                       |                     |       |       |            |          |
|   3 |    NESTED LOOPS                      |                     |     1 |   450 |   393   (1)| 00:00:01 |
|   4 |     VIEW                             |                     |     1 |   408 |   390   (1)| 00:00:01 |
|   5 |      HASH UNIQUE                     |                     |     1 |    75 |   390   (1)| 00:00:01 |
|*  6 |       FILTER                         |                     |       |       |            |          |
|   7 |        NESTED LOOPS                  |                     |       |       |            |          |
|   8 |         NESTED LOOPS                 |                     |     1 |    75 |   389   (0)| 00:00:01 |
|*  9 |          HASH JOIN                   |                     |     1 |    61 |   388   (0)| 00:00:01 |
|  10 |           TABLE ACCESS BY INDEX ROWID| LOC                 |   450 |  8550 |    16   (0)| 00:00:01 |
|* 11 |            INDEX RANGE SCAN          | LOC_STORESTATUS_IDX |   450 |       |     1   (0)| 00:00:01 |
|  12 |           TABLE ACCESS BY INDEX ROWID| SOURCING            |   374 | 15708 |   372   (0)| 00:00:01 |
|* 13 |            INDEX RANGE SCAN          | XIE3SOURCING        |   374 |       |     5   (0)| 00:00:01 |
|* 14 |          INDEX UNIQUE SCAN           | ITEM_PK             |     1 |       |     0   (0)|

On further drill down Identified that one of the core was utilizing 100% CPU.

Tasks: 494 total,   3 running, 491 sleeping,   0 stopped,   0 zombie
Cpu0  :  2.3%us,  0.3%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 99.7%us,  0.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  1.7%us,  0.0%sy,  0.0%ni, 98.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.3%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :  0.7%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :  0.7%us,  0.3%sy,  0.0%ni, 99.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  0.7%us,  0.3%sy,  0.0%ni, 99.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :  1.0%us,  0.3%sy,  0.0%ni, 98.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  :  0.7%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu9  :  1.3%us,  0.0%sy,  0.0%ni, 98.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu10 :  0.3%us,  0.3%sy,  0.0%ni, 99.0%id,  0.3%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu11 :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

So any thread which tries to consume 100% CPU will get caged by resource manager. 

So to help overcome this issue , only option to run smaller queries those does not consume more CPU. Else you will be doomed. 

Note: In another environment , where we were getting same issue and process was taking 4-5 hours to complete, after increasing CPU cores from 12 to 16 and enabling huge pages helped complete process in 40 minutes.

Useful links: https://blog.dbi-services.com/oracle-standard-edition-two/

Tuesday, July 17, 2018

Oracle Weblogic : Tuning JDBC Data Source parameters

In this post , we will cover some of the parameters and factors which affect JDBC performance in Weblogic Server. But first I will like to show how jdbc connections are made (in Snaps below )and for that we need to understand two terms Data Sources and Connection Pool.
Data sources:
Are administered factory Objects that provide JDBC connections.
Are bound into Java naming and Directory Interface (JNDI) and configure using the Administration console.
Make the application code Portal Across databases.
– Creating a connection to the database is an expensive operation.

Connection pools:
– Remove the overhead of establishing connections.
-Improve server performance by sharing database connections among multiple users accessing the Web application.
A simple Way to boost JDBC application performance and avoid Wasting resources:
1. JNDI lookups are relatively expensive. So caching an object that requires a lookup in client.
code or application code avoids incurring additional performance cost.
2. When client or application code has a connection, maximize the reuse of this connection rather than closing and reacquiring a new connection. Although acquiring and returning an existing creation is much less expensive than creating a new one, excessive acquisitions and returns to pools creates contention in the connection pool and degrades application performance.
3. Do not hold connections any longer than is necessary to achieve the Work needed. Getting a connection once, completing all necessary Work, and returning it as soon as possible provides the best balance for overall performance.

Parameters that Affect JDBC Data Source performance (which can be changed using Weblogic console) :
1. Connection Pool Capacity (Most Important) :
o Connection creation is expensive.
o For applications that consistently involve heavy database traffic:
 Determine the optimal Maximum Capacity of a data source experimentally
 Set the lnitial Capacity and Maximum Capacity to the same value.
o For applications, where peak database load is intermittent:
 Use different values for initial and maximum sizes.
Tune Capacity increment and Shrink Frequency on the basis of load changes.
In my Environment, I have kept Minimum value as 20 and Maximum as 100. Minimum cant be kept too high as there will be extra overhead because of this.
2. Connection testing :
– The WebLogic Server can test a connection from the connection pool before giving it to a client.
– Test Connection On Reserve parameter enables automatic testing of database connection.
– Connections can also be tested periodically for validity by using the Test Frequency parameter.
– Both these parameters can degrade performance (So , in production instances Avoid this)

3. Shrink Frequency
– The WebLogic Server periodically shrinks the connection pool to its initial capacity based on usage.
– The Shrink Frequency parameter is used to specify the number of seconds to wait before shrinking a connection pool.
– When set to Zero 0, shrinking is disabled. This can be helpful in a production environment.

4. Configuring Row Pre Fetch
-Row prefetching improves performance by fetching multiple rows from the sen/er to the client in one server access.
– The optimal prefetch size depends on the particulars of the query.
– In general, increasing this number will increase performance, until a particular value is reached.
Note: This is applicable only for external clients, not for clients in the same Java Virtual Machine (JVM) as the WebLogic Server.
Very rarely will increased performance result from Exceeding 100 rows.
Max value is 65536 and minimum value is 2
5. Statement caching:
The three types of statements in JDBC are:
-Statements (not cached)
-Prepared statements
-Callable statements
The prepared statements and callable statements are Cacheable and improve overall performance through reuse.
– The statement cache type determines how the cache is refreshed:
LRU: Replaces least recently used statement with new .
FIXED: no replacement is done
6. Statement Cache Size: The Statement Cache Size attribute determines the total number of prepared and callable statements to cache for each connection in each instance of the data source. By caching statements, you can increase your system performance. However, you must consider how your
DBMS handles open prepared and callable statements. In many cases, the DBMS will maintain a cursor for each open statement. This applies to prepared and callable statements in the statement cache. If you cache too many statements, you may exceed the limit of open cursors on your database server. Setting the size of the statement cache to 0 turns off statement caching.
-Minimum value: 0
-Maximum value: 1024

7. Connection Pinned to Thread:
– A data source can dedicate or “pin” a connection to the first server thread that requests it.
– This capability:
– May increase performance by eliminating potential contention for connections by threads
– ls not supported with multidata sources or Oracle Real Application Clusters (RAC)

8. Inactive Connection Timeout:
The number of inactive seconds on a reserved connection before WebLogic Server reclaims the connection and releases it back into the connection pool.
You can use the Inactive Connection Timeout feature to reclaim leaked connections – connections that were not explicitly closed by the application. Note that this feature is not intended to be used in place of properly closing connections.
When set to 0, the feature is disabled.
Things to Consider on Oracle Database Instance side:
Number of processes
-It includes simultaneous users and Oracle background processes.
-The default setting is usually too low.
Shared pool size
-It contains data and control information for each instance.
-lt is critical to performance.
-Even the moderate use of stored procedures and triggers may require an increase.
Maximum opened cursor
-It should be increased if you are using JDBC statement caching.
There are several other Factors which affect JDBC performance which i will try to cover later , the ones discussed above are primary and important . Although Performance issues differ from environment to environment and may be due to some other reasons or factors (which are not mentioned above) , but i hope understanding these parameters will certainly help.