Checking the alert log – the easy way

Do you check the alert log of your databases every day? In the morning when you get in? But what about the alerts which happen during the day? How do you spot them – especially if you don’t have Grid Control or Cloud Control configured. Even if you do have a full monitoring solution, this can be useful for a belt-and-braces approach.

Here’s a short bash shell script to use adrci to read through each ORACLE_HOME (for a DIAG location) and check every alert log contained therein, using adrci pattern matching functionality to search for problems. I usually schedule it within each host (using cron) to minimise the moving parts, and therefore minimise the opportunity for it to stop working. Any problems, and I get an email. I hope you find it useful. I usually keep it in /opt/oracle/bin, but you stick it in your script home of choice.

This should work for 11G and 12C database (tested to 12.1.0.2), unless I’ve made a cut/paste error :-)

#!/bin/bash
#########################################################################################
# Description: Read each Oracle Home directory. Run adrci matching for problems
# Author : N Chandler.2014-03-28
#
# crontab : # Check Alert Log 30.03.2014
# 00,30 * * * * /opt/oracle/bin/adrci_alert.sh > /opt/oracle/bin/log/adrci.cron.log 2>&1
#
#########################################################################################
# Which HOME?
 export ORACLE_HOME=/opt/app/oracle/product/11g
 export DIAG_LOC=/opt/app/oracle/diag/rdbms
# Who gets the alert?
 export RECIPIENT='neil@chandler.uk.com'
# Other Variables
 export LD_LIBRARY_PATH=$ORACLE_HOME/lib
 export HOST=`hostname`
 export PATH=$ORACLE_HOME/bin:$PATH
 export NLS_DATE_FORMAT='yyyy-mm-dd hh24:mi:ss'
 export SUBJECT="Oracle ALERTS on ${HOST} OK"
 export LOG=/tmp
 export ALERT=$LOG/error.txt

# Write the alert log message header for the email
 echo "${HOST} `date +%Y-%m-%d.%H:%M:%S-%Z`" > ${ALERT}
 echo "All alerts in ADRCI Alert log for the last 30 minutes" >> ${ALERT}
 echo "THIS ALERT WILL NOT BE REPEATED!!! TAKE ACTION NOW!!!" >> ${ALERT}
 echo "Follow-up on this email and check the alert log on ${HOST}" >> ${ALERT}

# find out the homes
 adrci_homes=( $(adrci exec="show homes" | grep -e rdbms -e asm))

# run through Each home found and examine the alert log
# Here we are looking for ORA- messges, Deadlock, anything which raises an incident or anything which is instance-level
# IN THE LAST 30 MINUTES (1/48), so we need to run this code every 30 minutes or we may miss something. 
 for adrci_home in ${adrci_homes[@]}
 do
   echo "Checking: ${adrci_home}" >> ${ALERT}
   echo $adrci_home' Alert Log' >> ${ALERT}
   adrci exec="set home ${adrci_home} ; show alert -p \\\"(message_text like '%ORA-%' or message_text like '%Deadlock%' or message_text like '%instance%' or message_text like '%incident%') and originating_timestamp>=systimestamp-(1/48) \\\"" -term >>${ALERT}
 done
# count the errors. This is a good place to exclude specific errors you wish to ignore with a -v match.
# note - your grep must be aligned with the pattern match above for this to work
num_errors=`grep -c -e 'TNS' -e 'ORA' -e 'Deadlock' -e 'instance' -e 'incident' ${ALERT} | grep -v 'ORA-28'`

# If there are any errors, lets email the alert information to someone
if [ $num_errors -gt 0 ]
then
  SUBJECT="ERROR in Oracle ALERT log on ${HOST}"
  mail -s "${SUBJECT}" ${RECIPIENT} < ${ALERT}
fi

Adding a DEFAULT column in 12C

I was at a talk recently, and there was an update by Jason Arneil about adding columns to tables with DEFAULT values in Oracle 12C. The NOT NULL restriction has been lifted and now Oracle cleverly intercepts the null value and replaces it with the DEFAULT meta-data without storing it in the table. To repeat the 11G experiment I ran recently:

 

SQL> alter table ncha.tab1 add (filler_default char(1000) default 'EXPAND' not null);
Table altered.

SQL> select table_name,num_rows,blocks,avg_space,avg_row_len 
      from user_tables where table_name = 'TAB1';
TABLE_NAME NUM_ROWS       BLOCKS  AVG_SPACE AVG_ROW_LEN
---------- ---------- ---------- ---------- -----------
TAB1            10000       1504          0        2017


In both releases we then issue:
SQL> alter table ncha.tab1 modify (filler_default null);
Table altered.


IN 11G
SQL> select table_name,num_rows,blocks,avg_space,avg_row_len
      from user_tables where table_name = 'TAB1';

TABLE_NAME NUM_ROWS       BLOCKS  AVG_SPACE AVG_ROW_LEN
---------- ---------- ---------- ---------- -----------
TAB1            10000       3394          0        2017

BUT IN 12C
SQL> select table_name,num_rows,blocks,avg_space,avg_row_len
      from user_tables where table_name = 'TAB1';
TABLE_NAME NUM_ROWS       BLOCKS  AVG_SPACE AVG_ROW_LEN
---------- ---------- ---------- ---------- -----------
TAB1            10000       1504          0        2017

So, as we can see, making the column NULLABLE in 12C didn’t cause it to go through and update every row in the way it must in 11G. It’s still a chained-row update accident waiting to happen, but its a more flexible accident :-)

However, I think it’s worth pointing out that you only get “free data storage” when you add the column. When inserting a record, simply having a column with a DEFAULT value means that the DEFAULT gets physically stored with the record if it is not specified. The meta-data effect is ONLY for subsequently added columns with DEFAULT values.

SQL> create table ncha.tab1 (pk number, c2 timestamp, filler char(1000), filler2 char(1000) DEFAULT 'FILLER2' NOT NULL) pctfree 1;
Table created.

SQL> alter table ncha.tab1 add constraint tab1_pk primary key (pk);
Table altered.

Insert 10,000 rows into the table, but not into FILLER2 with the DEFAULT
SQL> insert into ncha.tab1 (pk, c2, filler) select rownum id, sysdate, 'A' from dual connect by level <= 10000;
commit;
Commit complete.

Gather some stats and have a look after loading the table. Check for chained rows at the same time.
SQL> exec dbms_stats.gather_table_stats('NCHA','TAB1',null,100);
PL/SQL procedure successfully completed.

SQL> select table_name,num_rows,blocks,avg_space,avg_row_len
     from user_tables where table_name = 'TAB1';

TABLE_NAME   NUM_ROWS	  BLOCKS  AVG_SPACE AVG_ROW_LEN
---------- ---------- ---------- ---------- -----------
TAB1		10000	    3394	  0	   2017

For a bit of fun, I thought I would see just how weird the stats might look if I played around with adding defaults

SQL> drop table ncha.tab1;
Table dropped.

SQL> create table ncha.tab1 (pk number) pctfree 1;
Table created.

SQL> alter table ncha.tab1 add constraint tab1_pk primary key (pk);
Table altered.

Insert 10,000 rows into the table

SQL> insert into ncha.tab1 (pk) select rownum id from dual connect by level <= 10000;
commit;
Commit complete.

Gather some stats and have a look after loading the table. Check for chained rows at the same time.
SQL> exec dbms_stats.gather_table_stats('NCHA','TAB1',null,100);

PL/SQL procedure successfully completed.

SQL> select table_name,num_rows,blocks,avg_space,avg_row_len
  2    from user_tables
  3   where table_name = 'TAB1';

TABLE_NAME   NUM_ROWS	  BLOCKS  AVG_SPACE AVG_ROW_LEN
---------- ---------- ---------- ---------- -----------
TAB1		10000	      20	  0	      4

Now lets add a lot of defaults
SQL> alter table ncha.tab1 add (filler_1 char(2000) default 'F1' not null, filler_2 char(2000) default 'F2' null, filler_3 char(2000) default 'F3', filler_4 char(2000) default 'how big?' null );
Table altered.

Gather some stats and have a look after adding the column. Check for chained rows at the same time.
SQL> exec dbms_stats.gather_table_stats('NCHA','TAB1',null,100);

PL/SQL procedure successfully completed.

SQL> select table_name,num_rows,blocks,avg_space,avg_row_len
  2    from user_tables
  3   where table_name = 'TAB1';

TABLE_NAME   NUM_ROWS	  BLOCKS  AVG_SPACE AVG_ROW_LEN
---------- ---------- ---------- ---------- -----------
TAB1		10000	      20	  0	   8008

10,000 rows with an AVG_ROW_LEN of 8008, all in 20 blocks. Magic!

Just to finish off, lets update each DEFAULT column so the table expands….

SQL> select filler_1, filler_2, filler_3, filler_4,count(*) from ncha.tab1 group by filler_1,filler_2,filler_3,filler_4;

FILLER_1   FILLER_2   FILLER_3	 FILLER_4     COUNT(*)
---------- ---------- ---------- ---------- ----------
F1	   F2	      F3	 how big?	 10000

So it's all there. The metadata is intercepting the nulls and converting them to the default on the fly, rather than storing them in the blocks.
So what happens if we actually UPDATE the table?

SQL> update ncha.tab1 set filler_1 = 'EXPAND', filler_2 = 'EXPAND', filler_3='EXPAND', filler_4='THIS BIG!';
10000 rows updated.

SQL> select filler_1, filler_2, filler_3, filler_4,count(*) from ncha.tab1 group by filler_1,filler_2,filler_3,filler_4;

FILLER_1   FILLER_2   FILLER_3	 FILLER_4     COUNT(*)
---------- ---------- ---------- ---------- ----------
EXPAND	   EXPAND     EXPAND	 THIS BIG!	 10000

Gather some stats and have a look after the update, checking for chained rows at the same time.
SQL> exec dbms_stats.gather_table_stats('NCHA','TAB1',null,100);

PL/SQL procedure successfully completed.

SQL> select table_name,num_rows,blocks,avg_space,avg_row_len
     from user_tables where table_name = 'TAB1';

TABLE_NAME   NUM_ROWS	  BLOCKS  AVG_SPACE AVG_ROW_LEN
---------- ---------- ---------- ---------- -----------
TAB1		10000	   19277	  0	   8010

SQL> 
SQL> analyze table tab1 list chained rows into chained_rows;

Table analyzed.

SQL> select count(*) CHAINED_ROWS from chained_rows;

CHAINED_ROWS
------------
       10000

Yep. That’s bigger.

Club Oracle in London – 12th November 2014

Just a timely note to publicise the e-DBA and Jonathan Lewis #cluboracle event in London on 12th November 2014.

I’m a great fan of community events, as I hope my blog, tweets, presentations and support of the UKOUG, chairing many SIGs over the last few years, will attest. It’s the second best way to learn anything, talking with your peers who have already worked it out, and those who have the same problem to discuss [ NOTE: The best way to learn is to be sat with the manuals, MOS and Google at 2am, fixing something very broken. It’s not the nicest way to learn but you WILL learn :-) ]

This event will see several great speakers feeding us new information about the new 12c In-Memory option, and 12c enhancements, as well as some interesting take-aways from OOW14, and e-DBA feeding us and pizza and beer (optional) too.

There will also be a 15 minute Q&A at the end, so register, come armed with a really challenging question and enjoy the community. Also, it’s free! What more could you possibly want?

Please say Hi! to me too – I’ll be there, digesting all of the above.

When to use the NOLOCK hint in SQL Server

I frequently hear of, and see, developers and DBA’s using the NOLOCK hint within SQL Server to bypass the locking mechanism and return their data sets as soon as possible. There are times when this is OK, such as when you are running an ad hoc query and are only interested in approximate results. It is somewhat less OK to write this hint into application code and reports, unless you don’t actually care whether the data returned is accurate.

The big problem with NOLOCK is that the effects of using it are not fully understood by many of the coders who are using it. The common perception is that you’re simply reading uncommitted data, and the odd roll-back isn’t too much to worry about. If that was the full extent of the problem, then the developer would be fairly right – we tend not to roll back too often so don’t worry about it. However, there are more insidious side effects which are not generally understood. Effects caused by how the underlying database actually works.

To try and explain the true nature of issuing READ UNCOMMITTED selects, via NOLOCK, I have created an example so you can see the issue at work.

Here’s 2 scripts. SCRIPT 1 creates a table, puts some static data into it, then starts inserting lots of data. Each row is padded for realism, to get a few rows per block. The Primary Key is a “UNIQUEIDENTIFIER”, so we should expect to get the keys spread, and subsequent inserts into the same blocks as our initial inserts. This should generate some block splits – something that happens a lot in SQL Server.
SCRIPT1:

IF OBJECT_ID('dbo.test_table') IS NOT NULL
DROP TABLE dbo.test_table;
-- create a table and pad it out so we only get a few rows per block
CREATE TABLE dbo.test_table
(
 pk UNIQUEIDENTIFIER DEFAULT ( NEWID() ) NOT NULL
,search_col VARCHAR(10)
,count_col  INT
,padding CHAR(100) DEFAULT ( 'pad' )
);
alter TABLE dbo.test_table add constraint test_table_pk primary key clustered (pk);
DECLARE @LOOP1 INT
SET @LOOP1=0
WHILE (@LOOP1 < 100)
BEGIN
 SET @LOOP1=@LOOP1+1
 INSERT INTO dbo.test_table ( search_col, count_col ) VALUES('THIS_ONE',1),('THIS_ONE',1),('THIS_ONE',1),('THIS_ONE',1),('THIS_ONE',1),('THIS_ONE',1),('THIS_ONE',1),('THIS_ONE',1),('THIS_ONE',1),('THIS_ONE',1);
END;
select getdate(),sum(count_col) from dbo.test_table (NOLOCK) where search_col = 'THIS_ONE';
set nocount on
-- insert 100,000 rows, which should cause some lovely block splits as the PK will look to insert into the same block as the data we already have in there
-- we need to run the select in another windoow at the same time
DECLARE @LOOP INT
SET @LOOP=0
WHILE (@LOOP < 100000)
BEGIN
 SET @LOOP=@LOOP+1
 INSERT INTO dbo.test_table ( search_col, count_col ) VALUES ( CAST( RAND() * 1000000 AS CHAR) , 100000 )
END
select getdate(),sum(count_col) from dbo.test_table (NOLOCK) where search_col = 'THIS_ONE';

Output from SCRIPT1 – note that the 2 selects, before and after inserts, give the same output.

----------------------- -----------
2014-10-12 23:51:34.210 1000
---------------------- -----------
2014-10-12 23:51:53.490 1000

Whilst SCRIPT1 is running, run SCRIPT 2 in another window in the same database. It’s just repeating the same SELECT with (NOLOCK) over and over again. The WHERE clause doesn’t change, and the correct result set should never change… but due to the block splits we see it change. A lot. As the data from the block split is duplicated into the split block before cleanup on the old block, the NOLOCK, performing the READ UNCOMMITTED select sees the “data duplication” in the newly split block.
SCRIPT2:

set nocount on
DECLARE @LOOP INT
SET @LOOP=0
WHILE (@LOOP < 10000)
begin
 SET @LOOP=@LOOP+1
 select getdate(),sum(count_col) from dbo.test_table (NOLOCK) where search_col = 'THIS_ONE';
end;

Output from SCRIPT2 (trimmed)

2014-10-12 23:51:35.473 1000
.
2014-10-12 23:51:35.530 1000
2014-10-12 23:51:35.530 1000
2014-10-12 23:51:35.533 1005
2014-10-12 23:51:35.533 1000
2014-10-12 23:51:35.537 1000
2014-10-12 23:51:35.537 1000
2014-10-12 23:51:35.540 1003
2014-10-12 23:51:35.540 1000
2014-10-12 23:51:35.543 1001
2014-10-12 23:51:35.543 1000
2014-10-12 23:51:35.547 1000
2014-10-12 23:51:35.550 1000
2014-10-12 23:51:35.550 1000
2014-10-12 23:51:35.553 1000
2014-10-12 23:51:35.557 1006
2014-10-12 23:51:35.557 1003
2014-10-12 23:51:35.560 1000
2014-10-12 23:51:35.560 1000
.
2014-10-12 23:51:53.383 1000
2014-10-12 23:51:53.400 1000
2014-10-12 23:51:53.417 1004
2014-10-12 23:51:53.433 1001
2014-10-12 23:51:53.450 1000
2014-10-12 23:51:53.467 1002
2014-10-12 23:51:53.483 1000
2014-10-12 23:51:53.507 1000
Query was cancelled by user.

 
So, using the NOLOCK hint can return incorrect results, even if the data you are selecting is unchanged, unchanging, and NOT subject to rollback.
Locking is there for a reason. ACID transactions exist for a reason.
If you care about your data, you should try to access it correctly and treat it well, otherwise you have to ask if the code you are writing really has value. If it doesn’t have value, why are you storing the data in an expensive relational database, when you could use a freeware database engine or just pipe it straight to /dev/null – that’s really quick.

One solution to this problem is to change the locking method of SQL Server, and start using Read Committed Snapshot Isolation** mode. This allows readers to access the data without blocking writers or be blocked by writers. It works similarly to Oracle’s Multi-Version Concurrency Control, and (sweeping generalisation alert!) allows SQL Server to scale better.

**NOLOCK still “works” the same in this mode – it needs to be removed from your code.

Extending an ACFS filesystem dynamically.

To extend an ACFS cluster filesystem dynamically, we need to use the acfsutil command:

node01:/u01/grid>/sbin/acfsutil size +10G /u02
acfsutil size: ACFS-03008: The volume could not be resized.  The volume expansion limit has been reached.
acfsutil size: ACFS-03216: The ADVM compatibility attribute for the diskgroup was below the required
                           version (11.2.0.4.0) for unlimited volume expansions.

Oh dear, not 11.0.2.4, so you can only extend volumes dynamically a few times (5) before the global bitmap becomes full. So, now it’s an offline change. :-(

Check what is accessing /u02 and stop it:

node01:/opt/oracle>sudo -s
[root@node01 oracle]# lsof /u02

COMMAND   PID   USER   FD   TYPE    DEVICE SIZE/OFF NODE NAME
bash     5566 oracle  cwd    DIR 252,50177    12288   78 /u02/goldengate/bin11
su      29509   root  cwd    DIR 252,50177    12288   78 /u02/goldengate/bin11

erm. kill -9 5566 29509 :-)

DO THIS ON EVERY RAC NODE!

[root@node01 oracle]# umount –t acfs /u02
[root@node02 oracle]# umount –t acfs /u02
[root@node-n oracle]# umount –t acfs /u02

Once unmounted, we can “repair” the global bitmap:

[root@node02 oracle]# fsck -y -t acfs  /dev/asm/acfsdisk_u02-98
fsck from util-linux-ng 2.17.2
version                   = 11.2.0.4.0
*****************************
********** Pass: 1 **********
*****************************
Oracle ASM Cluster File System (ACFS) On-Disk Structure Version: 39.0
 ACFS file system created at: Thu Jan  2 17:08:02 2014
 checking primary file system
 Files checked in primary file system: 25%
 Files checked in primary file system: 100%

 fsck.acfs: ACFS-07728: The Global_BitMap file has reached the maximum number of extents (5).
 The file system can no longer be expanded. 

 Running fsck.acfs in fixer mode will attempt to consolidate the storage bitmap into 
 fewer extents which would allow for file system expansion

 Checking if any files are orphaned...
 0 orphans found
 Checker completed with no errors.

So lets fix it – output seriously trimmed, but with the important bit

[root@node02 oracle]# /sbin/fsck.acfs -a -v /dev/asm/acfsdisk_u02-98
fsck from util-linux-ng 2.17.2
version                   = 11.2.0.4.0
 *****************************
 ********** Pass: 1 **********
 *****************************
 Oracle ASM Cluster File System (ACFS) On-Disk Structure Version: 39.0
 ACFS file system created at: Thu Jan  2 17:08:02 2014

 checking primary file system
 Files checked in primary file system: 25%
 Files checked in primary file system: 100%

 fsck.acfs: ACFS-07729: The Global_Bitmap file has been
 consolidated into 2 extents.
 This may allow for file system expansion via the 'acfsutil size' command.
  
 Checking if any files are orphaned...
 0 orphans found
 Checker completed with no errors.

So, we’re done and can re-mount ON EVERY NODE. Given it’s now 2 extents, and the max we can have is 5, we have 3 more dynamic extensions before we need to do this again.

mount –t acfs /dev/asm/acfsdisk_u02-98 /u02

And re-attempt to expand the filesystem

node01:/u01/grid>df –h /u02
Filesystem              Size  Used Avail Use% Mounted on
/dev/asm/acfsdisk_u02-98  325G   36G  290G  12% /u02

node01:/u01/grid>/sbin/acfsutil size +10G /u02
acfsutil size: new file system size: 359703511040 (343040MB)
node01:/u01/grid>df -h

node01:/u01/grid>df -h /u02
Filesystem              Size  Used Avail Use% Mounted on
/dev/asm/acfsdisk_u02-98  335G   36G  300G  11% /u02
node01:/u01/grid>

Yey! Bigger filesystem! Lets minimise the amount of times it needs to be extended in the future but doing it in big lumps. Might just save a planned outage.

 

Alternatively, upgrade Grid Infra to at least 11.2.0.4 and set advm compatibility to 11.2.0.4 and the restriction will be gone for good:

ALTER DISKGROUP acfsdisk SET ATTRIBUTE 'compatible.asm' = '11.2.0.4', 'compatible.rdbms' = '11.2.0.4', 'compatible.advm' = '11.2.0.4';
(or ASMCMD [+] > setattr -G acfsdisk compatible.advm 11.2.0.4)
(or right-click on the disk group in asmca and click "edit attributes")

Grid Infrastructure Disk Space Problem – CHM DB file: crfclust.bdb

The Grid Infrastructure filesystem was reporting that it was a bit full today (release 11.2.0.4). This was tracked down to the “crfclust.bdb” file, which records information about the cluster health for monitoring purposes. It was 26GB. It’s not supposed to get bigger than 1GB so this is probably a bug, but let’s explicitly resolve the size issue right now and search Oracle support later. Worst case, bdb (Berkerley Database) files get regenerated when CHM (ora.crf) resource is restarted.  You only lose the (OS) statistics that CHM has gathered. Deleting bdb files does not have other impact.  CHM will start collecting the OS statistics again.

 

df –h /u01

Filesystem                Size  Used Avail Use% Mounted on
/dev/sdc1                  48G   36G  9.0G  81% /u01

pwd
/u01/app/11g/grid/crf/db/node01

ls -lh
total 29G

-rw-r–r– 1 root root 2.1M Jul 22 12:12 22-JUL-2014-12:12:03.txt
-rw-r–r– 1 root root 1.3M Apr 23 14:28 23-APR-2014-14:28:04.txt
-rw-r–r– 1 root root 1.2M Apr 23 14:33 23-APR-2014-14:33:34.txt
-rw-r–r– 1 root root 1.3M Jul 23 12:53 23-JUL-2014-12:53:02.txt
-rw-r–r– 1 root root 946K Apr 26 03:57 26-APR-2014-03:57:21.txt
-rw-r—– 1 root root 492M Aug 26 10:33 crfalert.bdb
-rw-r—– 1 root root  26G Aug 26 10:33 crfclust.bdb   <-26G!
-rw-r—– 1 root root 8.0K Jul 23 12:52 crfconn.bdb
-rw-r—– 1 root root 521M Aug 26 10:33 crfcpu.bdb
-rw-r—– 1 root root 513M Aug 26 10:33 crfhosts.bdb
-rw-r—– 1 root root 645M Aug 26 10:33 crfloclts.bdb
-rw-r—– 1 root root 418M Aug 26 10:33 crfts.bdb
-rw-r—– 1 root root  24K Aug  1 16:07 __db.001
-rw-r—– 1 root root 392K Aug 26 10:33 __db.002
-rw-r—– 1 root root 2.6M Aug 26 10:33 __db.003
-rw-r—– 1 root root 2.1M Aug 26 10:34 __db.004
-rw-r—– 1 root root 1.2M Aug 26 10:33 __db.005
-rw-r—– 1 root root  56K Aug 26 10:34 __db.006
-rw-r—– 1 root root  16M Aug 26 10:17 log.0000008759
-rw-r—– 1 root root  16M Aug 26 10:33 log.0000008760
-rw-r—– 1 root root 8.0K Aug 26 10:33 repdhosts.bdb
-rw-r–r– 1 root root 115M Jul 22 12:12 node01.ldb

Lets see how big the repository is…

oclumon manage -get repsize
CHM Repository Size = 1073736016

Wow.  Seems a bit oversized. Change the repository size to the desired number of seconds, between 3600 (1 hour) and 259200 (3 days)

oclumon manage -repos resize 259200

node01 –> retention check successful
node02 –> retention check successful

New retention is 259200 and will use 4524595200 bytes of disk space
CRS-9115-Cluster Health Monitor repository size change completed on all nodes.

If we now check the size, we get an error as the repository is bigger than the max allowed size.

oclumon manage -get resize
CRS-9011-Error manage: Failed to initialize connection to the Cluster Logger Service

So we need to stop and start the ora.crf service to get everything working again. It should be OK to do this on a running system with no impact, but I’d start with your sandpit to test it. Don’t take my word for it!

Check for process:

node01:/u01/app/11g/grid/bin>ps -ef |grep crf
root     26983     1  0 10:44 ?        00:00:00 /u01/app/11g/grid/bin/ologgerd -m node02 -r -d /u01/app/11g/grid/crf/db/node01

Stop service:
node01:/u01/app/11g/grid/bin>crsctl stop res ora.crf -init

CRS-2673: Attempting to stop ‘ora.crf’ on ‘node01′
CRS-2677: Stop of ‘ora.crf’ on ‘node01′ succeeded

Start Service:
node01:/u01/app/11g/grid/bin>crsctl start res ora.crf -init
CRS-2672: Attempting to start ‘ora.crf’ on ‘node01′
CRS-2676: Start of ‘ora.crf’ on ‘node01′ succeeded

Check for Process:
node01:/u01/app/11g/grid/bin>ps -ef  |grep crf
root     28000     1  5 10:49 ?        00:00:00 /u01/app/11g/grid/bin/ologgerd -m node02 -r -d /u01/app/11g/grid/crf/db/node01

Check the size – as specified:
node01:/u01/app/11g/grid/bin>oclumon manage -get repsize

CHM Repository Size = 259200

Done

And the space is released and reclaimed.

node01:/u01/app/11g/grid/bin>df –h /u01

Filesystem                Size  Used Avail Use% Mounted on
/dev/sdc1                  48G  7.7G   38G  18% /u01

The space has been returned. Marvellous.
Now repeat the stop/start on each node.

 

UPDATE: From Oracle Support: Having very large bdb files (greater than 2GB) is likely due to a bug since the default size limits the bdb to 1GB unless the CHM data retention time is increased.  One such bug is 10165314.

Follow

Get every new post delivered to your Inbox.

Join 27 other followers

%d bloggers like this: