Padstow (08)

Monday 21 July, 2008 - 08:34

The padstow cluster had a few problems over the past week:

The DATA disk group got corrupted again (the same tablespace was affected - SYSAUX).
The cluster had timing problems - at one point the padstow2 node was nine (9) seconds ahead of padstow1 .
The Grid Control agent was not collecting data or picking up targets on either node of the cluster.

Now I know there are some errors that ASM cannot protect against. I had to do a PITR because the archive logs were not duplexed across the DATA and FRA disk groups. At least, I am getting practice with RMAN backups and restorations.

To overcome the timing problems, I decided to go back to using NTP with gridctrl as the local NTP server. Although the other nodes recognise gridctrl as a peer (via ntpq peer), they still insist on using the local clock as the timing source.

The implementation procedure for NTP I have been using is:

vi /etc/ntp.conf (to add "server gridctrl")
Get the ntp service to recognise the new NTP server:

service ntpd restart
ntptime # to check the time

The Grid Control agent took several attempts at reinstallation before all the targets were detected. I am still having data collection errors. At least, I did not have to recreate the cluster from scratch to get this far.