I’ve modify words spoken by Prince Hamlet (William Shakespeare) a bit to explain one of the most common asked questions regarding the patching strategy especially in today’s complex IT environment.
Probably very larger company have their specific policy how often they need to patch their IT infrastructure (for example storage/network/OS/databases/App Servers…).
Parts of IT infrastructure that are exposed to external users like web servers, should be patched even more frequently (immediately) as well as the client PCs (see the latest threat from the WannaCry ransomware).
There are many relevant sites that advice just that: to upgrade ASAP (take a great blog at ).
My opinion is that such strategy could be fine for smaller systems where you can get downtime whenever you like.
But what if your system is running some very complex SW like SAP/Oracle eBusiness Suite/Oracle Retail etc?
Constantly upgrading the system would mean being in the complex upgrade process all the time with a huge costs and work that must be done and many interruptions for end users.
This is not possible as even the SW provider (like Oracle in this case) is not able to certify all components and align it to the latest release.
There is another valid argument that I’m going to explain here.
First you have to be aware that every patch resolves some issue, but also involves a new issues (otherwise you will end with the SW without a single bug).
The main question you need to answer is: Are your IT systems in stable condition or not?
It means if you experience of some minor bugs for which in some cases there are a good workaround, do you need to patch your system or not.
Or the version of OS/database or some other SW have the bugs for some functionality that you are not using at all.
My policy for critical systems is: patch when you must (when you have problem that is hurting your IT system and causing end users complaints).
The rule No. 2: when you have to patch, never use the latest release of the patch, as you’ll never know if that very fresh patch introduce new bugs that can seriously hit your system.
Those two rules are valid, as I assume you are not exposing your critical infrastructure to public.
The following is a true story that support my rules for patching for business critical systems.
Recently we did OS patching as part of the standard company patching strategy.
After a while we had the following errors in alert log file, and the critical system was out of service for about 4 hours where no one can buy anything in all stores.
Incomplete read from log member ‘+DATA/xxxxxxx/onlinelog/group_7.494.904765777’. Trying next member.
ARC1: Log corruption near block 668010 change 11367460664057 time ?
CORRUPTION DETECTED: thread 1 sequence 37928 log 7 at block 668010. Arch found corrupt blocks
Errors in file /opt/rtl/oracle/diag/rdbms/xxxxxxx/xxxxxxx/trace/xxxxxxx_arc1_12386446.trc (incident=481827):
ORA-00353: log corruption near block 668010 change 11367460664057 time 04/11/2017 13:10:15
ORA-00312: online log 7 thread 1: ‘+DATA/xxxxxxx/onlinelog/group_7.494.904765777’
Incident details in: /opt/rtl/oracle/diag/rdbms/xxxxxxx/xxxxxxx/incident/incdir_481827/xxxxxxx_arc1_12386446_i481827.trc
ARC1: All Archive destinations made inactive due to error 354
ARC1: Closing local archive destination LOG_ARCHIVE_DEST_1: ‘+FRA/mom1p/archlog/1_37928_898946333.arc’ (error 354) (xxxxxxx)
Committing creation of archivelog ‘+FRA/mom1p/archlog/1_37928_898946333.arc’ (error 354)
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance xxxxxxx – Archival Error
ORA-16038: log 7 sequence# 37928 cannot be archived
ORA-00354: corrupt redo log block header
ORA-00312: online log 7 thread 1: ‘+DATA/xxxxxxx/onlinelog/group_7.494.904765777’
Tue Apr 11 13:40:16 2017
Sweep [inc][481827]: completed
Tue Apr 11 13:40:16 2017
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance xxxxxxx – Archival Error
ORA-16014: log 7 sequence# 37928 not archived, no available destinations
ORA-00312: online log 7 thread 1: ‘+DATA/xxxxxxx/onlinelog/group_7.494.904765777’
Tue Apr 11 13:40:16 2017
Dumping diagnostic data in directory=[cdmp_20170411134016], requested by (instance=1, osid=12386446 (ARC1)), summary=[incident=481827].
ERROR:
ORA-00257: archiver error. Connect internal only, until freed.
ORA-00353: log corruption near block string change string time string
Cause: Some type of redo log corruption has been discovered. This error describes the location of the corruption. Accompanying errors describe the type of corruption.
Action: Do recovery with a good version of the log or do incomplete recovery up to the indicated change or time.
It was nationwide outage.
Buyers were forced to go into the competitors stores.
And what explanation you have for your managers: you know, system was down for 4+ hours where no one was able to buy anything, but the good news is we have IBM AIX patched.
In this case, the solution is to execute the following command:
SQL>ALTER DATABASE CLEAR UNARCHIVED LOGFILE GROUP X
More detail explanation can be found in the following note: 2237498.1
This is the extreme case when not listening the right people and blindly apply companies general rules, can produce a major damage in money and lost of trust for your partners/buyers.
There are many modest cases that can also produce a lot of headache after patching.
For example execution plan stability that could occurs even if you just restart your database (not to mention when you patching the database with a new functionality in the Cost Based Optimizer).
To summarize:
I’m not advice against patching at all.
Of course you should patch, but each important system must have their own patching strategy which contains when to patch, how often to patch, and which patch to apply.
I’ll always stay on conservative side when patching, as it’s better that someone else report the bug on some minor system and leave time to SW provider (IBM in this case) to fix the bug, before I’m going to apply it.
In this case,
– by inspecting what problems IBM AIX patch resolves
– if my critical system has been hit by some of those bug in the running AIX release
– what bugs a new IBM AIX patch introduce into the system
– which patch should I implement (if there are several patches released – by inspecting bugs that each patch introduce in the system)
– how old the patch is (in case where the patch is very fresh, users have not enough time to report a bug to a SW provider, which is IBM in this case)
The same can be apply for any other SW vendor or the Open Source code.
Comments