This is one of the longest living issues I’ve got for resolving, as in busy production environments you can rarely get some time frame for action on DNS.
First let me explain the issue.
Client where I’ve been working need to replace one very old DNS server with a new one.
On each server we have two DNS servers defined, and we simulate real scenario when one of two DNS servers will be down by blocking DNS port on the network switch, thus preventing all clients to resolve host names by that one DNS server.
We expect many difficulties, and most of them are easily resolvable (wrong client DNS settings, Apps that hard coded DNS server name etc).
Still there is one issue left that were difficult to explain.
While one of DNS servers were down (simulation by blocking a DNS port on switch for that server), some apps start to freeze completely or partially.
It took me a while to understand what all that different apps (Java Web Start, some JEE apps deployed on WebLogic app server etc.) have in common, as issues has occurred on more then one host (on both: Linux and IBM AIX boxes).
To make diagnostics more difficult, some apps were working perfectly without any interruption while the other one were frozen on the same hosts.
There are several steps involved in resolving the issue.
1.
I’ve started with network monitoring by using Net Activity Viewer/netstat/iftop or any other tool that can monitor traffic between your PC and host where problematic app is running.
On Windows PC you can install nice tool TCPView, free of charge (at least before I replace Windows with Linux as desktop OS).
Picture was taken from the web, so It doesn’t show the real system.
After I’ve started whatever network monitoring tool, I’ve try to establish connection to problematic apps from my PC, looking for host name that needs to be resolved by DNS.
In addition, I’ve inspect my local PC DNS configuration and try to resolve host name used with problematic apps.
nslookup some_host_name.mydomain.com
After I prove that host resolving is working fine from my local PC, It’s time to move forward.
2.
You need to check if resolve.config files are correctly configured on all servers where problematic apps are running.
In my case, resolve.conf file has the following content :
search domain_name.com
nameserver1 192.168.255.1
nameserver2 192.168.255.2
You need to replace domain_name with your domain name, and IP address of DNS servers are not real addresses (I’ve made them up), and you have to change them to suit to your environment.
3.
Check if you have DNS caching in place.
On AIX you need to execute the following command as a root user:
lssrc -ls netcd
To clear the cache, you need to perform the following:
stopsrc -s netcd (stop DNS caching)
lssrc -ls netcd (check if DNS cache has been stopped)
startsrc -s netcd (start DNS caching)
lssrc -ls netcd (check if DNS cache has been stopped)
On Linux box, DNS caching is not likely enabled.
4.
First test on host level if resolving is working:
Place a comment on one DNS server in resolve.conf file like in the following example:
search domain_name.hr
#nameserver1 192.168.255.1
nameserver2 192.168.255.2
After that you can use nslookup to se if resolving is working properly:
nslookup some_host_name.mydomain.com
If everything is working as expected, you can proceed by commenting the second DNS, like in the following example:
search domain_name.hr
nameserver1 192.168.255.1
#nameserver2 192.168.255.2
And try to resolve some host by name:
nslookup some_host_name.mydomain.com
On AIX for example, you can set environment variable on:
RES_OPTIONS=debug host some_host_name.mydomain.com | grep Query
and see if resolving is working fine.
If both tests will finish with success, you only have a proof that DNS names have been resolved correctly by host.
Unfortunately this is not enough, as some apps still doesn’t work.
Welcome to the last step in DNS resolving.
5.
To prevent DNS Spoofing attack, JVM – Java Virtual Machine (no matter is it JRockit/Oracle JDK/IBM Java/Open Java…) caches DNS server on a first host name lookup.
It can be the first or the last DNS in resolve.config file.
For that reason, some Java apps will work and some of them won’t if you bring down one of DNS servers listed in resolve.config file.
For example, imagine that you have WebLogic with two managed server.
On Managed Server 1 you have App1, while on Managed Server 2 you have App2.
If on a first host name resolution App1 will use nameserver1 (192.168.255.1), and App2 nameserver2 (192.168.255.2), then all consecutive host names resolution of Apps 1 will use nameserver1, while App2 will use nameserver2.
For that reason it is possible (in fact that is what happened here) that some apps will working fine, while the other one will be frozen (affected Apps by DNS that is unreachable).
To resolve the issue, you need to find out which JVM your application/application server is using.
Then you’ll have to find java.security file, which is in $JAVA_HOME/jre/lib/security directory
(for example /usr/java/jdk1.6.0_45/jre/lib/security in one of my test machines).
After that, you’ll need to uncomment the following line:
#networkaddress.cache.ttl=-1 ———> networkaddress.cache.ttl=-1
As you can see from the comments in the same file, there is a warning that this action can expose you to “DNS spoofing attack”, unless you have proper configured security.
The change I’ve made indicates caching policy for successful name lookups from the DNS service.
The value is integer that specify number of seconds to cache the successful lookups.
Value of -1 indicates that JVM will cache successful lookups forever.
If you bring down one of DNS servers (I assume you have at least two DNS servers in resolve.conf file), JVM will connect other DNS server to provide name info, and if successful, will cache that DNS forever.
After you made changes into the java.security file, you need to restart all problematic apps for changes to take effect (basically you need to restart JVM).
In case of WebLogic (the same is with WebSphere/JBoss/Tomcat), you need to restart managed servers (you can restart Admin server also).
After that, all subsequent simulation of DNS replacement (blocking DNS port for one DNS server) were successfully performed.
Hope this will spare some time troubleshooting DNS issues as all types of Java apps (whether they are client side or J2EE, and no matter which app server you are using: WebLogic/JBoss/Tomcat/WebSphere…) has been affected.
Comments
2016-02-21 10:15:42
2016-02-26 21:04:05
2016-03-01 01:07:29