Erlang/OTP: republishing node info to epmd after it’s failed
Part of the Erlang/OTP model is a separate naming server called epmd (protocol info) which is used to keep track of all Erlang nodes local to a machine and the ports to which other nodes may utilize to make direct connections to. A problem with the current design is that epmd stores data in memory only and should it crash all the data related to previously connected nodes is lost. Nodes already connected won’t be effected and new nodes will be able to reach each other but old and new nodes will have no ability to link. epmd is designed to be simple so that crashes are very unlikely but for a system which is supposed to provide nine 9’s reliability I find it odd that a simple retry isn’t even attempted .
http://blog.ulfurinn.net/2009/1/17/emergency-epmd-reanimation
After finding out this I scanned Google to confirm and ran across a few mailing list posts which recommended placing epmd in /etc/inittab so that should it ever die or is killed it will be restarted but this doesn’t solve the problem where the old nodes are no longer registered with epmd. The link above shows how to use the erl_epmd server to reregister the node but one needs to know the name and port which the Erlang instance is listening on. After digging around in lib/kernel/src/erl_epmd.erl I found that the name and port were kept in it’s state and there was an undocumented call which would return them call ‘client_info_req’. So rather than using netstat or lsof and remembering the node name as required above one can automate the process.
Which I’ve done. I through together an OTP app which will poll epmd and scan for an entry for the node. If it’s not found it will reregister itself. Optionally it will also attempt to restart epmd in daemon mode. An event based implementation would be best but I currently don’t know enough to know if that’s even possible. For the time being this will help maintain that nine 9’s uptime.

