Commercial solutions are available, however these solutions are very expensive, costing many thousands of dollars in additional hardware and software. More distressing though, is the obvious complexity of these solutions, most requiring several hundred man-hours of consulting time just to set up. Finally, they don't use the Web as their interface, not only making truly remote monitoring impossible, but also making data sharing extremely difficult.
Big Brother is a simple, effective solution to the Systems Monitoring problem, and is presented here for your comments and suggestions.
Big Brother is a loosely-coupled distributed set of tools for monitoring and displaying the current status of an entire network and notifying the admin should need be. It came about as the result of automating the day to day tasks encountered while actively administering Unix systems.
It consists of five major parts:
Big Brother was designed to provide instant information about the health of a Systems and Network environment to anyone, anywhere, with Web access to the site.
Network information is now instantly available to those who need it most: managers, systems administrators, and people on the help desk can actively and simply monitor the health of the network.
If any condition is severe, the administrator will have been paged, can use Big Brother to get additional information, and can proceed to fix the problem. Problem verification, data sharing and correction should improve immediately, since everyone has implicit access to the same information.
Finally, since warnings are displayed, corrective action can be taken even before users notice that there is a problem.
The display matrix shows a status of green (ok), yellow (warning), red (severe), and purple (no contact) and clear (not tested) for each system/area combination. Furthermore, the entire screen changes color to reflect the most serious condition on the network. In order of increasing severity these conditions are: green, yellow, blue, red.
Therefore one single warning anywhere on the network results in the entire display turning yellow which is highly visible, even from far away.
Each of the elements in the display matrix can then be clicked on to provide additional information, including the code, time, and specific information about the area being monitored.
Additionally, Big Brother now makes detailed information about every server in the network instantly accessible, just by clicking on the server name.
Every element in the etc/bb-hosts file is accessed every 5 minutes. Any loss of contact results in a code red, and the administrator being paged, unless this condition is explicitly permitted, for example with dialup devices that may not be online.
The BBNET machine tests each element in the etc/bb-hosts file every 5 minutes using a bbnet. Each system can have specific services like ftp, http, and mail tested, as required. Inability to contact a server is a severe condition. Inability of access a page due to a "Server Error" results in a warning condition.
All systems are monitored for disk usage. Any disk over 90% full is considered a warning condition. Disks over 95% full are marked as a severe condition, since this situation can quickly result in a system crash or hang.
All systems are monitored for CPU usage. A load average over 1.50 is a warning condition, 3.00 merits a severe condition. These values are configurable.
Processes are monitored on each system as well. The choice of what is to be monitored is dependent of what each system actually does. Sendmail and http are monitored on all systems by default. A page may be generated or a warning condition results if any of these important processes should not be running.
System messages are monitored. Big Brother watches the system message file for NOTICE and WARNING conditions. NOTICE conditions result in the admin being paged immediately. WARNING conditions cause a yellow dot to appear. Clicking on the corresponding dot will report the message that caused the display.
And finally the whole system itself is monitored by the BBDISPLAY. Any report over 30 minutes old results in that report, and the entire screen being marked in purple, indicating a likely loss of contact within the Big Brother system itself.
Note that all of the above are configurable parameters.
Some of the guidelines involved in the design of Big Brother are the following:
Big Brother is not a replacement for a qualified and experienced Systems Administrator. On the contrary, it is a big brother to the Sys Admin. It does not shut down machines or terminate processes, although it could be programmed to do so. It just identifies and notifies.
Big Brother does not explicitly monitor individual hardware components. However, failure of a hardware component is very likely to cause a severe condition through loss of service.
Big Brother does not monitor performance of the network, servers, databases or any individual application. It will however provide information about CPU loads and implicit information about response time; i.e. telnet connections have 15 seconds to answer.
Your Comments and Suggestions are welcome!
Send them via e-mail to sean@maclawran.ca
The photo which adorns the Big Brother Display is Sean MacGuire. He is Big Brother... it is meant to be reminicent of George Orwell's book 1984. That's why it's not a pretty picture.