How not to troubleshoot an unexplained server reboot

We asked our provider to investigate why one of our servers rebooted last night. In the process they accidentally rebooted it again… this is root’s bash_history just before it happened, note line 971:

  954  2011-08-17_15:10:39 sar -q
  955  2011-08-17_15:10:59 sar -q|less
  956  2011-08-17_15:11:09 sar -r|less
  957  2011-08-17_15:11:24 last -x|less
  958  2011-08-17_15:11:49 history |grep -i shutd
  959  2011-08-17_15:11:21 history
  960  2011-08-17_15:11:32 date
  961  2011-08-17_15:13:52 cd /var/log/
  962  2011-08-17_15:13:53 ls
  963  2011-08-17_15:13:54 ls -lah
  964  2011-08-17_15:13:58 less audit/
  965  2011-08-17_15:14:04 less audit/audit.log
  966  2011-08-17_15:14:25 less secure
  967  2011-08-17_15:15:15 grep -v nagios secure | less
  968  2011-08-17_15:16:11 dmesg
  969  2011-08-17_15:17:57 sar -r
  970  2011-08-17_15:18:19 dmesg
  971  2011-08-17_15:18:30 dmesg | reboot
  972  2011-08-17_16:20:20 [LOGOUT]: xxxx     pts/2        2011-08-17 15:27 (xxx.xxx.xxx.xxx)


This prompted an informal discussion about habits that would prevent accidental command execution like this. Suggestions raised were using quotes (still executes), never piping to grep (lose a lot of functionality there), and only using part-words (in this case grepping for “reboo”).

Probably the best one I can think of is aliasing dangerous commands like this, e.g:

alias reboot='echo Please use /sbin/reboot or shutdown -r now to reboot this server'

You can the line above to .bashrc for existing users, and updating /etc/skel/.bashrc will make it apply to any future user accounts. The most important one is root (/root/.bashrc or /root/.bash_aliases depending on distro). Ubuntu and Debian separate the aliases out into ~/.bash_aliases (the .bashrc should include this file).

This is what would have happened if our friend had run the same command:

$ dmesg |reboot
Please use /sbin/reboot or shutdown -r now to reboot this server

Similar aliases could be created for halt and poweroff.

Fortunately in this case no damage was done, as the server is part of a load balanced pool!

Leave a Reply