Category Archives: Sysadmin

How not to troubleshoot an unexplained server reboot

We asked our provider to investigate why one of our servers rebooted last night. In the process they accidentally rebooted it again… this is root’s bash_history just before it happened, note line 971:

  954  2011-08-17_15:10:39 sar -q
  955  2011-08-17_15:10:59 sar -q|less
  956  2011-08-17_15:11:09 sar -r|less
  957  2011-08-17_15:11:24 last -x|less
  958  2011-08-17_15:11:49 history |grep -i shutd
  959  2011-08-17_15:11:21 history
  960  2011-08-17_15:11:32 date
  961  2011-08-17_15:13:52 cd /var/log/
  962  2011-08-17_15:13:53 ls
  963  2011-08-17_15:13:54 ls -lah
  964  2011-08-17_15:13:58 less audit/
  965  2011-08-17_15:14:04 less audit/audit.log
  966  2011-08-17_15:14:25 less secure
  967  2011-08-17_15:15:15 grep -v nagios secure | less
  968  2011-08-17_15:16:11 dmesg
  969  2011-08-17_15:17:57 sar -r
  970  2011-08-17_15:18:19 dmesg
  971  2011-08-17_15:18:30 dmesg | reboot
  972  2011-08-17_16:20:20 [LOGOUT]: xxxx     pts/2        2011-08-17 15:27 (

Continue reading

Getting the most out of Terminator

Terminator in action

Terminator is a must-have tool for Linux administrators. It’s a terminal emulator that supports multiple terminals via tabs, but also by dividing up its window with horizontal and vertical splits.

The user documentation is a bit sparse, in fact what you see in the man page is what you get. In this post I’ll take you through the features that I think make Terminator the best terminal emulator around.
Continue reading

Recovering a RAID5 mdadm array with two failed devices

Before reading this article you should know that it is now quite old and there is a better method – ‘mdadm –assemble –force’ (it may have been there all along). This will try to assemble the array by marking previously failed drives as good. From the man page:

If mdadm cannot find enough working devices to start the array, but can find some devices that are recorded as having failed, then it will mark those devices as working so that the array can be started.

I would however strongly suggest that you first disconnect the drive that failed first. If you need to discover which device failed first, or assemble doesn’t work and you need to manually recreate the array, then read on.

I found myself in an interesting situation with my parents home server today (Ubuntu 10.04). Hardware wise it’s not the best setup – two of the drives are in an external enclose connected with eSATA cables. I did encourage Dad to buy a proper enclosure, but was unsuccessful. This is a demonstration of why eSATA is a very bad idea for RAID devices.

What happened was that one of the cables had been bumped, disconnecting one of the drives. Thus the array was running in a degraded state for over a month – not good. Anyway I noticed this when logging in one day to fix something else. The device wasn’t visible so I told Dad to check the cable, but unfortunately when he went to secure the cable, he must have somehow disconnected the another one. This caused a second drive to fail so the array immediately stopped.

Despite having no hardware failure, the situation is similar to someone replacing the wrong drive in a raid array. Recovering it was an interesting experience, so here I’ve documented the process.
Continue reading

Splitting files with dd

We have an ESXi box hosted with Rackspace, it took a bit of pushing to get them to install ESXi it in the first place as they tried to get us to use their cloud offering. But this is a staging environment and we want something dedicated on hardware we control so we can get an idea of performance without other people’s workloads muddying the water.

Anyway, I’ve been having a bit of fun getting our server template uploaded to it, which is only 11GB compressed – not exactly large, but apparently large enough to be inconvenient.

In my experience the datastore upload tool in the vSphere client frequently fails on large files. In this case I was getting the “Failed to log into NFC server” error, which is probably due to a requisite port not being open. I didn’t like that tool anyway, move on.

The trusty-but-slow scp method was also failing however. Uploads would start but consistently stall at about the 1GB mark. Not sure if it’s a buffer or something getting filled in dropbear (which is designed to be a lightweight ssh server and really shouldn’t need to deal with files this large), but Googling didn’t turn up much.
Continue reading

Likewise Open – problems rejoining domain after upgrade

There seems to be a common problem with Likewise open not gracefully upgrading on Ubuntu, e.g. – upgrading a system from the distribution supplied Likewise-open 5 in Ubuntu 10.10 to the latest packages from the Likewise website (Likewise 6.0 at the time of writing).

The system in this case was an old Ubuntu 9.10 server using Likewise Open 5. After some patching and an update to the current vmware tools it started failing to authenticate domain users, so I decided to upgrade to the latest version. However after the upgrade I was getting an error when trying to join the domain:

Error: ERROR_FILE_NOT_FOUND code 0x00000002

The obvious solution is to remove all likewise packages and purge the config, however that didn’t seem to work either. What DID work, was removing & purging the config, manually removing a few directories that were not empty, purging a few other seemingly related packages which were marked as no longer required after the uninstall, and finally reinstalling.
Continue reading

Bash script to alert when memory gets low

We have a web server that’s running out of memory about once every couple of weeks. The problem is that when it happens the swap file is totally full, the system is unresponsive and it usually needs a hard reboot. So it’s a bit difficult to debug. To avoid digging through log files I don’t understand I elected to put a script in /etc/cron.hourly which checks the total amount of free memory (including swap and physical). If there is less than 256mb free (this server has 512mb of ram and a 1gb swap so at this point the situation is serious), it dumps the process list to /tmp/processes.txt and sends me an email with it attached.

Note that mutt must be installed (‘apt-get install mutt’ on Debian/Ubuntu, or ‘yum install mutt’ on RedHat/CentOS).


free=`free -mt | grep Total | awk '{print $4}'`

if [ $free -lt 256 ]; then
        ps -eo %mem,pid,user,args >/tmp/processes.txt
        echo 'Warning, free memory is '$free'mb' | mutt -a /tmp/processes.txt -s "Server alert"

Then of course make it executable and symlink to cron.hourly:

chmod +x /etc/scripts/
ln -s -t /etc/cron.hourly/ /etc/scripts/