Category Archives: Sysadmin

How not to program Bash

Came across this gem today:
[shell]
COMP=$1

if [ -e $COMP ]; then
echo “Please supply a competition_id”
exit
fi
[/shell]

On first read it looks backwards – “if exist $COMP then exit”, and the programmer clearly wanted to do the exact opposite. However it actually works as desired. Mostly.

If no argument is supplied, $COMP will be a blank string, i.e. “”. Thus [ -e $COMP ] will parse to [ -e ], which returns 0 (effectively true). It’s basically like saying “does nothing exist”.

This is because the -e argument tests whether a file exists. So if an argument is supplied, [ -e $COMP ] will return 1 (effectively false), which does what the programmer intended UNLESS the argument supplied is a valid file path.

In this case a number was expected, so sure it’s unlikely to fail in this way, but it’s still an incredibly bad way to test if an argument is set. Not to mention confusing to read!

The correct way to test this by the way would be to use -z, which simply tests if a string’s length is zero:
[shell]
COMP=$1

if [ -z “$COMP” ]; then
echo “Please supply a competition id”
exit
fi
[/shell]

Or better still, use getopts. For more info run ‘man test’ from a bash terminal.

Xen Server “The SR failed to complete the operation”

Had this problem when trying to start a newly created VM and install from certain ISO files. Some ISO images would work and others would not.

Removing the ISO image allowed the VM to start but obviously there was no OS image to install from.

The error message is generic and not very helpful, but if you do encounter it check that the ISO is readable by the Xen server. In our case it was on an NFS share but the permissions were read-only to all but the owner…. so a simple `chmod a+r *.iso` fixed it!

Percona Monitoring Plugins for Cacti and Redis auth

In implementing the Percona Monitoring Plugins for Redis on our Cacti server we discovered that they don’t support authentication. This creates a problem when your servers require authentication to issue the “INFO” command.

The Percona templates use ss_get_by_ssh.php to fetch the data, and there are functions specific to Redis in this file. So I added a variable to store the password, and modified the redis_get function to run an AUTH command on the socket before INFO.
Continue reading

I Sincerely Hope I Never Write a Script like this Again

This sort of stuff destroys your soul:
[shell]
#!/bin/bash
logfile=~/slave-watcher.log

while true; do
status=$(mysql –execute=”show slave status\G”|grep “Seconds_Behind_Master:”|awk ‘{print $2}’)

if [ $status == “NULL” ]; then
mysql –execute=”show slave status\G” | grep “Last_SQL_Error:” | tee -a $logfile
mysql –execute=”set global sql_slave_skip_counter=1; start slave;”
fi

sleep 1
done
[/shell]
What it’s doing is looking at MySQL’s slave status and skipping over any statement that causes an error. There are so many reasons why you should never do this, and I won’t go into detail on what necessitated it here, but you’ll be glad to know the slave I set this loose on was not in production!

How not to troubleshoot an unexplained server reboot

We asked our provider to investigate why one of our servers rebooted last night. In the process they accidentally rebooted it again… this is root’s bash_history just before it happened, note line 971:

  954  2011-08-17_15:10:39 sar -q
  955  2011-08-17_15:10:59 sar -q|less
  956  2011-08-17_15:11:09 sar -r|less
  957  2011-08-17_15:11:24 last -x|less
  958  2011-08-17_15:11:49 history |grep -i shutd
  959  2011-08-17_15:11:21 history
  960  2011-08-17_15:11:32 date
  961  2011-08-17_15:13:52 cd /var/log/
  962  2011-08-17_15:13:53 ls
  963  2011-08-17_15:13:54 ls -lah
  964  2011-08-17_15:13:58 less audit/
  965  2011-08-17_15:14:04 less audit/audit.log
  966  2011-08-17_15:14:25 less secure
  967  2011-08-17_15:15:15 grep -v nagios secure | less
  968  2011-08-17_15:16:11 dmesg
  969  2011-08-17_15:17:57 sar -r
  970  2011-08-17_15:18:19 dmesg
  971  2011-08-17_15:18:30 dmesg | reboot
  972  2011-08-17_16:20:20 [LOGOUT]: xxxx     pts/2        2011-08-17 15:27 (xxx.xxx.xxx.xxx)

Continue reading

Getting the most out of Terminator

Terminator in action

Terminator is a must-have tool for Linux administrators. It’s a terminal emulator that supports multiple terminals via tabs, but also by dividing up its window with horizontal and vertical splits.

The user documentation is a bit sparse, in fact what you see in the man page is what you get. In this post I’ll take you through the features that I think make Terminator the best terminal emulator around.
Continue reading

Recovering a RAID5 mdadm array with two failed devices

Update 1/11/2019

If you’ve reached this article by Googling how to recover a RAID array, I suggest you don’t use this guide. The Linux RAID wiki has much more correct, complete, and authoritative information. In retrospect I was very lucky not to destroy the data, but this article is an interesting account of how I did it.

Update

Before reading this article you should know that it is now quite old and there is a better method – ‘mdadm –assemble –force’ (it may have been there all along). This will try to assemble the array by marking previously failed drives as good. From the man page:

If mdadm cannot find enough working devices to start the array, but can find some devices that are recorded as having failed, then it will mark those devices as working so that the array can be started.

I would however strongly suggest that you first disconnect the drive that failed first. If you need to discover which device failed first, or assemble doesn’t work and you need to manually recreate the array, then read on.

I found myself in an interesting situation with my parents home server today (Ubuntu 10.04). Hardware wise it’s not the best setup – two of the drives are in an external enclose connected with eSATA cables. I did encourage Dad to buy a proper enclosure, but was unsuccessful. This is a demonstration of why eSATA is a very bad idea for RAID devices.

What happened was that one of the cables had been bumped, disconnecting one of the drives. Thus the array was running in a degraded state for over a month – not good. Anyway I noticed this when logging in one day to fix something else. The device wasn’t visible so I told Dad to check the cable, but unfortunately when he went to secure the cable, he must have somehow disconnected the another one. This caused a second drive to fail so the array immediately stopped.

Despite having no hardware failure, the situation is similar to someone replacing the wrong drive in a raid array. Recovering it was an interesting experience, so here I’ve documented the process.
Continue reading

Splitting files with dd

We have an ESXi box hosted with Rackspace, it took a bit of pushing to get them to install ESXi it in the first place as they tried to get us to use their cloud offering. But this is a staging environment and we want something dedicated on hardware we control so we can get an idea of performance without other people’s workloads muddying the water.

Anyway, I’ve been having a bit of fun getting our server template uploaded to it, which is only 11GB compressed – not exactly large, but apparently large enough to be inconvenient.

In my experience the datastore upload tool in the vSphere client frequently fails on large files. In this case I was getting the “Failed to log into NFC server” error, which is probably due to a requisite port not being open. I didn’t like that tool anyway, move on.

The trusty-but-slow scp method was also failing however. Uploads would start but consistently stall at about the 1GB mark. Not sure if it’s a buffer or something getting filled in dropbear (which is designed to be a lightweight ssh server and really shouldn’t need to deal with files this large), but Googling didn’t turn up much.
Continue reading

Likewise Open – problems rejoining domain after upgrade

There seems to be a common problem with Likewise open not gracefully upgrading on Ubuntu, e.g. – upgrading a system from the distribution supplied Likewise-open 5 in Ubuntu 10.10 to the latest packages from the Likewise website (Likewise 6.0 at the time of writing).

The system in this case was an old Ubuntu 9.10 server using Likewise Open 5. After some patching and an update to the current vmware tools it started failing to authenticate domain users, so I decided to upgrade to the latest version. However after the upgrade I was getting an error when trying to join the domain:

Error: ERROR_FILE_NOT_FOUND code 0x00000002

The obvious solution is to remove all likewise packages and purge the config, however that didn’t seem to work either. What DID work, was removing & purging the config, manually removing a few directories that were not empty, purging a few other seemingly related packages which were marked as no longer required after the uninstall, and finally reinstalling.
Continue reading