phaqphaq

“a geeks daily life”

Archive for the 'HA' Category

Is RAID1 possible on an USB stick?

Friday, October 26th, 2007

Last week we had a discussion at the office wether it would possible to span a RAID across USB sticks.
That question came up as a joke while I was working on some RAID system for evaluation purposes.
Well, my friend doubted it when I replied that it would definitely work out with a FreeBSD software RAID using gmirror (geom vinum as a matter of fact works, too).

Proof?

Here it is, a ‘dmesg’ from my Sony Vaio PCG-C1MGP bootet off two gmirrored 256 MB USB sticks:

Copyright (c) 1992-2007 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 6.2-RELEASE #0: Fri jan 12 10:40:27 UTC 2007
root@dessler.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC
Timecounter "i8254" frequency 1193182 Hz quality 0
CPU: Transmeta(tm) Crusoe(tm) Processor TM5800 (727.84-MHz 586-class CPU)
  Origin = "GenuineTmx86" Id = 0x543  Stepping = 3
  Features=0x80893f
real memory  = 251658240 (240 MB)
avail memory = 232452096 (221 MB)
kbd1 at kbdmux9
ath_hal: 0.9.17.2 (AR5210, AR5211, AR5212, RF5111, RF5112, RF2413, RF5413)
acpi0:  on motherboard
Timecounter “ACPI-safe” frequency 3579545 Hz quality 1000
acpi_ec0:  port 0×62,0×66 on acpi0
acpi_timer0: <32-bit timer at 3.579545MHz> port 0×8008-0×800b on acpi0
acpi_lid0:  on acpi0
acpu_button0:
 on acpi0
[ output omitted ]
umass0: Sony USB Memory Stick Slot, rev 1.10/1.83 addr2
umass1: vendor 0×4146 USB Mass Storage Device, rev 2.00/1.00, addr 2
umass2: vendor 0×4146 USB Mass Storage Device, rev 2.00/1.00, addr 3
[ output omitted ]
da0 at umass-sim1 bus 1 target 0 lun 0
da0: <-pretec 256 MB 1.10> Removable Direct Access SCSO device
da0: 1.000MB/s transfers
da0: 242 MB (4964000 512 byte sectors: 64H 32S/T 242C)
da1 at umass-sim1 bus 2 target 0 lun 0
da1: <-pretec 256 MB 1.10> Removable Direct Access SCSO device
da1: 1.000MB/s transfers
da1: 242 MB (4964000 512 byte sectors: 64H 32S/T 242C)
GEOM_MIRROR: Device gm0 created (id=1986392903).
GEOM_MIRROR: Device gm0: provider da0 detected.
GEOM_MIRROR: Device gm0: provider da1 detected.
GEOM_MIRROR: Device gm0: provider da0 activated.
GEOM_MIRROR: Device gm0: provider da1 activated.
GEOM_MIRROR: Device gm0: provider mirror/gm0 launched.
Trying to mount root from ufs:/dev/mirror/gm0s1a

Of course it’s not incredibly fast, but it works afterall, that was the whole point about it :-)
Where could it be used? Possibly projects like FreeNAS, which support USB installs, could benefit from doing RAID1 on the sticks while also storing sensitive configuration data on them.
I could also imagine to take backups this way, e.g. keep one working copy on the active stick on the computer, while swapping in spare sticks which then automatically rebuild the mirror.

I suppose this also works with linux ‘md’ software raid, and netbsd’s RAIDframe, though I’ve not tested it.

What about Windows? Definitely not with stock functionality. However as there’s also a way to patch software RAID1 functionality into Windows 2000 and XP, one never knows … ;-)

FreeBSD software RAID0: gvinum vs. gstripe

Thursday, June 7th, 2007

Back some time I announced reviewing FreeBSD’s geom software RAID implementations.

Todays article compares geom stripe (gstripe) along with geom gvinum (gvinum) for disk striping (RAID0).

All testing was done on the same hardware as before to get results comparable to previous tests.

Benchmarks were taken using stripe sizes of 64k, 128k and 256k and measured using dd, bonnie++ and rawio as before.

As for the technology gstripe follows the same approach than gmirror which I look at previously.

# rawio benchmark results

rawio was choosen to measure I/O speed during concurrent access. rawio was set to run all tests (random read, seq read, random write, seq write) with eight processes on the /dev/stripe/* and /dev/gvinum/* devices.

Results for the single disk are provided as well to compare performance not only between the different frameworks but also against the native disk performance.

Click the images to see the actual result values and a chart.

rawio_bench_values

rawio_bench

# dd benchmark results

dd was choosen to measure raw block access to /dev/mirror/* and /dev/gvinum/* devices. dd was set to run sequential read and write tests using block sizes from 16k to 1024k.

Click the images to see the actual result values and a chart.

dd_bench_values

dd_bench

# bonnie++ benchmark results

finally, bonnie++ was used get pure file system performance.

Click the images to see the actual result values and a chart.

bonnie_bench_values

bonnie_bench

# conclusion

Looking at raw disk access I must conclude that none of the frameworks beats single disk performance in overall when it comes to blockwise input/output with dd.
gvinum generally performs better than gstripe except when using 256k stripe sizes.

Now since ‘dd’ is very synthetic by it’s nature, rawio is much better to see how the devices would perform under a more “real-life” situation.
Although rawio benchmark results may look low, these numbers where achieved by running 8 processes at once. They’ll reflect best what could be expected in a true multi-user environment with concurrent access.
As from the results there is no absolute winner, as depending on the stripe sizes either of both implementations out-performs the other.

Finally for bonnie++ we see some interesting results. Performance is almost identical for all implementations.
One notable exception was seen with gvinum (64k stripe size) which clearly outperformed its competitors..
One must keep in mind that the first six tests performed by bonnie++ (rand delete/read/create, seq delete/read/create) are limited by I/O performance of both the system bus and the device itself. The hardware I used for testing was capable of about 160 - 170 I/Os per second. I admit that results could be different if the tests are re-run on decent hardware with a higher I/O throughput. It’s possible that modern hardware reveals an I/O barrier for abstracted devices which cannot be seen from my tests.

Personally I prefer using gstripe over gvinum because of it’s more simplistic configuration approach. In terms of performance, gvinum seems to offer superiour performance when it comes to disk striping.

The next article will discuss gvinum and gstripe for RAID10.

Convert Single Disk to GEOM Mirror

Saturday, October 28th, 2006

This article should have been published already last month, unfortunately the draft was left forgotten for a while in my mailbox.

As some may remember I published two articles on how to setup GEOM disk mirroring on alpha and sparc earlier this year. These articles were originally based upon Ralf Engelschalls disk mirroring howto.

I had installed GEOM disk mirrors on various occasions since then, though I always felt that converting to a GEOM mirror like that involved too much work.

So I went on to find a better and faster way to achieve the task and I found one.

But be warned first: it is faster after all, even less error prone IMHO, but still dangerous if you are not very careful. Foremost you must consider that this procedure involves single user mode so local access (serial terminal for doing this remotely) is a must!

The instructions assume that you have two hard disks identical in size which are /dev/ad0 and /dev/ad1. Apply the instructions to fit your own system.

#1 Reboot into Single User Mode

#2 fsck and mount / fs read/write

First run an fsck on your root filesystem:

#fsck -p /

Then mount it with read/write access:

#mount -w /

#3 Edit /etc/fstab

First change your /etc/fstab. Replace all your single disk slices by the geom provider. The example shows a disk with a single root and swap slices.

# Device	Mountpoint	FStype	Options	Dump	Pass#
/dev/ad0s1b	none		swap	sw	0	0
/dev/ad0s1a	/		   ufs	    rw	     1       1

Change it to look similar to this:

# Device	Mountpoint	FStype	Options	Dump	Pass#
/dev/mirror/gm0s1b	none		swap	sw	0	0
/dev/mirror/gm0s1a	/		   ufs      rw	     1	     1

#4 Enable GEOM MIRROR kernel module

Edit your /boot/loader.conf and add the following line:

geom_mirror_load="YES"

#5 Enable Crash Dumps (optional, but recommended)

If you want to enable crash dumps to your GEOM provider you should add this line to your /etc/rc.early:

gmirror configure -b prefer gm0

And this should go to your /etc/rc.local:

gmirror configure -b split gm0

Have a look at gmirror(8) to understand what this means exactly.

#6 mount / fs read-only

Sync and mount your root file system read-only. This is very important to complete successfully. This is also why you must do this in single user mode.

#sync && sync
#mount -r /

#7 Initialize GEOM provider

First you must raise the GEOM debug flags:

#sysctl kern.geom.debugflags=16

This will allow to initialize a GEOM provider even if it is locked by the kernel.

#gmirror label -v -b split gm0 ad0

#8 Reset

Now it’s best to press the reset button (yes, you heard right!). This is absolutely safe as long as your root fs is in read-only mode.

Do not try to invoke reboot / shutdown as this will lead to a kernel panic. By labelling the GEOM provider you have taken away the underlaying device from the kernel.

Doing so will cause the machine to reboot after the panic, the reset button may be the better choice however.

#9 Boot

Your system should now boot from the GEOM consumer gm0 with your /dev/ad0 as the first provider.

It does not actually matter if you boot into multi-user or single-user mode. In either case you must add the second hard drisk to the GEOM mirror.

#gmirror insert gm0 ad1

This will cause your /dev/ad1 to be synced against /dev/ad0. Your dmesg should print something like this when starting:

Aug 19 11:26:37 localhost kernel: GEOM_MIRROR: Device gm0: rebuilding provider ad1

And something like this when finished:

Aug 19 12:34:17 localhost kernel: GEOM_MIRROR: Device gm0: rebuilding provider ad1 finished.

Congratulations: your GEOM mirror is now up and running!

These instructions were originally put together when trying to upgrade my PC-BSD to a GEOM mirror. They have been verified at least a dozen times since then in the lab.

Realtime File System Replication On FreeBSD

Friday, August 11th, 2006

This article describes a concept on how to implement realtime file system replication on a dual-node FreeBSD cluster to provide real HA services.
Maybe you are familiar with DRBD (distributed replicated block device) from the Linux world already, which basically does something we could call network-RAID1.

Since DRBD does not run on FreeBSD one might be tempted to believe that realtime file system replication would not be possible at all. This is not true however. FreeBSD provides you with two valuable geom classes which will allow you to implement a very similar setup: ggate and gmirror.

Requirements

The absolute minimum requirements for this setup are as follows:

  • two hardware nodes running FreeBSD
  • ethernet connection between both nodes
  • a free (as in “unused”) disk slice on each node

All right, this is just good enough to get it going.
If you are serious in useing it you may want to stick to something better than that:

  • use FreeBSD 6.x whenever possible, 5.x has some serious locking issues
  • Don’t use the same ethernet connection for public access AND replication, use a dedicated interface instead, preferrably over Gigabit ethernet. We’re talking about data replication over a LAN here, so latency and network load is a concern after all.
  • Fore the same reasons as above you should not do any geographic separation, especially not over slow links or VPN. Stay within the same network segment.
  • Use identical hardware for both nodes.
  • Use identical disk partition and slice setup on both nodes.
  • Use fast disks and fast disk controllers with good IO performance.
  • Refrein from useing geom/ataraid or other software RAID on partitions/slices mirrored to the second node. Use a real hardware RAID controller instead. If you don’t, deadlocks may occur.
  • Keep the partitions to be mirrored as small as possible. The reason for this is the fact that a complete resync is required if the mirror brakes. While a 20 GB partition might synchronize within ~30 minutes across a 100 Mbit network, a 500 GB partition will take over 11 hours.
  • You should propably not export more than one disk slice to a remote node. Every request (especially with lots and lots of write transactions) will be sent over your network. This causes load and latency on both nodes.

Pros

  • Build a two-node HA cluster useing FreeBSD
  • Implement realtime file system replication for mission critical failover scenarios
  • Use commodity hardware, no need for special shared storage like SAN or iSCSI
  • Do not rely on snapshot-based synchronisation (like rsync for example)
  • Do not rely on NFS or other file servers which could impose a single point of failure on their own

Cons

  • Yet experimental, not tested under heavy-load, possibly unstable
  • No support, if it brakes you’re on your own
  • Implementation not as mature as DRBD
  • Yet, a lot of hand work involved

#1 General System Setup

I have already pointed out some recommendations about the system setup previously. So if you stick with these you may save yourself from trouble.

When you install FreeBSD make sure you take a current 6.x series release. The 5.x series might work too though happened to be a bit flacky at my site due to locking issues. YMMV.

There are no special considerations except for the partition layout: reserve a partition which shall contain the data to be replicated to the remote-host. Don’t make it to big as the whole thing has to be synchronized over the network.

Choose the size according to your actual disk space requirement, the network speed and latency and also the IO performance of your system. A 500 GB partition may be too big, even when running over Gigabit ethernet. A size anywhere from 100 megs to 20 gigs may be ok though.
Since you would hopefully have two identical nodes, make the partition tables/disk slices match each other. This will help greatly to reduce any issues because of different device names.
You should also refrain from useing any geom/ataraid software RAID on the disks/slices to be exported. Remember that you will do a software RAID1 over the network already. Placeing another software RAID onto the underlying device will lead to deadlocks in most cases. Also your system will have twice the load as the data has to be written out four times actually.
If you really want the additional safety of local disk RAID do yourself a favor and use a real hardware RAID controller instead. This will even help you in getting good IO performance. Of course fask disks are a must then.

My setup consisted of two machines with Intel P-III 800 MHz CPU, 1 GB RAM, two 100 mbit network interfaces (one public, one private) and a RAID1 array with 20 GB SCSI disks (I used an ICP Vortex controllers).

This is what my disk slices look like:

/dev/da0s1a / 8 GB
/dev/da0s1b swap 2 GB
/dev/da0s1d [unused] 10 GB

#2 Enable Kernel Modules

Now make sure both nodes support the GEOM mirroring module. Enable it by adding the following line to your /boot/loader.conf:

geom_mirror_load=”YES”

Do the same for the GEOM gate module:

geom_gate_load=”YES”

If your secure level allows to load kernel modules at runtime you may omit these steps.
Check it like this:
#sysctl kern.securelevel

Any return value other than 0 or -1 denote that kernel modules may not be loaded at runtime. In this case a reboot is required to load the modules. But check out step #3 first.

#3 Configure Network Interfaces

Make sure your network interfaces are configured properly.

Since I have two of them I would use one as public interface and the other as private.

The latter one will be useing private IP addresses according to RFC1918 and is connected to the remote host useing a crossover cable.

On both hosts fxp0 is the public interface (which later on use the address 172.16.100.1 for the master node and 172.16.100.2 for the failover node).

On the master node the additional public IP address 172.16.100.12 is bound as an alias and used to provide public services. It will be monitored by freevrrpd and conditionally move over to the failover node.

fxp1 is the private interface used for data replication (192.168.100.1 for the master node and 192.168.100.2 for the failover node).

Restart networking or reboot the machine (if required by step #2), whatever applies to you.

#4 Install Failover Software

On FreeBSD freevrrpd may be used for IP takeovers and optional script execution. Install it from the ports (/usr/ports/net/freevrrpd/) or as a binary package (pkg_add -r freevrrpd).

The configuration of the failover setup is fairly easy and well documented in the freevrrpd man page. An example might look like this:

#
# config for usual master server
#
[VRID]
serverid = 1
interface = fxp0
priority = 255 # denotes priority = master
addr = 172.16.100.12/32 # denotes failover IP
password = anyoneulike
masterscript = /usr/local/bin/become_master
backupscript = /usr/local/bin/become_standby

And this would be an example for a standby node:

#
# config for usual standby server
#
[VRID]
serverid = 1
interface = fxp0
priority = 240 # denotes priority = failover
addr = 172.16.100.12/32 # denotes failover IP
password = anyoneulike
masterscript = /usr/local/bin/become_master
backupscript = /usr/local/bin/become_standby
Now I’d stronly recommend to read the man page and change the config file according to your needs. You will also need to write the master and backup scripts which do the actions required for the failover to work properly.

I leave this up to you as this is beyond the scope of this howto.

#5 Export Disk Slices

Now export the slices which shall be used for replication (/dev/da0s1d in my case). You do this by creating a file called /etc/gg.exports on the master server:

192.168.0.2 RW /dev/da0s1d

And the same on the standby server:

192.168.0.1 RW /dev/da0s1d

You’ll find more on this in the ggated man page. Basically you’re just exporting the underlying device to the given IP address in read/write mode.

Now since ggated does not support any password protection or encryption at all it is best to use a dedicated network for this anyway. This will also lower the load you place on the public network segment.
For optimum performance Gigabit ethernet is recommended.

When you’re set with the config files, ggated must be started on the failover node (yes: the failover node, not on the master!). You do this by running:

#ggated -v

This will place ggated in verbose mode and run in foreground, which is useful for debugging purposes. Later on, when everything works fine, this can be omitted.

Please note that you should not export the partion on both nodes at the same time. Run ggated only on the host which is the current failover node. Use the freevrrpd master/backup scripts to start/stop the service as required.

#6 Import Disk Slices

Looking at the primary node, the remote disk slices must no be imported.

This is done through ggatec, the client component of ggated. Run it as follows:

#ggatec create 192.168.100.2 /dev/da0s1d

This command will return the device node name. If it is the first one created usally ‘ggate0′.

Consider that you should run ggatec only on the designated primary node. Use the freevrrpd master/backup script facilities to create/delete the ggate device node according to it’s state.

Do not create the device node on the failover node as long as it is not in primary state. Do not delete the device node as long as the host is in master state (except for recovery purpose, but this will be covered later).

#7 Setup Replication

Now it’s actually time to bring up replication. This is where gmirror kernel module enabled previously comes in handy.

Make sure you’re on the primary node, then initialize a new GEOM mirror:

#gmirror label -v -n -b prefer gm0 /dev/ggate0

Then insert the local disk slice:

#gmirror insert -p 100 gm0 /dev/ad0s1e

Rebuild the mirror:

#gmirror rebuild rm0 ggate0

If you want to use the geom mirror auto synchronisation features, you can enable these as follows:

#gmirror configure -a gm0

This will cause the disk slices to be synchronized, actually the data from the local ad0s1e will be copied over to the ggate0 remote device.

This will surely take some time, depending on the size of your partition and the speed of your network. When finished, a message like this will appear in the dmesg log of your primary node:

GEOM_MIRROR: Device gm0: rebuilding provider ggate0 finished.
GEOM_MIRROR: Device gm0: provider ggate0 activated.

You may have noticed the “prefer” balance algorithm. This setting actually means that read requests shall only be directed to the geom provider with the highest priority.

By adding the /dev/ad0s1e (which is always the local disk) with a priority of 100 (actually any priority highter then the one of ggate0 according to “gmirror list gm0″ output is fine) you force all read requests to be directed to this device only.

You could actually use the “round-robin” balance algorithm as well, however this requires fast network connection with low latency, otherwise your read performance will drop significantly.

You may now “newfs” your gm0 device, mount and use it as you would with any other data partition.

In the first place you should now test the setup. Monitor the system performance on both hosts by using “vmstat” or a similar tool. Keep an eye on network interface and IO statistics.

If you experience lags, timeouts or slowisch behaviour during usual actions like copying files and directories then the above will certainly help you. In most cases it’s related to network bandwidth or limits in disk IO.

#8 Failing-Over To The Standby Node

Now that your replication is up and running it’s time to test a failover scenario. We do it by hand so you can see what you actually need to put in freevrrpd master/backup scripts for this purpose.

So go and unplug your current master node (yes, really do it. If you don’t do it now you’ll never do it and it is likely to never work properly).

So you unplugged it? Fine, that’s what we want.
Now connect to your failover node and stop the ggated service.

This should cause geom mirror to pick up the gm0 device with provider /dev/da0s1e automatically.

GEOM_MIRROR: Device gm0 created (id=2381431211).
GEOM_MIRROR: Device gm0: provider ad0s1e detected.
GEOM_MIRROR: Force device gm0 start due to timeout.
GEOM_MIRROR: Device gm0: provider ad0s1e activated.
GEOM_MIRROR: Device gm0: provider mirror/gm0 launched.

It may take a few moments for the device to become ready.

Now you must run fsck to ensure filesystem integrity (you really must do this as the filesystem will always be dirty):

#fsck -t ufs /dev/mirror/gm0

Then you can mount the device:

#mount /dev/mirror/gm0 /mnt

Step #9 will explain how the mirror may be rebuilt if the previous master node becomes available again.

#9 Recovering

To bring back the master host into the active combound you will need to make sure that the gm0 device is actually shut down on the failed host.

You remember that we enabled permanent loading of the geom mirror module previously?
This is required to circumvent some problematic situations when kernel secure level is in effect. But it also means that geom mirror will automatically pick up the gm0 device. This will however prevent you from exporting the underlying device through geom gate so the gm0 must disabled first. You can do it like this:
#gmirror stop gm0

As soon as it is stopped you may then run ggated to export the partiton (we’re doing it in debug mode):

#ggated -v

If you get an error stating failure to open the /dev/da0s1e device it may still be locked by the geom mirror class. Just look at “gmirror list” output and stop the device as required.
If ggated is running after all, turn to your failover host and turn off auto configuration on the geom mirror:

#gmirror configure -n gm0

Then make the ggate device available to your node:

#ggatec create 192.168.100.2 /dev/da0s1e

Reinsert the ggate device to the geom mirror using a low priority of ‘0′
#gmirror insert -p 0 gm0 /dev/ggate

and re-enable auto-configuration on the mirror

#gmirror cinfigure -a gm0

I’d recommend to always rebuild the mirror unless your absolutely sure that no new data has been added to the gm0 device in the meantime.

#gmirror rebuild gm0 ggate0

Make sure you give the ggate0 device as last argument which makes it the “sync target”. If you happen to do “gmirror rebuild gm0 da0s1e” accidentally this will sync the other way round leaving you most likely with corrupt or lost data.

The rebuild will take some time depending on the partitition size and network speed. After finishing you will see a message like this in your kernel log:

GEOM_MIRROR: Device gm0: rebuilding provider ggate0 finished.
GEOM_MIRROR: Device gm0: provider ggate0 activated.

Now you will have to remove the local /dev/ad0s1d device from the mirror and reinsert it using a high priority:

#gmirror remove gm0 /dev/ad0s1d
#gmirror insert -p 100 gm0 /dev/ad0s1d

The geom mirror will automatically rebuild the provider if required.

This is actually required to fix the read priority I previously talked about, although only required if you want the previous failover node to become your new master node.

If you do not intend switching designated roles and make your failed primary the active node again, have a look at the next sections.

#10 What If The Failover Node Fails?

Imagine you need to reboot your Failover Node, let’s say to install some updates. Or even more worse: It has rebooted due to some kernel panic, power loss or other real-life situations.
In any case you should put the geom mirror on the master host into degraded mode by forcibly removing the ggate0 device:

When you’re on the master, just make sure the ggate0 is disconnected from the mirror:

#ggatec destroy -f -u 0

This will result in this kernel message:

GEOM_GATE: Device ggate0 destroyed.
GEOM_MIRROR: Device gm0: provider ggate0 disconnected.

The gm0 is now running in degraded state until you re-insert your fail-over node to the configuration.

There is no problem in doing it this way anyway as you have to do a full resync in either case afer the failover node is up again.

The reason to remove the ggate0 device is to prevent IO locking on the geom mirror device.

#11 How To Recover Replication

To bring back the fail-over host into the active combound you will need to make sure that the gm0 device is actually shut down on the failed host.

#gmirror stop gm0

As soon as it is stopped you may then run ggated to export the partiton (we’re doing it in debug mode):

#ggated -v

If you get an error stating failure to open the /dev/da0s1e device it may still be locked by the geom mirror class. Just look at “gmirror list” output and stop the device as required.
If ggated is running after all, make the remote disk slice available on the other host:

#ggatec create 192.168.100.2 /dev/da0s1e

This will have the ggate0 device created and added automatically to your gm0 device.

GEOM_MIRROR: Device gm0: provider ggate0 detected.
GEOM_MIRROR: Device gm0: provider ggate0 activated.

I’d recommend to always rebuild the mirror unless your absolutely sure that no new data has been added to the gm0 device in the meantime.

#gmirror rebuild gm0 ggate0

Make sure you give the ggate0 device as last argument which makes it the “sync target”. If you happen to do “gmirror rebuild gm0 da0s1e” accidentally this will sync the other way round leaving you most likely with corrupt or lost data.

The rebuild will take some time depending on the partitition size and network speed. After finishing you will see a message like this in your kernel log:

GEOM_MIRROR: Device gm0: rebuilding provider ggate0 finished.
GEOM_MIRROR: Device gm0: provider ggate0 activated.

#12 Data Integrity Considerations

Some special considerations must be taken to ensure data integrity:

  • ggated cannot export a slice if it is in use by geom mirror
  • don’t try any fancy primary-primary replication stuff, it is not possible
  • never (as in never) mount the filesystem (the underlying partition to be exact), on the failover node
  • to access the data mount the geom mirror device, hence it’s only possible on the master node. Don’t ever do it on the failover node unless you have taken proper recovery action as described above
  • always run fsck on the geom mirror after failover
  • it’s better not to mount the geom mirror through fstab automatically. Use some freevrrpd recovery magic instead
  • Always take backups. This solution is to allow realtime replication for HA services. It is no substitute for proper backups at any time.

#13 Security Considerations

As you may have noticed ggated doesn’t support any security or encryption mechanism by default. “Security” is only implemented upon IP based access restrictions combined with read/write or read/only flags.

To enhance security a bit you should always use a dedicated network interface for data replication, preferrably a private one which is not connected to the internet. Crossover host-to-host cabling is fine.

If you need to go over the (insecure) public network please use additonal firewall rules to block port access to authorized hosts.

Both ggated and ggatec also allow useing a port different from their default so it would be possible to setup a redirect through stunnel. This may however pose another performance impact onto your hosts, especially if your network connection is laggy or slow.

#14 Observations

It may look a bit complicated at a first glance, but it is basically nothing else than spanning a software RAID1 accross networked hosts.

In theory its possible to apply any RAID configuration supported by geom accross networked hosts, but there is no practical reason in doing so.

The possibilites offered by this setup are huge if implemented properly. You can easily apply HA conditions to services which do not support such on their own.

If you happen to implement a live environment upon this technology some time, just let me know how it worked out.

Using NetBSD’s RaidFrame On Alpha

Wednesday, June 21st, 2006

Yet another software RAID howto, I seem to be growing these in a hurry after all ;-)

This time I will cover NetBSD’s RaidFrame and how to setup RAID1 disk mirroring on a Digital Alpha PWS433au.

While the basic outline retains pretty much the concept given in the NetBSD Guide, I thought to write down my personal experience on this matter.
The main reason for this is actually the fact that there is so much discrepancy in the information available. Some of the (partially misleading) facts I came along:

  • RaidFrame does not support RootFS booting on Alpha
  • RaidFrame requires fiddling around with ’sector offsets’ on Alpha
  • RaidFrame does not work at all on Alpha
  • etc.

It is incredible how much information exists which does not go along with the official manual, the manpages or simply is not accurate enough for whatever reason. Most of it is acutally *very* outdated, does not even reflect the newest developments whatsoever.

After reading the docs, scanning the newsgroups and trying things out on my own, I uncovered these myths:

  • RaidFrame does support booting RootFS on Alpha
  • RaidFrame does *not* require any sector offset magic on Alpha
  • RaidFrame works in general on Alpha

Although the NetBSD Guide covers only RaidFrame on x86 and sparc64, it will work out on Alpha for the most relevant parts. The differences will be outlined below using the original headlines as used in the NetBSD Guide.

15.3.4 Preparing Disk1

Wipe out the disk first (note: on Alpha partition ‘c’ covers the whole disk, not partition ‘d’ as on x86):

#dd if=/dev/zero of=/dev/rsd1c bs=8k count=1

Then recreate the partition table:

#disklabel -r -e -I sd1
type: unknown
disk: Disk1
label:
flags:
bytes/sector: 512
sectors/track: 63
tracks/cylinder: 16
sectors/cylinder: 1008
cylinders: 19386
total sectors: 19541088
[…snip…]
8 partitions:
# size offset fstype [fsize bsize cpg/sgs]
a: 19541088 0 RAID # (Cyl. 0 - 19385)
c: 19541088 0 unused 0 0 # (Cyl. 0 - 19385)

Just make the RAID partition ‘a’ start at offset ‘0′ and everything will be fine. You do not (and should not) add any additional sector offset to the partition as SRM requires all bootable partitions to be located at sector 0.

15.3.7. Setting up kernel dumps

This is basically the same as outlined in the manual though you may omit the first offset. The numbers are taken from the original manual, so change them according to your own disklabel output.

#dc

64 # RF_PROTECTED_SECTORS
+
19015680 # offset of raid0b
+p
19015744 # offset of swap within sd1
q

15.3.8. Migrating System to RAID

Most of this part is identical. The difference comes in only when installing the boot loader. This command will do the trick:

#installboot -v /dev/rsd1c /usr/mdec/bootxx_ffs

The boot loader will need to be installed onto the second disk later on step 15.3.10. Use the same command as required.

15.3.11. Testing Boot Blocks

If the boot loader is properly installed, your RaidFrame array should boot without any problems from both disks. This can be easily tested by issueing ‘boot dkc100′ or ‘boot dkc200′. The actual device names of your hard drives can be seen by running ’show device’ on the SRM console.

To let SRM automatically try multiple devices for system boot you will need to alter your SRM environment.

#set bootdef_dev dkc100,dkc200

Now SRM should always boot from dkc100 and use dkc200 in case of a failure of the first disk.