Highly available disk stacks are nothing new. At the time of writing, Dell wil sell you a nofactor-of-failure apparel company list array with a couple of drives for approximately $four,500. Not terrible, eh? Still, if you’re on a primary name basis with Linux and have multiple machines to spare, you could installation a shared-nothing disk cluster for next to not anything.

Just how may that be? The precise oldsters at LIN BIT have kindly supplied their Distributed Replicated Block Device (DRUB) below the GEL license. DRUB is an internet disk clustering suite that, as stated of their own phrases, may be visible as a “community-based raids″.
works by using injecting a thin layer in among the document gadget (and the buffer cache) and the disk motive force. The DRUB kernel module intercepts all requests from the file machine and splits them down two paths – one to the real disk and another to a reflected disk on a peer node. Should the previous fail, the record system may be installed on the opposing node and the information will be available for use.
DRBD works on nodes at a time – one is given the position of the primary node, the other – a secondary role. Reads and writes can only occur on the primary node. The secondary node must now not mount the report machine, no longer even in study-best mode. This ultimate factor calls for a few explanation. While it’s true to mention that the secondary node sees all updates on the primary node, it can’t expose those updates to the report system, as DRBD is completely report gadget agnostic. That is, DRBD has no specific expertise of the document machine and, as such, has no manner of communicating the modifications upstream to the record gadget driver. The -at-a-time rule does no longer certainly restrict DRBD from operating on extra than nodes. DRBD supports in addition “stacking”, where a better stage DRBD module acting as a block tool to the running system forks to a pair of decrease-level block gadgets which themselves are DRBD modules (and so on).
Replication takes region the use of one of 3 protocols:
Protocol A queues the information written at the primary node to the secondary node, however does not watch for the secondary node to verify the receipt of the records before acknowledging to its own host that the statistics has been safely committed. Those acquainted with NFS will draw relationships to “asynchronous replication” for that is, certainly, the case. Being asynchronous, it’s far the quickest of all replication protocols but suffers from one fundamental drawback – the failure of the number one tool does now not guarantee that all of the statistics is to be had on the secondary device. However, the facts on the secondary device is usually consistent, that is, it appropriately represents the information saved at the number one device on the time of the closing synchronisation.
Protocol B awaits the reaction from the secondary host previous to acknowledging the a success devote of the statistics to its own host. However, the secondary host isn’t always required to straight away persist the replicated changes to strong garage – it can do so a while after confirming the receipt of the changes from the number one host. This guarantees that, inside the event of a failure, the secondary node isn’t always only constant, however completely updated with respect to the primary node’s information. In the authors’ personal phrases, this protocol can be visible as “semi-synchronous” replication. This protocol is fairly slower than protocol A, because it physical games the community on every and each write operation.
Protocol C not most effective awaits the response from the secondary host, but additionally mandates that the secondary host secures the updates to solid storage prior to responding to the primary. Because of the brought disk I/O overhead, Protocol C is notably slower than protocol B. Drawing returned to our NFS example, this protocol equates to fully synchronous replication.
The protocols above represent varying degrees of guarantee with admire to the integrity of the records replication procedure and trade speed for safety. Protocol A is the fastest of all, however is not especially secure. Protocol C gives the most resiliency to failure, however incurs the most quantity of latency. LINBID declare that maximum clients need to be the use of protocol C. This is debatable – protocol B is simply as secure, at the same time as incurring far less overheads. Protocol B most effective comes unstuck if each nodes were to black out or electricity cycle at precisely the equal time. This situation should be guarded against using a UPS and/or redundant electricity strains. If redundant electricity is not available, protocol C is, indeed, the most suitable.
Setting up DRBD
Obtaining DRBD
DRBD has been incorporated into the Linux kernel due to the fact 2.6.33. If you have been blessed with an older kernel however are a paying client of LINBIT, you is probably supplied with a pre-built package to fit your distribution. But on account that that is an “on a finances” thing, you will just need to download a tarball distribution from the DRBD website (or get a recent kernel). The following instructions follow to DRBD version eight.3.8.1.
Building DRBD
You’re in all likelihood acquainted with the well-known Linux trio: configure-make-install. This one’s no different, although you do have specify an additional switch or two to get the build going.
$./configure –with-km –sysconfdir /and so on $ make # make deploy
NB: In every relevant vicinity, the DRBD documentation states that configuration files can be searched inside the series of /etc/drbd-83.Conf, /and so on/drbd-08.Conf, accompanied with the aid of /and so on/drbd.Conf. However, the header record (person/config.H) generated by way of jogging the configure script points to the /usr/neighborhood/and many others listing as an alternative, contradicting all documentation, consisting of the person pages. The –sysconfig switch overwrites this behaviour. Furthermore, according to the supply code of model eight.3.8.1 (person/drbdadm_main.C), there may be a further configuration document drbd-82.Conf this is searched after drbd-83.Conf, which has been disregarded from the documentation. Our advice to the folks at LINBIT might be either switch the default configuration directory to /and so forth, or to replace the documentation to indicate otherwise.
Verifying the build
After building, load the module to confirm that it was constructed efficiently:
# modprobe drbd
If modprobe fails to load the module, it is able to be due to the fact that DRBD has positioned the module inside the wrong listing – one which is not commensurate together with your kernel release (this wouldn’t be the first time DRBD were given careworn). You can try to look for the module, as so:
# discover /lib/modules -name drbd.Ko
If you locate the module, reproduction it on your /lib/modules/`uname -r`/kernel/drivers/block directory. Having finished that, sign in the module:
# depmod -a
Alternatively, input the drbd subdirectory of the DRBD source tree and run the subsequent (this time forcing the kernel revision):
$ make smooth $ make KDIR=/lib/modules/`uname -r`/build # make install
Then try running modprobe drbd once more.
Configuring DRBD
The layout of the DRBD disk cluster need to be described in a single configuration document placed at /etc/drbd.Conf. In our instance, replication will take location over digital machines, interconnected by a single non-public link. The machines are named ‘spark’ and ‘flare’. Both hosts could be commingling at the /dev/sda3 block tool. The corresponding configuration file is depicted under:
global
usage-be counted sure;
commonplace
protocol C;
useful resource r0
tool /dev/drbd1;
disk /dev/sda3;
meta-disk internal;
on spark
address 192.168.A hundred.10:7789;
on flare
address 192.168.One hundred.20:7789;
It’s apparent from the configuration that protocol C is being employed. The aid segment lists the details of a unmarried aid named r0. (DRBD may also have more than one resources configured and operational.) The two on sections constitute the configurations unique to the nodes ‘spark’ and ‘flare’. The device, disk and meta-disk entries are commonplace to each nodes. However, if any of these items have been to vary among the two nodes, you’ll be predicted to move them down into the on sections. The cope with entries will forever fluctuate between the 2 nodes. I feel forced to say that the 2 addresses need to be pass-routable, and appropriate preparations have to be made to allow DRDB traffic to traverse any firewalls on the ports nominated in the cope with access.
Configuring the metadata
DRBD requires a committed storage vicinity on each node for retaining metadata – information about the current state of synchronisation between the DRBD nodes.
Metadata can be external, in which case you ought to commit an area on the disk outside of the partition you wish to copy. External metadata can provide the greatest performance when you consider that you could rent a 2nd disk on each node to parallelise I/O operations.
Metadata can also be inner, this is, inlined with the partition being replicated. This mode gives a worse I/O overall performance as compared to external metadata. It is extremely less difficult, however, and does have the benefit of coupling metadata toward the real records – in case you need to bodily relocate the disk. Internal metadata is placed on the cease of the partition or Logical Volume (LV) occupying the target document gadget. To prevent the metadata from overwriting the end of the document gadget, the latter have to first be gotten smaller to make room for the metadata.
In our example we’ll be the use of internal metadata. In either case, metadata takes up a few area at the tool; the space varies relying on the scale of the replicated document gadget. Before determining the size of the metadata, we should accurately gauge the dimensions of the record gadget to be replicated. When we communicate approximately sizes, we talk to the raw length of the record gadget, i.E. The quantity of space it takes up on the disk – no longer the amount of usable area the record gadget presents to the programs. The nice manner to determine the size of the document device is to observe the size of the underlying partition or LV, due to the fact that record systems have a tendency to occupy the complete partition/LV. We’ll use the parted utility in our example of replicating /dev/sda3 – a 4GB Ext3 partition.
# parted /dev/sda3 unit s print
Model: Unknown (unknown) Disk /dev/sda3: 8193150s Sector length (logical/physical): 512B/512B Partition Table: loop Number Start End Size File device Flags 1 0s 8193149s 8193150s ext3
Determine the scale of the metadata:
given by means of: ceiling(Size/218) x eight + seventy two = 328 (in which the ceiling function rounds the input up to the closest integer)
NB: The observant amongst you will word that the actual requirement of the inner metadata block size could be smaller than the said figure, because via shrinking the report gadget we’re lowering the call for for metadata. Still, the distinction in length may be negligible, and it’s simplest to compute the metadata block size from the pre-contracted length.
Check document system for mistakes (Ext2/Ext3 document structures):
# e2fsck -f /dev/sda3
Calculate the brand new length of the FS, taking into account DRBD metadata:
given with the aid of: Size – 328 = 8192822
Resize the file system:
# resize2fs /dev/sda3 8192822s
Finally, create the metadata block:
# drbdadm create-md r0
Loading DRBD on startup
In most instances it’s desirable to load the DRBD kernel module and activate DRBD replication on start-up. DRBD is sent with a daemon for simply this reason. (Replace DRBD_DIR with the listing in which DRBD become unpacked to.)
# cp DRBD_DIR/scripts/drbd /and many others/rc.D/init.D # chkconfig –upload drbd
Activating DRBD
Start the daemon:
# service drbd start
Observe the status of the disks:
$ cat /proc/drbd
version: 8.Three.8.1 (api:88/proto:86-94) GIT-hash: 0d8589fcc32c874df57c930ca1691399b55ec893 build by [email protected], 2010-08-04 20:forty five:00 1: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r—- ns:zero nr:0 dw:zero dr:0 al:0 bm:0 lo:zero pe:zero ua:zero ap:0 ep:1 wo:b oos:4096408
The Inconsistent/Inconsistent disk nation is predicted at this factor. This simply means that the disks have in no way been synchronised.
Initial synchronisation
The next step is the preliminary synchronisation, and involves the complete overwrite of the information on one peer’s disk, sourced from the disk of every other peer. You ought to select which of the friends includes the best records, and issue the subsequent command on that peer:
# drbdadm — –overwrite-statistics-of-peer primary r0
Now, on both of the peer nodes, do:
$ watch “cat /proc/drbd”
You will see a progress bar, just like the one under: model: eight.3.8.1 (api:88/proto:86-ninety four) GIT-hash: 0d8589fcc32c874df57c930ca1691399b55ec893 construct through [email protected], 2010-08-04 20:45:00 1: cs:SyncTarget ro:Secondary/Primary ds:Inconsistent/UpToDate C r—- ns:0 nr:24064 dw:24064 dr:zero al:zero bm:1 lo:zero pe:0 ua:zero ap:zero ep:1 wo:b oos:4072344 [>………………..] sync’ed: 0.7% (4072344/4096408)K end: 2:forty nine:40 speed: 324 (320) K/sec
Depending on the scale of your report machine, and the velocity of the community, this operation may additionally take some time to complete. Using a couple of virtual machines and a virtual inner network, a 4GB Ext3 record machine took approximately 3.Five hours to synchronise. That said, you should be capable of start the use of the number one disk as quickly as it’s up, without expecting the synchronisation method to finish. However, refrain from appearing any challenge-crucial operations on the number one report gadget till the initial synchronisation completes (even if using protocol C).
Mounting the file system
Next, we can mount the disk at the primary node. But first, we have to make sure that one node is chosen as the primary node. On the primary node, issue the subsequent:
# drbdadm number one r0
Observe the output of cat /proc/drbd, having made a node number one:
version: 8.Three.8.1 (api:88/proto:86-94) GIT-hash: 0d8589fcc32c874df57c930ca1691399b55ec893 build with the aid of [email protected], 2010-08-06 08:01:01 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r—- ns:32768 nr:0 dw:0 dr:32984 al:zero bm:4 lo:0 pe:0 ua:zero ap:0 ep:1 wo:b oos:zero
The output of cat /proc/drbd on the secondary node ought to be very comparable, best the Primary/Secondary roles will seem reversed.
The order of our HA disk stack (lowest level first) is as follows:
Physical disk partition, LVM (if applicable), DRBD, file system
When mounting the disk, we talk over with the unique DRBD block device, in preference to the real device (e.G. /dev/sda3). Like actual walls, DRBD devices are suffixed with a 1-primarily based index. For comfort, it is well worth appending the following access to the stop of the /and so on/fstab document:
/dev/drbd1 /mnt/drbd1 ext3 noauto 0 0
The noauto option in the ‘mount alternatives’ column tells the working machine to chorus from mounting the tool at startup. Otherwise, one of the nodes would invariably fail looking to mount the report device, as only one node could have the file system mounted at any given time.
Now mount the block device:
# mount /dev/drbd1
NB: Because of the access in /and many others/fstab we don’t must specify a mountpoint to the mount command.
So there you’ve got it: a enormously to be had, no-single-point-of-failure disk stack for the charge of a couple of Linux packing containers. And all in the time it took you to drink 17 cups of espresso.
Further reading
Gridlock
DRBD absolutely integrates with Gridlock – the sector’s best excessive availability cluster. Whether you’re after a high performance, fantastically to be had shared-nothing structure, or off-website online replication and disaster recuperation, Gridlock is as much as the venture.
The hassle with the usage of Linux-based totally (or an OS-particular) clustering software program is that you may constantly be tied to the running device.
Gridlock, however, works on the utility level and isn’t coupled to the operating system. I suppose this is the manner ahead, especially considering that many establishments are strolling a combined bag of Windows and Linux servers – being capable of cluster Windows and Linux machines collectively can be a actual benefit. It also makes installation and configuration less difficult, since you’re don’t have separate instructions for a dozen specific operating systems and hardware configurations.
The other neat thing about Gridlock is that it doesn’t use quorum and doesn’t depend upon INC bonding/teaming to gain multi course configurations – as an alternative it combines redundant networks at the software level, this means that it really works on any network card and does not require specialised transfer tools.
Split brain
When jogging in an lively-standby configuration, simplest one DRUB node can be made number one at any given time. Two (or more) disks coexisting in the primary state can bring about the branching of the information units. Stated otherwise, one node ought to have changes now not seen to its peer, and vice versa. This situation is known as a break up brain. When the drbd daemon is started out, it’s going to check for a cut up brain situation, and abort synchronisation even as appending an errors message to /var/log/messages.
The first step in convalescing from a cut up mind situation is to become aware of the changes made to each nodes following the split mind event. If each nodes have essential information that desires to be merged, it is pleasant to lower back up one of the nodes (name it node A, or the trailing node) and re-sync facts from the alternative node (node B, or the leading node). When the re-sync is complete, both nodes will comprise the records set of node B, with the latter being the number one node. Following that, demote node B to secondary reputation, and promote node A to primary reputation. Hand-merge the changes from the backup facts set on node A – those changes will propagate to node B.
On the trailing node, backup the statistics and trouble the following commands
Observe /proc/drbd – it ought to now show the nodes synchronising.
Having synchronised the nodes, reverse the roles and manually merge the changes on the new number one node.
Startup barrier
Be default, when a DRBD node starts up, it waits for its peer node to start. This prevents a scenario wherein the cluster is most effective booted the usage of one node, and undertaking-important statistics is written with out being replicated onto a peer’s disk. The default timeout is ‘limitless’, that is, a node will wait indefinitely for its peer to come up before proceeding with its own boot series. Despite this, DRBD will present you with an choice to bypass the wait. To manage the timeouts, upload a startup segment in the useful resource phase, as proven below:
In this situation, we have explicitly distinct the timeout to be 10 seconds. So the node will permit some time for its peer to come up, but the absence of the peer won’t prevent the node from booting.
Synchronisation alternatives
Dread’s synchronisation mechanism is optimised for slow computer systems with slow network connections by means of default. This is just too terrible, because the out-of-the-container configuration calls for pretty a bit of tinkering to get occurring even the most primary hardware. The default synchronisation rate is capped at around 250 KB/s, that is more or less 2% of a a hundred MBps LAN. While the presence of a throttling characteristic is good, its default settings are too conservative. Furthermore, DRBD via default will transmit all blocks that it thinks can be out of sync. Compare this with rolling checksum and compression used by equipment which include rsync. While compression is not yet an choice, it’s miles viable to inform DRBD to evaluate the digests of each block with the primary’s replica, and only switch the block if the digests vary. Bear in mind even though – using a checksum will exchange CPU cycles for bandwidth. A more loose-flowing throttle cap and the use of MD 5 checksum for a quicker re sync can be designated by using adding a synced phase to the common phase, as shown beneath:
commonplace .
In the example above, the sync rate has been capped to five MB/s, that is around 50% of the ability of a abase-T Ethernet material, deliberating TCP/IP framing overheads. This configuration uses the MD5 algorithm to compute digests over the replicated blocks, which have to be supported by using your kernel (maximum will). The settings are absolutely unbiased: one can specify a brand new throttle without placing a checksum algorithm, and vice versa.