Logical Volume Manager 2 (lvm2) is a very powerful toolset to manage physical storage devices and logical volumes. I’ve been using that instead of disk partitions for over a decade now. LVM gives you full control where logical volumes are placed, and a ton of other features I have not even tried out yet. It can provide software RAID, it can provide error correction, you can move around logical volumes while they are being actively used. In short, LVM is an awesome tool that should be in every Linux-admin’s toolbox.
Today I want to show how I used LVM’s cache volume feature to drastically speed up a Btrfs RAID1 situated on two slow desktop HDDs, using two cheap SSDs also attached to the same computer, while still maintaining reasonable error resilience against single failing devices.
Creating the cached LVs and Btrfs RAID1
The setup is as follows:
- 2x 4TB HDD (slow), /dev/sda1, /dev/sdb1
- 2x 128GB SSD (consumer-grade, SATA), /dev/sdc1, /dev/sdd1
- All of these devices are part of the Volume Group
vg0
- Goal is to use Btrfs RAID1 mode instead of a MD RAID or lvmraid, because Btrfs has built-in checksums and can detect and correct problems a little bit better because it can determine which leg of the mirror is the correct one.
Note: you don’t have to start from scratch. If your Btrfs devices are already based on Logical Volumes (LVs), it is absolutely possible to attach the cache later on. Important part is: you should always start off with LVM and logical volumes instead of native partitions.
First step, is to create all the required Logical Volumes (LVs). The steps below will only use 2TB of the 4TB disks, and 64GB of the SSDs for caching.
# create data base volumes
lvcreate -n data-btrfs1 -L 2048G vg0 /dev/sda1
lvcreate -n data-btrfs2 -L 2048G vg0 /dev/sdb1
# cache volumes
lvcreate -n data-btrfs1_ssdcache -L 64G vg0 /dev/sdc1
lvcreate -n data-btrfs2_ssdcache -L 64G vg0 /dev/sdd1
Important part is that you need to specify the Physical Volume (PV) where each of the logical volumes needs to be located. If you don’t do that, LVM will manage it internally and you’ll likely end up with something where one physical disk contains multiple of the logical volumes and that’s not suited for the RAID setup (and would prevent the cache from actually being beneficial).
In the following step, the SSD cache volumes are attached in writeback
mode, which makes them a write cache. This is dangerous and you should only do that if you understand what that means. lvconvert
will warn you, and the warning is to be taken seriously: in case the SSD fails, it will corrupt the cached HDDs data irrecoverably. But that’s why we’ll use RAID1 on Btrfs level.
# attach cache volumes in writeback mode
lvconvert --type cache --cachevol data-btrfs1_ssdcache --cachemode writeback vg0/data-btrfs1
lvconvert --type cache --cachevol data-btrfs2_ssdcache --cachemode writeback vg0/data-btrfs2
That’s it. /dev/vg0/data-btrfs1
and /dev/vg0/data-btrfs2
are now fully cached, in writeback mode.
mkfs.btrfs -m raid1 -d raid1 /dev/vg0/data-btrfs1 /dev/vg0/data-btrfs2
Done. This filesystem can now be mounted:
mount.btrfs /dev/vg0/data-btrfs1 /mnt/
Failure Scenarios
Let’s explore different failure scenarios to show the dangers and limitations we face. Basically this setup will survive the failure of one of the four devices. If you are exceptionally lucky, it could survive the failure of two (but don’t count on it).
You should be familiar with how to handle failed devices in Btrfs. Especially you should be familiar if/when/how to mount degraded arrays and how to repair them. The Btrfs Kernel Wiki is a good resource for that.
HDD failure
In case a single one of the HDD fails, you should physically remove it to prevent further interference with the system, and potentially also the corresponding SSD. It might be a bit tricky to get LVM to activate all other LVs, but it is doable. Once all other LVs are active you need to follow the guide for “Replacing failed drive” guide in Btrfs’s Wiki.
I highly recommend removing the failed HDD from the Volume Group and adding the replacement drive, so that you can benefit from LVM again, maybe even re-using the SSD as cache for the replacement drive.
In case both of the HDDs fail, you are screwed. That’s what backups are for. Nothing to be done here. You’d need RAID1c3 to survive that, which would require 3 physical HDDs.
SSD failure
In case a single one of the SSDs fails, you should attempt to break up the cache, maybe saving a bit of the data in the write-cache back to the HDD. After you have broken up the cache, the HDD will be uncached (and likely data is corrupted). Everything will be slow again for this disk.
Attempt to repair the Btrfs mirror by doing a btrfs scrub
. The self-healing of BTFS might repair the second failed mirror leg. If this fails, because the HDDs data was too inconsistent due to the write-caching, the easiest approach is to delete the LV, create a new one with a new name (e.g. “data-btrfs3”), wipe it with dd if=/dev/zero of=/dev/vg0/data-btrfs3
and then use the partition as a replacement for the failed device in Btrfs, following again the “Replacing failed drive” guide. From Btrfs’ perspective the new LV in this case will be a new device, even though it might occupy the same physical space (and that’s why you should zero it out first).
Once the Btrfs1 RAID1 is healthy again, you could swap in another SSD and use it as cache again.
In case both SSDs fail, you are screwed. You can attempt to break up the cache and see if btrfs scrub
can save anything, but it’s very unlikely. That’s what backups are for.
Multiple device failures
If there are more than one failing disks, you can assume your data lost and reach for your backups. There is one scenario in which you might be lucky: if the HDD and SSD of the same cache-pair fail. From Btrfs’ perspective you still only lost one device.
Pitfalls / Caveats
Don’t cheat on the number of disks!
It might be tempting to use just one SSD and split it’s space for the cache. You must not do that! It is really important that every physical HDD has a physical SSD for the cache. Otherwise an SSD’s failure will affect both legs of the mirror and be a total data-loss. In that setup, just don’t bother with RAID1, use -d single
instead. You’ll lose data with any failing disks. And that can be a risk you can accept. It’s up to you and your backup strategy.
But RAID is all about uptime?!
You will often hear: “RAID is about uptime, it’s not about backup”. And that’s true. Btrfs RAID is not really about uptime, though. It will, for instance, refuse to mount degraded by default, and there is a good reason not to add -o degraded
to your /etc/fstab
.
I use this setup, because I want to reduce the number of times I have to reach for my backups, not to increase uptime. Because then I’d have to use a hardware RAID with hot-swappable disks and/or other file-systems like ZFS that prioritize uptime. Btrfs refuses to silently work without admin intervention if there is anything wrong with the mirror. And for my scenario, that’s the better choice. Your mileage may (and likely will) differ.
In case that’s important to you, you might look into ZFS, it has caching features built-in, but I never had time to evaluate that. I’d be happy to hear your experiences.
Why are you only using half of each of your drives?
The rest of the drive is used for other logical volumes, some VMs, containers, etc. Of course you can use all the space if you want to. I normally only allocate what I need and expand on-demand.