Speeding up Btrfs RAID1 with LVM Cache

Log­i­cal Vol­ume Man­ager 2 (lvm2) is a very pow­er­ful toolset to man­age phys­i­cal stor­age de­vices and log­i­cal vol­umes. I’ve been using that in­stead of disk par­ti­tions for over a decade now. LVM gives you full con­trol where log­i­cal vol­umes are placed, and a ton of other fea­tures I have not even tried out yet. It can pro­vide soft­ware RAID, it can pro­vide error cor­rec­tion, you can move around log­i­cal vol­umes while they are being ac­tively used. In short, LVM is an awe­some tool that should be in every Linux-ad­min’s tool­box.

Today I want to show how I used LVM’s cache vol­ume fea­ture to dras­ti­cally speed up a Btrfs RAID1 sit­u­ated on two slow desk­top HDDs, using two cheap SSDs also at­tached to the same com­puter, while still main­tain­ing rea­son­able error re­silience against sin­gle fail­ing de­vices.

Cre­at­ing the cached LVs and Btrfs RAID1

The setup is as fol­lows:

  • 2x 4TB HDD (slow), /dev/sda1, /dev/sdb1
  • 2x 128GB SSD (con­sumer-grade, SATA), /dev/sdc1, /dev/sdd1
  • All of these de­vices are part of the Vol­ume Group vg0
  • Goal is to use Btrfs RAID1 mode in­stead of a MD RAID or lvm­raid, be­cause Btrfs has built-in check­sums and can de­tect and cor­rect prob­lems a lit­tle bit bet­ter be­cause it can de­ter­mine which leg of the mir­ror is the cor­rect one.

Note: you don’t have to start from scratch. If your Btrfs de­vices are al­ready based on Log­i­cal Vol­umes (LVs), it is ab­solutely pos­si­ble to at­tach the cache later on. Im­por­tant part is: you should al­ways start off with LVM and log­i­cal vol­umes in­stead of na­tive par­ti­tions.

First step, is to cre­ate all the re­quired Log­i­cal Vol­umes (LVs). The steps below will only use 2TB of the 4TB disks, and 64GB of the SSDs for caching.

# create data base volumes
lvcreate -n data-btrfs1 -L 2048G vg0 /dev/sda1
lvcreate -n data-btrfs2 -L 2048G vg0 /dev/sdb1
# cache volumes
lvcreate -n data-btrfs1_ssdcache -L 64G vg0 /dev/sdc1
lvcreate -n data-btrfs2_ssdcache -L 64G vg0 /dev/sdd1

Im­por­tant part is that you need to spec­ify the Phys­i­cal Vol­ume (PV) where each of the log­i­cal vol­umes needs to be lo­cated. If you don’t do that, LVM will man­age it in­ter­nally and you’ll likely end up with some­thing where one phys­i­cal disk con­tains mul­ti­ple of the log­i­cal vol­umes and that’s not suited for the RAID setup (and would pre­vent the cache from ac­tu­ally being ben­e­fi­cial).

In the fol­low­ing step, the SSD cache vol­umes are at­tached in writeback mode, which makes them a write cache. This is dan­ger­ous and you should only do that if you un­der­stand what that means. lvconvert will warn you, and the warn­ing is to be taken se­ri­ously: in case the SSD fails, it will cor­rupt the cached HDDs data ir­recov­er­ably. But that’s why we’ll use RAID1 on Btrfs level.

# attach cache volumes in writeback mode
lvconvert --type cache --cachevol data-btrfs1_ssdcache --cachemode writeback vg0/data-btrfs1
lvconvert --type cache --cachevol data-btrfs2_ssdcache --cachemode writeback vg0/data-btrfs2

That’s it. /dev/vg0/data-btrfs1 and /dev/vg0/data-btrfs2 are now fully cached, in write­back mode.

mkfs.btrfs -m raid1 -d raid1 /dev/vg0/data-btrfs1 /dev/vg0/data-btrfs2

Done. This filesys­tem can now be mounted:

mount.btrfs /dev/vg0/data-btrfs1 /mnt/

Fail­ure Sce­nar­ios

Let’s ex­plore dif­fer­ent fail­ure sce­nar­ios to show the dan­gers and lim­i­ta­tions we face. Ba­si­cally this setup will sur­vive the fail­ure of one of the four de­vices. If you are ex­cep­tion­ally lucky, it could sur­vive the fail­ure of two (but don’t count on it).

You should be fa­mil­iar with how to han­dle failed de­vices in Btrfs. Es­pe­cially you should be fa­mil­iar if/when/how to mount de­graded ar­rays and how to re­pair them. The Btrfs Ker­nel Wiki is a good re­source for that.

HDD fail­ure

In case a sin­gle one of the HDD fails, you should phys­i­cally re­move it to pre­vent fur­ther in­ter­fer­ence with the sys­tem, and po­ten­tially also the cor­re­spond­ing SSD. It might be a bit tricky to get LVM to ac­ti­vate all other LVs, but it is doable. Once all other LVs are ac­tive you need to fol­low the guide for “Re­plac­ing failed drive” guide in Btrfs’s Wiki.

I highly rec­om­mend re­mov­ing the failed HDD from the Vol­ume Group and adding the re­place­ment drive, so that you can ben­e­fit from LVM again, maybe even re-us­ing the SSD as cache for the re­place­ment drive.

In case both of the HDDs fail, you are screwed. That’s what back­ups are for. Noth­ing to be done here. You’d need RAID1c3 to sur­vive that, which would re­quire 3 phys­i­cal HDDs.

SSD fail­ure

In case a sin­gle one of the SSDs fails, you should at­tempt to break up the cache, maybe sav­ing a bit of the data in the write-cache back to the HDD. After you have bro­ken up the cache, the HDD will be un­cached (and likely data is cor­rupted). Every­thing will be slow again for this disk.

At­tempt to re­pair the Btrfs mir­ror by doing a btrfs scrub. The self-heal­ing of BTFS might re­pair the sec­ond failed mir­ror leg. If this fails, be­cause the HDDs data was too in­con­sis­tent due to the write-caching, the eas­i­est ap­proach is to delete the LV, cre­ate a new one with a new name (e.g. “data-btrfs3”), wipe it with dd if=/dev/zero of=/dev/vg0/data-btrfs3 and then use the par­ti­tion as a re­place­ment for the failed de­vice in Btrfs, fol­low­ing again the “Re­plac­ing failed drive” guide. From Btrfs’ per­spec­tive the new LV in this case will be a new de­vice, even though it might oc­cupy the same phys­i­cal space (and that’s why you should zero it out first).

Once the Btrfs1 RAID1 is healthy again, you could swap in an­other SSD and use it as cache again.

In case both SSDs fail, you are screwed. You can at­tempt to break up the cache and see if btrfs scrub can save any­thing, but it’s very un­likely. That’s what back­ups are for.

Mul­ti­ple de­vice fail­ures

If there are more than one fail­ing disks, you can as­sume your data lost and reach for your back­ups. There is one sce­nario in which you might be lucky: if the HDD and SSD of the same cache-pair fail. From Btrfs’ per­spec­tive you still only lost one de­vice.

Pit­falls / Caveats

Don’t cheat on the num­ber of disks!

It might be tempt­ing to use just one SSD and split it’s space for the cache. You must not do that! It is re­ally im­por­tant that every phys­i­cal HDD has a phys­i­cal SSD for the cache. Oth­er­wise an SSD’s fail­ure will af­fect both legs of the mir­ror and be a total data-loss. In that setup, just don’t bother with RAID1, use -d single in­stead. You’ll lose data with any fail­ing disks. And that can be a risk you can ac­cept. It’s up to you and your backup strat­egy.

But RAID is all about up­time?!

You will often hear: “RAID is about up­time, it’s not about backup”. And that’s true. Btrfs RAID is not re­ally about up­time, though. It will, for in­stance, refuse to mount de­graded by de­fault, and there is a good rea­son not to add -o degraded to your /etc/fstab

I use this setup, be­cause I want to re­duce the num­ber of times I have to reach for my back­ups, not to in­crease up­time. Be­cause then I’d have to use a hard­ware RAID with hot-swap­pable disks and/or other file-sys­tems like ZFS that pri­or­i­tize up­time. Btrfs re­fuses to silently work with­out admin in­ter­ven­tion if there is any­thing wrong with the mir­ror. And for my sce­nario, that’s the bet­ter choice. Your mileage may (and likely will) dif­fer.

In case that’s im­por­tant to you, you might look into ZFS, it has caching fea­tures built-in, but I never had time to eval­u­ate that. I’d be happy to hear your ex­pe­ri­ences.

4 thoughts on “Speeding up Btrfs RAID1 with LVM Cache”

    1. The rest of the drive is used for other logical volumes, some VMs, containers, etc. Of course you can use all the space if you want to. I normally only allocate what I need and expand on-demand.

  1. Nice blog, I like it.
    I am also linux enthusiast and looking to experiment things on my laptop with btrfs so I’ve decided to try the raid thing 🙂
    I love the idea to have the stability of raid1 but I feel that I’m gonna miss the speed of raid0.
    *I do have 2 SSD each 1TB and they are fast!

    Can you drop an analogy/comparison between raid0 and 1 and maybe what can we improve for speed using raid1?
    The LVM Cache will only help if I had another 2 SSD if my understanding is correct, please drop a few lines on that 🙂

    Much appreciate it!

    1. So, first of all, if you already have 2 SSDs as your main storage, I don’t think you’ll benefit much from adding a cache to them. Just use them in whatever redundancy configuration you feel suits your needs (-d raid1, -d single). If you’d have “largish” SATA SSDs and some “smallish” NVME SSDs, this setup might help you, but I highly doubt it for anything real-world.

      I have never experimented with BTRFS RAID0. To my understanding, the main difference between -d single (which is RAID’s equiv. of JBOD) and RAID0 is the striping. Since chunks are striped across devices, I expect some limitations in flexibility when using devices with different sizes, but as I said, no real experience with it. My home-lab setup never had a use-case where I needed to speed up two SSDs. For HDDs, the presented cache solution in my personal use-case provided more than enough benefits, since random-access performance of SSDs is that much better.

      If you have any experiences, please share them, I’m eager to learn 🙂

Leave a Reply to Rahul Shah Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.