Wednesday, January 13, 2010

Increase performance of an encrypted RAID5

Maybe someone has read my last post on encrypting a RAID5 with cryptsetup (dm-crypt/LUKS). I knew that the encryption layer would somehow cost me some performance but I was not aware of how much. Keep in mind that we're talking here about an encrypted software RAID5 which means we're losing performance by calculating the parity information and by adding an encryption layer. If someone doesn't need or want encryption or does use a hardware RAID controller then the performance should be much better of course. But nevertheless the trick I needed to apply also holds true for unencrypted RAID arrays.
So after the RAID5 was set up I started to copy my data back from the backup disk. And curious as I was I did this with the time command. But first let's look of what amount of data we're talking about
root@server:~# du -csh /mnt/RAID/DATEN/
752G    /mnt/RAID/DATEN/
752G    total
That's quite a bit. But heh needs must so let's fire it up
root@server:~# time cp -a /mnt/BACKUP/DATEN/ /mnt/RAID/

real    789m46.834s
user    0m11.391s
sys     55m13.869s
oh f*** that took a while. Simple math gives us here something like 16MiB/s write performance and that should not be acceptable. Ok I know that copying such an amount of data with cp can't be taken as a serious write performance test but anyhow it gives you an idea. Searching through the net I quickly found out what I was missing while creating the encrypted RAID5 and especially the ext3 filesystem on it. I have to admit that I didn't fully understand (not even now) all the chunk-size, stripe-size, block-size harddisk slang 1337 5P34K but I could get at least the necessary information that I needed.
Mdadm per default sets the chunk_size of the RAID array to 64K. I've read somewhere that for a file server with large files you could increase this to 128K or 256K. I don't know if that's true and how big the performance gain would be but what I know is that I didn't want to spent another two days on reshaping the RAID array. So I stuck with 64K.

I have to make a little insertion here. I also didn't care much about partition alignment. As much as I understand it, it is not that important for a software RAID where you assemble the RAID out of partitions e.g. /dev/sda1, /dev/sdb1 etc. It may have an impact when using whole devices /dev/sda etc. instead and putting the partition table on top of the RAID device /dev/md0. I simply wanted to avoid any troubles with the partitions and running into the situation of reshaping the RAID array again.

Ok back to the topic. The last time I put the filesystem on the encrypted RAID device I used
me@server:~$ sudo mkfs.ext3 -c /dev/mapper/cryptraid
which created the filesystem with a default block size of 4K. But because the data is written in 64K chunks to the RAID (see above) I should have used additional parameters to mkfs.ext3. This time I did a
me@server:~$ sudo mkfs.ext3 -b 4096 -R stride=16,stripe-width=32 /dev/mapper/cryptraid
which tells the filesystem that the data comes in 16*4K=64K chunks and the stripe-width of my three disk array is 32K. The math goes by this:
  • chunk-size = 64K (that's what mdadm uses as default, maybe not the best choice here..but anyway)
  • block-size = 4K (that's the recommended block size for a large-file filesystem)
  • stride = chunk / block = 64K / 4K = 16K
  • stripe-width = stride * [(x disks in RAID5) - 1] = 16K * [(3)-1] = 32K
If the chunk-size is 64K, it means that 64K of consecutive data will reside on one disk. If one builds a filesystem with a 4K block-size this means that there are 16 filesystem blocks in one array chunk. The stripe-width is calculated by multiplying the stride=16 value with the number of data disk in the array. That's why we use x - 1 because one disk is needed for the parity information.
Making a simple dd if=/dev/zero of=/mnt/RAID/test.dat bs=X count=Y benchmark gives:
  • bs=   4K count = 480000 = 2GiB ==>  80,9 MiB/s
  • bs=   8K count = 240000 = 2GiB ==>  88,8 MiB/s
  • bs=  16K count = 120000 = 2GiB ==>  98,0 MiB/s
  • bs=  32K count =  60000 = 2GiB ==> 107,0 MiB/s
  • bs=  64K count =  30000 = 2GiB ==>  93,1 MiB/s
  • bs= 128K count =  15000 = 2GiB ==>  94,2 MiB/s
  • bs= 256K count =   7500 = 2GiB ==>  95,8 MiB/s
  • bs= 512K count =   3750 = 2GiB ==> 103,0 MiB/s
  • bs=1024K count =   1875 = 2GiB ==>  93,5 MiB/s
so that it seems as the maximum write performance of 107 MiB/s is reached when the data comes in stripe-width pieces. The other peak is at stripe-width * stride pieces. Maybe this should tell me something but right now it doesn't. ;-)

But reading through serveral posts on the net people always mention to check the stripe_cache_size value of the RAID.
root@server:~# cat /sys/block/md0/md/stripe_cache_size
256
seemed quite low to me because everyone was talking about 4K, 8K and 16K stripe_cache_sizes. So I decided to put it to 8K
root@server:~# echo 8192 > /sys/block/md0/md/stripe_cache_size
and to fire up the copying again. This time I got
root@server: time cp -a /mnt/BACKUP/DATEN/ /mnt/RAID/

real    234m27.764s
user    0m19.860s
sys     58m33.030s
which gives something like 55MiB/s and looks much better than before. Hdparm shows me something around 90MiB/s as buffered disk reads which is together with the current write performance enough for my needs.

I'd appreciate any comments and tips on that topic because first of all I'm really not very deep into this and second I really don't know which  of the two tunings had the bigger influence on the performance increase. Very helpfull information I found here, here and here and of course on the unholy G.

cheers

No comments: