Btrfs subvolume quota still in its infancy with btrfs version 4.2.2


Ever tried to dive into the btrfs subvolume topic, especially in combination with quotas (not snapshots here)? Looks really promising… administring subvolumes in hierarchies automagically offers managing quotas on (summed up) top- and on (dedicated) sub-levels by design, see : Btrfs SysadminGuide Subvolumes or Btrfs: Subvolumes and snapshots for example. With the later 4.x kernels there is btrfs 4.2.2, representing a huge step forward in btrfs development, so thought to give it another try on a red hat / oracle uek based 7.2 system.

Following, I’m going to show what I attempted to achive, the how-to’s, the workarounds I tried and, intermixed, the quite odd behaviour that I observed. Odd to an magnitude, that makes me recommed everyone to stay averse against employing this promising but still semifinished (?) technology.

The setup of a subvolume, dedicated for quota control is quite easy and takes only a couple of keystrokes. Understand though, that quota control with btrfs can only be enabled throughout the entire filesystem but can then be set to a value dedicatetly.

# see what subvolumes already exist
# so /home is a child of / and /var/lib/machines is a child of /home
btrfs subvolume list -pt /
    ID      gen     parent  top level       path
    --      ---     ------  ---------       ----
    257     443195  5       5               home
    306     443188  257     257             var/lib/machines

# check for the initially top-level subvolume, whose subvolume id should be 5
btrfs subvolume get-default /
    ID 5 (FS_TREE)

# enable quotas on the file system level (only)
btrfs quota enable /

# see the quota groups for subvolumes
# no need to do "btrfs subvolume list <path> | cut -d' ' -f2 | xargs -I{} -n1 btrfs qgroup create 0/{} <path>"
#   for original (install time) subvolumes, btrfs does seem to do that automatically
# if "WARNING: Qgroup data inconsistent, rescan recommended" is given here
#   do a : "btrfs quota rescan -w /"
btrfs qgroup show -r /
    qgroupid         rfer         excl     max_rfer
    --------         ----         ----     --------
    0/5          16.00KiB     16.00KiB         none
    0/257         3.82GiB      3.82GiB         none
    0/306        16.00KiB     16.00KiB         none

# add another (newer) subvolume
btrfs subvolume create /mnt/vlogs
    Create subvolume '/mnt/vlogs'

# see the new subvolume existing (excerpt)
btrfs subvolume list -pt /
    ID      gen     parent  top level       path
    --      ---     ------  ---------       ----
    308     443211  257     257             mnt/vlogs

# subvolume quota group automatically generated again, as supposed from the lesson above (excerpt)
btrfs qgroup show -r /
    qgroupid         rfer         excl     max_rfer
    --------         ----         ----     --------
    0/308        16.00KiB     16.00KiB         none

# set a limit, limit space assigned to this qgroup
btrfs qgroup limit 100m /mnt/vlogs

# show the qgroup again, see the limit now (excerpt)
btrfs qgroup show -r /
    qgroupid         rfer         excl     max_rfer
    --------         ----         ----     --------
    0/308        16.00KiB     16.00KiB    100.00MiB

That’s it, more or less. What follows next, of course, is the limit check. So the question is, what happens in case… first an initial below limit write, than exceed the limit intentionally.

# test check
dd if=/dev/zero of=/mnt/vlogs/junk bs=1M count=1
    100+0 records in
    100+0 records out
    1024000 bytes (1.0 MB) copied, 0.00180985 s, 566 MB/s

# du
du -sh /mnt/vlogs/
    1.0M   /mnt/vlogs/

# rescan the quotas and show the qgroup (excerpt)
# the btrfs quota scanner does seem to run on callback but on schedule such that an output
#   of qgroup show may not ad hoc give the expected numbers, an estimated delay of 10 secs
#   was seen with btrfs 4.2.2
# that way, for ad hoc correct results, prepend a rescan to the show
#   the same is true for the inverse case, i.a. a rm /mnt/logs/junk
(btrfs quota rescan -w / &&  btrfs qgroup show -r /)
    quota rescan started
    qgroupid         rfer         excl     max_rfer
    --------         ----         ----     --------
    0/5          16.00KiB     16.00KiB         none
    0/257         3.82GiB      3.82GiB         none
    0/306        16.00KiB     16.00KiB         none
    0/308      1016.00KiB   1016.00KiB    100.00MiB

So far, so good, now exceed.

# test check 2
dd if=/dev/zero of=/mnt/logs/junk bs=10k count=100000
    dd: error writing ‘/mnt/logs/junk’: Disk quota exceeded
    10123+0 records in
    10122+0 records out
    103649280 bytes (104 MB) copied, 0.125126 s, 828 MB/s

du -sh /mnt/vlogs/
    99M     /mnt/vlogs/

(btrfs quota rescan -w / &&  btrfs qgroup show -r /)
    quota rescan started
    qgroupid         rfer         excl     max_rfer
    --------         ----         ----     --------
    0/5          16.00KiB     16.00KiB         none
    0/257         3.82GiB      3.82GiB         none
    0/306        16.00KiB     16.00KiB         none
    0/308        98.86MiB     98.86MiB    100.00MiB

Ok, up to 99M have been written, then the exception got raised. Anything as expected and attempted. To save the day in that situation, room needs to be freed by trashing unimportant files and this is, well, a touchy aspect for copy-on-write filesystems like btrfs, see : Copy-on-write in storage media and especially Re: working quota example? because even deleting data requires allocating space, for a while at least. But I don’t care, the idea is, since there should be enough free space on the next upper level of the subvolume, on the root volume, the limit may be temporarily reset to none, unimportant files removed and the limit reset into effect again.

Note, by the way, that the famous "echo '' > /mnt/vlogs/junk" will not work reliably to erase (or truncate) a file after the overflow, although a lot of people on the net do pretend so. Some public wiki’s even declare this procedure as a documented fix. I saw this thing working but only in one out of ten times or so. Don’t trust this.

# copy-on-write...
rm /mnt/vlogs/junk
    rm: cannot remove ‘/mnt/logs/junk’: Disk quota exceeded

# unset (better infinite) the limit for rm to succeed
# other option :
#   btrfs qgroup destroy 308 /mnt/vlogs
#   btrfs qgroup create 308 /mnt/vlogs
btrfs qgroup limit none /mnt/logs

# retry rm with success
rm /mnt/vlogs/junk

# reestablish the limit
btrfs qgroup limit 100m /mnt/vlogs

Everything is fine again? Missed completely, this is where the odd behaviour begins. Without boring anyone with endless logs of what has been done, two types of oddity were spotted. First, btrfs accidentially announced a Disk quota exceeded even when the was enough (cow-doubled) room on the subvolume for the file written. Second, btrfs does seem to recheck quota violations only in cycles of time, I think I witnessed a pattern of around 10secs, such that a write to a subvolume with quota will fail this time but may succeed the next time. Nevertheless, see this log for example, where the empty subvolume (like above) succeeds in writing twice before the write fails on any other reason than space. How come?

# check the limit
btrfs qgroup show -r /
    qgroupid         rfer         excl     max_rfer
    --------         ----         ----     --------
    0/307        16.00KiB     16.00KiB    100.00MiB

# write 1m
dd if=/dev/zero of=/mnt/vlogs/junk bs=1M count=1
    1+0 records in
    1+0 records out
    1048576 bytes (1.0 MB) copied, 0.00448722 s, 234 MB/s

# write 5m
dd if=/dev/zero of=/mnt/vlogs/junk bs=1M count=5
    5+0 records in
    5+0 records out
    5242880 bytes (5.2 MB) copied, 0.0128641 s, 408 MB/s

# write 5m again - limit violation for whatever reason
dd if=/dev/zero of=/mnt/vlogs/junk bs=1M count=5
    dd: error writing ‘/mnt/vlogs/junk’: Disk quota exceeded
    1+0 records in
    0+0 records out
    917504 bytes (918 kB) copied, 0.00250201 s, 367 MB/s

# check the limit - there is enough room for just 5m
btrfs qgroup show -r /
    qgroupid         rfer         excl     max_rfer
    --------         ----         ----     --------
    0/307         5.02MiB      5.02MiB    100.00MiB

Already disappointed but still not beaten, I developed the idea to just switch off copy-on-write for the subvolume in question. Documentation tells that is is possible in two ways, either by specifying the nodatacow in the mount options for the subvolume or by setting the C attribute for the subvolume directory to apply to any (newly created) contained files (see : Btrfs Disabling CoW). However, proceeding with the mount nodatacow option technique (see : FAQ# an I mount subvolumes with different mount options) is currently not avaible, at least according to : Mount options (at the very top). The alternate approach, the the C flag, did not show any effects at all on testing. Well, yes at first buth then…

# make rubber
btrfs subvolume delete -c /mnt/vlogs
    Delete subvolume (commit): '/mnt/vlogs'

# gone?
ll /mnt/vlogs/
    ls: cannot access /mnt/vlogs/: No such file or directory

# anew
btrfs subvolume create /mnt/vlogs
    Create subvolume '/mnt/vlogs'

# view
ll /mnt
  drwxr-xr-x. 1 root root   0 Jun 24 14:01 vlogs

# do the C thing
chattr +C /mnt/vlogs/
    n/a

# check C
lsattr -Ra /mnt/vlogs/
    ---------------C /mnt/vlogs/.
    ---------------- /mnt/vlogs/..

# limit again
btrfs qgroup limit 100m /mnt/vlogs
    n/a

# limit check
(btrfs quota rescan -w / &&  btrfs qgroup show -r /)
    qgroupid         rfer         excl     max_rfer
    --------         ----         ----     --------
    0/309        16.00KiB     16.00KiB    100.00MiB

# exceed
dd if=/dev/zero of=/mnt/vlogs/junk bs=10M count=100000
    dd: error writing ‘/mnt/vlogs/junk’: Disk quota exceeded
    10+0 records in
    9+0 records out
    102760448 bytes (103 MB) copied, 0.0776324 s, 1.3 GB/s

# exceed check
(btrfs quota rescan -w / &&  btrfs qgroup show -r /)
    qgroupid         rfer         excl     max_rfer
    --------         ----         ----     --------
    0/309        98.02MiB     98.02MiB    100.00MiB

# the C thing on the file ??
lsattr -Ra /mnt/vlogs/
    ---------------C /mnt/vlogs/.
    ---------------- /mnt/vlogs/..
    ---------------C /mnt/vlogs/junk

# du says what?
du -h /mnt/vlogs
    98M     /mnt/vlogs

# yippie, it works
rm -f /mnt/vlogs/junk
    n/a

# try again
dd if=/dev/zero of=/mnt/vlogs/junk bs=10M count=100000
    dd: error writing ‘/mnt/vlogs/junk’: Disk quota exceeded
    10+0 records in
    9+0 records out
    102760448 bytes (103 MB) copied, 0.0707266 s, 1.5 GB/s

# yippie, it works it works it works it works
rm -f /mnt/vlogs/junk
    n/a

# not true, this is only 50m on an empty volume -> giving up
dd if=/dev/zero of=/mnt/vlogs/junk bs=10M count=5
  dd: error writing ‘/mnt/vlogs/junk’: Disk quota exceeded
  1+0 records in
  0+0 records out
  1966080 bytes (2.0 MB) copied, 0.00459045 s, 428 MB/s

Really giving up at this stage of affairs.

ps. Another point I remember is, though not shown here, because I don’t have the logs anymore and maybe not valid for the C attribute approach explicitely without copy-on-write, ahemm: beware that the file, triggering the Disk quota exceeded on write, may/will have been gone to nowhere. This contradicts the purpose of a copy-on-write filesystem, right? This is the scenario that should not happen by design, right?

Have fun, Peter

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s