safeguard against reuse of active caching device #115

onlyjob · 2016-07-15T17:27:09Z

I've managed to wreck dm-writeboost module by accidentally re-using active caching device by constructing another dm-writeboosted device using the same caching device.

dm-writeboost went into endless loop constantly reading from caching device:

Jul 16 03:09:28 deblab kernel: [1901555.620293] dm-4: rw=1, want=2233227040, limit=1749020672
Jul 16 03:09:28 deblab kernel: [1901555.620294] attempt to access beyond end of device
Jul 16 03:09:28 deblab kernel: [1901555.620294] dm-4: rw=1, want=2233227816, limit=1749020672
Jul 16 03:09:28 deblab kernel: [1901555.620295] attempt to access beyond end of device
Jul 16 03:09:28 deblab kernel: [1901555.620296] dm-4: rw=1, want=2233227824, limit=1749020672
Jul 16 03:09:28 deblab kernel: [1901555.620296] attempt to access beyond end of device
Jul 16 03:09:28 deblab kernel: [1901555.620297] dm-4: rw=1, want=2233227832, limit=1749020672
Jul 16 03:09:28 deblab kernel: [1901555.620298] attempt to access beyond end of device
Jul 16 03:09:28 deblab kernel: [1901555.620299] dm-4: rw=1, want=2233227840, limit=1749020672
Jul 16 03:09:28 deblab kernel: [1901555.620299] attempt to access beyond end of device
Jul 16 03:09:28 deblab kernel: [1901555.620300] dm-4: rw=1, want=2233227944, limit=1749020672
Jul 16 03:09:28 deblab kernel: [1901555.620301] attempt to access beyond end of device
Jul 16 03:09:28 deblab kernel: [1901555.620301] dm-4: rw=1, want=2233227952, limit=1749020672

I did not find the way how to interrupt it so I had to resort to emergency reboot.

Please consider introducing safeguard to prevent accidental re-use of caching device.

The text was updated successfully, but these errors were encountered:

akiradeveloper · 2016-07-16T02:04:48Z

@onlyjob Sharing a caching device with two dm-writeboost'd devices is invalid.

I once considered this by embedding some unique identifier of the backing device into the super block of the caching device but concluded there is no perfect solution for this.

The first reason is such perfect identifier doesn't exist (neither device name nor uuid are valid).

And the second reason is we should let the userland tools (e.g. your writeboost) to manage invalid operations. There are lot of invalid operations that is theoretically possible but managing them isn't a role of kernel module but the userland tools.

However, looping forever and you resort to rebooting bit sounds to be fixed. So I want to ask you for more details. I couldn't see what you actually did

Give me lsblk
What does "active" mean?
The first wb'd device and the second one are backed by a device of difference sizes?
You created the second wb'd device before removing the first wb'd device? or after?

akiradeveloper · 2016-07-16T09:12:05Z

@onlyjob

I wrote a test to reproduce the error message you saw in dmesg

class REPRO_115 extends DMTestSuite {
  test("endless loop when try to use active caching device") {
    Memory(Sector.M(8)) { caching => // 1
      Memory(Sector.M(128)) { backing1 => // 2
        Writeboost.sweepCaches(caching) // 3
        Writeboost.Table(backing1, caching).create { s => // 4
          Shell(s"dd if=/dev/urandom of=${s.bdev.path} bs=1M count=128") // write // 5
        }
      } // 6 (when moving out of block, device is removed)
      Memory(Sector.M(64)) { backing2 => // 7
        Writeboost.Table(backing2, caching).create { s => // 8
          Shell(s"dd if=${s.bdev.path} of=/dev/null bs=1M count=128") // read // 9
        }
      }
    }
  }
}

[ 3478.223470] attempt to access beyond end of device
[ 3478.223471] loop1: rw=1, want=254400, limit=131072
[ 3478.223497] attempt to access beyond end of device
[ 3478.223498] loop1: rw=1, want=254408, limit=131072
[ 3478.223525] attempt to access beyond end of device
[ 3478.223526] loop1: rw=1, want=254416, limit=131072
[ 3478.223552] attempt to access beyond end of device
[ 3478.223553] loop1: rw=1, want=254424, limit=131072
[ 3478.223580] attempt to access beyond end of device
[ 3478.223581] loop1: rw=1, want=254432, limit=131072
[ 3478.223607] attempt to access beyond end of device
[ 3478.223609] loop1: rw=1, want=254440, limit=131072
[ 3478.223635] attempt to access beyond end of device
[ 3478.223636] loop1: rw=1, want=254448, limit=131072
[ 3478.223663] attempt to access beyond end of device
[ 3478.223664] loop1: rw=1, want=254456, limit=131072

the test code says:

create a 8MB caching
create a 128MB backing1
zero out the 1st sector of caching
create wb'd device s
dd out the s
remove s
create 64M backing2 (smaller than backing1)
create wb'd device s without zeroing out
read out s (but this is irrelevant)

the number is corresponding to the comments in the test code so you can understand the code. (not difficult isn't it?)

the dmesg comes from writeback thread because the dirty cache's destination address is far beyond 64MB (it's around 120MB)

onlyjob · 2016-07-17T00:43:42Z

Give me lsblk

I believe it would be irrelevant. Too many disks, etc.

What does "active" mean?

"Active" means in use.

I created one dm-writeboosted device (HDD1+SSD1)
then I created another dm-writeboosted device (HDD2+SSD1) where caching device SSD1 was mistakenly re-used. Second writeboosted device was constructed when first one was already in use.

The first wb'd device and the second one are backed by a device of difference sizes?

Probably same size. IMHO irrelevant as you can't reliably use size to detect whether device is used.

You created the second wb'd device before removing the first wb'd device? or after?

Actively used dm-writeboost device was not stopped so yes, second device was created before stopping first one.

I'm not concerned about embedding unique ID to caching devices. Here what's important is to check whether caching device is already allocated to another device before constructing new dm-writeboosted device. Only run-time check to prevent deadlock and potential data loss. Thanks.

akiradeveloper · 2016-07-17T09:39:45Z

I wrote a test for the case but there is no error.

  test("making another device while in use") {
    Memory(Sector.M(8)) { caching =>
      Memory(Sector.M(64)) { backing1 =>
        Memory(Sector.M(64)) { backing2 =>
          Writeboost.sweepCaches(caching)
          Writeboost.Table(backing1, caching).create { s =>
            Writeboost.Table(backing2, caching).create { s =>
            }
          }
        }
      }
    }
  }

Here what's important is to check whether caching device is already allocated to another device before constructing new dm-writeboosted device. Only run-time check to prevent deadlock and potential data loss. Thanks.

dm_dev has reference count internally but target code can't access it. I don't like to have my own data structure to manage that.

onlyjob · 2016-07-17T09:51:30Z

Well, I don't understand how your test work but I wrecked my system once when I improperly constructed writeboosted device by accidentally reusing caching disk of another writeboosted device that was already in use. It was too easy to make such mistake. Safeguard is important to implement for safety.
Similar reasons why system prevents attempts to format device with mounted file system -- fool proof... May protect from accidents...

akiradeveloper · 2016-08-05T12:38:36Z

If we maintain a list of pair (backing, caching) where both values are simply the name passed that's only effective between two reboots (or hot-swapping is another issue this can't avoid the problem described here)

I will look deeper the dm code if we can use the internally managed reference count.

akiradeveloper · 2016-09-15T07:03:27Z

Let's try adding FMODE_EXCL flag and see what changes. There is no target adding this flag though.

static int consume_essential_argv(struct wb_device *wb, struct dm_arg_set *as)
{
    int err = 0;
    struct dm_target *ti = wb->ti;

    err = dm_get_device(ti, dm_shift_arg(as), dm_table_get_mode(ti->table),
                &wb->backing_dev);
    if (err) {
        DMERR("Failed to get backing_dev");
        return err;
    }

    err = dm_get_device(ti, dm_shift_arg(as), dm_table_get_mode(ti->table),
                &wb->cache_dev);

akiradeveloper · 2016-09-15T07:08:53Z

By default the mode doesn't have FMODE_EXCL in.

static inline fmode_t get_mode(struct dm_ioctl *param)
{
        fmode_t mode = FMODE_READ | FMODE_WRITE;

        if (param->flags & DM_READONLY_FLAG)
                mode = FMODE_READ;

        return mode;
}

it's either WRITE | READ (normal) or READ (read only)

akiradeveloper · 2016-09-18T14:39:18Z

The idea of adding FMODE_EXCL alone doesn't work

akiradeveloper added the enhancement label Aug 2, 2016

akiradeveloper mentioned this issue Jan 22, 2017

Add validation: the backing or cache device is already used? akiradeveloper/dm-writeboost-tools#11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

safeguard against reuse of active caching device #115

safeguard against reuse of active caching device #115

onlyjob commented Jul 15, 2016

akiradeveloper commented Jul 16, 2016

akiradeveloper commented Jul 16, 2016

onlyjob commented Jul 17, 2016

akiradeveloper commented Jul 17, 2016

onlyjob commented Jul 17, 2016

akiradeveloper commented Aug 5, 2016

akiradeveloper commented Sep 15, 2016

akiradeveloper commented Sep 15, 2016

akiradeveloper commented Sep 18, 2016

safeguard against reuse of active caching device #115

safeguard against reuse of active caching device #115

Comments

onlyjob commented Jul 15, 2016

akiradeveloper commented Jul 16, 2016

akiradeveloper commented Jul 16, 2016

onlyjob commented Jul 17, 2016

akiradeveloper commented Jul 17, 2016

onlyjob commented Jul 17, 2016

akiradeveloper commented Aug 5, 2016

akiradeveloper commented Sep 15, 2016

akiradeveloper commented Sep 15, 2016

akiradeveloper commented Sep 18, 2016