Weird full heal on Distributed-Disperse volume with sharding

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Weird full heal on Distributed-Disperse volume with sharding

Dmitry Antipov
For the testing purposes, I've set up a localhost-only setup with 6x16M
ramdisks (formatted as ext4) mounted (with '-o user_xattr') at
/tmp/ram/{0,1,2,3,4,5} and SHARD_MIN_BLOCK_SIZE lowered to 4K. Finally
the volume is:

Volume Name: test
Type: Distributed-Replicate
Volume ID: 241d6679-7cd7-48b4-bdc5-8bc1c9940ac3
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: [local-ip]:/tmp/ram/0
Brick2: [local-ip]:/tmp/ram/1
Brick3: [local-ip]:/tmp/ram/2
Brick4: [local-ip]:/tmp/ram/3
Brick5: [local-ip]:/tmp/ram/4
Brick6: [local-ip]:/tmp/ram/5
Options Reconfigured:
features.shard-block-size: 64KB
features.shard: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off

Then I mount it under /mnt/test:

# mount -t glusterfs [local-ip]:/test /mnt/test

and create 4M file on it:

# dd if=/dev/random of=/mnt/test/file0 bs=1M count=4

This creates 189 shards of 64K each, in /tmp/ram/?/.shard:

/tmp/ram/0/.shard: 24
/tmp/ram/1/.shard: 24
/tmp/ram/2/.shard: 24
/tmp/ram/3/.shard: 39
/tmp/ram/4/.shard: 39
/tmp/ram/5/.shard: 39

To simulate data loss I just remove 2 arbitrary .shard directories,
for example:

# rm -rfv /tmp/ram/0/.shard /tmp/ram/5/.shard

Finally, I do full heal:

# gluster volume heal test full

and successfully got all shards under /tmp/ram/{0,5}.shard back.

But the things seems going weird for the following volume:

Volume Name: test
Type: Distributed-Disperse
Volume ID: aa621c7e-1693-427a-9fd5-d7b38c27035e
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (2 + 1) = 6
Transport-type: tcp
Bricks:
Brick1: [local-ip]:/tmp/ram/0
Brick2: [local-ip]:/tmp/ram/1
Brick3: [local-ip]:/tmp/ram/2
Brick4: [local-ip]:/tmp/ram/3
Brick5: [local-ip]:/tmp/ram/4
Brick6: [local-ip]:/tmp/ram/5
Options Reconfigured:
features.shard: on
features.shard-block-size: 64KB
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on

After creating 4M file as before, I've got the same 189 shards
but 32K each. After deleting /tmp/ram/{0,5}/.shard and full heal,
I was able to get all shards back. But, after deleting
/tmp/ram/{3,4}/.shard and full heal, I've ended up with the following:

/tmp/ram/0/.shard:
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.10
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.11
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.12
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.13
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.14
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.15
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.16
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.17
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.2
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.22
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.23
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.27
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.28
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.3
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.31
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.34
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.35
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.37
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.39
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.4
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.40
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.44
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.45
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.46
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.47
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.53
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.54
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.55
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.57
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.58
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.6
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.63
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.7
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.9

/tmp/ram/1/.shard:
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.10
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.11
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.12
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.13
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.14
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.15
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.16
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.17
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.2
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.22
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.23
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.27
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.28
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.3
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.31
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.34
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.35
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.37
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.39
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.4
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.40
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.44
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.45
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.46
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.47
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.53
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.54
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.55
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.57
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.58
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.6
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.63
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.7
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.9

/tmp/ram/2/.shard:
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.10
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.11
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.12
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.13
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.14
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.15
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.16
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.17
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.2
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.22
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.23
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.27
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.28
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.3
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.31
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.34
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.35
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.37
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.39
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.4
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.40
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.44
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.45
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.46
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.47
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.53
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.54
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.55
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.57
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.58
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.6
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.63
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.7
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.9

So, /tmp/ram/{3,4}/.shard was not recovered. Even worse, /tmp/ram/5/.shard
has disappeared completely. And of course this breaks all I/O on /mnt/test/file0,
for example:

# dd if=/dev/random of=/mnt/test/file0 bs=1M count=4
dd: error writing '/mnt/test/file0': No such file or directory
dd: closing output file '/mnt/test/file0': No such file or directory

Any ideas on what's going on here?

Dmitry
_______________________________________________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968




Gluster-devel mailing list
[hidden email]
https://lists.gluster.org/mailman/listinfo/gluster-devel

Reply | Threaded
Open this post in threaded view
|

Re: Weird full heal on Distributed-Disperse volume with sharding

Xavi Hernandez-2
Hi Dmitry,

my comments below...

On Tue, Sep 29, 2020 at 11:19 AM Dmitry Antipov <[hidden email]> wrote:
For the testing purposes, I've set up a localhost-only setup with 6x16M
ramdisks (formatted as ext4) mounted (with '-o user_xattr') at
/tmp/ram/{0,1,2,3,4,5} and SHARD_MIN_BLOCK_SIZE lowered to 4K. Finally
the volume is:

Volume Name: test
Type: Distributed-Replicate
Volume ID: 241d6679-7cd7-48b4-bdc5-8bc1c9940ac3
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: [local-ip]:/tmp/ram/0
Brick2: [local-ip]:/tmp/ram/1
Brick3: [local-ip]:/tmp/ram/2
Brick4: [local-ip]:/tmp/ram/3
Brick5: [local-ip]:/tmp/ram/4
Brick6: [local-ip]:/tmp/ram/5
Options Reconfigured:
features.shard-block-size: 64KB
features.shard: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off

Then I mount it under /mnt/test:

# mount -t glusterfs [local-ip]:/test /mnt/test

and create 4M file on it:

# dd if=/dev/random of=/mnt/test/file0 bs=1M count=4

This creates 189 shards of 64K each, in /tmp/ram/?/.shard:

/tmp/ram/0/.shard: 24
/tmp/ram/1/.shard: 24
/tmp/ram/2/.shard: 24
/tmp/ram/3/.shard: 39
/tmp/ram/4/.shard: 39
/tmp/ram/5/.shard: 39

To simulate data loss I just remove 2 arbitrary .shard directories,
for example:

# rm -rfv /tmp/ram/0/.shard /tmp/ram/5/.shard

Finally, I do full heal:

# gluster volume heal test full

and successfully got all shards under /tmp/ram/{0,5}.shard back.

But the things seems going weird for the following volume:

Volume Name: test
Type: Distributed-Disperse
Volume ID: aa621c7e-1693-427a-9fd5-d7b38c27035e
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (2 + 1) = 6
Transport-type: tcp
Bricks:
Brick1: [local-ip]:/tmp/ram/0
Brick2: [local-ip]:/tmp/ram/1
Brick3: [local-ip]:/tmp/ram/2
Brick4: [local-ip]:/tmp/ram/3
Brick5: [local-ip]:/tmp/ram/4
Brick6: [local-ip]:/tmp/ram/5
Options Reconfigured:
features.shard: on
features.shard-block-size: 64KB
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on

After creating 4M file as before, I've got the same 189 shards
but 32K each.

This is normal. A dispersed volume writes encoded fragments of each block in each brick. In this case it's a 2+1 configuration, so each block is divided into 2 fragments. A third fragment is generated for redundancy and stored on the third brick.
 
After deleting /tmp/ram/{0,5}/.shard and full heal,
I was able to get all shards back. But, after deleting
/tmp/ram/{3,4}/.shard and full heal, I've ended up with the following:

This is not right. A disperse 2+1 configuration only supports a single failure. Wiping 2 fragments from the same file makes the file unrecoverable. Disperse works using the Reed-Solomon erasure code, which requires at least 2 healthy fragments to recover the data (in a 2+1 configuration).

If you want to be able to recover from 2 disk failures, you need to create a 4+2 configuration.

To make it more clear: a 2+1 configuration is like a traditional RAID5 with 3 disks. If you lose 2 disks, data is lost. A 4+2 is similar to a RAID6.

Regards,

Xavi


/tmp/ram/0/.shard:
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.10
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.11
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.12
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.13
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.14
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.15
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.16
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.17
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.2
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.22
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.23
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.27
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.28
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.3
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.31
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.34
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.35
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.37
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.39
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.4
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.40
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.44
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.45
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.46
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.47
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.53
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.54
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.55
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.57
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.58
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.6
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.63
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.7
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.9

/tmp/ram/1/.shard:
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.10
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.11
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.12
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.13
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.14
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.15
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.16
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.17
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.2
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.22
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.23
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.27
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.28
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.3
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.31
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.34
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.35
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.37
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.39
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.4
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.40
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.44
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.45
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.46
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.47
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.53
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.54
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.55
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.57
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.58
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.6
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.63
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.7
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.9

/tmp/ram/2/.shard:
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.10
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.11
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.12
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.13
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.14
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.15
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.16
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.17
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.2
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.22
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.23
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.27
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.28
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.3
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.31
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.34
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.35
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.37
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.39
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.4
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.40
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.44
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.45
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.46
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.47
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.53
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.54
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.55
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.57
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.58
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.6
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.63
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.7
-rw-r--r-- 2 root root 32768 Sep 29 12:01 951d7c52-7230-420b-b8bb-da887fffd41e.9

So, /tmp/ram/{3,4}/.shard was not recovered. Even worse, /tmp/ram/5/.shard
has disappeared completely. And of course this breaks all I/O on /mnt/test/file0,
for example:

# dd if=/dev/random of=/mnt/test/file0 bs=1M count=4
dd: error writing '/mnt/test/file0': No such file or directory
dd: closing output file '/mnt/test/file0': No such file or directory

Any ideas on what's going on here? 

Dmitry
_______________________________________________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968




Gluster-devel mailing list
[hidden email]
https://lists.gluster.org/mailman/listinfo/gluster-devel


_______________________________________________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968




Gluster-devel mailing list
[hidden email]
https://lists.gluster.org/mailman/listinfo/gluster-devel

Reply | Threaded
Open this post in threaded view
|

Re: Weird full heal on Distributed-Disperse volume with sharding

Dmitry Antipov
On 9/30/20 8:58 AM, Xavi Hernandez wrote:

> This is normal. A dispersed volume writes encoded fragments of each block in each brick. In this case it's a 2+1 configuration, so each block is divided into 2 fragments. A third fragment is generated
> for redundancy and stored on the third brick.

OK. But for Distributed-Replicate 2 x 3 setup and 64K shards, 4M file should be split into (4096 / 64) * 3 = 192 shards, not 189. So why 189?

And if all bricks are considered equal and has enough amount of free space, shards distribution {24, 24, 24, 39, 39, 39} looks suboptimal.
Why not {31, 32, 31, 32, 31, 32}? Isn't it a bug?

> This is not right. A disperse 2+1 configuration only supports a single failure. Wiping 2 fragments from the same file makes the file unrecoverable. Disperse works using the Reed-Solomon erasure code,
> which requires at least 2 healthy fragments to recover the data (in a 2+1 configuration).

It seems that I missed the point that all bricks are considered equal, regardless of the physical host they're attached to.

So, for the Distributed-Disperse 2 x (2 + 1) setup with 3 hosts, 2 bricks per each, and two files, A and B, it's possible to have
the following layout:

Host0:                  Host1:                  Host2:
|- Brick0: A0 B0        |- Brick0: A1           |- Brick0: A2
|- Brick1: B1           |- Brick1: B2           |- Brick1:

This setup can tolerate single brick failure but not single host failure because if Host0 is down, two fragments of B will be lost
and so B becomes unrecoverable (but A is not).

If this is so, is it possible/hard to enforce 'one fragment per *host*' behavior? If we can guarantee the following:

Host0:                  Host1:                  Host2:
|- Brick0: A0           |- Brick0: A1           |- Brick0: A2
|- Brick1: B1           |- Brick1: B2           |- Brick1: B0

this setup can tolerate both single brick and single host failures.

Dmitry
_______________________________________________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968




Gluster-devel mailing list
[hidden email]
https://lists.gluster.org/mailman/listinfo/gluster-devel

Reply | Threaded
Open this post in threaded view
|

Re: Weird full heal on Distributed-Disperse volume with sharding

Xavi Hernandez-2
Hi Dmitry,

On Wed, Sep 30, 2020 at 9:21 AM Dmitry Antipov <[hidden email]> wrote:
On 9/30/20 8:58 AM, Xavi Hernandez wrote:

> This is normal. A dispersed volume writes encoded fragments of each block in each brick. In this case it's a 2+1 configuration, so each block is divided into 2 fragments. A third fragment is generated
> for redundancy and stored on the third brick.

OK. But for Distributed-Replicate 2 x 3 setup and 64K shards, 4M file should be split into (4096 / 64) * 3 = 192 shards, not 189. So why 189?

In fact, there aren't 189 shards. There are 63 shards replicated 3 times each. The shard 0 is not inside the .shard directory. It's placed in the directory where the file was created. So there are a total of 64 chunks of 64 KiB = 4 MiB.


And if all bricks are considered equal and has enough amount of free space, shards distribution {24, 24, 24, 39, 39, 39} looks suboptimal.

Shards are distributed exactly equal as regular files. This means that they are balanced based on a random distribution (with some correction when free space is not equal, but this is irrelevant now). Random distributions tend to balance very well the number of files, but only with a big number of files. Statistics on a small number of files may be biased.

If you keep adding new files to the volume, the balance will improve.
 
Why not {31, 32, 31, 32, 31, 32}? Isn't it a bug?

This can't happen. When you create a 2 x 3 replicated volume, you are creating 2 independent replica 3 subvolumes. The first replica set is composed of the first 3 bricks, and the second of the last 3. The distribution layer chooses on which replica set to put each file.

It's not a bug. It's by design. Gluster can work with multiple clients creating files simultaneously. To force a perfect distribution, all of them would have to synchronize to decide where to create each file. This would have a significant performance impact. Instead of that, distribution is done randomly, which allows each client to work independently and it will balance files pretty well in the long term.


> This is not right. A disperse 2+1 configuration only supports a single failure. Wiping 2 fragments from the same file makes the file unrecoverable. Disperse works using the Reed-Solomon erasure code,
> which requires at least 2 healthy fragments to recover the data (in a 2+1 configuration).

It seems that I missed the point that all bricks are considered equal, regardless of the physical host they're attached to.

All bricks are considered equal inside a single replica/disperse set. A 2 x (2 + 1) configuration has 2 independent disperse sets, so only one brick from each of them may fail without data loss. If you want to support any 2 brick failures, you need to use a 1 x (4 + 2) configuration. In this case there's a single disperse set which tolerates up to 2 brick failures.
 

So, for the Distributed-Disperse 2 x (2 + 1) setup with 3 hosts, 2 bricks per each, and two files, A and B, it's possible to have
the following layout:

Host0:                  Host1:                  Host2:
|- Brick0: A0 B0        |- Brick0: A1           |- Brick0: A2
|- Brick1: B1           |- Brick1: B2           |- Brick1:

No, this won't happen. A single file will go either to brick0 of all hosts or brick1 of all hosts. They won't be mixed.


This setup can tolerate single brick failure but not single host failure because if Host0 is down, two fragments of B will be lost
and so B becomes unrecoverable (but A is not).

If this is so, is it possible/hard to enforce 'one fragment per *host*' behavior? If we can guarantee the following:

Host0:                  Host1:                  Host2:
|- Brick0: A0           |- Brick0: A1           |- Brick0: A2
|- Brick1: B1           |- Brick1: B2           |- Brick1: B0

This is how it currently works. You only need to take care of creating the volume with the bricks in the right order. In this case the order should be H0/B0, H1/B0, H2/B0, H0/B1, H1/B1, H1/B1. Anyway, if you create the volume using an incorrect order and two bricks of the same disperse set are placed in the same host, the operation will complain about it. This will only be accepted by gluster if you create the volume with the 'force' option.

Regards,

Xavi


this setup can tolerate both single brick and single host failures.

Dmitry


_______________________________________________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968




Gluster-devel mailing list
[hidden email]
https://lists.gluster.org/mailman/listinfo/gluster-devel