Replica 3 volume with forced quorum 1 fault tolerance and recovery

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Replica 3 volume with forced quorum 1 fault tolerance and recovery

Dmitry Antipov
It seems that consistency of replica 3 volume with quorum forced to 1 becomes
broken after a few forced volume restarts initiated after 2 brick failures.
At least it breaks GFAPI clients, and even volume restart doesn't help.

Volume setup is:

Volume Name: test0
Type: Replicate
Volume ID: 919352fb-15d8-49cb-b94c-c106ac68f072
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 192.168.1.112:/glusterfs/test0-000
Brick2: 192.168.1.112:/glusterfs/test0-001
Brick3: 192.168.1.112:/glusterfs/test0-002
Options Reconfigured:
cluster.quorum-count: 1
cluster.quorum-type: fixed
cluster.granular-entry-heal: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off

Client is fio with the following options:

[global]
name=write
filename=testfile
ioengine=gfapi_async
volume=test0
brick=localhost
create_on_open=1
rw=randwrite
direct=1
numjobs=1
time_based=1
runtime=600

[test-4-kbytes]
bs=4k
size=1G
iodepth=128

How to reproduce:

0) start the volume;
1) run fio;
2) run 'gluster volume status', select 2 arbitrary brick processes
    and kill them;
3) make sure fio is OK;
4) wait a few seconds, then issue 'gluster volume start [VOL] force'
    to restart bricks, and finally issue 'gluster volume status' again
    to check whether all bricks are running;
5) restart from 2).

This is likely to work for a few times but, sooner or later, it breaks
at 3) and fio detects an I/O error, most probably EIO or ENOTCONN. Starting
from this point, killing and restarting fio yields in error in glfs_creat(),
and even the manual volume restart doesn't help.

NOTE: as of 7914c6147adaf3ef32804519ced850168fff1711, fio's gfapi_async
engine is still incomplete and _silently ignores I/O errors_. Currently
I'm using the following tweak to detect and report them (YMMV, consider
experimental):

diff --git a/engines/glusterfs_async.c b/engines/glusterfs_async.c
index 0392ad6e..27ebb6f1 100644
--- a/engines/glusterfs_async.c
+++ b/engines/glusterfs_async.c
@@ -7,6 +7,7 @@
  #include "gfapi.h"
  #define NOT_YET 1
  struct fio_gf_iou {
+ struct thread_data *td;
  struct io_u *io_u;
  int io_complete;
  };
@@ -80,6 +81,7 @@ static int fio_gf_io_u_init(struct thread_data *td, struct io_u *io_u)
      }
      io->io_complete = 0;
      io->io_u = io_u;
+    io->td = td;
      io_u->engine_data = io;
  return 0;
  }
@@ -95,7 +97,20 @@ static void gf_async_cb(glfs_fd_t * fd, ssize_t ret, void *data)
  struct fio_gf_iou *iou = io_u->engine_data;

  dprint(FD_IO, "%s ret %zd\n", __FUNCTION__, ret);
- iou->io_complete = 1;
+ if (ret != io_u->xfer_buflen) {
+ if (ret >= 0) {
+ io_u->resid = io_u->xfer_buflen - ret;
+ io_u->error = 0;
+ iou->io_complete = 1;
+ } else
+ io_u->error = errno;
+ }
+
+ if (io_u->error) {
+ log_err("IO failed (%s).\n", strerror(io_u->error));
+ td_verror(iou->td, io_u->error, "xfer");
+ } else
+ iou->io_complete = 1;
  }

  static enum fio_q_status fio_gf_async_queue(struct thread_data fio_unused * td,

--

Dmitry
_______________________________________________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968




Gluster-devel mailing list
[hidden email]
https://lists.gluster.org/mailman/listinfo/gluster-devel

Reply | Threaded
Open this post in threaded view
|

Re: [Gluster-users] Replica 3 volume with forced quorum 1 fault tolerance and recovery

Strahil Nikolov
Replica 3 with quorum 1 ?
This is not good. I doubt anyone will help you with this. The idea of replica 3 volumes is to tolerate 1 node ,as when a second one is dead - only 1 will accept writes.

You can imagine the situation when 2 bricks are down and data is writen to brick 3. What happens when the brick 1 and 2 is up and running -> how is gluster going to decide where to heal from ?
2 is more than 1 , so the third node should delete the file instead of the opposite.

What are you trying to achive with the quorum 1 ?


Best Regards,
Strahil Nikolov






В вторник, 1 декември 2020 г., 14:09:32 Гринуич+2, Dmitry Antipov <[hidden email]> написа:





It seems that consistency of replica 3 volume with quorum forced to 1 becomes
broken after a few forced volume restarts initiated after 2 brick failures.
At least it breaks GFAPI clients, and even volume restart doesn't help.

Volume setup is:

Volume Name: test0
Type: Replicate
Volume ID: 919352fb-15d8-49cb-b94c-c106ac68f072
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 192.168.1.112:/glusterfs/test0-000
Brick2: 192.168.1.112:/glusterfs/test0-001
Brick3: 192.168.1.112:/glusterfs/test0-002
Options Reconfigured:
cluster.quorum-count: 1
cluster.quorum-type: fixed
cluster.granular-entry-heal: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off

Client is fio with the following options:

[global]
name=write
filename=testfile
ioengine=gfapi_async
volume=test0
brick=localhost
create_on_open=1
rw=randwrite
direct=1
numjobs=1
time_based=1
runtime=600

[test-4-kbytes]
bs=4k
size=1G
iodepth=128

How to reproduce:

0) start the volume;
1) run fio;
2) run 'gluster volume status', select 2 arbitrary brick processes
    and kill them;
3) make sure fio is OK;
4) wait a few seconds, then issue 'gluster volume start [VOL] force'
    to restart bricks, and finally issue 'gluster volume status' again
    to check whether all bricks are running;
5) restart from 2).

This is likely to work for a few times but, sooner or later, it breaks
at 3) and fio detects an I/O error, most probably EIO or ENOTCONN. Starting
from this point, killing and restarting fio yields in error in glfs_creat(),
and even the manual volume restart doesn't help.

NOTE: as of 7914c6147adaf3ef32804519ced850168fff1711, fio's gfapi_async
engine is still incomplete and _silently ignores I/O errors_. Currently
I'm using the following tweak to detect and report them (YMMV, consider
experimental):

diff --git a/engines/glusterfs_async.c b/engines/glusterfs_async.c
index 0392ad6e..27ebb6f1 100644
--- a/engines/glusterfs_async.c
+++ b/engines/glusterfs_async.c
@@ -7,6 +7,7 @@
  #include "gfapi.h"
  #define NOT_YET 1
  struct fio_gf_iou {
+    struct thread_data *td;
      struct io_u *io_u;
      int io_complete;
  };
@@ -80,6 +81,7 @@ static int fio_gf_io_u_init(struct thread_data *td, struct io_u *io_u)
      }
      io->io_complete = 0;
      io->io_u = io_u;
+    io->td = td;
      io_u->engine_data = io;
      return 0;
  }
@@ -95,7 +97,20 @@ static void gf_async_cb(glfs_fd_t * fd, ssize_t ret, void *data)
      struct fio_gf_iou *iou = io_u->engine_data;

      dprint(FD_IO, "%s ret %zd\n", __FUNCTION__, ret);
-    iou->io_complete = 1;
+    if (ret != io_u->xfer_buflen) {
+        if (ret >= 0) {
+            io_u->resid = io_u->xfer_buflen - ret;
+            io_u->error = 0;
+            iou->io_complete = 1;
+        } else
+            io_u->error = errno;
+    }
+
+    if (io_u->error) {
+        log_err("IO failed (%s).\n", strerror(io_u->error));
+        td_verror(iou->td, io_u->error, "xfer");
+    } else
+        iou->io_complete = 1;
  }

  static enum fio_q_status fio_gf_async_queue(struct thread_data fio_unused * td,

--

Dmitry
________



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
[hidden email]
https://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968




Gluster-devel mailing list
[hidden email]
https://lists.gluster.org/mailman/listinfo/gluster-devel

Reply | Threaded
Open this post in threaded view
|

Re: [Gluster-users] Replica 3 volume with forced quorum 1 fault tolerance and recovery

Dmitry Antipov
On 12/1/20 4:15 PM, Strahil Nikolov wrote:

> You can imagine the situation when 2 bricks are down and data is writen to brick 3. What happens when the brick 1 and 2 is up and running -> how is gluster going to decide where to heal from ?

At least I can imagine the volume option to specify "let's assume that the only live brick contains the
most recent (and so hopefully valid) data, so newly (re)started ones are pleased to heal from it" behavior.

Dmitry
_______________________________________________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968




Gluster-devel mailing list
[hidden email]
https://lists.gluster.org/mailman/listinfo/gluster-devel