Re: [Gluster-users] "rpc_clnt_ping_timer_expired" errors

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [Gluster-users] "rpc_clnt_ping_timer_expired" errors

Raghavendra Gowdappa

I would like to point out another issue. Even if what I suggested prevents disconnects, part of the solution would be only symptomatic treatment and doesn't address the root cause of the problem. In most of the ping-timer-expiry issues, the root cause is the increased load on bricks and the inability of bricks to be responsive under high load. So, the actual solution would be doing any or both of the following:
* identify the source of increased load and if possible throttle it. Internal heal processes like self-heal, rebalance, quota heal are known to pump traffic into bricks without much throttling (io-threads _might_ do some throttling, but my understanding is its not sufficient).
* identify the reason for bricks to become unresponsive during load. This may be fixable issues like not enough event-threads to read from network or difficult to fix issues like fsync on backend fs freezing the process or semi fixable issues (in code) like lock contention.

So any genuine effort to fix ping-timer-issues (to be honest most of the times they are not issues related to rpc/network) would involve performance characterization of various subsystems on bricks and clients. Various subsystems can include (but not necessarily limited to), underlying OS/filesystem, glusterfs processes, CPU consumption etc

regards,
Raghavendra

On Mon, Mar 4, 2019 at 9:31 PM Mauro Tridici <[hidden email]> wrote:
Thank you, let’s try!
I will inform you about the effects of the change.

Regards,
Mauro

On 4 Mar 2019, at 16:55, Raghavendra Gowdappa <[hidden email]> wrote:



On Mon, Mar 4, 2019 at 8:54 PM Mauro Tridici <[hidden email]> wrote:
Hi Raghavendra,

thank you for your reply.
Yes, you are right. It is a problem that seems to happen randomly.
At this moment, server.event-threads value is 4. I will try to increase this value to 8. Do you think that it could be a valid value ?

Yes. We can try with that. You should see at least frequency of ping-timer related disconnects  reduce with this value (even if it doesn't eliminate the problem completely).


Regards,
Mauro


On 4 Mar 2019, at 15:36, Raghavendra Gowdappa <[hidden email]> wrote:



On Mon, Mar 4, 2019 at 8:01 PM Nithya Balachandran <[hidden email]> wrote:
Hi Mauro,

It looks like some problem on s06. Are all your other nodes ok? Can you send us the gluster logs from this node?

[hidden email] , do you have any idea as to how this can be debugged? Maybe running top ? Or debug brick logs?

If we can reproduce the problem, collecting tcpdump on both ends of connection will help. But, one common problem is these bugs are inconsistently reproducible and hence we may not be able to capture tcpdump at correct intervals. Other than that, we can try to collect some evidence that poller threads were busy (waiting on locks). But, not sure what debug data provides that information.

From what I know, its difficult to collect evidence for this issue and we could only reason about it.

We can try a workaround though - try increasing server.event-threads and see whether ping-timer expiry issues go away with an optimal value. If that's the case, it kind of provides proof for our hypothesis.



Regards,
Nithya

On Mon, 4 Mar 2019 at 15:25, Mauro Tridici <[hidden email]> wrote:
Hi All,

some minutes ago I received this message from NAGIOS server 

***** Nagios *****

Notification Type: PROBLEM

Service: Brick - /gluster/mnt2/brick
Host: s06
Address: s06-stg
State: CRITICAL

Date/Time: Mon Mar 4 10:25:33 CET 2019

Additional Info:
CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds.

I checked the network, RAM and CPUs usage on s06 node and everything seems to be ok.
No bricks are in error state. In /var/log/messages, I detected again a crash of “check_vol_utili” that I think it is a module used by NRPE executable (that is the NAGIOS client).

Mar  4 10:15:29 s06 kernel: traps: check_vol_utili[161224] general protection ip:7facffa0a66d sp:7ffe9f4e6fc0 error:0 in libglusterfs.so.0.0.1[7facff9b7000+f7000]
Mar  4 10:15:29 s06 abrt-hook-ccpp: Process 161224 (python2.7) of user 0 killed by SIGSEGV - dumping core
Mar  4 10:15:29 s06 abrt-server: Generating core_backtrace
Mar  4 10:15:29 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  4 10:16:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:16:01 s06 systemd: Starting User Slice of root.
Mar  4 10:16:01 s06 systemd: Started Session 201010 of user root.
Mar  4 10:16:01 s06 systemd: Starting Session 201010 of user root.
Mar  4 10:16:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:16:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:16:24 s06 abrt-server: Duplicate: UUID
Mar  4 10:16:24 s06 abrt-server: DUP_OF_DIR: /var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041
Mar  4 10:16:24 s06 abrt-server: Deleting problem directory ccpp-2019-03-04-10:15:29-161224 (dup of ccpp-2018-09-25-12:27:42-13041)
Mar  4 10:16:24 s06 abrt-server: Generating core_backtrace
Mar  4 10:16:24 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  4 10:16:24 s06 abrt-server: Cannot notify '/var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041' via uReport: Event 'report_uReport' exited with 1
Mar  4 10:16:24 s06 abrt-hook-ccpp: Process 161391 (python2.7) of user 0 killed by SIGABRT - dumping core
Mar  4 10:16:25 s06 abrt-server: Generating core_backtrace
Mar  4 10:16:25 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  4 10:17:01 s06 systemd: Created slice User Slice of root.

Also, I noticed the following errors that I think are very critical:

Mar  4 10:21:12 s06 glustershd[20355]: [2019-03-04 09:21:12.954798] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-55: server 192.168.0.55:49158 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:22:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:22:01 s06 systemd: Starting User Slice of root.
Mar  4 10:22:01 s06 systemd: Started Session 201017 of user root.
Mar  4 10:22:01 s06 systemd: Starting Session 201017 of user root.
Mar  4 10:22:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:22:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:22:03 s06 glustershd[20355]: [2019-03-04 09:22:03.964120] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-60: server 192.168.0.54:49165 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:23:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:23:01 s06 systemd: Starting User Slice of root.
Mar  4 10:23:01 s06 systemd: Started Session 201018 of user root.
Mar  4 10:23:01 s06 systemd: Starting Session 201018 of user root.
Mar  4 10:23:02 s06 systemd: Removed slice User Slice of root.
Mar  4 10:23:02 s06 systemd: Stopping User Slice of root.
Mar  4 10:24:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:24:01 s06 systemd: Starting User Slice of root.
Mar  4 10:24:01 s06 systemd: Started Session 201019 of user root.
Mar  4 10:24:01 s06 systemd: Starting Session 201019 of user root.
Mar  4 10:24:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:24:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:24:03 s06 glustershd[20355]: [2019-03-04 09:24:03.982502] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-16: server 192.168.0.52:49158 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746109] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-3: server 192.168.0.51:49153 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746215] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-10: server 192.168.0.52:49156 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746260] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-21: server 192.168.0.51:49159 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746296] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-25: server 192.168.0.52:49161 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746413] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-60: server 192.168.0.54:49165 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:07 s06 glustershd[20355]: [2019-03-04 09:24:07.982952] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-45: server 192.168.0.54:49155 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:18 s06 glustershd[20355]: [2019-03-04 09:24:18.990929] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-25: server 192.168.0.52:49161 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:31 s06 glustershd[20355]: [2019-03-04 09:24:31.995781] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-20: server 192.168.0.53:49159 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:25:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:25:01 s06 systemd: Starting User Slice of root.
Mar  4 10:25:01 s06 systemd: Started Session 201020 of user root.
Mar  4 10:25:01 s06 systemd: Starting Session 201020 of user root.
Mar  4 10:25:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:25:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:25:57 s06 systemd: Created slice User Slice of root.
Mar  4 10:25:57 s06 systemd: Starting User Slice of root.
Mar  4 10:25:57 s06 systemd-logind: New session 201021 of user root.
Mar  4 10:25:57 s06 systemd: Started Session 201021 of user root.
Mar  4 10:25:57 s06 systemd: Starting Session 201021 of user root.
Mar  4 10:26:01 s06 systemd: Started Session 201022 of user root.
Mar  4 10:26:01 s06 systemd: Starting Session 201022 of user root.
Mar  4 10:26:21 s06 nrpe[162388]: Error: Could not complete SSL handshake with 192.168.1.56: 5
Mar  4 10:27:01 s06 systemd: Started Session 201023 of user root.
Mar  4 10:27:01 s06 systemd: Starting Session 201023 of user root.
Mar  4 10:28:01 s06 systemd: Started Session 201024 of user root.
Mar  4 10:28:01 s06 systemd: Starting Session 201024 of user root.
Mar  4 10:29:01 s06 systemd: Started Session 201025 of user root.
Mar  4 10:29:01 s06 systemd: Starting Session 201025 of user root.

But, unfortunately, I don’t understand why it is happening.
Now, NAGIOS server shows that s06 status is ok:

***** Nagios *****

Notification Type: RECOVERY

Service: Brick - /gluster/mnt2/brick
Host: s06
Address: s06-stg
State: OK

Date/Time: Mon Mar 4 10:35:23 CET 2019

Additional Info:
OK: Brick /gluster/mnt2/brick is up

Nothing is changed from RAM, CPUs, and NETWORK point of view.
/var/log/message file has been updated:

Mar  4 10:32:01 s06 systemd: Starting Session 201029 of user root.
Mar  4 10:32:30 s06 glustershd[20355]: [2019-03-04 09:32:30.069082] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-10: server 192.168.0.52:49156 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:32:55 s06 glustershd[20355]: [2019-03-04 09:32:55.074689] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-66: server 192.168.0.54:49167 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:33:01 s06 systemd: Started Session 201030 of user root.
Mar  4 10:33:01 s06 systemd: Starting Session 201030 of user root.
Mar  4 10:34:01 s06 systemd: Started Session 201031 of user root.
Mar  4 10:34:01 s06 systemd: Starting Session 201031 of user root.
Mar  4 10:35:01 s06 nrpe[162562]: Could not read request from client 192.168.1.56, bailing out...
Mar  4 10:35:01 s06 nrpe[162562]: INFO: SSL Socket Shutdown.
Mar  4 10:35:01 s06 systemd: Started Session 201032 of user root.
Mar  4 10:35:01 s06 systemd: Starting Session 201032 of user root.

Could you please help me to understand what it’s happening ?
Thank you in advance.

Rergards,
Mauro


On 1 Mar 2019, at 12:17, Mauro Tridici <[hidden email]> wrote:


Thank you, Milind.
I executed the instructions you suggested:

- grep “blocked for” /var/log/messages on s06 returns no output (no “blocked” word is detected in messages file);
- in /var/log/messages file I can see this kind of error repeated for a lot of times:

Mar  1 08:43:01 s06 systemd: Starting Session 196071 of user root.
Mar  1 08:43:01 s06 systemd: Removed slice User Slice of root.
Mar  1 08:43:01 s06 systemd: Stopping User Slice of root.
Mar  1 08:43:02 s06 kernel: traps: check_vol_utili[57091] general protection ip:7f88e76ee66d sp:7ffe5a5bcc30 error:0 in libglusterfs.so.0.0.1[7f88e769b000+f7000]
Mar  1 08:43:02 s06 abrt-hook-ccpp: Process 57091 (python2.7) of user 0 killed by SIGSEGV - dumping core
Mar  1 08:43:02 s06 abrt-server: Generating core_backtrace
Mar  1 08:43:02 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  1 08:43:58 s06 abrt-server: Duplicate: UUID
Mar  1 08:43:58 s06 abrt-server: DUP_OF_DIR: /var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041
Mar  1 08:43:58 s06 abrt-server: Deleting problem directory ccpp-2019-03-01-08:43:02-57091 (dup of ccpp-2018-09-25-12:27:42-13041)
Mar  1 08:43:58 s06 dbus[1872]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Mar  1 08:43:58 s06 dbus[1872]: [system] Successfully activated service 'org.freedesktop.problems'
Mar  1 08:43:58 s06 abrt-server: Generating core_backtrace
Mar  1 08:43:58 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  1 08:43:58 s06 abrt-server: Cannot notify '/var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041' via uReport: Event 'report_uReport' exited with 1
Mar  1 08:44:01 s06 systemd: Created slice User Slice of root.
Mar  1 08:44:01 s06 systemd: Starting User Slice of root.
Mar  1 08:44:01 s06 systemd: Started Session 196072 of user root.
Mar  1 08:44:01 s06 systemd: Starting Session 196072 of user root.
Mar  1 08:44:01 s06 systemd: Removed slice User Slice of root.

- in /var/log/messages file I can see also 4 errors related to other cluster servers:

Mar  1 11:05:01 s06 systemd: Starting User Slice of root.
Mar  1 11:05:01 s06 systemd: Started Session 196230 of user root.
Mar  1 11:05:01 s06 systemd: Starting Session 196230 of user root.
Mar  1 11:05:01 s06 systemd: Removed slice User Slice of root.
Mar  1 11:05:01 s06 systemd: Stopping User Slice of root.
Mar  1 11:05:59 s06 glustershd[70117]: [2019-03-01 10:05:59.347094] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-33: server 192.168.0.51:49163 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:06:01 s06 systemd: Created slice User Slice of root.
Mar  1 11:06:01 s06 systemd: Starting User Slice of root.
Mar  1 11:06:01 s06 systemd: Started Session 196231 of user root.
Mar  1 11:06:01 s06 systemd: Starting Session 196231 of user root.
Mar  1 11:06:01 s06 systemd: Removed slice User Slice of root.
Mar  1 11:06:01 s06 systemd: Stopping User Slice of root.
Mar  1 11:06:12 s06 glustershd[70117]: [2019-03-01 10:06:12.351319] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-1: server 192.168.0.52:49153 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:06:38 s06 glustershd[70117]: [2019-03-01 10:06:38.356920] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-7: server 192.168.0.52:49155 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:07:01 s06 systemd: Created slice User Slice of root.
Mar  1 11:07:01 s06 systemd: Starting User Slice of root.
Mar  1 11:07:01 s06 systemd: Started Session 196232 of user root.
Mar  1 11:07:01 s06 systemd: Starting Session 196232 of user root.
Mar  1 11:07:01 s06 systemd: Removed slice User Slice of root.
Mar  1 11:07:01 s06 systemd: Stopping User Slice of root.
Mar  1 11:07:36 s06 glustershd[70117]: [2019-03-01 10:07:36.366259] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-0: server 192.168.0.51:49152 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:08:01 s06 systemd: Created slice User Slice of root.

No “blocked” word is in /var/log/messages files on other cluster servers.
In attachment, the /var/log/messages file from s06 server.

Thank you in advance,
Mauro

<messages.zip>


On 1 Mar 2019, at 11:47, Milind Changire <[hidden email]> wrote:

The traces of very high disk activity on the servers are often found in /var/log/messages
You might want to grep for "blocked for" in /var/log/messages on s06 and correlate the timestamps to confirm the unresponsiveness as reported in gluster client logs.
In cases of high disk activity, although the operating system continues to respond to ICMP pings, the processes writing to disks often get blocked to a large flush to the disk which could span beyond 42 seconds and hence result in ping-timer-expiry logs.

As a side note:
If you indeed find gluster processes being blocked in /var/log/messages, you might want to tweak sysctl tunables called vm.dirty_background_ratio or vm.dirty_background_bytes to a smaller value than the existing. Please read up more on those tunables before touching the settings.


On Fri, Mar 1, 2019 at 4:06 PM Mauro Tridici <[hidden email]> wrote:

Hi all,

in attachment the client log captured after changing network.ping-timeout option.
I noticed this error involving server 192.168.0.56 (s06)

[2019-03-01 09:23:36.077287] I [rpc-clnt.c:1962:rpc_clnt_reconfig] 0-tier2-client-71: changing ping timeout to 42 (from 0)
[2019-03-01 09:23:36.078213] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 09:23:36.078432] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 09:23:36.092357] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 09:23:36.094146] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 10:06:24.708082] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-50: server 192.168.0.56:49156 has not responded in the last 42 seconds, disconnecting.

I don’t know why it happens, s06 server seems to be reachable.

[athena_login2][/users/home/sysm02/]> telnet 192.168.0.56 49156
Trying 192.168.0.56...
Connected to 192.168.0.56.
Escape character is '^]'.
^CConnection closed by foreign host.
[athena_login2][/users/home/sysm02/]> ping 192.168.0.56
PING 192.168.0.56 (192.168.0.56) 56(84) bytes of data.
64 bytes from 192.168.0.56: icmp_seq=1 ttl=64 time=0.116 ms
64 bytes from 192.168.0.56: icmp_seq=2 ttl=64 time=0.101 ms

--- 192.168.0.56 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1528ms
rtt min/avg/max/mdev = 0.101/0.108/0.116/0.012 ms

[athena_login2][/users/home/sysm02/]> telnet 192.168.0.56 49156
Trying 192.168.0.56...
Connected to 192.168.0.56.
Escape character is '^]'.

Thank you for your help,
Mauro



On 1 Mar 2019, at 10:29, Mauro Tridici <[hidden email]> wrote:

Hi all,

thank you for the explanation.
I just changed network.ping-timeout option to default value (network.ping-timeout=42).

I will check the logs to see if the errors will appear again.

Regards,
Mauro

On 1 Mar 2019, at 04:43, Milind Changire <[hidden email]> wrote:

network.ping-timeout should not be set to zero for non-glusterd clients.
glusterd is a special case for which ping-timeout is set to zero via /etc/glusterfs/glusterd.vol

Setting network.ping-timeout to zero disables arming of the ping timer for connections. This disables testing the connection for responsiveness and hence avoids proactive fail-over.

Please reset network.ping-timeout to a non-zero positive value, eg. 42


On Thu, Feb 28, 2019 at 5:07 PM Nithya Balachandran <[hidden email]> wrote:
Adding Raghavendra and Milind to comment on this.

What is the effect of setting network.ping-timeout to 0 and should it be set back to 42?
Regards,
Nithya

On Thu, 28 Feb 2019 at 16:01, Mauro Tridici <[hidden email]> wrote:
Hi Nithya,

sorry for the late.
network.ping-timeout has been set to 0 in order to try to solve some timeout problems, but it didn’t help.
I can set it to the default value. 

Can I proceed with the change?

Thank you,
Mauro


On 28 Feb 2019, at 04:41, Nithya Balachandran <[hidden email]> wrote:

Hi Mauro,

Is network.ping-timeout still set to 0. The default value is 42. Is there a particular reason why this was changed?

Regards,
Nithya


On Wed, 27 Feb 2019 at 21:32, Mauro Tridici <[hidden email]> wrote:

Hi Xavi,

thank you for the detailed explanation and suggestions.
Yes, transport.listen-backlog option is still set to 1024.

I will check the network and connectivity status using “ping” and “telnet” as soon as the errors will come back again.

Regards,
Mauro

Il giorno 27 feb 2019, alle ore 16:42, Xavi Hernandez <[hidden email]> ha scritto:

Hi Mauro,

those errors say that the mount point is not connected to some of the bricks while executing operations. I see references to 3rd and 6th bricks of several disperse sets, which seem to map to server s06. For some reason, gluster is having troubles connecting from the client machine to that particular server. At the end of the log I see that after long time a reconnect is done to both of them. However little after, other bricks from the s05 get disconnected and a reconnect times out.

That's really odd. It seems like if server/communication is cut to s06 for some time, then restored, and then the same happens to the next server.

If the servers are really online and it's only a communication issue, it explains why server memory and network has increased: if the problem only exists between the client and servers, any write made by the client will automatically mark the file as damaged, since some of the servers have not been updated. Since self-heal runs from the server nodes, they will probably be correctly connected to all bricks, which allows them to heal the just damaged file, which increases memory and network usage.

I guess you still have transport.listen-backlog set to 1024, right ?

Just to try to identify if the problem really comes from network, can you check if you lose some pings from the client to all of the servers while you are seeing those errors in the log file ?

You can also check if during those errors, you can telnet to the port of the brick from the client.

Xavi



On Tue, Feb 26, 2019 at 10:17 AM Mauro Tridici <[hidden email]> wrote:
Hi Nithya,

“df -h” operation is not still slow, but no users are using the volume, RAM and NETWORK usage is ok on the client node.

I was worried about this kind of warnings/errors:

[2019-02-25 10:59:00.664323] W [MSGID: 122035] [ec-common.c:571:ec_child_select] 0-tier2-disperse-6: Executing operation with some subvolumes unavailable (20)

[2019-02-26 03:11:35.212603] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] ))))) 0-tier2-client-50: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2019-02-26 03:10:56.549903 (xid=0x106f1c5)

[2019-02-26 03:13:03.313831] E [socket.c:2376:socket_connect_finish] 0-tier2-client-50: connection to 192.168.0.56:49156 failed (Timeout della connessione); disconnecting socket

It seems that some subvolumes are not available and 192.168.0.56 server (s06) is not reachable.
But gluster servers are up&running and bricks are ok.

In attachment the updated tier2.log file.


Thank you.
Regards,
Mauro

Il giorno 26 feb 2019, alle ore 04:03, Nithya Balachandran <[hidden email]> ha scritto:

Hi,

I see a lot of EC messages in the log but they don't seem very serious. Xavi, can you take a look?

The only errors I see are:
[2019-02-25 10:58:45.519871] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] ))))) 0-tier2-client-50: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2019-02-25 10:57:47.429969 (xid=0xd26fe7)
[2019-02-25 10:58:51.461493] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] ))))) 0-tier2-client-41: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2019-02-25 10:57:47.499174 (xid=0xf47d6a)
[2019-02-25 11:07:57.152874] E [socket.c:2376:socket_connect_finish] 0-tier2-client-70: connection to 192.168.0.55:49163 failed (Timeout della connessione); disconnecting socket


Is the df -h operation still slow? If yes, can you take a tcpdump of the client while running df -h and send that across?

Regards,
Nithya

On Mon, 25 Feb 2019 at 17:27, Mauro Tridici <[hidden email]> wrote:

Sorry, some minutes after my last mail message, I noticed that “df -h” command hanged for a while before returns the prompt.
Yesterday, everything was ok in the gluster client log, but, today, I see a lot of errors (please, take a look to the attached file).

On the client node, I detected an important RAM e NETWORK usage.



Do you think that the errors have been caused by the client resources usage?

Thank you in advance,
Mauro




-------------------------
Mauro Tridici

Fondazione CMCC
CMCC Supercomputing Center
presso Complesso Ecotekne - Università del Salento -
Strada Prov.le Lecce - Monteroni sn
73100 Lecce  IT
http://www.cmcc.it

mobile: (+39) 327 5630841
email: [hidden email]
https://it.linkedin.com/in/mauro-tridici-5977238b


_______________________________________________
Gluster-devel mailing list
[hidden email]
https://lists.gluster.org/mailman/listinfo/gluster-devel
Reply | Threaded
Open this post in threaded view
|

Re: [Gluster-users] "rpc_clnt_ping_timer_expired" errors

Mauro Tridici
Dear All,

do you have any suggestions about the right way to "debug" this issue?
In attachment, the updated logs of “s06" gluster server.

I noticed a lot of intermittent warning and error messages.

Thank you in advance,
Mauro




On 4 Mar 2019, at 18:45, Raghavendra Gowdappa <[hidden email]> wrote:


I would like to point out another issue. Even if what I suggested prevents disconnects, part of the solution would be only symptomatic treatment and doesn't address the root cause of the problem. In most of the ping-timer-expiry issues, the root cause is the increased load on bricks and the inability of bricks to be responsive under high load. So, the actual solution would be doing any or both of the following:
* identify the source of increased load and if possible throttle it. Internal heal processes like self-heal, rebalance, quota heal are known to pump traffic into bricks without much throttling (io-threads _might_ do some throttling, but my understanding is its not sufficient).
* identify the reason for bricks to become unresponsive during load. This may be fixable issues like not enough event-threads to read from network or difficult to fix issues like fsync on backend fs freezing the process or semi fixable issues (in code) like lock contention.

So any genuine effort to fix ping-timer-issues (to be honest most of the times they are not issues related to rpc/network) would involve performance characterization of various subsystems on bricks and clients. Various subsystems can include (but not necessarily limited to), underlying OS/filesystem, glusterfs processes, CPU consumption etc

regards,
Raghavendra

On Mon, Mar 4, 2019 at 9:31 PM Mauro Tridici <[hidden email]> wrote:
Thank you, let’s try!
I will inform you about the effects of the change.

Regards,
Mauro

On 4 Mar 2019, at 16:55, Raghavendra Gowdappa <[hidden email]> wrote:



On Mon, Mar 4, 2019 at 8:54 PM Mauro Tridici <[hidden email]> wrote:
Hi Raghavendra,

thank you for your reply.
Yes, you are right. It is a problem that seems to happen randomly.
At this moment, server.event-threads value is 4. I will try to increase this value to 8. Do you think that it could be a valid value ?

Yes. We can try with that. You should see at least frequency of ping-timer related disconnects  reduce with this value (even if it doesn't eliminate the problem completely).


Regards,
Mauro


On 4 Mar 2019, at 15:36, Raghavendra Gowdappa <[hidden email]> wrote:



On Mon, Mar 4, 2019 at 8:01 PM Nithya Balachandran <[hidden email]> wrote:
Hi Mauro,

It looks like some problem on s06. Are all your other nodes ok? Can you send us the gluster logs from this node?

[hidden email] , do you have any idea as to how this can be debugged? Maybe running top ? Or debug brick logs?

If we can reproduce the problem, collecting tcpdump on both ends of connection will help. But, one common problem is these bugs are inconsistently reproducible and hence we may not be able to capture tcpdump at correct intervals. Other than that, we can try to collect some evidence that poller threads were busy (waiting on locks). But, not sure what debug data provides that information.

From what I know, its difficult to collect evidence for this issue and we could only reason about it.

We can try a workaround though - try increasing server.event-threads and see whether ping-timer expiry issues go away with an optimal value. If that's the case, it kind of provides proof for our hypothesis.



Regards,
Nithya

On Mon, 4 Mar 2019 at 15:25, Mauro Tridici <[hidden email]> wrote:
Hi All,

some minutes ago I received this message from NAGIOS server 

***** Nagios *****

Notification Type: PROBLEM

Service: Brick - /gluster/mnt2/brick
Host: s06
Address: s06-stg
State: CRITICAL

Date/Time: Mon Mar 4 10:25:33 CET 2019

Additional Info:
CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds.

I checked the network, RAM and CPUs usage on s06 node and everything seems to be ok.
No bricks are in error state. In /var/log/messages, I detected again a crash of “check_vol_utili” that I think it is a module used by NRPE executable (that is the NAGIOS client).

Mar  4 10:15:29 s06 kernel: traps: check_vol_utili[161224] general protection ip:7facffa0a66d sp:7ffe9f4e6fc0 error:0 in libglusterfs.so.0.0.1[7facff9b7000+f7000]
Mar  4 10:15:29 s06 abrt-hook-ccpp: Process 161224 (python2.7) of user 0 killed by SIGSEGV - dumping core
Mar  4 10:15:29 s06 abrt-server: Generating core_backtrace
Mar  4 10:15:29 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  4 10:16:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:16:01 s06 systemd: Starting User Slice of root.
Mar  4 10:16:01 s06 systemd: Started Session 201010 of user root.
Mar  4 10:16:01 s06 systemd: Starting Session 201010 of user root.
Mar  4 10:16:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:16:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:16:24 s06 abrt-server: Duplicate: UUID
Mar  4 10:16:24 s06 abrt-server: DUP_OF_DIR: /var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041
Mar  4 10:16:24 s06 abrt-server: Deleting problem directory ccpp-2019-03-04-10:15:29-161224 (dup of ccpp-2018-09-25-12:27:42-13041)
Mar  4 10:16:24 s06 abrt-server: Generating core_backtrace
Mar  4 10:16:24 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  4 10:16:24 s06 abrt-server: Cannot notify '/var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041' via uReport: Event 'report_uReport' exited with 1
Mar  4 10:16:24 s06 abrt-hook-ccpp: Process 161391 (python2.7) of user 0 killed by SIGABRT - dumping core
Mar  4 10:16:25 s06 abrt-server: Generating core_backtrace
Mar  4 10:16:25 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  4 10:17:01 s06 systemd: Created slice User Slice of root.

Also, I noticed the following errors that I think are very critical:

Mar  4 10:21:12 s06 glustershd[20355]: [2019-03-04 09:21:12.954798] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-55: server 192.168.0.55:49158 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:22:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:22:01 s06 systemd: Starting User Slice of root.
Mar  4 10:22:01 s06 systemd: Started Session 201017 of user root.
Mar  4 10:22:01 s06 systemd: Starting Session 201017 of user root.
Mar  4 10:22:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:22:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:22:03 s06 glustershd[20355]: [2019-03-04 09:22:03.964120] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-60: server 192.168.0.54:49165 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:23:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:23:01 s06 systemd: Starting User Slice of root.
Mar  4 10:23:01 s06 systemd: Started Session 201018 of user root.
Mar  4 10:23:01 s06 systemd: Starting Session 201018 of user root.
Mar  4 10:23:02 s06 systemd: Removed slice User Slice of root.
Mar  4 10:23:02 s06 systemd: Stopping User Slice of root.
Mar  4 10:24:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:24:01 s06 systemd: Starting User Slice of root.
Mar  4 10:24:01 s06 systemd: Started Session 201019 of user root.
Mar  4 10:24:01 s06 systemd: Starting Session 201019 of user root.
Mar  4 10:24:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:24:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:24:03 s06 glustershd[20355]: [2019-03-04 09:24:03.982502] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-16: server 192.168.0.52:49158 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746109] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-3: server 192.168.0.51:49153 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746215] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-10: server 192.168.0.52:49156 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746260] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-21: server 192.168.0.51:49159 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746296] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-25: server 192.168.0.52:49161 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746413] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-60: server 192.168.0.54:49165 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:07 s06 glustershd[20355]: [2019-03-04 09:24:07.982952] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-45: server 192.168.0.54:49155 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:18 s06 glustershd[20355]: [2019-03-04 09:24:18.990929] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-25: server 192.168.0.52:49161 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:31 s06 glustershd[20355]: [2019-03-04 09:24:31.995781] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-20: server 192.168.0.53:49159 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:25:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:25:01 s06 systemd: Starting User Slice of root.
Mar  4 10:25:01 s06 systemd: Started Session 201020 of user root.
Mar  4 10:25:01 s06 systemd: Starting Session 201020 of user root.
Mar  4 10:25:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:25:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:25:57 s06 systemd: Created slice User Slice of root.
Mar  4 10:25:57 s06 systemd: Starting User Slice of root.
Mar  4 10:25:57 s06 systemd-logind: New session 201021 of user root.
Mar  4 10:25:57 s06 systemd: Started Session 201021 of user root.
Mar  4 10:25:57 s06 systemd: Starting Session 201021 of user root.
Mar  4 10:26:01 s06 systemd: Started Session 201022 of user root.
Mar  4 10:26:01 s06 systemd: Starting Session 201022 of user root.
Mar  4 10:26:21 s06 nrpe[162388]: Error: Could not complete SSL handshake with 192.168.1.56: 5
Mar  4 10:27:01 s06 systemd: Started Session 201023 of user root.
Mar  4 10:27:01 s06 systemd: Starting Session 201023 of user root.
Mar  4 10:28:01 s06 systemd: Started Session 201024 of user root.
Mar  4 10:28:01 s06 systemd: Starting Session 201024 of user root.
Mar  4 10:29:01 s06 systemd: Started Session 201025 of user root.
Mar  4 10:29:01 s06 systemd: Starting Session 201025 of user root.

But, unfortunately, I don’t understand why it is happening.
Now, NAGIOS server shows that s06 status is ok:

***** Nagios *****

Notification Type: RECOVERY

Service: Brick - /gluster/mnt2/brick
Host: s06
Address: s06-stg
State: OK

Date/Time: Mon Mar 4 10:35:23 CET 2019

Additional Info:
OK: Brick /gluster/mnt2/brick is up

Nothing is changed from RAM, CPUs, and NETWORK point of view.
/var/log/message file has been updated:

Mar  4 10:32:01 s06 systemd: Starting Session 201029 of user root.
Mar  4 10:32:30 s06 glustershd[20355]: [2019-03-04 09:32:30.069082] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-10: server 192.168.0.52:49156 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:32:55 s06 glustershd[20355]: [2019-03-04 09:32:55.074689] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-66: server 192.168.0.54:49167 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:33:01 s06 systemd: Started Session 201030 of user root.
Mar  4 10:33:01 s06 systemd: Starting Session 201030 of user root.
Mar  4 10:34:01 s06 systemd: Started Session 201031 of user root.
Mar  4 10:34:01 s06 systemd: Starting Session 201031 of user root.
Mar  4 10:35:01 s06 nrpe[162562]: Could not read request from client 192.168.1.56, bailing out...
Mar  4 10:35:01 s06 nrpe[162562]: INFO: SSL Socket Shutdown.
Mar  4 10:35:01 s06 systemd: Started Session 201032 of user root.
Mar  4 10:35:01 s06 systemd: Starting Session 201032 of user root.

Could you please help me to understand what it’s happening ?
Thank you in advance.

Rergards,
Mauro


On 1 Mar 2019, at 12:17, Mauro Tridici <[hidden email]> wrote:


Thank you, Milind.
I executed the instructions you suggested:

- grep “blocked for” /var/log/messages on s06 returns no output (no “blocked” word is detected in messages file);
- in /var/log/messages file I can see this kind of error repeated for a lot of times:

Mar  1 08:43:01 s06 systemd: Starting Session 196071 of user root.
Mar  1 08:43:01 s06 systemd: Removed slice User Slice of root.
Mar  1 08:43:01 s06 systemd: Stopping User Slice of root.
Mar  1 08:43:02 s06 kernel: traps: check_vol_utili[57091] general protection ip:7f88e76ee66d sp:7ffe5a5bcc30 error:0 in libglusterfs.so.0.0.1[7f88e769b000+f7000]
Mar  1 08:43:02 s06 abrt-hook-ccpp: Process 57091 (python2.7) of user 0 killed by SIGSEGV - dumping core
Mar  1 08:43:02 s06 abrt-server: Generating core_backtrace
Mar  1 08:43:02 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  1 08:43:58 s06 abrt-server: Duplicate: UUID
Mar  1 08:43:58 s06 abrt-server: DUP_OF_DIR: /var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041
Mar  1 08:43:58 s06 abrt-server: Deleting problem directory ccpp-2019-03-01-08:43:02-57091 (dup of ccpp-2018-09-25-12:27:42-13041)
Mar  1 08:43:58 s06 dbus[1872]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Mar  1 08:43:58 s06 dbus[1872]: [system] Successfully activated service 'org.freedesktop.problems'
Mar  1 08:43:58 s06 abrt-server: Generating core_backtrace
Mar  1 08:43:58 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  1 08:43:58 s06 abrt-server: Cannot notify '/var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041' via uReport: Event 'report_uReport' exited with 1
Mar  1 08:44:01 s06 systemd: Created slice User Slice of root.
Mar  1 08:44:01 s06 systemd: Starting User Slice of root.
Mar  1 08:44:01 s06 systemd: Started Session 196072 of user root.
Mar  1 08:44:01 s06 systemd: Starting Session 196072 of user root.
Mar  1 08:44:01 s06 systemd: Removed slice User Slice of root.

- in /var/log/messages file I can see also 4 errors related to other cluster servers:

Mar  1 11:05:01 s06 systemd: Starting User Slice of root.
Mar  1 11:05:01 s06 systemd: Started Session 196230 of user root.
Mar  1 11:05:01 s06 systemd: Starting Session 196230 of user root.
Mar  1 11:05:01 s06 systemd: Removed slice User Slice of root.
Mar  1 11:05:01 s06 systemd: Stopping User Slice of root.
Mar  1 11:05:59 s06 glustershd[70117]: [2019-03-01 10:05:59.347094] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-33: server 192.168.0.51:49163 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:06:01 s06 systemd: Created slice User Slice of root.
Mar  1 11:06:01 s06 systemd: Starting User Slice of root.
Mar  1 11:06:01 s06 systemd: Started Session 196231 of user root.
Mar  1 11:06:01 s06 systemd: Starting Session 196231 of user root.
Mar  1 11:06:01 s06 systemd: Removed slice User Slice of root.
Mar  1 11:06:01 s06 systemd: Stopping User Slice of root.
Mar  1 11:06:12 s06 glustershd[70117]: [2019-03-01 10:06:12.351319] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-1: server 192.168.0.52:49153 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:06:38 s06 glustershd[70117]: [2019-03-01 10:06:38.356920] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-7: server 192.168.0.52:49155 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:07:01 s06 systemd: Created slice User Slice of root.
Mar  1 11:07:01 s06 systemd: Starting User Slice of root.
Mar  1 11:07:01 s06 systemd: Started Session 196232 of user root.
Mar  1 11:07:01 s06 systemd: Starting Session 196232 of user root.
Mar  1 11:07:01 s06 systemd: Removed slice User Slice of root.
Mar  1 11:07:01 s06 systemd: Stopping User Slice of root.
Mar  1 11:07:36 s06 glustershd[70117]: [2019-03-01 10:07:36.366259] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-0: server 192.168.0.51:49152 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:08:01 s06 systemd: Created slice User Slice of root.

No “blocked” word is in /var/log/messages files on other cluster servers.
In attachment, the /var/log/messages file from s06 server.

Thank you in advance,
Mauro

<messages.zip>


On 1 Mar 2019, at 11:47, Milind Changire <[hidden email]> wrote:

The traces of very high disk activity on the servers are often found in /var/log/messages
You might want to grep for "blocked for" in /var/log/messages on s06 and correlate the timestamps to confirm the unresponsiveness as reported in gluster client logs.
In cases of high disk activity, although the operating system continues to respond to ICMP pings, the processes writing to disks often get blocked to a large flush to the disk which could span beyond 42 seconds and hence result in ping-timer-expiry logs.

As a side note:
If you indeed find gluster processes being blocked in /var/log/messages, you might want to tweak sysctl tunables called vm.dirty_background_ratio or vm.dirty_background_bytes to a smaller value than the existing. Please read up more on those tunables before touching the settings.


On Fri, Mar 1, 2019 at 4:06 PM Mauro Tridici <[hidden email]> wrote:

Hi all,

in attachment the client log captured after changing network.ping-timeout option.
I noticed this error involving server 192.168.0.56 (s06)

[2019-03-01 09:23:36.077287] I [rpc-clnt.c:1962:rpc_clnt_reconfig] 0-tier2-client-71: changing ping timeout to 42 (from 0)
[2019-03-01 09:23:36.078213] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 09:23:36.078432] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 09:23:36.092357] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 09:23:36.094146] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 10:06:24.708082] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-50: server 192.168.0.56:49156 has not responded in the last 42 seconds, disconnecting.

I don’t know why it happens, s06 server seems to be reachable.

[athena_login2][/users/home/sysm02/]> telnet 192.168.0.56 49156
Trying 192.168.0.56...
Connected to 192.168.0.56.
Escape character is '^]'.
^CConnection closed by foreign host.
[athena_login2][/users/home/sysm02/]> ping 192.168.0.56
PING 192.168.0.56 (192.168.0.56) 56(84) bytes of data.
64 bytes from 192.168.0.56: icmp_seq=1 ttl=64 time=0.116 ms
64 bytes from 192.168.0.56: icmp_seq=2 ttl=64 time=0.101 ms

--- 192.168.0.56 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1528ms
rtt min/avg/max/mdev = 0.101/0.108/0.116/0.012 ms

[athena_login2][/users/home/sysm02/]> telnet 192.168.0.56 49156
Trying 192.168.0.56...
Connected to 192.168.0.56.
Escape character is '^]'.

Thank you for your help,
Mauro



On 1 Mar 2019, at 10:29, Mauro Tridici <[hidden email]> wrote:

Hi all,

thank you for the explanation.
I just changed network.ping-timeout option to default value (network.ping-timeout=42).

I will check the logs to see if the errors will appear again.

Regards,
Mauro

On 1 Mar 2019, at 04:43, Milind Changire <[hidden email]> wrote:

network.ping-timeout should not be set to zero for non-glusterd clients.
glusterd is a special case for which ping-timeout is set to zero via /etc/glusterfs/glusterd.vol

Setting network.ping-timeout to zero disables arming of the ping timer for connections. This disables testing the connection for responsiveness and hence avoids proactive fail-over.

Please reset network.ping-timeout to a non-zero positive value, eg. 42


On Thu, Feb 28, 2019 at 5:07 PM Nithya Balachandran <[hidden email]> wrote:
Adding Raghavendra and Milind to comment on this.

What is the effect of setting network.ping-timeout to 0 and should it be set back to 42?
Regards,
Nithya

On Thu, 28 Feb 2019 at 16:01, Mauro Tridici <[hidden email]> wrote:
Hi Nithya,

sorry for the late.
network.ping-timeout has been set to 0 in order to try to solve some timeout problems, but it didn’t help.
I can set it to the default value. 

Can I proceed with the change?

Thank you,
Mauro


On 28 Feb 2019, at 04:41, Nithya Balachandran <[hidden email]> wrote:

Hi Mauro,

Is network.ping-timeout still set to 0. The default value is 42. Is there a particular reason why this was changed?

Regards,
Nithya


On Wed, 27 Feb 2019 at 21:32, Mauro Tridici <[hidden email]> wrote:

Hi Xavi,

thank you for the detailed explanation and suggestions.
Yes, transport.listen-backlog option is still set to 1024.

I will check the network and connectivity status using “ping” and “telnet” as soon as the errors will come back again.

Regards,
Mauro

Il giorno 27 feb 2019, alle ore 16:42, Xavi Hernandez <[hidden email]> ha scritto:

Hi Mauro,

those errors say that the mount point is not connected to some of the bricks while executing operations. I see references to 3rd and 6th bricks of several disperse sets, which seem to map to server s06. For some reason, gluster is having troubles connecting from the client machine to that particular server. At the end of the log I see that after long time a reconnect is done to both of them. However little after, other bricks from the s05 get disconnected and a reconnect times out.

That's really odd. It seems like if server/communication is cut to s06 for some time, then restored, and then the same happens to the next server.

If the servers are really online and it's only a communication issue, it explains why server memory and network has increased: if the problem only exists between the client and servers, any write made by the client will automatically mark the file as damaged, since some of the servers have not been updated. Since self-heal runs from the server nodes, they will probably be correctly connected to all bricks, which allows them to heal the just damaged file, which increases memory and network usage.

I guess you still have transport.listen-backlog set to 1024, right ?

Just to try to identify if the problem really comes from network, can you check if you lose some pings from the client to all of the servers while you are seeing those errors in the log file ?

You can also check if during those errors, you can telnet to the port of the brick from the client.

Xavi



On Tue, Feb 26, 2019 at 10:17 AM Mauro Tridici <[hidden email]> wrote:
Hi Nithya,

“df -h” operation is not still slow, but no users are using the volume, RAM and NETWORK usage is ok on the client node.

I was worried about this kind of warnings/errors:

[2019-02-25 10:59:00.664323] W [MSGID: 122035] [ec-common.c:571:ec_child_select] 0-tier2-disperse-6: Executing operation with some subvolumes unavailable (20)

[2019-02-26 03:11:35.212603] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] ))))) 0-tier2-client-50: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2019-02-26 03:10:56.549903 (xid=0x106f1c5)

[2019-02-26 03:13:03.313831] E [socket.c:2376:socket_connect_finish] 0-tier2-client-50: connection to 192.168.0.56:49156 failed (Timeout della connessione); disconnecting socket

It seems that some subvolumes are not available and 192.168.0.56 server (s06) is not reachable.
But gluster servers are up&running and bricks are ok.

In attachment the updated tier2.log file.


Thank you.
Regards,
Mauro

Il giorno 26 feb 2019, alle ore 04:03, Nithya Balachandran <[hidden email]> ha scritto:

Hi,

I see a lot of EC messages in the log but they don't seem very serious. Xavi, can you take a look?

The only errors I see are:
[2019-02-25 10:58:45.519871] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] ))))) 0-tier2-client-50: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2019-02-25 10:57:47.429969 (xid=0xd26fe7)
[2019-02-25 10:58:51.461493] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] ))))) 0-tier2-client-41: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2019-02-25 10:57:47.499174 (xid=0xf47d6a)
[2019-02-25 11:07:57.152874] E [socket.c:2376:socket_connect_finish] 0-tier2-client-70: connection to 192.168.0.55:49163 failed (Timeout della connessione); disconnecting socket


Is the df -h operation still slow? If yes, can you take a tcpdump of the client while running df -h and send that across?

Regards,
Nithya

On Mon, 25 Feb 2019 at 17:27, Mauro Tridici <[hidden email]> wrote:

Sorry, some minutes after my last mail message, I noticed that “df -h” command hanged for a while before returns the prompt.
Yesterday, everything was ok in the gluster client log, but, today, I see a lot of errors (please, take a look to the attached file).

On the client node, I detected an important RAM e NETWORK usage.



Do you think that the errors have been caused by the client resources usage?

Thank you in advance,
Mauro







_______________________________________________
Gluster-devel mailing list
[hidden email]
https://lists.gluster.org/mailman/listinfo/gluster-devel

s06-20190311.logs.tar.gz (3M) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [Gluster-users] "rpc_clnt_ping_timer_expired" errors

Raghavendra Gowdappa
Was the suggestion to increase server.event-thread values tried? If yes, what were the results?

On Mon, Mar 11, 2019 at 2:40 PM Mauro Tridici <[hidden email]> wrote:
Dear All,

do you have any suggestions about the right way to "debug" this issue?
In attachment, the updated logs of “s06" gluster server.

I noticed a lot of intermittent warning and error messages.

Thank you in advance,
Mauro



On 4 Mar 2019, at 18:45, Raghavendra Gowdappa <[hidden email]> wrote:


I would like to point out another issue. Even if what I suggested prevents disconnects, part of the solution would be only symptomatic treatment and doesn't address the root cause of the problem. In most of the ping-timer-expiry issues, the root cause is the increased load on bricks and the inability of bricks to be responsive under high load. So, the actual solution would be doing any or both of the following:
* identify the source of increased load and if possible throttle it. Internal heal processes like self-heal, rebalance, quota heal are known to pump traffic into bricks without much throttling (io-threads _might_ do some throttling, but my understanding is its not sufficient).
* identify the reason for bricks to become unresponsive during load. This may be fixable issues like not enough event-threads to read from network or difficult to fix issues like fsync on backend fs freezing the process or semi fixable issues (in code) like lock contention.

So any genuine effort to fix ping-timer-issues (to be honest most of the times they are not issues related to rpc/network) would involve performance characterization of various subsystems on bricks and clients. Various subsystems can include (but not necessarily limited to), underlying OS/filesystem, glusterfs processes, CPU consumption etc

regards,
Raghavendra

On Mon, Mar 4, 2019 at 9:31 PM Mauro Tridici <[hidden email]> wrote:
Thank you, let’s try!
I will inform you about the effects of the change.

Regards,
Mauro

On 4 Mar 2019, at 16:55, Raghavendra Gowdappa <[hidden email]> wrote:



On Mon, Mar 4, 2019 at 8:54 PM Mauro Tridici <[hidden email]> wrote:
Hi Raghavendra,

thank you for your reply.
Yes, you are right. It is a problem that seems to happen randomly.
At this moment, server.event-threads value is 4. I will try to increase this value to 8. Do you think that it could be a valid value ?

Yes. We can try with that. You should see at least frequency of ping-timer related disconnects  reduce with this value (even if it doesn't eliminate the problem completely).


Regards,
Mauro


On 4 Mar 2019, at 15:36, Raghavendra Gowdappa <[hidden email]> wrote:



On Mon, Mar 4, 2019 at 8:01 PM Nithya Balachandran <[hidden email]> wrote:
Hi Mauro,

It looks like some problem on s06. Are all your other nodes ok? Can you send us the gluster logs from this node?

[hidden email] , do you have any idea as to how this can be debugged? Maybe running top ? Or debug brick logs?

If we can reproduce the problem, collecting tcpdump on both ends of connection will help. But, one common problem is these bugs are inconsistently reproducible and hence we may not be able to capture tcpdump at correct intervals. Other than that, we can try to collect some evidence that poller threads were busy (waiting on locks). But, not sure what debug data provides that information.

From what I know, its difficult to collect evidence for this issue and we could only reason about it.

We can try a workaround though - try increasing server.event-threads and see whether ping-timer expiry issues go away with an optimal value. If that's the case, it kind of provides proof for our hypothesis.



Regards,
Nithya

On Mon, 4 Mar 2019 at 15:25, Mauro Tridici <[hidden email]> wrote:
Hi All,

some minutes ago I received this message from NAGIOS server 

***** Nagios *****

Notification Type: PROBLEM

Service: Brick - /gluster/mnt2/brick
Host: s06
Address: s06-stg
State: CRITICAL

Date/Time: Mon Mar 4 10:25:33 CET 2019

Additional Info:
CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds.

I checked the network, RAM and CPUs usage on s06 node and everything seems to be ok.
No bricks are in error state. In /var/log/messages, I detected again a crash of “check_vol_utili” that I think it is a module used by NRPE executable (that is the NAGIOS client).

Mar  4 10:15:29 s06 kernel: traps: check_vol_utili[161224] general protection ip:7facffa0a66d sp:7ffe9f4e6fc0 error:0 in libglusterfs.so.0.0.1[7facff9b7000+f7000]
Mar  4 10:15:29 s06 abrt-hook-ccpp: Process 161224 (python2.7) of user 0 killed by SIGSEGV - dumping core
Mar  4 10:15:29 s06 abrt-server: Generating core_backtrace
Mar  4 10:15:29 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  4 10:16:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:16:01 s06 systemd: Starting User Slice of root.
Mar  4 10:16:01 s06 systemd: Started Session 201010 of user root.
Mar  4 10:16:01 s06 systemd: Starting Session 201010 of user root.
Mar  4 10:16:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:16:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:16:24 s06 abrt-server: Duplicate: UUID
Mar  4 10:16:24 s06 abrt-server: DUP_OF_DIR: /var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041
Mar  4 10:16:24 s06 abrt-server: Deleting problem directory ccpp-2019-03-04-10:15:29-161224 (dup of ccpp-2018-09-25-12:27:42-13041)
Mar  4 10:16:24 s06 abrt-server: Generating core_backtrace
Mar  4 10:16:24 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  4 10:16:24 s06 abrt-server: Cannot notify '/var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041' via uReport: Event 'report_uReport' exited with 1
Mar  4 10:16:24 s06 abrt-hook-ccpp: Process 161391 (python2.7) of user 0 killed by SIGABRT - dumping core
Mar  4 10:16:25 s06 abrt-server: Generating core_backtrace
Mar  4 10:16:25 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  4 10:17:01 s06 systemd: Created slice User Slice of root.

Also, I noticed the following errors that I think are very critical:

Mar  4 10:21:12 s06 glustershd[20355]: [2019-03-04 09:21:12.954798] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-55: server 192.168.0.55:49158 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:22:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:22:01 s06 systemd: Starting User Slice of root.
Mar  4 10:22:01 s06 systemd: Started Session 201017 of user root.
Mar  4 10:22:01 s06 systemd: Starting Session 201017 of user root.
Mar  4 10:22:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:22:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:22:03 s06 glustershd[20355]: [2019-03-04 09:22:03.964120] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-60: server 192.168.0.54:49165 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:23:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:23:01 s06 systemd: Starting User Slice of root.
Mar  4 10:23:01 s06 systemd: Started Session 201018 of user root.
Mar  4 10:23:01 s06 systemd: Starting Session 201018 of user root.
Mar  4 10:23:02 s06 systemd: Removed slice User Slice of root.
Mar  4 10:23:02 s06 systemd: Stopping User Slice of root.
Mar  4 10:24:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:24:01 s06 systemd: Starting User Slice of root.
Mar  4 10:24:01 s06 systemd: Started Session 201019 of user root.
Mar  4 10:24:01 s06 systemd: Starting Session 201019 of user root.
Mar  4 10:24:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:24:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:24:03 s06 glustershd[20355]: [2019-03-04 09:24:03.982502] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-16: server 192.168.0.52:49158 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746109] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-3: server 192.168.0.51:49153 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746215] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-10: server 192.168.0.52:49156 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746260] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-21: server 192.168.0.51:49159 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746296] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-25: server 192.168.0.52:49161 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746413] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-60: server 192.168.0.54:49165 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:07 s06 glustershd[20355]: [2019-03-04 09:24:07.982952] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-45: server 192.168.0.54:49155 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:18 s06 glustershd[20355]: [2019-03-04 09:24:18.990929] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-25: server 192.168.0.52:49161 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:31 s06 glustershd[20355]: [2019-03-04 09:24:31.995781] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-20: server 192.168.0.53:49159 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:25:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:25:01 s06 systemd: Starting User Slice of root.
Mar  4 10:25:01 s06 systemd: Started Session 201020 of user root.
Mar  4 10:25:01 s06 systemd: Starting Session 201020 of user root.
Mar  4 10:25:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:25:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:25:57 s06 systemd: Created slice User Slice of root.
Mar  4 10:25:57 s06 systemd: Starting User Slice of root.
Mar  4 10:25:57 s06 systemd-logind: New session 201021 of user root.
Mar  4 10:25:57 s06 systemd: Started Session 201021 of user root.
Mar  4 10:25:57 s06 systemd: Starting Session 201021 of user root.
Mar  4 10:26:01 s06 systemd: Started Session 201022 of user root.
Mar  4 10:26:01 s06 systemd: Starting Session 201022 of user root.
Mar  4 10:26:21 s06 nrpe[162388]: Error: Could not complete SSL handshake with 192.168.1.56: 5
Mar  4 10:27:01 s06 systemd: Started Session 201023 of user root.
Mar  4 10:27:01 s06 systemd: Starting Session 201023 of user root.
Mar  4 10:28:01 s06 systemd: Started Session 201024 of user root.
Mar  4 10:28:01 s06 systemd: Starting Session 201024 of user root.
Mar  4 10:29:01 s06 systemd: Started Session 201025 of user root.
Mar  4 10:29:01 s06 systemd: Starting Session 201025 of user root.

But, unfortunately, I don’t understand why it is happening.
Now, NAGIOS server shows that s06 status is ok:

***** Nagios *****

Notification Type: RECOVERY

Service: Brick - /gluster/mnt2/brick
Host: s06
Address: s06-stg
State: OK

Date/Time: Mon Mar 4 10:35:23 CET 2019

Additional Info:
OK: Brick /gluster/mnt2/brick is up

Nothing is changed from RAM, CPUs, and NETWORK point of view.
/var/log/message file has been updated:

Mar  4 10:32:01 s06 systemd: Starting Session 201029 of user root.
Mar  4 10:32:30 s06 glustershd[20355]: [2019-03-04 09:32:30.069082] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-10: server 192.168.0.52:49156 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:32:55 s06 glustershd[20355]: [2019-03-04 09:32:55.074689] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-66: server 192.168.0.54:49167 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:33:01 s06 systemd: Started Session 201030 of user root.
Mar  4 10:33:01 s06 systemd: Starting Session 201030 of user root.
Mar  4 10:34:01 s06 systemd: Started Session 201031 of user root.
Mar  4 10:34:01 s06 systemd: Starting Session 201031 of user root.
Mar  4 10:35:01 s06 nrpe[162562]: Could not read request from client 192.168.1.56, bailing out...
Mar  4 10:35:01 s06 nrpe[162562]: INFO: SSL Socket Shutdown.
Mar  4 10:35:01 s06 systemd: Started Session 201032 of user root.
Mar  4 10:35:01 s06 systemd: Starting Session 201032 of user root.

Could you please help me to understand what it’s happening ?
Thank you in advance.

Rergards,
Mauro


On 1 Mar 2019, at 12:17, Mauro Tridici <[hidden email]> wrote:


Thank you, Milind.
I executed the instructions you suggested:

- grep “blocked for” /var/log/messages on s06 returns no output (no “blocked” word is detected in messages file);
- in /var/log/messages file I can see this kind of error repeated for a lot of times:

Mar  1 08:43:01 s06 systemd: Starting Session 196071 of user root.
Mar  1 08:43:01 s06 systemd: Removed slice User Slice of root.
Mar  1 08:43:01 s06 systemd: Stopping User Slice of root.
Mar  1 08:43:02 s06 kernel: traps: check_vol_utili[57091] general protection ip:7f88e76ee66d sp:7ffe5a5bcc30 error:0 in libglusterfs.so.0.0.1[7f88e769b000+f7000]
Mar  1 08:43:02 s06 abrt-hook-ccpp: Process 57091 (python2.7) of user 0 killed by SIGSEGV - dumping core
Mar  1 08:43:02 s06 abrt-server: Generating core_backtrace
Mar  1 08:43:02 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  1 08:43:58 s06 abrt-server: Duplicate: UUID
Mar  1 08:43:58 s06 abrt-server: DUP_OF_DIR: /var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041
Mar  1 08:43:58 s06 abrt-server: Deleting problem directory ccpp-2019-03-01-08:43:02-57091 (dup of ccpp-2018-09-25-12:27:42-13041)
Mar  1 08:43:58 s06 dbus[1872]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Mar  1 08:43:58 s06 dbus[1872]: [system] Successfully activated service 'org.freedesktop.problems'
Mar  1 08:43:58 s06 abrt-server: Generating core_backtrace
Mar  1 08:43:58 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  1 08:43:58 s06 abrt-server: Cannot notify '/var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041' via uReport: Event 'report_uReport' exited with 1
Mar  1 08:44:01 s06 systemd: Created slice User Slice of root.
Mar  1 08:44:01 s06 systemd: Starting User Slice of root.
Mar  1 08:44:01 s06 systemd: Started Session 196072 of user root.
Mar  1 08:44:01 s06 systemd: Starting Session 196072 of user root.
Mar  1 08:44:01 s06 systemd: Removed slice User Slice of root.

- in /var/log/messages file I can see also 4 errors related to other cluster servers:

Mar  1 11:05:01 s06 systemd: Starting User Slice of root.
Mar  1 11:05:01 s06 systemd: Started Session 196230 of user root.
Mar  1 11:05:01 s06 systemd: Starting Session 196230 of user root.
Mar  1 11:05:01 s06 systemd: Removed slice User Slice of root.
Mar  1 11:05:01 s06 systemd: Stopping User Slice of root.
Mar  1 11:05:59 s06 glustershd[70117]: [2019-03-01 10:05:59.347094] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-33: server 192.168.0.51:49163 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:06:01 s06 systemd: Created slice User Slice of root.
Mar  1 11:06:01 s06 systemd: Starting User Slice of root.
Mar  1 11:06:01 s06 systemd: Started Session 196231 of user root.
Mar  1 11:06:01 s06 systemd: Starting Session 196231 of user root.
Mar  1 11:06:01 s06 systemd: Removed slice User Slice of root.
Mar  1 11:06:01 s06 systemd: Stopping User Slice of root.
Mar  1 11:06:12 s06 glustershd[70117]: [2019-03-01 10:06:12.351319] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-1: server 192.168.0.52:49153 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:06:38 s06 glustershd[70117]: [2019-03-01 10:06:38.356920] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-7: server 192.168.0.52:49155 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:07:01 s06 systemd: Created slice User Slice of root.
Mar  1 11:07:01 s06 systemd: Starting User Slice of root.
Mar  1 11:07:01 s06 systemd: Started Session 196232 of user root.
Mar  1 11:07:01 s06 systemd: Starting Session 196232 of user root.
Mar  1 11:07:01 s06 systemd: Removed slice User Slice of root.
Mar  1 11:07:01 s06 systemd: Stopping User Slice of root.
Mar  1 11:07:36 s06 glustershd[70117]: [2019-03-01 10:07:36.366259] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-0: server 192.168.0.51:49152 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:08:01 s06 systemd: Created slice User Slice of root.

No “blocked” word is in /var/log/messages files on other cluster servers.
In attachment, the /var/log/messages file from s06 server.

Thank you in advance,
Mauro

<messages.zip>


On 1 Mar 2019, at 11:47, Milind Changire <[hidden email]> wrote:

The traces of very high disk activity on the servers are often found in /var/log/messages
You might want to grep for "blocked for" in /var/log/messages on s06 and correlate the timestamps to confirm the unresponsiveness as reported in gluster client logs.
In cases of high disk activity, although the operating system continues to respond to ICMP pings, the processes writing to disks often get blocked to a large flush to the disk which could span beyond 42 seconds and hence result in ping-timer-expiry logs.

As a side note:
If you indeed find gluster processes being blocked in /var/log/messages, you might want to tweak sysctl tunables called vm.dirty_background_ratio or vm.dirty_background_bytes to a smaller value than the existing. Please read up more on those tunables before touching the settings.


On Fri, Mar 1, 2019 at 4:06 PM Mauro Tridici <[hidden email]> wrote:

Hi all,

in attachment the client log captured after changing network.ping-timeout option.
I noticed this error involving server 192.168.0.56 (s06)

[2019-03-01 09:23:36.077287] I [rpc-clnt.c:1962:rpc_clnt_reconfig] 0-tier2-client-71: changing ping timeout to 42 (from 0)
[2019-03-01 09:23:36.078213] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 09:23:36.078432] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 09:23:36.092357] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 09:23:36.094146] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 10:06:24.708082] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-50: server 192.168.0.56:49156 has not responded in the last 42 seconds, disconnecting.

I don’t know why it happens, s06 server seems to be reachable.

[athena_login2][/users/home/sysm02/]> telnet 192.168.0.56 49156
Trying 192.168.0.56...
Connected to 192.168.0.56.
Escape character is '^]'.
^CConnection closed by foreign host.
[athena_login2][/users/home/sysm02/]> ping 192.168.0.56
PING 192.168.0.56 (192.168.0.56) 56(84) bytes of data.
64 bytes from 192.168.0.56: icmp_seq=1 ttl=64 time=0.116 ms
64 bytes from 192.168.0.56: icmp_seq=2 ttl=64 time=0.101 ms

--- 192.168.0.56 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1528ms
rtt min/avg/max/mdev = 0.101/0.108/0.116/0.012 ms

[athena_login2][/users/home/sysm02/]> telnet 192.168.0.56 49156
Trying 192.168.0.56...
Connected to 192.168.0.56.
Escape character is '^]'.

Thank you for your help,
Mauro



On 1 Mar 2019, at 10:29, Mauro Tridici <[hidden email]> wrote:

Hi all,

thank you for the explanation.
I just changed network.ping-timeout option to default value (network.ping-timeout=42).

I will check the logs to see if the errors will appear again.

Regards,
Mauro

On 1 Mar 2019, at 04:43, Milind Changire <[hidden email]> wrote:

network.ping-timeout should not be set to zero for non-glusterd clients.
glusterd is a special case for which ping-timeout is set to zero via /etc/glusterfs/glusterd.vol

Setting network.ping-timeout to zero disables arming of the ping timer for connections. This disables testing the connection for responsiveness and hence avoids proactive fail-over.

Please reset network.ping-timeout to a non-zero positive value, eg. 42


On Thu, Feb 28, 2019 at 5:07 PM Nithya Balachandran <[hidden email]> wrote:
Adding Raghavendra and Milind to comment on this.

What is the effect of setting network.ping-timeout to 0 and should it be set back to 42?
Regards,
Nithya

On Thu, 28 Feb 2019 at 16:01, Mauro Tridici <[hidden email]> wrote:
Hi Nithya,

sorry for the late.
network.ping-timeout has been set to 0 in order to try to solve some timeout problems, but it didn’t help.
I can set it to the default value. 

Can I proceed with the change?

Thank you,
Mauro


On 28 Feb 2019, at 04:41, Nithya Balachandran <[hidden email]> wrote:

Hi Mauro,

Is network.ping-timeout still set to 0. The default value is 42. Is there a particular reason why this was changed?

Regards,
Nithya


On Wed, 27 Feb 2019 at 21:32, Mauro Tridici <[hidden email]> wrote:

Hi Xavi,

thank you for the detailed explanation and suggestions.
Yes, transport.listen-backlog option is still set to 1024.

I will check the network and connectivity status using “ping” and “telnet” as soon as the errors will come back again.

Regards,
Mauro

Il giorno 27 feb 2019, alle ore 16:42, Xavi Hernandez <[hidden email]> ha scritto:

Hi Mauro,

those errors say that the mount point is not connected to some of the bricks while executing operations. I see references to 3rd and 6th bricks of several disperse sets, which seem to map to server s06. For some reason, gluster is having troubles connecting from the client machine to that particular server. At the end of the log I see that after long time a reconnect is done to both of them. However little after, other bricks from the s05 get disconnected and a reconnect times out.

That's really odd. It seems like if server/communication is cut to s06 for some time, then restored, and then the same happens to the next server.

If the servers are really online and it's only a communication issue, it explains why server memory and network has increased: if the problem only exists between the client and servers, any write made by the client will automatically mark the file as damaged, since some of the servers have not been updated. Since self-heal runs from the server nodes, they will probably be correctly connected to all bricks, which allows them to heal the just damaged file, which increases memory and network usage.

I guess you still have transport.listen-backlog set to 1024, right ?

Just to try to identify if the problem really comes from network, can you check if you lose some pings from the client to all of the servers while you are seeing those errors in the log file ?

You can also check if during those errors, you can telnet to the port of the brick from the client.

Xavi



On Tue, Feb 26, 2019 at 10:17 AM Mauro Tridici <[hidden email]> wrote:
Hi Nithya,

“df -h” operation is not still slow, but no users are using the volume, RAM and NETWORK usage is ok on the client node.

I was worried about this kind of warnings/errors:

[2019-02-25 10:59:00.664323] W [MSGID: 122035] [ec-common.c:571:ec_child_select] 0-tier2-disperse-6: Executing operation with some subvolumes unavailable (20)

[2019-02-26 03:11:35.212603] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] ))))) 0-tier2-client-50: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2019-02-26 03:10:56.549903 (xid=0x106f1c5)

[2019-02-26 03:13:03.313831] E [socket.c:2376:socket_connect_finish] 0-tier2-client-50: connection to 192.168.0.56:49156 failed (Timeout della connessione); disconnecting socket

It seems that some subvolumes are not available and 192.168.0.56 server (s06) is not reachable.
But gluster servers are up&running and bricks are ok.

In attachment the updated tier2.log file.


Thank you.
Regards,
Mauro

Il giorno 26 feb 2019, alle ore 04:03, Nithya Balachandran <[hidden email]> ha scritto:

Hi,

I see a lot of EC messages in the log but they don't seem very serious. Xavi, can you take a look?

The only errors I see are:
[2019-02-25 10:58:45.519871] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] ))))) 0-tier2-client-50: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2019-02-25 10:57:47.429969 (xid=0xd26fe7)
[2019-02-25 10:58:51.461493] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] ))))) 0-tier2-client-41: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2019-02-25 10:57:47.499174 (xid=0xf47d6a)
[2019-02-25 11:07:57.152874] E [socket.c:2376:socket_connect_finish] 0-tier2-client-70: connection to 192.168.0.55:49163 failed (Timeout della connessione); disconnecting socket


Is the df -h operation still slow? If yes, can you take a tcpdump of the client while running df -h and send that across?

Regards,
Nithya

On Mon, 25 Feb 2019 at 17:27, Mauro Tridici <[hidden email]> wrote:

Sorry, some minutes after my last mail message, I noticed that “df -h” command hanged for a while before returns the prompt.
Yesterday, everything was ok in the gluster client log, but, today, I see a lot of errors (please, take a look to the attached file).

On the client node, I detected an important RAM e NETWORK usage.



Do you think that the errors have been caused by the client resources usage?

Thank you in advance,
Mauro







_______________________________________________
Gluster-devel mailing list
[hidden email]
https://lists.gluster.org/mailman/listinfo/gluster-devel
Reply | Threaded
Open this post in threaded view
|

Re: [Gluster-users] "rpc_clnt_ping_timer_expired" errors

Mauro Tridici
Hi Raghavendra,

Yes, server.event-thread has been changed from 4 to 8.
During last days, I noticed that the error events are still here although they have been considerably reduced.

So, I used grep command against the log files in order to provide you a global vision about the warning, error and critical events appeared today at 06:xx (may be useful I hope).
I collected the info from s06 gluster server, but the behaviour is the the almost the same on the other gluster servers.

ERRORS:  
CWD: /var/log/glusterfs 
COMMAND: grep " E " *.log |grep "2019-03-13 06:"

(I can see a lot of this kind of message in the same period but I'm notifying you only one record for each type of error)

glusterd.log:[2019-03-13 06:12:35.982863] E [MSGID: 101042] [compat.c:569:gf_umount_lazy] 0-management: Lazy unmount of /var/run/gluster/tier2_quota_list/

glustershd.log:[2019-03-13 06:14:28.666562] E [rpc-clnt.c:350:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7f4a71ddcebb] (--> /lib64/libgfr
pc.so.0(saved_frames_unwind+0x1de)[0x7f4a71ba1d9e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f4a71ba1ebe] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup
+0x90)[0x7f4a71ba3640] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7f4a71ba4130] ))))) 0-tier2-client-55: forced unwinding frame type(GlusterFS 3.3) op(INODELK(29)) 
called at 2019-03-13 06:14:14.858441 (xid=0x17fddb50) 

glustershd.log:[2019-03-13 06:17:48.883825] E [socket.c:2376:socket_connect_finish] 0-tier2-client-55: connection to 192.168.0.55:49158 failed (Connection timed out); disco
nnecting socket
glustershd.log:[2019-03-13 06:19:58.931798] E [socket.c:2376:socket_connect_finish] 0-tier2-client-55: connection to 192.168.0.55:49158 failed (Connection timed out); disco
nnecting socket
glustershd.log:[2019-03-13 06:22:08.979829] E [socket.c:2376:socket_connect_finish] 0-tier2-client-55: connection to 192.168.0.55:49158 failed (Connection timed out); disco
nnecting socket
glustershd.log:[2019-03-13 06:22:36.226847] E [MSGID: 114031] [client-rpc-fops.c:1508:client3_3_inodelk_cbk] 0-tier2-client-55: remote operation failed [Transport endpoint 
is not connected]
glustershd.log:[2019-03-13 06:22:36.306669] E [MSGID: 114031] [client-rpc-fops.c:1508:client3_3_inodelk_cbk] 0-tier2-client-55: remote operation failed [Transport endpoint 
is not connected]
glustershd.log:[2019-03-13 06:22:36.385257] E [MSGID: 114031] [client-rpc-fops.c:1508:client3_3_inodelk_cbk] 0-tier2-client-55: remote operation failed [Transport endpoint 
is not connected]

WARNINGS:
CWD: /var/log/glusterfs 
COMMAND: grep " W " *.log |grep "2019-03-13 06:"

(I can see a lot of this kind of message in the same period but I'm notifying you only one record for each type of warnings)

glustershd.log:[2019-03-13 06:14:28.666772] W [MSGID: 114031] [client-rpc-fops.c:1080:client3_3_getxattr_cbk] 0-tier2-client-55: remote operation failed. Path: <gfid:b6b35d
0f-f34d-4c25-bbe8-74bde0248d7e> (b6b35d0f-f34d-4c25-bbe8-74bde0248d7e). Key: (null) [Transport endpoint is not connected]

glustershd.log:[2019-03-13 06:14:31.421576] W [MSGID: 122035] [ec-common.c:571:ec_child_select] 0-tier2-disperse-9: Executing operation with some subvolumes unavailable (2)

glustershd.log:[2019-03-13 06:15:31.547417] W [MSGID: 122032] [ec-heald.c:266:ec_shd_index_sweep] 0-tier2-disperse-9: unable to get index-dir on tier2-client-55 [Operation 
now in progress]

quota-mount-tier2.log:[2019-03-13 06:12:36.116277] W [MSGID: 101002] [options.c:995:xl_opt_validate] 0-glusterfs: option 'address-family' is deprecated, preferred is 'trans
port.address-family', continuing with correction
quota-mount-tier2.log:[2019-03-13 06:12:36.198430] W [MSGID: 101174] [graph.c:363:_log_if_unknown_option] 0-tier2-readdir-ahead: option 'parallel-readdir' is not recognized
quota-mount-tier2.log:[2019-03-13 06:12:37.945007] W [glusterfsd.c:1375:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7e25) [0x7f340892be25] -->/usr/sbin/glusterfs(gluste
rfs_sigwaiter+0xe5) [0x55ef010164b5] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x55ef0101632b] ) 0-: received signum (15), shutting down

CRITICALS:
CWD: /var/log/glusterfs 
COMMAND: grep " C " *.log |grep "2019-03-13 06:"

no critical errors at 06:xx
only one critical error during the day

[root@s06 glusterfs]# grep " C " *.log |grep "2019-03-13"
glustershd.log:[2019-03-13 02:21:29.126279] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-55: server 192.168.0.55:49158 has not responded in the last 42 seconds, disconnecting.


Thank you very much for your help.
Regards,
Mauro

On 12 Mar 2019, at 05:17, Raghavendra Gowdappa <[hidden email]> wrote:

Was the suggestion to increase server.event-thread values tried? If yes, what were the results?

On Mon, Mar 11, 2019 at 2:40 PM Mauro Tridici <[hidden email]> wrote:
Dear All,

do you have any suggestions about the right way to "debug" this issue?
In attachment, the updated logs of “s06" gluster server.

I noticed a lot of intermittent warning and error messages.

Thank you in advance,
Mauro



On 4 Mar 2019, at 18:45, Raghavendra Gowdappa <[hidden email]> wrote:


I would like to point out another issue. Even if what I suggested prevents disconnects, part of the solution would be only symptomatic treatment and doesn't address the root cause of the problem. In most of the ping-timer-expiry issues, the root cause is the increased load on bricks and the inability of bricks to be responsive under high load. So, the actual solution would be doing any or both of the following:
* identify the source of increased load and if possible throttle it. Internal heal processes like self-heal, rebalance, quota heal are known to pump traffic into bricks without much throttling (io-threads _might_ do some throttling, but my understanding is its not sufficient).
* identify the reason for bricks to become unresponsive during load. This may be fixable issues like not enough event-threads to read from network or difficult to fix issues like fsync on backend fs freezing the process or semi fixable issues (in code) like lock contention.

So any genuine effort to fix ping-timer-issues (to be honest most of the times they are not issues related to rpc/network) would involve performance characterization of various subsystems on bricks and clients. Various subsystems can include (but not necessarily limited to), underlying OS/filesystem, glusterfs processes, CPU consumption etc

regards,
Raghavendra

On Mon, Mar 4, 2019 at 9:31 PM Mauro Tridici <[hidden email]> wrote:
Thank you, let’s try!
I will inform you about the effects of the change.

Regards,
Mauro

On 4 Mar 2019, at 16:55, Raghavendra Gowdappa <[hidden email]> wrote:



On Mon, Mar 4, 2019 at 8:54 PM Mauro Tridici <[hidden email]> wrote:
Hi Raghavendra,

thank you for your reply.
Yes, you are right. It is a problem that seems to happen randomly.
At this moment, server.event-threads value is 4. I will try to increase this value to 8. Do you think that it could be a valid value ?

Yes. We can try with that. You should see at least frequency of ping-timer related disconnects  reduce with this value (even if it doesn't eliminate the problem completely).


Regards,
Mauro


On 4 Mar 2019, at 15:36, Raghavendra Gowdappa <[hidden email]> wrote:



On Mon, Mar 4, 2019 at 8:01 PM Nithya Balachandran <[hidden email]> wrote:
Hi Mauro,

It looks like some problem on s06. Are all your other nodes ok? Can you send us the gluster logs from this node?

[hidden email] , do you have any idea as to how this can be debugged? Maybe running top ? Or debug brick logs?

If we can reproduce the problem, collecting tcpdump on both ends of connection will help. But, one common problem is these bugs are inconsistently reproducible and hence we may not be able to capture tcpdump at correct intervals. Other than that, we can try to collect some evidence that poller threads were busy (waiting on locks). But, not sure what debug data provides that information.

From what I know, its difficult to collect evidence for this issue and we could only reason about it.

We can try a workaround though - try increasing server.event-threads and see whether ping-timer expiry issues go away with an optimal value. If that's the case, it kind of provides proof for our hypothesis.



Regards,
Nithya

On Mon, 4 Mar 2019 at 15:25, Mauro Tridici <[hidden email]> wrote:
Hi All,

some minutes ago I received this message from NAGIOS server 

***** Nagios *****

Notification Type: PROBLEM

Service: Brick - /gluster/mnt2/brick
Host: s06
Address: s06-stg
State: CRITICAL

Date/Time: Mon Mar 4 10:25:33 CET 2019

Additional Info:
CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds.

I checked the network, RAM and CPUs usage on s06 node and everything seems to be ok.
No bricks are in error state. In /var/log/messages, I detected again a crash of “check_vol_utili” that I think it is a module used by NRPE executable (that is the NAGIOS client).

Mar  4 10:15:29 s06 kernel: traps: check_vol_utili[161224] general protection ip:7facffa0a66d sp:7ffe9f4e6fc0 error:0 in libglusterfs.so.0.0.1[7facff9b7000+f7000]
Mar  4 10:15:29 s06 abrt-hook-ccpp: Process 161224 (python2.7) of user 0 killed by SIGSEGV - dumping core
Mar  4 10:15:29 s06 abrt-server: Generating core_backtrace
Mar  4 10:15:29 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  4 10:16:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:16:01 s06 systemd: Starting User Slice of root.
Mar  4 10:16:01 s06 systemd: Started Session 201010 of user root.
Mar  4 10:16:01 s06 systemd: Starting Session 201010 of user root.
Mar  4 10:16:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:16:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:16:24 s06 abrt-server: Duplicate: UUID
Mar  4 10:16:24 s06 abrt-server: DUP_OF_DIR: /var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041
Mar  4 10:16:24 s06 abrt-server: Deleting problem directory ccpp-2019-03-04-10:15:29-161224 (dup of ccpp-2018-09-25-12:27:42-13041)
Mar  4 10:16:24 s06 abrt-server: Generating core_backtrace
Mar  4 10:16:24 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  4 10:16:24 s06 abrt-server: Cannot notify '/var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041' via uReport: Event 'report_uReport' exited with 1
Mar  4 10:16:24 s06 abrt-hook-ccpp: Process 161391 (python2.7) of user 0 killed by SIGABRT - dumping core
Mar  4 10:16:25 s06 abrt-server: Generating core_backtrace
Mar  4 10:16:25 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  4 10:17:01 s06 systemd: Created slice User Slice of root.

Also, I noticed the following errors that I think are very critical:

Mar  4 10:21:12 s06 glustershd[20355]: [2019-03-04 09:21:12.954798] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-55: server 192.168.0.55:49158 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:22:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:22:01 s06 systemd: Starting User Slice of root.
Mar  4 10:22:01 s06 systemd: Started Session 201017 of user root.
Mar  4 10:22:01 s06 systemd: Starting Session 201017 of user root.
Mar  4 10:22:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:22:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:22:03 s06 glustershd[20355]: [2019-03-04 09:22:03.964120] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-60: server 192.168.0.54:49165 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:23:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:23:01 s06 systemd: Starting User Slice of root.
Mar  4 10:23:01 s06 systemd: Started Session 201018 of user root.
Mar  4 10:23:01 s06 systemd: Starting Session 201018 of user root.
Mar  4 10:23:02 s06 systemd: Removed slice User Slice of root.
Mar  4 10:23:02 s06 systemd: Stopping User Slice of root.
Mar  4 10:24:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:24:01 s06 systemd: Starting User Slice of root.
Mar  4 10:24:01 s06 systemd: Started Session 201019 of user root.
Mar  4 10:24:01 s06 systemd: Starting Session 201019 of user root.
Mar  4 10:24:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:24:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:24:03 s06 glustershd[20355]: [2019-03-04 09:24:03.982502] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-16: server 192.168.0.52:49158 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746109] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-3: server 192.168.0.51:49153 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746215] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-10: server 192.168.0.52:49156 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746260] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-21: server 192.168.0.51:49159 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746296] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-25: server 192.168.0.52:49161 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746413] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-60: server 192.168.0.54:49165 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:07 s06 glustershd[20355]: [2019-03-04 09:24:07.982952] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-45: server 192.168.0.54:49155 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:18 s06 glustershd[20355]: [2019-03-04 09:24:18.990929] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-25: server 192.168.0.52:49161 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:31 s06 glustershd[20355]: [2019-03-04 09:24:31.995781] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-20: server 192.168.0.53:49159 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:25:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:25:01 s06 systemd: Starting User Slice of root.
Mar  4 10:25:01 s06 systemd: Started Session 201020 of user root.
Mar  4 10:25:01 s06 systemd: Starting Session 201020 of user root.
Mar  4 10:25:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:25:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:25:57 s06 systemd: Created slice User Slice of root.
Mar  4 10:25:57 s06 systemd: Starting User Slice of root.
Mar  4 10:25:57 s06 systemd-logind: New session 201021 of user root.
Mar  4 10:25:57 s06 systemd: Started Session 201021 of user root.
Mar  4 10:25:57 s06 systemd: Starting Session 201021 of user root.
Mar  4 10:26:01 s06 systemd: Started Session 201022 of user root.
Mar  4 10:26:01 s06 systemd: Starting Session 201022 of user root.
Mar  4 10:26:21 s06 nrpe[162388]: Error: Could not complete SSL handshake with 192.168.1.56: 5
Mar  4 10:27:01 s06 systemd: Started Session 201023 of user root.
Mar  4 10:27:01 s06 systemd: Starting Session 201023 of user root.
Mar  4 10:28:01 s06 systemd: Started Session 201024 of user root.
Mar  4 10:28:01 s06 systemd: Starting Session 201024 of user root.
Mar  4 10:29:01 s06 systemd: Started Session 201025 of user root.
Mar  4 10:29:01 s06 systemd: Starting Session 201025 of user root.

But, unfortunately, I don’t understand why it is happening.
Now, NAGIOS server shows that s06 status is ok:

***** Nagios *****

Notification Type: RECOVERY

Service: Brick - /gluster/mnt2/brick
Host: s06
Address: s06-stg
State: OK

Date/Time: Mon Mar 4 10:35:23 CET 2019

Additional Info:
OK: Brick /gluster/mnt2/brick is up

Nothing is changed from RAM, CPUs, and NETWORK point of view.
/var/log/message file has been updated:

Mar  4 10:32:01 s06 systemd: Starting Session 201029 of user root.
Mar  4 10:32:30 s06 glustershd[20355]: [2019-03-04 09:32:30.069082] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-10: server 192.168.0.52:49156 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:32:55 s06 glustershd[20355]: [2019-03-04 09:32:55.074689] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-66: server 192.168.0.54:49167 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:33:01 s06 systemd: Started Session 201030 of user root.
Mar  4 10:33:01 s06 systemd: Starting Session 201030 of user root.
Mar  4 10:34:01 s06 systemd: Started Session 201031 of user root.
Mar  4 10:34:01 s06 systemd: Starting Session 201031 of user root.
Mar  4 10:35:01 s06 nrpe[162562]: Could not read request from client 192.168.1.56, bailing out...
Mar  4 10:35:01 s06 nrpe[162562]: INFO: SSL Socket Shutdown.
Mar  4 10:35:01 s06 systemd: Started Session 201032 of user root.
Mar  4 10:35:01 s06 systemd: Starting Session 201032 of user root.

Could you please help me to understand what it’s happening ?
Thank you in advance.

Rergards,
Mauro


On 1 Mar 2019, at 12:17, Mauro Tridici <[hidden email]> wrote:


Thank you, Milind.
I executed the instructions you suggested:

- grep “blocked for” /var/log/messages on s06 returns no output (no “blocked” word is detected in messages file);
- in /var/log/messages file I can see this kind of error repeated for a lot of times:

Mar  1 08:43:01 s06 systemd: Starting Session 196071 of user root.
Mar  1 08:43:01 s06 systemd: Removed slice User Slice of root.
Mar  1 08:43:01 s06 systemd: Stopping User Slice of root.
Mar  1 08:43:02 s06 kernel: traps: check_vol_utili[57091] general protection ip:7f88e76ee66d sp:7ffe5a5bcc30 error:0 in libglusterfs.so.0.0.1[7f88e769b000+f7000]
Mar  1 08:43:02 s06 abrt-hook-ccpp: Process 57091 (python2.7) of user 0 killed by SIGSEGV - dumping core
Mar  1 08:43:02 s06 abrt-server: Generating core_backtrace
Mar  1 08:43:02 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  1 08:43:58 s06 abrt-server: Duplicate: UUID
Mar  1 08:43:58 s06 abrt-server: DUP_OF_DIR: /var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041
Mar  1 08:43:58 s06 abrt-server: Deleting problem directory ccpp-2019-03-01-08:43:02-57091 (dup of ccpp-2018-09-25-12:27:42-13041)
Mar  1 08:43:58 s06 dbus[1872]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Mar  1 08:43:58 s06 dbus[1872]: [system] Successfully activated service 'org.freedesktop.problems'
Mar  1 08:43:58 s06 abrt-server: Generating core_backtrace
Mar  1 08:43:58 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  1 08:43:58 s06 abrt-server: Cannot notify '/var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041' via uReport: Event 'report_uReport' exited with 1
Mar  1 08:44:01 s06 systemd: Created slice User Slice of root.
Mar  1 08:44:01 s06 systemd: Starting User Slice of root.
Mar  1 08:44:01 s06 systemd: Started Session 196072 of user root.
Mar  1 08:44:01 s06 systemd: Starting Session 196072 of user root.
Mar  1 08:44:01 s06 systemd: Removed slice User Slice of root.

- in /var/log/messages file I can see also 4 errors related to other cluster servers:

Mar  1 11:05:01 s06 systemd: Starting User Slice of root.
Mar  1 11:05:01 s06 systemd: Started Session 196230 of user root.
Mar  1 11:05:01 s06 systemd: Starting Session 196230 of user root.
Mar  1 11:05:01 s06 systemd: Removed slice User Slice of root.
Mar  1 11:05:01 s06 systemd: Stopping User Slice of root.
Mar  1 11:05:59 s06 glustershd[70117]: [2019-03-01 10:05:59.347094] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-33: server 192.168.0.51:49163 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:06:01 s06 systemd: Created slice User Slice of root.
Mar  1 11:06:01 s06 systemd: Starting User Slice of root.
Mar  1 11:06:01 s06 systemd: Started Session 196231 of user root.
Mar  1 11:06:01 s06 systemd: Starting Session 196231 of user root.
Mar  1 11:06:01 s06 systemd: Removed slice User Slice of root.
Mar  1 11:06:01 s06 systemd: Stopping User Slice of root.
Mar  1 11:06:12 s06 glustershd[70117]: [2019-03-01 10:06:12.351319] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-1: server 192.168.0.52:49153 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:06:38 s06 glustershd[70117]: [2019-03-01 10:06:38.356920] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-7: server 192.168.0.52:49155 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:07:01 s06 systemd: Created slice User Slice of root.
Mar  1 11:07:01 s06 systemd: Starting User Slice of root.
Mar  1 11:07:01 s06 systemd: Started Session 196232 of user root.
Mar  1 11:07:01 s06 systemd: Starting Session 196232 of user root.
Mar  1 11:07:01 s06 systemd: Removed slice User Slice of root.
Mar  1 11:07:01 s06 systemd: Stopping User Slice of root.
Mar  1 11:07:36 s06 glustershd[70117]: [2019-03-01 10:07:36.366259] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-0: server 192.168.0.51:49152 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:08:01 s06 systemd: Created slice User Slice of root.

No “blocked” word is in /var/log/messages files on other cluster servers.
In attachment, the /var/log/messages file from s06 server.

Thank you in advance,
Mauro

<messages.zip>


On 1 Mar 2019, at 11:47, Milind Changire <[hidden email]> wrote:

The traces of very high disk activity on the servers are often found in /var/log/messages
You might want to grep for "blocked for" in /var/log/messages on s06 and correlate the timestamps to confirm the unresponsiveness as reported in gluster client logs.
In cases of high disk activity, although the operating system continues to respond to ICMP pings, the processes writing to disks often get blocked to a large flush to the disk which could span beyond 42 seconds and hence result in ping-timer-expiry logs.

As a side note:
If you indeed find gluster processes being blocked in /var/log/messages, you might want to tweak sysctl tunables called vm.dirty_background_ratio or vm.dirty_background_bytes to a smaller value than the existing. Please read up more on those tunables before touching the settings.


On Fri, Mar 1, 2019 at 4:06 PM Mauro Tridici <[hidden email]> wrote:

Hi all,

in attachment the client log captured after changing network.ping-timeout option.
I noticed this error involving server 192.168.0.56 (s06)

[2019-03-01 09:23:36.077287] I [rpc-clnt.c:1962:rpc_clnt_reconfig] 0-tier2-client-71: changing ping timeout to 42 (from 0)
[2019-03-01 09:23:36.078213] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 09:23:36.078432] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 09:23:36.092357] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 09:23:36.094146] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 10:06:24.708082] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-50: server 192.168.0.56:49156 has not responded in the last 42 seconds, disconnecting.

I don’t know why it happens, s06 server seems to be reachable.

[athena_login2][/users/home/sysm02/]> telnet 192.168.0.56 49156
Trying 192.168.0.56...
Connected to 192.168.0.56.
Escape character is '^]'.
^CConnection closed by foreign host.
[athena_login2][/users/home/sysm02/]> ping 192.168.0.56
PING 192.168.0.56 (192.168.0.56) 56(84) bytes of data.
64 bytes from 192.168.0.56: icmp_seq=1 ttl=64 time=0.116 ms
64 bytes from 192.168.0.56: icmp_seq=2 ttl=64 time=0.101 ms

--- 192.168.0.56 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1528ms
rtt min/avg/max/mdev = 0.101/0.108/0.116/0.012 ms

[athena_login2][/users/home/sysm02/]> telnet 192.168.0.56 49156
Trying 192.168.0.56...
Connected to 192.168.0.56.
Escape character is '^]'.

Thank you for your help,
Mauro



On 1 Mar 2019, at 10:29, Mauro Tridici <[hidden email]> wrote:

Hi all,

thank you for the explanation.
I just changed network.ping-timeout option to default value (network.ping-timeout=42).

I will check the logs to see if the errors will appear again.

Regards,
Mauro

On 1 Mar 2019, at 04:43, Milind Changire <[hidden email]> wrote:

network.ping-timeout should not be set to zero for non-glusterd clients.
glusterd is a special case for which ping-timeout is set to zero via /etc/glusterfs/glusterd.vol

Setting network.ping-timeout to zero disables arming of the ping timer for connections. This disables testing the connection for responsiveness and hence avoids proactive fail-over.

Please reset network.ping-timeout to a non-zero positive value, eg. 42


On Thu, Feb 28, 2019 at 5:07 PM Nithya Balachandran <[hidden email]> wrote:
Adding Raghavendra and Milind to comment on this.

What is the effect of setting network.ping-timeout to 0 and should it be set back to 42?
Regards,
Nithya

On Thu, 28 Feb 2019 at 16:01, Mauro Tridici <[hidden email]> wrote:
Hi Nithya,

sorry for the late.
network.ping-timeout has been set to 0 in order to try to solve some timeout problems, but it didn’t help.
I can set it to the default value. 

Can I proceed with the change?

Thank you,
Mauro


On 28 Feb 2019, at 04:41, Nithya Balachandran <[hidden email]> wrote:

Hi Mauro,

Is network.ping-timeout still set to 0. The default value is 42. Is there a particular reason why this was changed?

Regards,
Nithya


On Wed, 27 Feb 2019 at 21:32, Mauro Tridici <[hidden email]> wrote:

Hi Xavi,

thank you for the detailed explanation and suggestions.
Yes, transport.listen-backlog option is still set to 1024.

I will check the network and connectivity status using “ping” and “telnet” as soon as the errors will come back again.

Regards,
Mauro

Il giorno 27 feb 2019, alle ore 16:42, Xavi Hernandez <[hidden email]> ha scritto:

Hi Mauro,

those errors say that the mount point is not connected to some of the bricks while executing operations. I see references to 3rd and 6th bricks of several disperse sets, which seem to map to server s06. For some reason, gluster is having troubles connecting from the client machine to that particular server. At the end of the log I see that after long time a reconnect is done to both of them. However little after, other bricks from the s05 get disconnected and a reconnect times out.

That's really odd. It seems like if server/communication is cut to s06 for some time, then restored, and then the same happens to the next server.

If the servers are really online and it's only a communication issue, it explains why server memory and network has increased: if the problem only exists between the client and servers, any write made by the client will automatically mark the file as damaged, since some of the servers have not been updated. Since self-heal runs from the server nodes, they will probably be correctly connected to all bricks, which allows them to heal the just damaged file, which increases memory and network usage.

I guess you still have transport.listen-backlog set to 1024, right ?

Just to try to identify if the problem really comes from network, can you check if you lose some pings from the client to all of the servers while you are seeing those errors in the log file ?

You can also check if during those errors, you can telnet to the port of the brick from the client.

Xavi



On Tue, Feb 26, 2019 at 10:17 AM Mauro Tridici <[hidden email]> wrote:
Hi Nithya,

“df -h” operation is not still slow, but no users are using the volume, RAM and NETWORK usage is ok on the client node.

I was worried about this kind of warnings/errors:

[2019-02-25 10:59:00.664323] W [MSGID: 122035] [ec-common.c:571:ec_child_select] 0-tier2-disperse-6: Executing operation with some subvolumes unavailable (20)

[2019-02-26 03:11:35.212603] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] ))))) 0-tier2-client-50: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2019-02-26 03:10:56.549903 (xid=0x106f1c5)

[2019-02-26 03:13:03.313831] E [socket.c:2376:socket_connect_finish] 0-tier2-client-50: connection to 192.168.0.56:49156 failed (Timeout della connessione); disconnecting socket

It seems that some subvolumes are not available and 192.168.0.56 server (s06) is not reachable.
But gluster servers are up&running and bricks are ok.

In attachment the updated tier2.log file.


Thank you.
Regards,
Mauro

Il giorno 26 feb 2019, alle ore 04:03, Nithya Balachandran <[hidden email]> ha scritto:

Hi,

I see a lot of EC messages in the log but they don't seem very serious. Xavi, can you take a look?

The only errors I see are:
[2019-02-25 10:58:45.519871] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] ))))) 0-tier2-client-50: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2019-02-25 10:57:47.429969 (xid=0xd26fe7)
[2019-02-25 10:58:51.461493] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] ))))) 0-tier2-client-41: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2019-02-25 10:57:47.499174 (xid=0xf47d6a)
[2019-02-25 11:07:57.152874] E [socket.c:2376:socket_connect_finish] 0-tier2-client-70: connection to 192.168.0.55:49163 failed (Timeout della connessione); disconnecting socket


Is the df -h operation still slow? If yes, can you take a tcpdump of the client while running df -h and send that across?

Regards,
Nithya

On Mon, 25 Feb 2019 at 17:27, Mauro Tridici <[hidden email]> wrote:

Sorry, some minutes after my last mail message, I noticed that “df -h” command hanged for a while before returns the prompt.
Yesterday, everything was ok in the gluster client log, but, today, I see a lot of errors (please, take a look to the attached file).

On the client node, I detected an important RAM e NETWORK usage.



Do you think that the errors have been caused by the client resources usage?

Thank you in advance,
Mauro









_______________________________________________
Gluster-devel mailing list
[hidden email]
https://lists.gluster.org/mailman/listinfo/gluster-devel
Reply | Threaded
Open this post in threaded view
|

Re: [Gluster-users] "rpc_clnt_ping_timer_expired" errors

Raghavendra Gowdappa


On Wed, Mar 13, 2019 at 3:55 PM Mauro Tridici <[hidden email]> wrote:
Hi Raghavendra,

Yes, server.event-thread has been changed from 4 to 8.

Was client.event-thread value too changed to 8? If not, I would like to know the results of including this tuning too. Also, if possible, can you get the output of following command from problematic clients and bricks (during the duration when load tends to be high and ping-timer-expiry is seen)?

# top -bHd 3
 
This will help us to know  CPU utilization of event-threads.

And I forgot to ask, what version of Glusterfs are you using?

During last days, I noticed that the error events are still here although they have been considerably reduced.

So, I used grep command against the log files in order to provide you a global vision about the warning, error and critical events appeared today at 06:xx (may be useful I hope).
I collected the info from s06 gluster server, but the behaviour is the the almost the same on the other gluster servers.

ERRORS:  
CWD: /var/log/glusterfs 
COMMAND: grep " E " *.log |grep "2019-03-13 06:"

(I can see a lot of this kind of message in the same period but I'm notifying you only one record for each type of error)

glusterd.log:[2019-03-13 06:12:35.982863] E [MSGID: 101042] [compat.c:569:gf_umount_lazy] 0-management: Lazy unmount of /var/run/gluster/tier2_quota_list/

glustershd.log:[2019-03-13 06:14:28.666562] E [rpc-clnt.c:350:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7f4a71ddcebb] (--> /lib64/libgfr
pc.so.0(saved_frames_unwind+0x1de)[0x7f4a71ba1d9e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f4a71ba1ebe] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup
+0x90)[0x7f4a71ba3640] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7f4a71ba4130] ))))) 0-tier2-client-55: forced unwinding frame type(GlusterFS 3.3) op(INODELK(29)) 
called at 2019-03-13 06:14:14.858441 (xid=0x17fddb50) 

glustershd.log:[2019-03-13 06:17:48.883825] E [socket.c:2376:socket_connect_finish] 0-tier2-client-55: connection to 192.168.0.55:49158 failed (Connection timed out); disco
nnecting socket
glustershd.log:[2019-03-13 06:19:58.931798] E [socket.c:2376:socket_connect_finish] 0-tier2-client-55: connection to 192.168.0.55:49158 failed (Connection timed out); disco
nnecting socket
glustershd.log:[2019-03-13 06:22:08.979829] E [socket.c:2376:socket_connect_finish] 0-tier2-client-55: connection to 192.168.0.55:49158 failed (Connection timed out); disco
nnecting socket
glustershd.log:[2019-03-13 06:22:36.226847] E [MSGID: 114031] [client-rpc-fops.c:1508:client3_3_inodelk_cbk] 0-tier2-client-55: remote operation failed [Transport endpoint 
is not connected]
glustershd.log:[2019-03-13 06:22:36.306669] E [MSGID: 114031] [client-rpc-fops.c:1508:client3_3_inodelk_cbk] 0-tier2-client-55: remote operation failed [Transport endpoint 
is not connected]
glustershd.log:[2019-03-13 06:22:36.385257] E [MSGID: 114031] [client-rpc-fops.c:1508:client3_3_inodelk_cbk] 0-tier2-client-55: remote operation failed [Transport endpoint 
is not connected]

WARNINGS:
CWD: /var/log/glusterfs 
COMMAND: grep " W " *.log |grep "2019-03-13 06:"

(I can see a lot of this kind of message in the same period but I'm notifying you only one record for each type of warnings)

glustershd.log:[2019-03-13 06:14:28.666772] W [MSGID: 114031] [client-rpc-fops.c:1080:client3_3_getxattr_cbk] 0-tier2-client-55: remote operation failed. Path: <gfid:b6b35d
0f-f34d-4c25-bbe8-74bde0248d7e> (b6b35d0f-f34d-4c25-bbe8-74bde0248d7e). Key: (null) [Transport endpoint is not connected]

glustershd.log:[2019-03-13 06:14:31.421576] W [MSGID: 122035] [ec-common.c:571:ec_child_select] 0-tier2-disperse-9: Executing operation with some subvolumes unavailable (2)

glustershd.log:[2019-03-13 06:15:31.547417] W [MSGID: 122032] [ec-heald.c:266:ec_shd_index_sweep] 0-tier2-disperse-9: unable to get index-dir on tier2-client-55 [Operation 
now in progress]

quota-mount-tier2.log:[2019-03-13 06:12:36.116277] W [MSGID: 101002] [options.c:995:xl_opt_validate] 0-glusterfs: option 'address-family' is deprecated, preferred is 'trans
port.address-family', continuing with correction
quota-mount-tier2.log:[2019-03-13 06:12:36.198430] W [MSGID: 101174] [graph.c:363:_log_if_unknown_option] 0-tier2-readdir-ahead: option 'parallel-readdir' is not recognized
quota-mount-tier2.log:[2019-03-13 06:12:37.945007] W [glusterfsd.c:1375:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7e25) [0x7f340892be25] -->/usr/sbin/glusterfs(gluste
rfs_sigwaiter+0xe5) [0x55ef010164b5] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x55ef0101632b] ) 0-: received signum (15), shutting down

CRITICALS:
CWD: /var/log/glusterfs 
COMMAND: grep " C " *.log |grep "2019-03-13 06:"

no critical errors at 06:xx
only one critical error during the day

[root@s06 glusterfs]# grep " C " *.log |grep "2019-03-13"
glustershd.log:[2019-03-13 02:21:29.126279] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-55: server 192.168.0.55:49158 has not responded in the last 42 seconds, disconnecting.


Thank you very much for your help.
Regards,
Mauro

On 12 Mar 2019, at 05:17, Raghavendra Gowdappa <[hidden email]> wrote:

Was the suggestion to increase server.event-thread values tried? If yes, what were the results?

On Mon, Mar 11, 2019 at 2:40 PM Mauro Tridici <[hidden email]> wrote:
Dear All,

do you have any suggestions about the right way to "debug" this issue?
In attachment, the updated logs of “s06" gluster server.

I noticed a lot of intermittent warning and error messages.

Thank you in advance,
Mauro



On 4 Mar 2019, at 18:45, Raghavendra Gowdappa <[hidden email]> wrote:


I would like to point out another issue. Even if what I suggested prevents disconnects, part of the solution would be only symptomatic treatment and doesn't address the root cause of the problem. In most of the ping-timer-expiry issues, the root cause is the increased load on bricks and the inability of bricks to be responsive under high load. So, the actual solution would be doing any or both of the following:
* identify the source of increased load and if possible throttle it. Internal heal processes like self-heal, rebalance, quota heal are known to pump traffic into bricks without much throttling (io-threads _might_ do some throttling, but my understanding is its not sufficient).
* identify the reason for bricks to become unresponsive during load. This may be fixable issues like not enough event-threads to read from network or difficult to fix issues like fsync on backend fs freezing the process or semi fixable issues (in code) like lock contention.

So any genuine effort to fix ping-timer-issues (to be honest most of the times they are not issues related to rpc/network) would involve performance characterization of various subsystems on bricks and clients. Various subsystems can include (but not necessarily limited to), underlying OS/filesystem, glusterfs processes, CPU consumption etc

regards,
Raghavendra

On Mon, Mar 4, 2019 at 9:31 PM Mauro Tridici <[hidden email]> wrote:
Thank you, let’s try!
I will inform you about the effects of the change.

Regards,
Mauro

On 4 Mar 2019, at 16:55, Raghavendra Gowdappa <[hidden email]> wrote:



On Mon, Mar 4, 2019 at 8:54 PM Mauro Tridici <[hidden email]> wrote:
Hi Raghavendra,

thank you for your reply.
Yes, you are right. It is a problem that seems to happen randomly.
At this moment, server.event-threads value is 4. I will try to increase this value to 8. Do you think that it could be a valid value ?

Yes. We can try with that. You should see at least frequency of ping-timer related disconnects  reduce with this value (even if it doesn't eliminate the problem completely).


Regards,
Mauro


On 4 Mar 2019, at 15:36, Raghavendra Gowdappa <[hidden email]> wrote:



On Mon, Mar 4, 2019 at 8:01 PM Nithya Balachandran <[hidden email]> wrote:
Hi Mauro,

It looks like some problem on s06. Are all your other nodes ok? Can you send us the gluster logs from this node?

[hidden email] , do you have any idea as to how this can be debugged? Maybe running top ? Or debug brick logs?

If we can reproduce the problem, collecting tcpdump on both ends of connection will help. But, one common problem is these bugs are inconsistently reproducible and hence we may not be able to capture tcpdump at correct intervals. Other than that, we can try to collect some evidence that poller threads were busy (waiting on locks). But, not sure what debug data provides that information.

From what I know, its difficult to collect evidence for this issue and we could only reason about it.

We can try a workaround though - try increasing server.event-threads and see whether ping-timer expiry issues go away with an optimal value. If that's the case, it kind of provides proof for our hypothesis.



Regards,
Nithya

On Mon, 4 Mar 2019 at 15:25, Mauro Tridici <[hidden email]> wrote:
Hi All,

some minutes ago I received this message from NAGIOS server 

***** Nagios *****

Notification Type: PROBLEM

Service: Brick - /gluster/mnt2/brick
Host: s06
Address: s06-stg
State: CRITICAL

Date/Time: Mon Mar 4 10:25:33 CET 2019

Additional Info:
CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds.

I checked the network, RAM and CPUs usage on s06 node and everything seems to be ok.
No bricks are in error state. In /var/log/messages, I detected again a crash of “check_vol_utili” that I think it is a module used by NRPE executable (that is the NAGIOS client).

Mar  4 10:15:29 s06 kernel: traps: check_vol_utili[161224] general protection ip:7facffa0a66d sp:7ffe9f4e6fc0 error:0 in libglusterfs.so.0.0.1[7facff9b7000+f7000]
Mar  4 10:15:29 s06 abrt-hook-ccpp: Process 161224 (python2.7) of user 0 killed by SIGSEGV - dumping core
Mar  4 10:15:29 s06 abrt-server: Generating core_backtrace
Mar  4 10:15:29 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  4 10:16:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:16:01 s06 systemd: Starting User Slice of root.
Mar  4 10:16:01 s06 systemd: Started Session 201010 of user root.
Mar  4 10:16:01 s06 systemd: Starting Session 201010 of user root.
Mar  4 10:16:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:16:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:16:24 s06 abrt-server: Duplicate: UUID
Mar  4 10:16:24 s06 abrt-server: DUP_OF_DIR: /var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041
Mar  4 10:16:24 s06 abrt-server: Deleting problem directory ccpp-2019-03-04-10:15:29-161224 (dup of ccpp-2018-09-25-12:27:42-13041)
Mar  4 10:16:24 s06 abrt-server: Generating core_backtrace
Mar  4 10:16:24 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  4 10:16:24 s06 abrt-server: Cannot notify '/var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041' via uReport: Event 'report_uReport' exited with 1
Mar  4 10:16:24 s06 abrt-hook-ccpp: Process 161391 (python2.7) of user 0 killed by SIGABRT - dumping core
Mar  4 10:16:25 s06 abrt-server: Generating core_backtrace
Mar  4 10:16:25 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  4 10:17:01 s06 systemd: Created slice User Slice of root.

Also, I noticed the following errors that I think are very critical:

Mar  4 10:21:12 s06 glustershd[20355]: [2019-03-04 09:21:12.954798] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-55: server 192.168.0.55:49158 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:22:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:22:01 s06 systemd: Starting User Slice of root.
Mar  4 10:22:01 s06 systemd: Started Session 201017 of user root.
Mar  4 10:22:01 s06 systemd: Starting Session 201017 of user root.
Mar  4 10:22:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:22:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:22:03 s06 glustershd[20355]: [2019-03-04 09:22:03.964120] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-60: server 192.168.0.54:49165 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:23:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:23:01 s06 systemd: Starting User Slice of root.
Mar  4 10:23:01 s06 systemd: Started Session 201018 of user root.
Mar  4 10:23:01 s06 systemd: Starting Session 201018 of user root.
Mar  4 10:23:02 s06 systemd: Removed slice User Slice of root.
Mar  4 10:23:02 s06 systemd: Stopping User Slice of root.
Mar  4 10:24:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:24:01 s06 systemd: Starting User Slice of root.
Mar  4 10:24:01 s06 systemd: Started Session 201019 of user root.
Mar  4 10:24:01 s06 systemd: Starting Session 201019 of user root.
Mar  4 10:24:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:24:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:24:03 s06 glustershd[20355]: [2019-03-04 09:24:03.982502] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-16: server 192.168.0.52:49158 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746109] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-3: server 192.168.0.51:49153 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746215] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-10: server 192.168.0.52:49156 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746260] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-21: server 192.168.0.51:49159 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746296] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-25: server 192.168.0.52:49161 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746413] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-60: server 192.168.0.54:49165 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:07 s06 glustershd[20355]: [2019-03-04 09:24:07.982952] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-45: server 192.168.0.54:49155 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:18 s06 glustershd[20355]: [2019-03-04 09:24:18.990929] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-25: server 192.168.0.52:49161 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:31 s06 glustershd[20355]: [2019-03-04 09:24:31.995781] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-20: server 192.168.0.53:49159 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:25:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:25:01 s06 systemd: Starting User Slice of root.
Mar  4 10:25:01 s06 systemd: Started Session 201020 of user root.
Mar  4 10:25:01 s06 systemd: Starting Session 201020 of user root.
Mar  4 10:25:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:25:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:25:57 s06 systemd: Created slice User Slice of root.
Mar  4 10:25:57 s06 systemd: Starting User Slice of root.
Mar  4 10:25:57 s06 systemd-logind: New session 201021 of user root.
Mar  4 10:25:57 s06 systemd: Started Session 201021 of user root.
Mar  4 10:25:57 s06 systemd: Starting Session 201021 of user root.
Mar  4 10:26:01 s06 systemd: Started Session 201022 of user root.
Mar  4 10:26:01 s06 systemd: Starting Session 201022 of user root.
Mar  4 10:26:21 s06 nrpe[162388]: Error: Could not complete SSL handshake with 192.168.1.56: 5
Mar  4 10:27:01 s06 systemd: Started Session 201023 of user root.
Mar  4 10:27:01 s06 systemd: Starting Session 201023 of user root.
Mar  4 10:28:01 s06 systemd: Started Session 201024 of user root.
Mar  4 10:28:01 s06 systemd: Starting Session 201024 of user root.
Mar  4 10:29:01 s06 systemd: Started Session 201025 of user root.
Mar  4 10:29:01 s06 systemd: Starting Session 201025 of user root.

But, unfortunately, I don’t understand why it is happening.
Now, NAGIOS server shows that s06 status is ok:

***** Nagios *****

Notification Type: RECOVERY

Service: Brick - /gluster/mnt2/brick
Host: s06
Address: s06-stg
State: OK

Date/Time: Mon Mar 4 10:35:23 CET 2019

Additional Info:
OK: Brick /gluster/mnt2/brick is up

Nothing is changed from RAM, CPUs, and NETWORK point of view.
/var/log/message file has been updated:

Mar  4 10:32:01 s06 systemd: Starting Session 201029 of user root.
Mar  4 10:32:30 s06 glustershd[20355]: [2019-03-04 09:32:30.069082] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-10: server 192.168.0.52:49156 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:32:55 s06 glustershd[20355]: [2019-03-04 09:32:55.074689] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-66: server 192.168.0.54:49167 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:33:01 s06 systemd: Started Session 201030 of user root.
Mar  4 10:33:01 s06 systemd: Starting Session 201030 of user root.
Mar  4 10:34:01 s06 systemd: Started Session 201031 of user root.
Mar  4 10:34:01 s06 systemd: Starting Session 201031 of user root.
Mar  4 10:35:01 s06 nrpe[162562]: Could not read request from client 192.168.1.56, bailing out...
Mar  4 10:35:01 s06 nrpe[162562]: INFO: SSL Socket Shutdown.
Mar  4 10:35:01 s06 systemd: Started Session 201032 of user root.
Mar  4 10:35:01 s06 systemd: Starting Session 201032 of user root.

Could you please help me to understand what it’s happening ?
Thank you in advance.

Rergards,
Mauro


On 1 Mar 2019, at 12:17, Mauro Tridici <[hidden email]> wrote:


Thank you, Milind.
I executed the instructions you suggested:

- grep “blocked for” /var/log/messages on s06 returns no output (no “blocked” word is detected in messages file);
- in /var/log/messages file I can see this kind of error repeated for a lot of times:

Mar  1 08:43:01 s06 systemd: Starting Session 196071 of user root.
Mar  1 08:43:01 s06 systemd: Removed slice User Slice of root.
Mar  1 08:43:01 s06 systemd: Stopping User Slice of root.
Mar  1 08:43:02 s06 kernel: traps: check_vol_utili[57091] general protection ip:7f88e76ee66d sp:7ffe5a5bcc30 error:0 in libglusterfs.so.0.0.1[7f88e769b000+f7000]
Mar  1 08:43:02 s06 abrt-hook-ccpp: Process 57091 (python2.7) of user 0 killed by SIGSEGV - dumping core
Mar  1 08:43:02 s06 abrt-server: Generating core_backtrace
Mar  1 08:43:02 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  1 08:43:58 s06 abrt-server: Duplicate: UUID
Mar  1 08:43:58 s06 abrt-server: DUP_OF_DIR: /var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041
Mar  1 08:43:58 s06 abrt-server: Deleting problem directory ccpp-2019-03-01-08:43:02-57091 (dup of ccpp-2018-09-25-12:27:42-13041)
Mar  1 08:43:58 s06 dbus[1872]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Mar  1 08:43:58 s06 dbus[1872]: [system] Successfully activated service 'org.freedesktop.problems'
Mar  1 08:43:58 s06 abrt-server: Generating core_backtrace
Mar  1 08:43:58 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  1 08:43:58 s06 abrt-server: Cannot notify '/var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041' via uReport: Event 'report_uReport' exited with 1
Mar  1 08:44:01 s06 systemd: Created slice User Slice of root.
Mar  1 08:44:01 s06 systemd: Starting User Slice of root.
Mar  1 08:44:01 s06 systemd: Started Session 196072 of user root.
Mar  1 08:44:01 s06 systemd: Starting Session 196072 of user root.
Mar  1 08:44:01 s06 systemd: Removed slice User Slice of root.

- in /var/log/messages file I can see also 4 errors related to other cluster servers:

Mar  1 11:05:01 s06 systemd: Starting User Slice of root.
Mar  1 11:05:01 s06 systemd: Started Session 196230 of user root.
Mar  1 11:05:01 s06 systemd: Starting Session 196230 of user root.
Mar  1 11:05:01 s06 systemd: Removed slice User Slice of root.
Mar  1 11:05:01 s06 systemd: Stopping User Slice of root.
Mar  1 11:05:59 s06 glustershd[70117]: [2019-03-01 10:05:59.347094] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-33: server 192.168.0.51:49163 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:06:01 s06 systemd: Created slice User Slice of root.
Mar  1 11:06:01 s06 systemd: Starting User Slice of root.
Mar  1 11:06:01 s06 systemd: Started Session 196231 of user root.
Mar  1 11:06:01 s06 systemd: Starting Session 196231 of user root.
Mar  1 11:06:01 s06 systemd: Removed slice User Slice of root.
Mar  1 11:06:01 s06 systemd: Stopping User Slice of root.
Mar  1 11:06:12 s06 glustershd[70117]: [2019-03-01 10:06:12.351319] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-1: server 192.168.0.52:49153 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:06:38 s06 glustershd[70117]: [2019-03-01 10:06:38.356920] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-7: server 192.168.0.52:49155 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:07:01 s06 systemd: Created slice User Slice of root.
Mar  1 11:07:01 s06 systemd: Starting User Slice of root.
Mar  1 11:07:01 s06 systemd: Started Session 196232 of user root.
Mar  1 11:07:01 s06 systemd: Starting Session 196232 of user root.
Mar  1 11:07:01 s06 systemd: Removed slice User Slice of root.
Mar  1 11:07:01 s06 systemd: Stopping User Slice of root.
Mar  1 11:07:36 s06 glustershd[70117]: [2019-03-01 10:07:36.366259] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-0: server 192.168.0.51:49152 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:08:01 s06 systemd: Created slice User Slice of root.

No “blocked” word is in /var/log/messages files on other cluster servers.
In attachment, the /var/log/messages file from s06 server.

Thank you in advance,
Mauro

<messages.zip>


On 1 Mar 2019, at 11:47, Milind Changire <[hidden email]> wrote:

The traces of very high disk activity on the servers are often found in /var/log/messages
You might want to grep for "blocked for" in /var/log/messages on s06 and correlate the timestamps to confirm the unresponsiveness as reported in gluster client logs.
In cases of high disk activity, although the operating system continues to respond to ICMP pings, the processes writing to disks often get blocked to a large flush to the disk which could span beyond 42 seconds and hence result in ping-timer-expiry logs.

As a side note:
If you indeed find gluster processes being blocked in /var/log/messages, you might want to tweak sysctl tunables called vm.dirty_background_ratio or vm.dirty_background_bytes to a smaller value than the existing. Please read up more on those tunables before touching the settings.


On Fri, Mar 1, 2019 at 4:06 PM Mauro Tridici <[hidden email]> wrote:

Hi all,

in attachment the client log captured after changing network.ping-timeout option.
I noticed this error involving server 192.168.0.56 (s06)

[2019-03-01 09:23:36.077287] I [rpc-clnt.c:1962:rpc_clnt_reconfig] 0-tier2-client-71: changing ping timeout to 42 (from 0)
[2019-03-01 09:23:36.078213] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 09:23:36.078432] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 09:23:36.092357] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 09:23:36.094146] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 10:06:24.708082] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-50: server 192.168.0.56:49156 has not responded in the last 42 seconds, disconnecting.

I don’t know why it happens, s06 server seems to be reachable.

[athena_login2][/users/home/sysm02/]> telnet 192.168.0.56 49156
Trying 192.168.0.56...
Connected to 192.168.0.56.
Escape character is '^]'.
^CConnection closed by foreign host.
[athena_login2][/users/home/sysm02/]> ping 192.168.0.56
PING 192.168.0.56 (192.168.0.56) 56(84) bytes of data.
64 bytes from 192.168.0.56: icmp_seq=1 ttl=64 time=0.116 ms
64 bytes from 192.168.0.56: icmp_seq=2 ttl=64 time=0.101 ms

--- 192.168.0.56 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1528ms
rtt min/avg/max/mdev = 0.101/0.108/0.116/0.012 ms

[athena_login2][/users/home/sysm02/]> telnet 192.168.0.56 49156
Trying 192.168.0.56...
Connected to 192.168.0.56.
Escape character is '^]'.

Thank you for your help,
Mauro



On 1 Mar 2019, at 10:29, Mauro Tridici <[hidden email]> wrote:

Hi all,

thank you for the explanation.
I just changed network.ping-timeout option to default value (network.ping-timeout=42).

I will check the logs to see if the errors will appear again.

Regards,
Mauro

On 1 Mar 2019, at 04:43, Milind Changire <[hidden email]> wrote:

network.ping-timeout should not be set to zero for non-glusterd clients.
glusterd is a special case for which ping-timeout is set to zero via /etc/glusterfs/glusterd.vol

Setting network.ping-timeout to zero disables arming of the ping timer for connections. This disables testing the connection for responsiveness and hence avoids proactive fail-over.

Please reset network.ping-timeout to a non-zero positive value, eg. 42


On Thu, Feb 28, 2019 at 5:07 PM Nithya Balachandran <[hidden email]> wrote:
Adding Raghavendra and Milind to comment on this.

What is the effect of setting network.ping-timeout to 0 and should it be set back to 42?
Regards,
Nithya

On Thu, 28 Feb 2019 at 16:01, Mauro Tridici <[hidden email]> wrote:
Hi Nithya,

sorry for the late.
network.ping-timeout has been set to 0 in order to try to solve some timeout problems, but it didn’t help.
I can set it to the default value. 

Can I proceed with the change?

Thank you,
Mauro


On 28 Feb 2019, at 04:41, Nithya Balachandran <[hidden email]> wrote:

Hi Mauro,

Is network.ping-timeout still set to 0. The default value is 42. Is there a particular reason why this was changed?

Regards,
Nithya


On Wed, 27 Feb 2019 at 21:32, Mauro Tridici <[hidden email]> wrote:

Hi Xavi,

thank you for the detailed explanation and suggestions.
Yes, transport.listen-backlog option is still set to 1024.

I will check the network and connectivity status using “ping” and “telnet” as soon as the errors will come back again.

Regards,
Mauro

Il giorno 27 feb 2019, alle ore 16:42, Xavi Hernandez <[hidden email]> ha scritto:

Hi Mauro,

those errors say that the mount point is not connected to some of the bricks while executing operations. I see references to 3rd and 6th bricks of several disperse sets, which seem to map to server s06. For some reason, gluster is having troubles connecting from the client machine to that particular server. At the end of the log I see that after long time a reconnect is done to both of them. However little after, other bricks from the s05 get disconnected and a reconnect times out.

That's really odd. It seems like if server/communication is cut to s06 for some time, then restored, and then the same happens to the next server.

If the servers are really online and it's only a communication issue, it explains why server memory and network has increased: if the problem only exists between the client and servers, any write made by the client will automatically mark the file as damaged, since some of the servers have not been updated. Since self-heal runs from the server nodes, they will probably be correctly connected to all bricks, which allows them to heal the just damaged file, which increases memory and network usage.

I guess you still have transport.listen-backlog set to 1024, right ?

Just to try to identify if the problem really comes from network, can you check if you lose some pings from the client to all of the servers while you are seeing those errors in the log file ?

You can also check if during those errors, you can telnet to the port of the brick from the client.

Xavi



On Tue, Feb 26, 2019 at 10:17 AM Mauro Tridici <[hidden email]> wrote:
Hi Nithya,

“df -h” operation is not still slow, but no users are using the volume, RAM and NETWORK usage is ok on the client node.

I was worried about this kind of warnings/errors:

[2019-02-25 10:59:00.664323] W [MSGID: 122035] [ec-common.c:571:ec_child_select] 0-tier2-disperse-6: Executing operation with some subvolumes unavailable (20)

[2019-02-26 03:11:35.212603] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] ))))) 0-tier2-client-50: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2019-02-26 03:10:56.549903 (xid=0x106f1c5)

[2019-02-26 03:13:03.313831] E [socket.c:2376:socket_connect_finish] 0-tier2-client-50: connection to 192.168.0.56:49156 failed (Timeout della connessione); disconnecting socket

It seems that some subvolumes are not available and 192.168.0.56 server (s06) is not reachable.
But gluster servers are up&running and bricks are ok.

In attachment the updated tier2.log file.


Thank you.
Regards,
Mauro

Il giorno 26 feb 2019, alle ore 04:03, Nithya Balachandran <[hidden email]> ha scritto:

Hi,

I see a lot of EC messages in the log but they don't seem very serious. Xavi, can you take a look?

The only errors I see are:
[2019-02-25 10:58:45.519871] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] ))))) 0-tier2-client-50: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2019-02-25 10:57:47.429969 (xid=0xd26fe7)
[2019-02-25 10:58:51.461493] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] ))))) 0-tier2-client-41: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2019-02-25 10:57:47.499174 (xid=0xf47d6a)
[2019-02-25 11:07:57.152874] E [socket.c:2376:socket_connect_finish] 0-tier2-client-70: connection to 192.168.0.55:49163 failed (Timeout della connessione); disconnecting socket


Is the df -h operation still slow? If yes, can you take a tcpdump of the client while running df -h and send that across?

Regards,
Nithya

On Mon, 25 Feb 2019 at 17:27, Mauro Tridici <[hidden email]> wrote:

Sorry, some minutes after my last mail message, I noticed that “df -h” command hanged for a while before returns the prompt.
Yesterday, everything was ok in the gluster client log, but, today, I see a lot of errors (please, take a look to the attached file).

On the client node, I detected an important RAM e NETWORK usage.



Do you think that the errors have been caused by the client resources usage?

Thank you in advance,
Mauro









_______________________________________________
Gluster-devel mailing list
[hidden email]
https://lists.gluster.org/mailman/listinfo/gluster-devel
Reply | Threaded
Open this post in threaded view
|

Re: [Gluster-users] "rpc_clnt_ping_timer_expired" errors

Mauro Tridici
Hi Raghavendra,

I just changed the client option value to 8.
I will check the volume behaviour during the next hours.

The GlusterFS version is 3.12.14.

I will provide you the logs as soon as the activity load will be high.
Thank you,
Mauro

On 14 Mar 2019, at 04:57, Raghavendra Gowdappa <[hidden email]> wrote:



On Wed, Mar 13, 2019 at 3:55 PM Mauro Tridici <[hidden email]> wrote:
Hi Raghavendra,

Yes, server.event-thread has been changed from 4 to 8.

Was client.event-thread value too changed to 8? If not, I would like to know the results of including this tuning too. Also, if possible, can you get the output of following command from problematic clients and bricks (during the duration when load tends to be high and ping-timer-expiry is seen)?

# top -bHd 3
 
This will help us to know  CPU utilization of event-threads.

And I forgot to ask, what version of Glusterfs are you using?

During last days, I noticed that the error events are still here although they have been considerably reduced.

So, I used grep command against the log files in order to provide you a global vision about the warning, error and critical events appeared today at 06:xx (may be useful I hope).
I collected the info from s06 gluster server, but the behaviour is the the almost the same on the other gluster servers.

ERRORS:  
CWD: /var/log/glusterfs 
COMMAND: grep " E " *.log |grep "2019-03-13 06:"

(I can see a lot of this kind of message in the same period but I'm notifying you only one record for each type of error)

glusterd.log:[2019-03-13 06:12:35.982863] E [MSGID: 101042] [compat.c:569:gf_umount_lazy] 0-management: Lazy unmount of /var/run/gluster/tier2_quota_list/

glustershd.log:[2019-03-13 06:14:28.666562] E [rpc-clnt.c:350:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7f4a71ddcebb] (--> /lib64/libgfr
pc.so.0(saved_frames_unwind+0x1de)[0x7f4a71ba1d9e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f4a71ba1ebe] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup
+0x90)[0x7f4a71ba3640] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7f4a71ba4130] ))))) 0-tier2-client-55: forced unwinding frame type(GlusterFS 3.3) op(INODELK(29)) 
called at 2019-03-13 06:14:14.858441 (xid=0x17fddb50) 

glustershd.log:[2019-03-13 06:17:48.883825] E [socket.c:2376:socket_connect_finish] 0-tier2-client-55: connection to 192.168.0.55:49158 failed (Connection timed out); disco
nnecting socket
glustershd.log:[2019-03-13 06:19:58.931798] E [socket.c:2376:socket_connect_finish] 0-tier2-client-55: connection to 192.168.0.55:49158 failed (Connection timed out); disco
nnecting socket
glustershd.log:[2019-03-13 06:22:08.979829] E [socket.c:2376:socket_connect_finish] 0-tier2-client-55: connection to 192.168.0.55:49158 failed (Connection timed out); disco
nnecting socket
glustershd.log:[2019-03-13 06:22:36.226847] E [MSGID: 114031] [client-rpc-fops.c:1508:client3_3_inodelk_cbk] 0-tier2-client-55: remote operation failed [Transport endpoint 
is not connected]
glustershd.log:[2019-03-13 06:22:36.306669] E [MSGID: 114031] [client-rpc-fops.c:1508:client3_3_inodelk_cbk] 0-tier2-client-55: remote operation failed [Transport endpoint 
is not connected]
glustershd.log:[2019-03-13 06:22:36.385257] E [MSGID: 114031] [client-rpc-fops.c:1508:client3_3_inodelk_cbk] 0-tier2-client-55: remote operation failed [Transport endpoint 
is not connected]

WARNINGS:
CWD: /var/log/glusterfs 
COMMAND: grep " W " *.log |grep "2019-03-13 06:"

(I can see a lot of this kind of message in the same period but I'm notifying you only one record for each type of warnings)

glustershd.log:[2019-03-13 06:14:28.666772] W [MSGID: 114031] [client-rpc-fops.c:1080:client3_3_getxattr_cbk] 0-tier2-client-55: remote operation failed. Path: <gfid:b6b35d
0f-f34d-4c25-bbe8-74bde0248d7e> (b6b35d0f-f34d-4c25-bbe8-74bde0248d7e). Key: (null) [Transport endpoint is not connected]

glustershd.log:[2019-03-13 06:14:31.421576] W [MSGID: 122035] [ec-common.c:571:ec_child_select] 0-tier2-disperse-9: Executing operation with some subvolumes unavailable (2)

glustershd.log:[2019-03-13 06:15:31.547417] W [MSGID: 122032] [ec-heald.c:266:ec_shd_index_sweep] 0-tier2-disperse-9: unable to get index-dir on tier2-client-55 [Operation 
now in progress]

quota-mount-tier2.log:[2019-03-13 06:12:36.116277] W [MSGID: 101002] [options.c:995:xl_opt_validate] 0-glusterfs: option 'address-family' is deprecated, preferred is 'trans
port.address-family', continuing with correction
quota-mount-tier2.log:[2019-03-13 06:12:36.198430] W [MSGID: 101174] [graph.c:363:_log_if_unknown_option] 0-tier2-readdir-ahead: option 'parallel-readdir' is not recognized
quota-mount-tier2.log:[2019-03-13 06:12:37.945007] W [glusterfsd.c:1375:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7e25) [0x7f340892be25] -->/usr/sbin/glusterfs(gluste
rfs_sigwaiter+0xe5) [0x55ef010164b5] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x55ef0101632b] ) 0-: received signum (15), shutting down

CRITICALS:
CWD: /var/log/glusterfs 
COMMAND: grep " C " *.log |grep "2019-03-13 06:"

no critical errors at 06:xx
only one critical error during the day

[root@s06 glusterfs]# grep " C " *.log |grep "2019-03-13"
glustershd.log:[2019-03-13 02:21:29.126279] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-55: server 192.168.0.55:49158 has not responded in the last 42 seconds, disconnecting.


Thank you very much for your help.
Regards,
Mauro

On 12 Mar 2019, at 05:17, Raghavendra Gowdappa <[hidden email]> wrote:

Was the suggestion to increase server.event-thread values tried? If yes, what were the results?

On Mon, Mar 11, 2019 at 2:40 PM Mauro Tridici <[hidden email]> wrote:
Dear All,

do you have any suggestions about the right way to "debug" this issue?
In attachment, the updated logs of “s06" gluster server.

I noticed a lot of intermittent warning and error messages.

Thank you in advance,
Mauro



On 4 Mar 2019, at 18:45, Raghavendra Gowdappa <[hidden email]> wrote:


I would like to point out another issue. Even if what I suggested prevents disconnects, part of the solution would be only symptomatic treatment and doesn't address the root cause of the problem. In most of the ping-timer-expiry issues, the root cause is the increased load on bricks and the inability of bricks to be responsive under high load. So, the actual solution would be doing any or both of the following:
* identify the source of increased load and if possible throttle it. Internal heal processes like self-heal, rebalance, quota heal are known to pump traffic into bricks without much throttling (io-threads _might_ do some throttling, but my understanding is its not sufficient).
* identify the reason for bricks to become unresponsive during load. This may be fixable issues like not enough event-threads to read from network or difficult to fix issues like fsync on backend fs freezing the process or semi fixable issues (in code) like lock contention.

So any genuine effort to fix ping-timer-issues (to be honest most of the times they are not issues related to rpc/network) would involve performance characterization of various subsystems on bricks and clients. Various subsystems can include (but not necessarily limited to), underlying OS/filesystem, glusterfs processes, CPU consumption etc

regards,
Raghavendra

On Mon, Mar 4, 2019 at 9:31 PM Mauro Tridici <[hidden email]> wrote:
Thank you, let’s try!
I will inform you about the effects of the change.

Regards,
Mauro

On 4 Mar 2019, at 16:55, Raghavendra Gowdappa <[hidden email]> wrote:



On Mon, Mar 4, 2019 at 8:54 PM Mauro Tridici <[hidden email]> wrote:
Hi Raghavendra,

thank you for your reply.
Yes, you are right. It is a problem that seems to happen randomly.
At this moment, server.event-threads value is 4. I will try to increase this value to 8. Do you think that it could be a valid value ?

Yes. We can try with that. You should see at least frequency of ping-timer related disconnects  reduce with this value (even if it doesn't eliminate the problem completely).


Regards,
Mauro


On 4 Mar 2019, at 15:36, Raghavendra Gowdappa <[hidden email]> wrote:



On Mon, Mar 4, 2019 at 8:01 PM Nithya Balachandran <[hidden email]> wrote:
Hi Mauro,

It looks like some problem on s06. Are all your other nodes ok? Can you send us the gluster logs from this node?

[hidden email] , do you have any idea as to how this can be debugged? Maybe running top ? Or debug brick logs?

If we can reproduce the problem, collecting tcpdump on both ends of connection will help. But, one common problem is these bugs are inconsistently reproducible and hence we may not be able to capture tcpdump at correct intervals. Other than that, we can try to collect some evidence that poller threads were busy (waiting on locks). But, not sure what debug data provides that information.

From what I know, its difficult to collect evidence for this issue and we could only reason about it.

We can try a workaround though - try increasing server.event-threads and see whether ping-timer expiry issues go away with an optimal value. If that's the case, it kind of provides proof for our hypothesis.



Regards,
Nithya

On Mon, 4 Mar 2019 at 15:25, Mauro Tridici <[hidden email]> wrote:
Hi All,

some minutes ago I received this message from NAGIOS server 

***** Nagios *****

Notification Type: PROBLEM

Service: Brick - /gluster/mnt2/brick
Host: s06
Address: s06-stg
State: CRITICAL

Date/Time: Mon Mar 4 10:25:33 CET 2019

Additional Info:
CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds.

I checked the network, RAM and CPUs usage on s06 node and everything seems to be ok.
No bricks are in error state. In /var/log/messages, I detected again a crash of “check_vol_utili” that I think it is a module used by NRPE executable (that is the NAGIOS client).

Mar  4 10:15:29 s06 kernel: traps: check_vol_utili[161224] general protection ip:7facffa0a66d sp:7ffe9f4e6fc0 error:0 in libglusterfs.so.0.0.1[7facff9b7000+f7000]
Mar  4 10:15:29 s06 abrt-hook-ccpp: Process 161224 (python2.7) of user 0 killed by SIGSEGV - dumping core
Mar  4 10:15:29 s06 abrt-server: Generating core_backtrace
Mar  4 10:15:29 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  4 10:16:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:16:01 s06 systemd: Starting User Slice of root.
Mar  4 10:16:01 s06 systemd: Started Session 201010 of user root.
Mar  4 10:16:01 s06 systemd: Starting Session 201010 of user root.
Mar  4 10:16:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:16:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:16:24 s06 abrt-server: Duplicate: UUID
Mar  4 10:16:24 s06 abrt-server: DUP_OF_DIR: /var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041
Mar  4 10:16:24 s06 abrt-server: Deleting problem directory ccpp-2019-03-04-10:15:29-161224 (dup of ccpp-2018-09-25-12:27:42-13041)
Mar  4 10:16:24 s06 abrt-server: Generating core_backtrace
Mar  4 10:16:24 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  4 10:16:24 s06 abrt-server: Cannot notify '/var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041' via uReport: Event 'report_uReport' exited with 1
Mar  4 10:16:24 s06 abrt-hook-ccpp: Process 161391 (python2.7) of user 0 killed by SIGABRT - dumping core
Mar  4 10:16:25 s06 abrt-server: Generating core_backtrace
Mar  4 10:16:25 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  4 10:17:01 s06 systemd: Created slice User Slice of root.

Also, I noticed the following errors that I think are very critical:

Mar  4 10:21:12 s06 glustershd[20355]: [2019-03-04 09:21:12.954798] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-55: server 192.168.0.55:49158 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:22:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:22:01 s06 systemd: Starting User Slice of root.
Mar  4 10:22:01 s06 systemd: Started Session 201017 of user root.
Mar  4 10:22:01 s06 systemd: Starting Session 201017 of user root.
Mar  4 10:22:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:22:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:22:03 s06 glustershd[20355]: [2019-03-04 09:22:03.964120] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-60: server 192.168.0.54:49165 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:23:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:23:01 s06 systemd: Starting User Slice of root.
Mar  4 10:23:01 s06 systemd: Started Session 201018 of user root.
Mar  4 10:23:01 s06 systemd: Starting Session 201018 of user root.
Mar  4 10:23:02 s06 systemd: Removed slice User Slice of root.
Mar  4 10:23:02 s06 systemd: Stopping User Slice of root.
Mar  4 10:24:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:24:01 s06 systemd: Starting User Slice of root.
Mar  4 10:24:01 s06 systemd: Started Session 201019 of user root.
Mar  4 10:24:01 s06 systemd: Starting Session 201019 of user root.
Mar  4 10:24:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:24:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:24:03 s06 glustershd[20355]: [2019-03-04 09:24:03.982502] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-16: server 192.168.0.52:49158 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746109] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-3: server 192.168.0.51:49153 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746215] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-10: server 192.168.0.52:49156 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746260] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-21: server 192.168.0.51:49159 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746296] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-25: server 192.168.0.52:49161 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746413] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-60: server 192.168.0.54:49165 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:07 s06 glustershd[20355]: [2019-03-04 09:24:07.982952] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-45: server 192.168.0.54:49155 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:18 s06 glustershd[20355]: [2019-03-04 09:24:18.990929] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-25: server 192.168.0.52:49161 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:31 s06 glustershd[20355]: [2019-03-04 09:24:31.995781] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-20: server 192.168.0.53:49159 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:25:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:25:01 s06 systemd: Starting User Slice of root.
Mar  4 10:25:01 s06 systemd: Started Session 201020 of user root.
Mar  4 10:25:01 s06 systemd: Starting Session 201020 of user root.
Mar  4 10:25:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:25:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:25:57 s06 systemd: Created slice User Slice of root.
Mar  4 10:25:57 s06 systemd: Starting User Slice of root.
Mar  4 10:25:57 s06 systemd-logind: New session 201021 of user root.
Mar  4 10:25:57 s06 systemd: Started Session 201021 of user root.
Mar  4 10:25:57 s06 systemd: Starting Session 201021 of user root.
Mar  4 10:26:01 s06 systemd: Started Session 201022 of user root.
Mar  4 10:26:01 s06 systemd: Starting Session 201022 of user root.
Mar  4 10:26:21 s06 nrpe[162388]: Error: Could not complete SSL handshake with 192.168.1.56: 5
Mar  4 10:27:01 s06 systemd: Started Session 201023 of user root.
Mar  4 10:27:01 s06 systemd: Starting Session 201023 of user root.
Mar  4 10:28:01 s06 systemd: Started Session 201024 of user root.
Mar  4 10:28:01 s06 systemd: Starting Session 201024 of user root.
Mar  4 10:29:01 s06 systemd: Started Session 201025 of user root.
Mar  4 10:29:01 s06 systemd: Starting Session 201025 of user root.

But, unfortunately, I don’t understand why it is happening.
Now, NAGIOS server shows that s06 status is ok:

***** Nagios *****

Notification Type: RECOVERY

Service: Brick - /gluster/mnt2/brick
Host: s06
Address: s06-stg
State: OK

Date/Time: Mon Mar 4 10:35:23 CET 2019

Additional Info:
OK: Brick /gluster/mnt2/brick is up

Nothing is changed from RAM, CPUs, and NETWORK point of view.
/var/log/message file has been updated:

Mar  4 10:32:01 s06 systemd: Starting Session 201029 of user root.
Mar  4 10:32:30 s06 glustershd[20355]: [2019-03-04 09:32:30.069082] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-10: server 192.168.0.52:49156 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:32:55 s06 glustershd[20355]: [2019-03-04 09:32:55.074689] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-66: server 192.168.0.54:49167 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:33:01 s06 systemd: Started Session 201030 of user root.
Mar  4 10:33:01 s06 systemd: Starting Session 201030 of user root.
Mar  4 10:34:01 s06 systemd: Started Session 201031 of user root.
Mar  4 10:34:01 s06 systemd: Starting Session 201031 of user root.
Mar  4 10:35:01 s06 nrpe[162562]: Could not read request from client 192.168.1.56, bailing out...
Mar  4 10:35:01 s06 nrpe[162562]: INFO: SSL Socket Shutdown.
Mar  4 10:35:01 s06 systemd: Started Session 201032 of user root.
Mar  4 10:35:01 s06 systemd: Starting Session 201032 of user root.

Could you please help me to understand what it’s happening ?
Thank you in advance.

Rergards,
Mauro


On 1 Mar 2019, at 12:17, Mauro Tridici <[hidden email]> wrote:


Thank you, Milind.
I executed the instructions you suggested:

- grep “blocked for” /var/log/messages on s06 returns no output (no “blocked” word is detected in messages file);
- in /var/log/messages file I can see this kind of error repeated for a lot of times:

Mar  1 08:43:01 s06 systemd: Starting Session 196071 of user root.
Mar  1 08:43:01 s06 systemd: Removed slice User Slice of root.
Mar  1 08:43:01 s06 systemd: Stopping User Slice of root.
Mar  1 08:43:02 s06 kernel: traps: check_vol_utili[57091] general protection ip:7f88e76ee66d sp:7ffe5a5bcc30 error:0 in libglusterfs.so.0.0.1[7f88e769b000+f7000]
Mar  1 08:43:02 s06 abrt-hook-ccpp: Process 57091 (python2.7) of user 0 killed by SIGSEGV - dumping core
Mar  1 08:43:02 s06 abrt-server: Generating core_backtrace
Mar  1 08:43:02 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  1 08:43:58 s06 abrt-server: Duplicate: UUID
Mar  1 08:43:58 s06 abrt-server: DUP_OF_DIR: /var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041
Mar  1 08:43:58 s06 abrt-server: Deleting problem directory ccpp-2019-03-01-08:43:02-57091 (dup of ccpp-2018-09-25-12:27:42-13041)
Mar  1 08:43:58 s06 dbus[1872]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Mar  1 08:43:58 s06 dbus[1872]: [system] Successfully activated service 'org.freedesktop.problems'
Mar  1 08:43:58 s06 abrt-server: Generating core_backtrace
Mar  1 08:43:58 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  1 08:43:58 s06 abrt-server: Cannot notify '/var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041' via uReport: Event 'report_uReport' exited with 1
Mar  1 08:44:01 s06 systemd: Created slice User Slice of root.
Mar  1 08:44:01 s06 systemd: Starting User Slice of root.
Mar  1 08:44:01 s06 systemd: Started Session 196072 of user root.
Mar  1 08:44:01 s06 systemd: Starting Session 196072 of user root.
Mar  1 08:44:01 s06 systemd: Removed slice User Slice of root.

- in /var/log/messages file I can see also 4 errors related to other cluster servers:

Mar  1 11:05:01 s06 systemd: Starting User Slice of root.
Mar  1 11:05:01 s06 systemd: Started Session 196230 of user root.
Mar  1 11:05:01 s06 systemd: Starting Session 196230 of user root.
Mar  1 11:05:01 s06 systemd: Removed slice User Slice of root.
Mar  1 11:05:01 s06 systemd: Stopping User Slice of root.
Mar  1 11:05:59 s06 glustershd[70117]: [2019-03-01 10:05:59.347094] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-33: server 192.168.0.51:49163 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:06:01 s06 systemd: Created slice User Slice of root.
Mar  1 11:06:01 s06 systemd: Starting User Slice of root.
Mar  1 11:06:01 s06 systemd: Started Session 196231 of user root.
Mar  1 11:06:01 s06 systemd: Starting Session 196231 of user root.
Mar  1 11:06:01 s06 systemd: Removed slice User Slice of root.
Mar  1 11:06:01 s06 systemd: Stopping User Slice of root.
Mar  1 11:06:12 s06 glustershd[70117]: [2019-03-01 10:06:12.351319] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-1: server 192.168.0.52:49153 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:06:38 s06 glustershd[70117]: [2019-03-01 10:06:38.356920] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-7: server 192.168.0.52:49155 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:07:01 s06 systemd: Created slice User Slice of root.
Mar  1 11:07:01 s06 systemd: Starting User Slice of root.
Mar  1 11:07:01 s06 systemd: Started Session 196232 of user root.
Mar  1 11:07:01 s06 systemd: Starting Session 196232 of user root.
Mar  1 11:07:01 s06 systemd: Removed slice User Slice of root.
Mar  1 11:07:01 s06 systemd: Stopping User Slice of root.
Mar  1 11:07:36 s06 glustershd[70117]: [2019-03-01 10:07:36.366259] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-0: server 192.168.0.51:49152 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:08:01 s06 systemd: Created slice User Slice of root.

No “blocked” word is in /var/log/messages files on other cluster servers.
In attachment, the /var/log/messages file from s06 server.

Thank you in advance,
Mauro

<messages.zip>


On 1 Mar 2019, at 11:47, Milind Changire <[hidden email]> wrote:

The traces of very high disk activity on the servers are often found in /var/log/messages
You might want to grep for "blocked for" in /var/log/messages on s06 and correlate the timestamps to confirm the unresponsiveness as reported in gluster client logs.
In cases of high disk activity, although the operating system continues to respond to ICMP pings, the processes writing to disks often get blocked to a large flush to the disk which could span beyond 42 seconds and hence result in ping-timer-expiry logs.

As a side note:
If you indeed find gluster processes being blocked in /var/log/messages, you might want to tweak sysctl tunables called vm.dirty_background_ratio or vm.dirty_background_bytes to a smaller value than the existing. Please read up more on those tunables before touching the settings.


On Fri, Mar 1, 2019 at 4:06 PM Mauro Tridici <[hidden email]> wrote:

Hi all,

in attachment the client log captured after changing network.ping-timeout option.
I noticed this error involving server 192.168.0.56 (s06)

[2019-03-01 09:23:36.077287] I [rpc-clnt.c:1962:rpc_clnt_reconfig] 0-tier2-client-71: changing ping timeout to 42 (from 0)
[2019-03-01 09:23:36.078213] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 09:23:36.078432] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 09:23:36.092357] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 09:23:36.094146] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 10:06:24.708082] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-50: server 192.168.0.56:49156 has not responded in the last 42 seconds, disconnecting.

I don’t know why it happens, s06 server seems to be reachable.

[athena_login2][/users/home/sysm02/]> telnet 192.168.0.56 49156
Trying 192.168.0.56...
Connected to 192.168.0.56.
Escape character is '^]'.
^CConnection closed by foreign host.
[athena_login2][/users/home/sysm02/]> ping 192.168.0.56
PING 192.168.0.56 (192.168.0.56) 56(84) bytes of data.
64 bytes from 192.168.0.56: icmp_seq=1 ttl=64 time=0.116 ms
64 bytes from 192.168.0.56: icmp_seq=2 ttl=64 time=0.101 ms

--- 192.168.0.56 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1528ms
rtt min/avg/max/mdev = 0.101/0.108/0.116/0.012 ms

[athena_login2][/users/home/sysm02/]> telnet 192.168.0.56 49156
Trying 192.168.0.56...
Connected to 192.168.0.56.
Escape character is '^]'.

Thank you for your help,
Mauro



On 1 Mar 2019, at 10:29, Mauro Tridici <[hidden email]> wrote:

Hi all,

thank you for the explanation.
I just changed network.ping-timeout option to default value (network.ping-timeout=42).

I will check the logs to see if the errors will appear again.

Regards,
Mauro

On 1 Mar 2019, at 04:43, Milind Changire <[hidden email]> wrote:

network.ping-timeout should not be set to zero for non-glusterd clients.
glusterd is a special case for which ping-timeout is set to zero via /etc/glusterfs/glusterd.vol

Setting network.ping-timeout to zero disables arming of the ping timer for connections. This disables testing the connection for responsiveness and hence avoids proactive fail-over.

Please reset network.ping-timeout to a non-zero positive value, eg. 42


On Thu, Feb 28, 2019 at 5:07 PM Nithya Balachandran <[hidden email]> wrote:
Adding Raghavendra and Milind to comment on this.

What is the effect of setting network.ping-timeout to 0 and should it be set back to 42?
Regards,
Nithya

On Thu, 28 Feb 2019 at 16:01, Mauro Tridici <[hidden email]> wrote:
Hi Nithya,

sorry for the late.
network.ping-timeout has been set to 0 in order to try to solve some timeout problems, but it didn’t help.
I can set it to the default value. 

Can I proceed with the change?

Thank you,
Mauro


On 28 Feb 2019, at 04:41, Nithya Balachandran <[hidden email]> wrote:

Hi Mauro,

Is network.ping-timeout still set to 0. The default value is 42. Is there a particular reason why this was changed?

Regards,
Nithya


On Wed, 27 Feb 2019 at 21:32, Mauro Tridici <[hidden email]> wrote:

Hi Xavi,

thank you for the detailed explanation and suggestions.
Yes, transport.listen-backlog option is still set to 1024.

I will check the network and connectivity status using “ping” and “telnet” as soon as the errors will come back again.

Regards,
Mauro

Il giorno 27 feb 2019, alle ore 16:42, Xavi Hernandez <[hidden email]> ha scritto:

Hi Mauro,

those errors say that the mount point is not connected to some of the bricks while executing operations. I see references to 3rd and 6th bricks of several disperse sets, which seem to map to server s06. For some reason, gluster is having troubles connecting from the client machine to that particular server. At the end of the log I see that after long time a reconnect is done to both of them. However little after, other bricks from the s05 get disconnected and a reconnect times out.

That's really odd. It seems like if server/communication is cut to s06 for some time, then restored, and then the same happens to the next server.

If the servers are really online and it's only a communication issue, it explains why server memory and network has increased: if the problem only exists between the client and servers, any write made by the client will automatically mark the file as damaged, since some of the servers have not been updated. Since self-heal runs from the server nodes, they will probably be correctly connected to all bricks, which allows them to heal the just damaged file, which increases memory and network usage.

I guess you still have transport.listen-backlog set to 1024, right ?

Just to try to identify if the problem really comes from network, can you check if you lose some pings from the client to all of the servers while you are seeing those errors in the log file ?

You can also check if during those errors, you can telnet to the port of the brick from the client.

Xavi



On Tue, Feb 26, 2019 at 10:17 AM Mauro Tridici <[hidden email]> wrote:
Hi Nithya,

“df -h” operation is not still slow, but no users are using the volume, RAM and NETWORK usage is ok on the client node.

I was worried about this kind of warnings/errors:

[2019-02-25 10:59:00.664323] W [MSGID: 122035] [ec-common.c:571:ec_child_select] 0-tier2-disperse-6: Executing operation with some subvolumes unavailable (20)

[2019-02-26 03:11:35.212603] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] ))))) 0-tier2-client-50: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2019-02-26 03:10:56.549903 (xid=0x106f1c5)

[2019-02-26 03:13:03.313831] E [socket.c:2376:socket_connect_finish] 0-tier2-client-50: connection to 192.168.0.56:49156 failed (Timeout della connessione); disconnecting socket

It seems that some subvolumes are not available and 192.168.0.56 server (s06) is not reachable.
But gluster servers are up&running and bricks are ok.

In attachment the updated tier2.log file.


Thank you.
Regards,
Mauro

Il giorno 26 feb 2019, alle ore 04:03, Nithya Balachandran <[hidden email]> ha scritto:

Hi,

I see a lot of EC messages in the log but they don't seem very serious. Xavi, can you take a look?

The only errors I see are:
[2019-02-25 10:58:45.519871] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] ))))) 0-tier2-client-50: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2019-02-25 10:57:47.429969 (xid=0xd26fe7)
[2019-02-25 10:58:51.461493] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] ))))) 0-tier2-client-41: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2019-02-25 10:57:47.499174 (xid=0xf47d6a)
[2019-02-25 11:07:57.152874] E [socket.c:2376:socket_connect_finish] 0-tier2-client-70: connection to 192.168.0.55:49163 failed (Timeout della connessione); disconnecting socket


Is the df -h operation still slow? If yes, can you take a tcpdump of the client while running df -h and send that across?

Regards,
Nithya

On Mon, 25 Feb 2019 at 17:27, Mauro Tridici <[hidden email]> wrote:

Sorry, some minutes after my last mail message, I noticed that “df -h” command hanged for a while before returns the prompt.
Yesterday, everything was ok in the gluster client log, but, today, I see a lot of errors (please, take a look to the attached file).

On the client node, I detected an important RAM e NETWORK usage.



Do you think that the errors have been caused by the client resources usage?

Thank you in advance,
Mauro











_______________________________________________
Gluster-devel mailing list
[hidden email]
https://lists.gluster.org/mailman/listinfo/gluster-devel
Reply | Threaded
Open this post in threaded view
|

Re: [Gluster-users] "rpc_clnt_ping_timer_expired" errors

Raghavendra Gowdappa
Thanks Mauro.

On Thu, Mar 14, 2019 at 3:38 PM Mauro Tridici <[hidden email]> wrote:
Hi Raghavendra,

I just changed the client option value to 8.
I will check the volume behaviour during the next hours.

The GlusterFS version is 3.12.14.

I will provide you the logs as soon as the activity load will be high.
Thank you,
Mauro

On 14 Mar 2019, at 04:57, Raghavendra Gowdappa <[hidden email]> wrote:



On Wed, Mar 13, 2019 at 3:55 PM Mauro Tridici <[hidden email]> wrote:
Hi Raghavendra,

Yes, server.event-thread has been changed from 4 to 8.

Was client.event-thread value too changed to 8? If not, I would like to know the results of including this tuning too. Also, if possible, can you get the output of following command from problematic clients and bricks (during the duration when load tends to be high and ping-timer-expiry is seen)?

# top -bHd 3
 
This will help us to know  CPU utilization of event-threads.

And I forgot to ask, what version of Glusterfs are you using?

During last days, I noticed that the error events are still here although they have been considerably reduced.

So, I used grep command against the log files in order to provide you a global vision about the warning, error and critical events appeared today at 06:xx (may be useful I hope).
I collected the info from s06 gluster server, but the behaviour is the the almost the same on the other gluster servers.

ERRORS:  
CWD: /var/log/glusterfs 
COMMAND: grep " E " *.log |grep "2019-03-13 06:"

(I can see a lot of this kind of message in the same period but I'm notifying you only one record for each type of error)

glusterd.log:[2019-03-13 06:12:35.982863] E [MSGID: 101042] [compat.c:569:gf_umount_lazy] 0-management: Lazy unmount of /var/run/gluster/tier2_quota_list/

glustershd.log:[2019-03-13 06:14:28.666562] E [rpc-clnt.c:350:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7f4a71ddcebb] (--> /lib64/libgfr
pc.so.0(saved_frames_unwind+0x1de)[0x7f4a71ba1d9e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f4a71ba1ebe] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup
+0x90)[0x7f4a71ba3640] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7f4a71ba4130] ))))) 0-tier2-client-55: forced unwinding frame type(GlusterFS 3.3) op(INODELK(29)) 
called at 2019-03-13 06:14:14.858441 (xid=0x17fddb50) 

glustershd.log:[2019-03-13 06:17:48.883825] E [socket.c:2376:socket_connect_finish] 0-tier2-client-55: connection to 192.168.0.55:49158 failed (Connection timed out); disco
nnecting socket
glustershd.log:[2019-03-13 06:19:58.931798] E [socket.c:2376:socket_connect_finish] 0-tier2-client-55: connection to 192.168.0.55:49158 failed (Connection timed out); disco
nnecting socket
glustershd.log:[2019-03-13 06:22:08.979829] E [socket.c:2376:socket_connect_finish] 0-tier2-client-55: connection to 192.168.0.55:49158 failed (Connection timed out); disco
nnecting socket
glustershd.log:[2019-03-13 06:22:36.226847] E [MSGID: 114031] [client-rpc-fops.c:1508:client3_3_inodelk_cbk] 0-tier2-client-55: remote operation failed [Transport endpoint 
is not connected]
glustershd.log:[2019-03-13 06:22:36.306669] E [MSGID: 114031] [client-rpc-fops.c:1508:client3_3_inodelk_cbk] 0-tier2-client-55: remote operation failed [Transport endpoint 
is not connected]
glustershd.log:[2019-03-13 06:22:36.385257] E [MSGID: 114031] [client-rpc-fops.c:1508:client3_3_inodelk_cbk] 0-tier2-client-55: remote operation failed [Transport endpoint 
is not connected]

WARNINGS:
CWD: /var/log/glusterfs 
COMMAND: grep " W " *.log |grep "2019-03-13 06:"

(I can see a lot of this kind of message in the same period but I'm notifying you only one record for each type of warnings)

glustershd.log:[2019-03-13 06:14:28.666772] W [MSGID: 114031] [client-rpc-fops.c:1080:client3_3_getxattr_cbk] 0-tier2-client-55: remote operation failed. Path: <gfid:b6b35d
0f-f34d-4c25-bbe8-74bde0248d7e> (b6b35d0f-f34d-4c25-bbe8-74bde0248d7e). Key: (null) [Transport endpoint is not connected]

glustershd.log:[2019-03-13 06:14:31.421576] W [MSGID: 122035] [ec-common.c:571:ec_child_select] 0-tier2-disperse-9: Executing operation with some subvolumes unavailable (2)

glustershd.log:[2019-03-13 06:15:31.547417] W [MSGID: 122032] [ec-heald.c:266:ec_shd_index_sweep] 0-tier2-disperse-9: unable to get index-dir on tier2-client-55 [Operation 
now in progress]

quota-mount-tier2.log:[2019-03-13 06:12:36.116277] W [MSGID: 101002] [options.c:995:xl_opt_validate] 0-glusterfs: option 'address-family' is deprecated, preferred is 'trans
port.address-family', continuing with correction
quota-mount-tier2.log:[2019-03-13 06:12:36.198430] W [MSGID: 101174] [graph.c:363:_log_if_unknown_option] 0-tier2-readdir-ahead: option 'parallel-readdir' is not recognized
quota-mount-tier2.log:[2019-03-13 06:12:37.945007] W [glusterfsd.c:1375:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7e25) [0x7f340892be25] -->/usr/sbin/glusterfs(gluste
rfs_sigwaiter+0xe5) [0x55ef010164b5] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x55ef0101632b] ) 0-: received signum (15), shutting down

CRITICALS:
CWD: /var/log/glusterfs 
COMMAND: grep " C " *.log |grep "2019-03-13 06:"

no critical errors at 06:xx
only one critical error during the day

[root@s06 glusterfs]# grep " C " *.log |grep "2019-03-13"
glustershd.log:[2019-03-13 02:21:29.126279] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-55: server 192.168.0.55:49158 has not responded in the last 42 seconds, disconnecting.


Thank you very much for your help.
Regards,
Mauro

On 12 Mar 2019, at 05:17, Raghavendra Gowdappa <[hidden email]> wrote:

Was the suggestion to increase server.event-thread values tried? If yes, what were the results?

On Mon, Mar 11, 2019 at 2:40 PM Mauro Tridici <[hidden email]> wrote:
Dear All,

do you have any suggestions about the right way to "debug" this issue?
In attachment, the updated logs of “s06" gluster server.

I noticed a lot of intermittent warning and error messages.

Thank you in advance,
Mauro



On 4 Mar 2019, at 18:45, Raghavendra Gowdappa <[hidden email]> wrote:


I would like to point out another issue. Even if what I suggested prevents disconnects, part of the solution would be only symptomatic treatment and doesn't address the root cause of the problem. In most of the ping-timer-expiry issues, the root cause is the increased load on bricks and the inability of bricks to be responsive under high load. So, the actual solution would be doing any or both of the following:
* identify the source of increased load and if possible throttle it. Internal heal processes like self-heal, rebalance, quota heal are known to pump traffic into bricks without much throttling (io-threads _might_ do some throttling, but my understanding is its not sufficient).
* identify the reason for bricks to become unresponsive during load. This may be fixable issues like not enough event-threads to read from network or difficult to fix issues like fsync on backend fs freezing the process or semi fixable issues (in code) like lock contention.

So any genuine effort to fix ping-timer-issues (to be honest most of the times they are not issues related to rpc/network) would involve performance characterization of various subsystems on bricks and clients. Various subsystems can include (but not necessarily limited to), underlying OS/filesystem, glusterfs processes, CPU consumption etc

regards,
Raghavendra

On Mon, Mar 4, 2019 at 9:31 PM Mauro Tridici <[hidden email]> wrote:
Thank you, let’s try!
I will inform you about the effects of the change.

Regards,
Mauro

On 4 Mar 2019, at 16:55, Raghavendra Gowdappa <[hidden email]> wrote:



On Mon, Mar 4, 2019 at 8:54 PM Mauro Tridici <[hidden email]> wrote:
Hi Raghavendra,

thank you for your reply.
Yes, you are right. It is a problem that seems to happen randomly.
At this moment, server.event-threads value is 4. I will try to increase this value to 8. Do you think that it could be a valid value ?

Yes. We can try with that. You should see at least frequency of ping-timer related disconnects  reduce with this value (even if it doesn't eliminate the problem completely).


Regards,
Mauro


On 4 Mar 2019, at 15:36, Raghavendra Gowdappa <[hidden email]> wrote:



On Mon, Mar 4, 2019 at 8:01 PM Nithya Balachandran <[hidden email]> wrote:
Hi Mauro,

It looks like some problem on s06. Are all your other nodes ok? Can you send us the gluster logs from this node?

[hidden email] , do you have any idea as to how this can be debugged? Maybe running top ? Or debug brick logs?

If we can reproduce the problem, collecting tcpdump on both ends of connection will help. But, one common problem is these bugs are inconsistently reproducible and hence we may not be able to capture tcpdump at correct intervals. Other than that, we can try to collect some evidence that poller threads were busy (waiting on locks). But, not sure what debug data provides that information.

From what I know, its difficult to collect evidence for this issue and we could only reason about it.

We can try a workaround though - try increasing server.event-threads and see whether ping-timer expiry issues go away with an optimal value. If that's the case, it kind of provides proof for our hypothesis.



Regards,
Nithya

On Mon, 4 Mar 2019 at 15:25, Mauro Tridici <[hidden email]> wrote:
Hi All,

some minutes ago I received this message from NAGIOS server 

***** Nagios *****

Notification Type: PROBLEM

Service: Brick - /gluster/mnt2/brick
Host: s06
Address: s06-stg
State: CRITICAL

Date/Time: Mon Mar 4 10:25:33 CET 2019

Additional Info:
CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds.

I checked the network, RAM and CPUs usage on s06 node and everything seems to be ok.
No bricks are in error state. In /var/log/messages, I detected again a crash of “check_vol_utili” that I think it is a module used by NRPE executable (that is the NAGIOS client).

Mar  4 10:15:29 s06 kernel: traps: check_vol_utili[161224] general protection ip:7facffa0a66d sp:7ffe9f4e6fc0 error:0 in libglusterfs.so.0.0.1[7facff9b7000+f7000]
Mar  4 10:15:29 s06 abrt-hook-ccpp: Process 161224 (python2.7) of user 0 killed by SIGSEGV - dumping core
Mar  4 10:15:29 s06 abrt-server: Generating core_backtrace
Mar  4 10:15:29 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  4 10:16:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:16:01 s06 systemd: Starting User Slice of root.
Mar  4 10:16:01 s06 systemd: Started Session 201010 of user root.
Mar  4 10:16:01 s06 systemd: Starting Session 201010 of user root.
Mar  4 10:16:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:16:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:16:24 s06 abrt-server: Duplicate: UUID
Mar  4 10:16:24 s06 abrt-server: DUP_OF_DIR: /var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041
Mar  4 10:16:24 s06 abrt-server: Deleting problem directory ccpp-2019-03-04-10:15:29-161224 (dup of ccpp-2018-09-25-12:27:42-13041)
Mar  4 10:16:24 s06 abrt-server: Generating core_backtrace
Mar  4 10:16:24 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  4 10:16:24 s06 abrt-server: Cannot notify '/var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041' via uReport: Event 'report_uReport' exited with 1
Mar  4 10:16:24 s06 abrt-hook-ccpp: Process 161391 (python2.7) of user 0 killed by SIGABRT - dumping core
Mar  4 10:16:25 s06 abrt-server: Generating core_backtrace
Mar  4 10:16:25 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  4 10:17:01 s06 systemd: Created slice User Slice of root.

Also, I noticed the following errors that I think are very critical:

Mar  4 10:21:12 s06 glustershd[20355]: [2019-03-04 09:21:12.954798] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-55: server 192.168.0.55:49158 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:22:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:22:01 s06 systemd: Starting User Slice of root.
Mar  4 10:22:01 s06 systemd: Started Session 201017 of user root.
Mar  4 10:22:01 s06 systemd: Starting Session 201017 of user root.
Mar  4 10:22:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:22:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:22:03 s06 glustershd[20355]: [2019-03-04 09:22:03.964120] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-60: server 192.168.0.54:49165 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:23:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:23:01 s06 systemd: Starting User Slice of root.
Mar  4 10:23:01 s06 systemd: Started Session 201018 of user root.
Mar  4 10:23:01 s06 systemd: Starting Session 201018 of user root.
Mar  4 10:23:02 s06 systemd: Removed slice User Slice of root.
Mar  4 10:23:02 s06 systemd: Stopping User Slice of root.
Mar  4 10:24:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:24:01 s06 systemd: Starting User Slice of root.
Mar  4 10:24:01 s06 systemd: Started Session 201019 of user root.
Mar  4 10:24:01 s06 systemd: Starting Session 201019 of user root.
Mar  4 10:24:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:24:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:24:03 s06 glustershd[20355]: [2019-03-04 09:24:03.982502] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-16: server 192.168.0.52:49158 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746109] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-3: server 192.168.0.51:49153 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746215] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-10: server 192.168.0.52:49156 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746260] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-21: server 192.168.0.51:49159 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746296] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-25: server 192.168.0.52:49161 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746413] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-60: server 192.168.0.54:49165 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:07 s06 glustershd[20355]: [2019-03-04 09:24:07.982952] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-45: server 192.168.0.54:49155 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:18 s06 glustershd[20355]: [2019-03-04 09:24:18.990929] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-25: server 192.168.0.52:49161 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:31 s06 glustershd[20355]: [2019-03-04 09:24:31.995781] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-20: server 192.168.0.53:49159 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:25:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:25:01 s06 systemd: Starting User Slice of root.
Mar  4 10:25:01 s06 systemd: Started Session 201020 of user root.
Mar  4 10:25:01 s06 systemd: Starting Session 201020 of user root.
Mar  4 10:25:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:25:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:25:57 s06 systemd: Created slice User Slice of root.
Mar  4 10:25:57 s06 systemd: Starting User Slice of root.
Mar  4 10:25:57 s06 systemd-logind: New session 201021 of user root.
Mar  4 10:25:57 s06 systemd: Started Session 201021 of user root.
Mar  4 10:25:57 s06 systemd: Starting Session 201021 of user root.
Mar  4 10:26:01 s06 systemd: Started Session 201022 of user root.
Mar  4 10:26:01 s06 systemd: Starting Session 201022 of user root.
Mar  4 10:26:21 s06 nrpe[162388]: Error: Could not complete SSL handshake with 192.168.1.56: 5
Mar  4 10:27:01 s06 systemd: Started Session 201023 of user root.
Mar  4 10:27:01 s06 systemd: Starting Session 201023 of user root.
Mar  4 10:28:01 s06 systemd: Started Session 201024 of user root.
Mar  4 10:28:01 s06 systemd: Starting Session 201024 of user root.
Mar  4 10:29:01 s06 systemd: Started Session 201025 of user root.
Mar  4 10:29:01 s06 systemd: Starting Session 201025 of user root.

But, unfortunately, I don’t understand why it is happening.
Now, NAGIOS server shows that s06 status is ok:

***** Nagios *****

Notification Type: RECOVERY

Service: Brick - /gluster/mnt2/brick
Host: s06
Address: s06-stg
State: OK

Date/Time: Mon Mar 4 10:35:23 CET 2019

Additional Info:
OK: Brick /gluster/mnt2/brick is up

Nothing is changed from RAM, CPUs, and NETWORK point of view.
/var/log/message file has been updated:

Mar  4 10:32:01 s06 systemd: Starting Session 201029 of user root.
Mar  4 10:32:30 s06 glustershd[20355]: [2019-03-04 09:32:30.069082] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-10: server 192.168.0.52:49156 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:32:55 s06 glustershd[20355]: [2019-03-04 09:32:55.074689] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-66: server 192.168.0.54:49167 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:33:01 s06 systemd: Started Session 201030 of user root.
Mar  4 10:33:01 s06 systemd: Starting Session 201030 of user root.
Mar  4 10:34:01 s06 systemd: Started Session 201031 of user root.
Mar  4 10:34:01 s06 systemd: Starting Session 201031 of user root.
Mar  4 10:35:01 s06 nrpe[162562]: Could not read request from client 192.168.1.56, bailing out...
Mar  4 10:35:01 s06 nrpe[162562]: INFO: SSL Socket Shutdown.
Mar  4 10:35:01 s06 systemd: Started Session 201032 of user root.
Mar  4 10:35:01 s06 systemd: Starting Session 201032 of user root.

Could you please help me to understand what it’s happening ?
Thank you in advance.

Rergards,
Mauro


On 1 Mar 2019, at 12:17, Mauro Tridici <[hidden email]> wrote:


Thank you, Milind.
I executed the instructions you suggested:

- grep “blocked for” /var/log/messages on s06 returns no output (no “blocked” word is detected in messages file);
- in /var/log/messages file I can see this kind of error repeated for a lot of times:

Mar  1 08:43:01 s06 systemd: Starting Session 196071 of user root.
Mar  1 08:43:01 s06 systemd: Removed slice User Slice of root.
Mar  1 08:43:01 s06 systemd: Stopping User Slice of root.
Mar  1 08:43:02 s06 kernel: traps: check_vol_utili[57091] general protection ip:7f88e76ee66d sp:7ffe5a5bcc30 error:0 in libglusterfs.so.0.0.1[7f88e769b000+f7000]
Mar  1 08:43:02 s06 abrt-hook-ccpp: Process 57091 (python2.7) of user 0 killed by SIGSEGV - dumping core
Mar  1 08:43:02 s06 abrt-server: Generating core_backtrace
Mar  1 08:43:02 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  1 08:43:58 s06 abrt-server: Duplicate: UUID
Mar  1 08:43:58 s06 abrt-server: DUP_OF_DIR: /var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041
Mar  1 08:43:58 s06 abrt-server: Deleting problem directory ccpp-2019-03-01-08:43:02-57091 (dup of ccpp-2018-09-25-12:27:42-13041)
Mar  1 08:43:58 s06 dbus[1872]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Mar  1 08:43:58 s06 dbus[1872]: [system] Successfully activated service 'org.freedesktop.problems'
Mar  1 08:43:58 s06 abrt-server: Generating core_backtrace
Mar  1 08:43:58 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  1 08:43:58 s06 abrt-server: Cannot notify '/var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041' via uReport: Event 'report_uReport' exited with 1
Mar  1 08:44:01 s06 systemd: Created slice User Slice of root.
Mar  1 08:44:01 s06 systemd: Starting User Slice of root.
Mar  1 08:44:01 s06 systemd: Started Session 196072 of user root.
Mar  1 08:44:01 s06 systemd: Starting Session 196072 of user root.
Mar  1 08:44:01 s06 systemd: Removed slice User Slice of root.

- in /var/log/messages file I can see also 4 errors related to other cluster servers:

Mar  1 11:05:01 s06 systemd: Starting User Slice of root.
Mar  1 11:05:01 s06 systemd: Started Session 196230 of user root.
Mar  1 11:05:01 s06 systemd: Starting Session 196230 of user root.
Mar  1 11:05:01 s06 systemd: Removed slice User Slice of root.
Mar  1 11:05:01 s06 systemd: Stopping User Slice of root.
Mar  1 11:05:59 s06 glustershd[70117]: [2019-03-01 10:05:59.347094] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-33: server 192.168.0.51:49163 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:06:01 s06 systemd: Created slice User Slice of root.
Mar  1 11:06:01 s06 systemd: Starting User Slice of root.
Mar  1 11:06:01 s06 systemd: Started Session 196231 of user root.
Mar  1 11:06:01 s06 systemd: Starting Session 196231 of user root.
Mar  1 11:06:01 s06 systemd: Removed slice User Slice of root.
Mar  1 11:06:01 s06 systemd: Stopping User Slice of root.
Mar  1 11:06:12 s06 glustershd[70117]: [2019-03-01 10:06:12.351319] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-1: server 192.168.0.52:49153 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:06:38 s06 glustershd[70117]: [2019-03-01 10:06:38.356920] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-7: server 192.168.0.52:49155 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:07:01 s06 systemd: Created slice User Slice of root.
Mar  1 11:07:01 s06 systemd: Starting User Slice of root.
Mar  1 11:07:01 s06 systemd: Started Session 196232 of user root.
Mar  1 11:07:01 s06 systemd: Starting Session 196232 of user root.
Mar  1 11:07:01 s06 systemd: Removed slice User Slice of root.
Mar  1 11:07:01 s06 systemd: Stopping User Slice of root.
Mar  1 11:07:36 s06 glustershd[70117]: [2019-03-01 10:07:36.366259] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-0: server 192.168.0.51:49152 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:08:01 s06 systemd: Created slice User Slice of root.

No “blocked” word is in /var/log/messages files on other cluster servers.
In attachment, the /var/log/messages file from s06 server.

Thank you in advance,
Mauro

<messages.zip>


On 1 Mar 2019, at 11:47, Milind Changire <[hidden email]> wrote:

The traces of very high disk activity on the servers are often found in /var/log/messages
You might want to grep for "blocked for" in /var/log/messages on s06 and correlate the timestamps to confirm the unresponsiveness as reported in gluster client logs.
In cases of high disk activity, although the operating system continues to respond to ICMP pings, the processes writing to disks often get blocked to a large flush to the disk which could span beyond 42 seconds and hence result in ping-timer-expiry logs.

As a side note:
If you indeed find gluster processes being blocked in /var/log/messages, you might want to tweak sysctl tunables called vm.dirty_background_ratio or vm.dirty_background_bytes to a smaller value than the existing. Please read up more on those tunables before touching the settings.


On Fri, Mar 1, 2019 at 4:06 PM Mauro Tridici <[hidden email]> wrote:

Hi all,

in attachment the client log captured after changing network.ping-timeout option.
I noticed this error involving server 192.168.0.56 (s06)

[2019-03-01 09:23:36.077287] I [rpc-clnt.c:1962:rpc_clnt_reconfig] 0-tier2-client-71: changing ping timeout to 42 (from 0)
[2019-03-01 09:23:36.078213] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 09:23:36.078432] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 09:23:36.092357] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 09:23:36.094146] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 10:06:24.708082] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-50: server 192.168.0.56:49156 has not responded in the last 42 seconds, disconnecting.

I don’t know why it happens, s06 server seems to be reachable.

[athena_login2][/users/home/sysm02/]> telnet 192.168.0.56 49156
Trying 192.168.0.56...
Connected to 192.168.0.56.
Escape character is '^]'.
^CConnection closed by foreign host.
[athena_login2][/users/home/sysm02/]> ping 192.168.0.56
PING 192.168.0.56 (192.168.0.56) 56(84) bytes of data.
64 bytes from 192.168.0.56: icmp_seq=1 ttl=64 time=0.116 ms
64 bytes from 192.168.0.56: icmp_seq=2 ttl=64 time=0.101 ms

--- 192.168.0.56 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1528ms
rtt min/avg/max/mdev = 0.101/0.108/0.116/0.012 ms

[athena_login2][/users/home/sysm02/]> telnet 192.168.0.56 49156
Trying 192.168.0.56...
Connected to 192.168.0.56.
Escape character is '^]'.

Thank you for your help,
Mauro



On 1 Mar 2019, at 10:29, Mauro Tridici <[hidden email]> wrote:

Hi all,

thank you for the explanation.
I just changed network.ping-timeout option to default value (network.ping-timeout=42).

I will check the logs to see if the errors will appear again.

Regards,
Mauro

On 1 Mar 2019, at 04:43, Milind Changire <[hidden email]> wrote:

network.ping-timeout should not be set to zero for non-glusterd clients.
glusterd is a special case for which ping-timeout is set to zero via /etc/glusterfs/glusterd.vol

Setting network.ping-timeout to zero disables arming of the ping timer for connections. This disables testing the connection for responsiveness and hence avoids proactive fail-over.

Please reset network.ping-timeout to a non-zero positive value, eg. 42


On Thu, Feb 28, 2019 at 5:07 PM Nithya Balachandran <[hidden email]> wrote:
Adding Raghavendra and Milind to comment on this.

What is the effect of setting network.ping-timeout to 0 and should it be set back to 42?
Regards,
Nithya

On Thu, 28 Feb 2019 at 16:01, Mauro Tridici <[hidden email]> wrote:
Hi Nithya,

sorry for the late.
network.ping-timeout has been set to 0 in order to try to solve some timeout problems, but it didn’t help.
I can set it to the default value. 

Can I proceed with the change?

Thank you,
Mauro


On 28 Feb 2019, at 04:41, Nithya Balachandran <[hidden email]> wrote:

Hi Mauro,

Is network.ping-timeout still set to 0. The default value is 42. Is there a particular reason why this was changed?

Regards,
Nithya


On Wed, 27 Feb 2019 at 21:32, Mauro Tridici <[hidden email]> wrote:

Hi Xavi,

thank you for the detailed explanation and suggestions.
Yes, transport.listen-backlog option is still set to 1024.

I will check the network and connectivity status using “ping” and “telnet” as soon as the errors will come back again.

Regards,
Mauro

Il giorno 27 feb 2019, alle ore 16:42, Xavi Hernandez <[hidden email]> ha scritto:

Hi Mauro,

those errors say that the mount point is not connected to some of the bricks while executing operations. I see references to 3rd and 6th bricks of several disperse sets, which seem to map to server s06. For some reason, gluster is having troubles connecting from the client machine to that particular server. At the end of the log I see that after long time a reconnect is done to both of them. However little after, other bricks from the s05 get disconnected and a reconnect times out.

That's really odd. It seems like if server/communication is cut to s06 for some time, then restored, and then the same happens to the next server.

If the servers are really online and it's only a communication issue, it explains why server memory and network has increased: if the problem only exists between the client and servers, any write made by the client will automatically mark the file as damaged, since some of the servers have not been updated. Since self-heal runs from the server nodes, they will probably be correctly connected to all bricks, which allows them to heal the just damaged file, which increases memory and network usage.

I guess you still have transport.listen-backlog set to 1024, right ?

Just to try to identify if the problem really comes from network, can you check if you lose some pings from the client to all of the servers while you are seeing those errors in the log file ?

You can also check if during those errors, you can telnet to the port of the brick from the client.

Xavi



On Tue, Feb 26, 2019 at 10:17 AM Mauro Tridici <[hidden email]> wrote:
Hi Nithya,

“df -h” operation is not still slow, but no users are using the volume, RAM and NETWORK usage is ok on the client node.

I was worried about this kind of warnings/errors:

[2019-02-25 10:59:00.664323] W [MSGID: 122035] [ec-common.c:571:ec_child_select] 0-tier2-disperse-6: Executing operation with some subvolumes unavailable (20)

[2019-02-26 03:11:35.212603] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] ))))) 0-tier2-client-50: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2019-02-26 03:10:56.549903 (xid=0x106f1c5)

[2019-02-26 03:13:03.313831] E [socket.c:2376:socket_connect_finish] 0-tier2-client-50: connection to 192.168.0.56:49156 failed (Timeout della connessione); disconnecting socket

It seems that some subvolumes are not available and 192.168.0.56 server (s06) is not reachable.
But gluster servers are up&running and bricks are ok.

In attachment the updated tier2.log file.


Thank you.
Regards,
Mauro

Il giorno 26 feb 2019, alle ore 04:03, Nithya Balachandran <[hidden email]> ha scritto:

Hi,

I see a lot of EC messages in the log but they don't seem very serious. Xavi, can you take a look?

The only errors I see are:
[2019-02-25 10:58:45.519871] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] ))))) 0-tier2-client-50: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2019-02-25 10:57:47.429969 (xid=0xd26fe7)
[2019-02-25 10:58:51.461493] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] ))))) 0-tier2-client-41: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2019-02-25 10:57:47.499174 (xid=0xf47d6a)
[2019-02-25 11:07:57.152874] E [socket.c:2376:socket_connect_finish] 0-tier2-client-70: connection to 192.168.0.55:49163 failed (Timeout della connessione); disconnecting socket


Is the df -h operation still slow? If yes, can you take a tcpdump of the client while running df -h and send that across?

Regards,
Nithya

On Mon, 25 Feb 2019 at 17:27, Mauro Tridici <[hidden email]> wrote:

Sorry, some minutes after my last mail message, I noticed that “df -h” command hanged for a while before returns the prompt.
Yesterday, everything was ok in the gluster client log, but, today, I see a lot of errors (please, take a look to the attached file).

On the client node, I detected an important RAM e NETWORK usage.



Do you think that the errors have been caused by the client resources usage?

Thank you in advance,
Mauro











_______________________________________________
Gluster-devel mailing list
[hidden email]
https://lists.gluster.org/mailman/listinfo/gluster-devel
Reply | Threaded
Open this post in threaded view
|

Re: [Gluster-users] "rpc_clnt_ping_timer_expired" errors

Raghavendra Gowdappa


On Thu, Mar 21, 2019 at 4:10 PM Mauro Tridici <[hidden email]> wrote:
Hi Raghavendra,

the number of errors reduced, but during last days I received some error notifications from Nagios server similar to the following one:

***** Nagios *****

Notification Type: PROBLEM

Service: Brick - /gluster/mnt5/brick
Host: s04
Address: s04-stg
State: CRITICAL

Date/Time: Mon Mar 18 19:56:36 CET 2019

Additional Info:

CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds.

The error was related only to s04 gluster server.

So, following your suggestions,  I executed, on s04 node, the top command.
In attachment, you can find the related output.

top output doesn't contain cmd/thread names. Was there anything wrong.


Thank you very much for your help.
Regards,
Mauro



On 14 Mar 2019, at 13:31, Raghavendra Gowdappa <[hidden email]> wrote:

Thanks Mauro.

On Thu, Mar 14, 2019 at 3:38 PM Mauro Tridici <[hidden email]> wrote:
Hi Raghavendra,

I just changed the client option value to 8.
I will check the volume behaviour during the next hours.

The GlusterFS version is 3.12.14.

I will provide you the logs as soon as the activity load will be high.
Thank you,
Mauro

On 14 Mar 2019, at 04:57, Raghavendra Gowdappa <[hidden email]> wrote:



On Wed, Mar 13, 2019 at 3:55 PM Mauro Tridici <[hidden email]> wrote:
Hi Raghavendra,

Yes, server.event-thread has been changed from 4 to 8.

Was client.event-thread value too changed to 8? If not, I would like to know the results of including this tuning too. Also, if possible, can you get the output of following command from problematic clients and bricks (during the duration when load tends to be high and ping-timer-expiry is seen)?

# top -bHd 3
 
This will help us to know  CPU utilization of event-threads.

And I forgot to ask, what version of Glusterfs are you using?

During last days, I noticed that the error events are still here although they have been considerably reduced.

So, I used grep command against the log files in order to provide you a global vision about the warning, error and critical events appeared today at 06:xx (may be useful I hope).
I collected the info from s06 gluster server, but the behaviour is the the almost the same on the other gluster servers.

ERRORS:  
CWD: /var/log/glusterfs 
COMMAND: grep " E " *.log |grep "2019-03-13 06:"

(I can see a lot of this kind of message in the same period but I'm notifying you only one record for each type of error)

glusterd.log:[2019-03-13 06:12:35.982863] E [MSGID: 101042] [compat.c:569:gf_umount_lazy] 0-management: Lazy unmount of /var/run/gluster/tier2_quota_list/

glustershd.log:[2019-03-13 06:14:28.666562] E [rpc-clnt.c:350:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7f4a71ddcebb] (--> /lib64/libgfr
pc.so.0(saved_frames_unwind+0x1de)[0x7f4a71ba1d9e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f4a71ba1ebe] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup
+0x90)[0x7f4a71ba3640] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7f4a71ba4130] ))))) 0-tier2-client-55: forced unwinding frame type(GlusterFS 3.3) op(INODELK(29)) 
called at 2019-03-13 06:14:14.858441 (xid=0x17fddb50) 

glustershd.log:[2019-03-13 06:17:48.883825] E [socket.c:2376:socket_connect_finish] 0-tier2-client-55: connection to 192.168.0.55:49158 failed (Connection timed out); disco
nnecting socket
glustershd.log:[2019-03-13 06:19:58.931798] E [socket.c:2376:socket_connect_finish] 0-tier2-client-55: connection to 192.168.0.55:49158 failed (Connection timed out); disco
nnecting socket
glustershd.log:[2019-03-13 06:22:08.979829] E [socket.c:2376:socket_connect_finish] 0-tier2-client-55: connection to 192.168.0.55:49158 failed (Connection timed out); disco
nnecting socket
glustershd.log:[2019-03-13 06:22:36.226847] E [MSGID: 114031] [client-rpc-fops.c:1508:client3_3_inodelk_cbk] 0-tier2-client-55: remote operation failed [Transport endpoint 
is not connected]
glustershd.log:[2019-03-13 06:22:36.306669] E [MSGID: 114031] [client-rpc-fops.c:1508:client3_3_inodelk_cbk] 0-tier2-client-55: remote operation failed [Transport endpoint 
is not connected]
glustershd.log:[2019-03-13 06:22:36.385257] E [MSGID: 114031] [client-rpc-fops.c:1508:client3_3_inodelk_cbk] 0-tier2-client-55: remote operation failed [Transport endpoint 
is not connected]

WARNINGS:
CWD: /var/log/glusterfs 
COMMAND: grep " W " *.log |grep "2019-03-13 06:"

(I can see a lot of this kind of message in the same period but I'm notifying you only one record for each type of warnings)

glustershd.log:[2019-03-13 06:14:28.666772] W [MSGID: 114031] [client-rpc-fops.c:1080:client3_3_getxattr_cbk] 0-tier2-client-55: remote operation failed. Path: <gfid:b6b35d
0f-f34d-4c25-bbe8-74bde0248d7e> (b6b35d0f-f34d-4c25-bbe8-74bde0248d7e). Key: (null) [Transport endpoint is not connected]

glustershd.log:[2019-03-13 06:14:31.421576] W [MSGID: 122035] [ec-common.c:571:ec_child_select] 0-tier2-disperse-9: Executing operation with some subvolumes unavailable (2)

glustershd.log:[2019-03-13 06:15:31.547417] W [MSGID: 122032] [ec-heald.c:266:ec_shd_index_sweep] 0-tier2-disperse-9: unable to get index-dir on tier2-client-55 [Operation 
now in progress]

quota-mount-tier2.log:[2019-03-13 06:12:36.116277] W [MSGID: 101002] [options.c:995:xl_opt_validate] 0-glusterfs: option 'address-family' is deprecated, preferred is 'trans
port.address-family', continuing with correction
quota-mount-tier2.log:[2019-03-13 06:12:36.198430] W [MSGID: 101174] [graph.c:363:_log_if_unknown_option] 0-tier2-readdir-ahead: option 'parallel-readdir' is not recognized
quota-mount-tier2.log:[2019-03-13 06:12:37.945007] W [glusterfsd.c:1375:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7e25) [0x7f340892be25] -->/usr/sbin/glusterfs(gluste
rfs_sigwaiter+0xe5) [0x55ef010164b5] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x55ef0101632b] ) 0-: received signum (15), shutting down

CRITICALS:
CWD: /var/log/glusterfs 
COMMAND: grep " C " *.log |grep "2019-03-13 06:"

no critical errors at 06:xx
only one critical error during the day

[root@s06 glusterfs]# grep " C " *.log |grep "2019-03-13"
glustershd.log:[2019-03-13 02:21:29.126279] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-55: server 192.168.0.55:49158 has not responded in the last 42 seconds, disconnecting.


Thank you very much for your help.
Regards,
Mauro

On 12 Mar 2019, at 05:17, Raghavendra Gowdappa <[hidden email]> wrote:

Was the suggestion to increase server.event-thread values tried? If yes, what were the results?

On Mon, Mar 11, 2019 at 2:40 PM Mauro Tridici <[hidden email]> wrote:
Dear All,

do you have any suggestions about the right way to "debug" this issue?
In attachment, the updated logs of “s06" gluster server.

I noticed a lot of intermittent warning and error messages.

Thank you in advance,
Mauro



On 4 Mar 2019, at 18:45, Raghavendra Gowdappa <[hidden email]> wrote:


I would like to point out another issue. Even if what I suggested prevents disconnects, part of the solution would be only symptomatic treatment and doesn't address the root cause of the problem. In most of the ping-timer-expiry issues, the root cause is the increased load on bricks and the inability of bricks to be responsive under high load. So, the actual solution would be doing any or both of the following:
* identify the source of increased load and if possible throttle it. Internal heal processes like self-heal, rebalance, quota heal are known to pump traffic into bricks without much throttling (io-threads _might_ do some throttling, but my understanding is its not sufficient).
* identify the reason for bricks to become unresponsive during load. This may be fixable issues like not enough event-threads to read from network or difficult to fix issues like fsync on backend fs freezing the process or semi fixable issues (in code) like lock contention.

So any genuine effort to fix ping-timer-issues (to be honest most of the times they are not issues related to rpc/network) would involve performance characterization of various subsystems on bricks and clients. Various subsystems can include (but not necessarily limited to), underlying OS/filesystem, glusterfs processes, CPU consumption etc

regards,
Raghavendra

On Mon, Mar 4, 2019 at 9:31 PM Mauro Tridici <[hidden email]> wrote:
Thank you, let’s try!
I will inform you about the effects of the change.

Regards,
Mauro

On 4 Mar 2019, at 16:55, Raghavendra Gowdappa <[hidden email]> wrote:



On Mon, Mar 4, 2019 at 8:54 PM Mauro Tridici <[hidden email]> wrote:
Hi Raghavendra,

thank you for your reply.
Yes, you are right. It is a problem that seems to happen randomly.
At this moment, server.event-threads value is 4. I will try to increase this value to 8. Do you think that it could be a valid value ?

Yes. We can try with that. You should see at least frequency of ping-timer related disconnects  reduce with this value (even if it doesn't eliminate the problem completely).


Regards,
Mauro


On 4 Mar 2019, at 15:36, Raghavendra Gowdappa <[hidden email]> wrote:



On Mon, Mar 4, 2019 at 8:01 PM Nithya Balachandran <[hidden email]> wrote:
Hi Mauro,

It looks like some problem on s06. Are all your other nodes ok? Can you send us the gluster logs from this node?

[hidden email] , do you have any idea as to how this can be debugged? Maybe running top ? Or debug brick logs?

If we can reproduce the problem, collecting tcpdump on both ends of connection will help. But, one common problem is these bugs are inconsistently reproducible and hence we may not be able to capture tcpdump at correct intervals. Other than that, we can try to collect some evidence that poller threads were busy (waiting on locks). But, not sure what debug data provides that information.

From what I know, its difficult to collect evidence for this issue and we could only reason about it.

We can try a workaround though - try increasing server.event-threads and see whether ping-timer expiry issues go away with an optimal value. If that's the case, it kind of provides proof for our hypothesis.



Regards,
Nithya

On Mon, 4 Mar 2019 at 15:25, Mauro Tridici <[hidden email]> wrote:
Hi All,

some minutes ago I received this message from NAGIOS server 

***** Nagios *****

Notification Type: PROBLEM

Service: Brick - /gluster/mnt2/brick
Host: s06
Address: s06-stg
State: CRITICAL

Date/Time: Mon Mar 4 10:25:33 CET 2019

Additional Info:
CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds.

I checked the network, RAM and CPUs usage on s06 node and everything seems to be ok.
No bricks are in error state. In /var/log/messages, I detected again a crash of “check_vol_utili” that I think it is a module used by NRPE executable (that is the NAGIOS client).

Mar  4 10:15:29 s06 kernel: traps: check_vol_utili[161224] general protection ip:7facffa0a66d sp:7ffe9f4e6fc0 error:0 in libglusterfs.so.0.0.1[7facff9b7000+f7000]
Mar  4 10:15:29 s06 abrt-hook-ccpp: Process 161224 (python2.7) of user 0 killed by SIGSEGV - dumping core
Mar  4 10:15:29 s06 abrt-server: Generating core_backtrace
Mar  4 10:15:29 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  4 10:16:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:16:01 s06 systemd: Starting User Slice of root.
Mar  4 10:16:01 s06 systemd: Started Session 201010 of user root.
Mar  4 10:16:01 s06 systemd: Starting Session 201010 of user root.
Mar  4 10:16:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:16:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:16:24 s06 abrt-server: Duplicate: UUID
Mar  4 10:16:24 s06 abrt-server: DUP_OF_DIR: /var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041
Mar  4 10:16:24 s06 abrt-server: Deleting problem directory ccpp-2019-03-04-10:15:29-161224 (dup of ccpp-2018-09-25-12:27:42-13041)
Mar  4 10:16:24 s06 abrt-server: Generating core_backtrace
Mar  4 10:16:24 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  4 10:16:24 s06 abrt-server: Cannot notify '/var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041' via uReport: Event 'report_uReport' exited with 1
Mar  4 10:16:24 s06 abrt-hook-ccpp: Process 161391 (python2.7) of user 0 killed by SIGABRT - dumping core
Mar  4 10:16:25 s06 abrt-server: Generating core_backtrace
Mar  4 10:16:25 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  4 10:17:01 s06 systemd: Created slice User Slice of root.

Also, I noticed the following errors that I think are very critical:

Mar  4 10:21:12 s06 glustershd[20355]: [2019-03-04 09:21:12.954798] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-55: server 192.168.0.55:49158 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:22:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:22:01 s06 systemd: Starting User Slice of root.
Mar  4 10:22:01 s06 systemd: Started Session 201017 of user root.
Mar  4 10:22:01 s06 systemd: Starting Session 201017 of user root.
Mar  4 10:22:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:22:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:22:03 s06 glustershd[20355]: [2019-03-04 09:22:03.964120] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-60: server 192.168.0.54:49165 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:23:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:23:01 s06 systemd: Starting User Slice of root.
Mar  4 10:23:01 s06 systemd: Started Session 201018 of user root.
Mar  4 10:23:01 s06 systemd: Starting Session 201018 of user root.
Mar  4 10:23:02 s06 systemd: Removed slice User Slice of root.
Mar  4 10:23:02 s06 systemd: Stopping User Slice of root.
Mar  4 10:24:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:24:01 s06 systemd: Starting User Slice of root.
Mar  4 10:24:01 s06 systemd: Started Session 201019 of user root.
Mar  4 10:24:01 s06 systemd: Starting Session 201019 of user root.
Mar  4 10:24:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:24:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:24:03 s06 glustershd[20355]: [2019-03-04 09:24:03.982502] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-16: server 192.168.0.52:49158 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746109] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-3: server 192.168.0.51:49153 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746215] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-10: server 192.168.0.52:49156 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746260] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-21: server 192.168.0.51:49159 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746296] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-25: server 192.168.0.52:49161 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746413] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-60: server 192.168.0.54:49165 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:07 s06 glustershd[20355]: [2019-03-04 09:24:07.982952] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-45: server 192.168.0.54:49155 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:18 s06 glustershd[20355]: [2019-03-04 09:24:18.990929] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-25: server 192.168.0.52:49161 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:31 s06 glustershd[20355]: [2019-03-04 09:24:31.995781] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-20: server 192.168.0.53:49159 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:25:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:25:01 s06 systemd: Starting User Slice of root.
Mar  4 10:25:01 s06 systemd: Started Session 201020 of user root.
Mar  4 10:25:01 s06 systemd: Starting Session 201020 of user root.
Mar  4 10:25:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:25:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:25:57 s06 systemd: Created slice User Slice of root.
Mar  4 10:25:57 s06 systemd: Starting User Slice of root.
Mar  4 10:25:57 s06 systemd-logind: New session 201021 of user root.
Mar  4 10:25:57 s06 systemd: Started Session 201021 of user root.
Mar  4 10:25:57 s06 systemd: Starting Session 201021 of user root.
Mar  4 10:26:01 s06 systemd: Started Session 201022 of user root.
Mar  4 10:26:01 s06 systemd: Starting Session 201022 of user root.
Mar  4 10:26:21 s06 nrpe[162388]: Error: Could not complete SSL handshake with 192.168.1.56: 5
Mar  4 10:27:01 s06 systemd: Started Session 201023 of user root.
Mar  4 10:27:01 s06 systemd: Starting Session 201023 of user root.
Mar  4 10:28:01 s06 systemd: Started Session 201024 of user root.
Mar  4 10:28:01 s06 systemd: Starting Session 201024 of user root.
Mar  4 10:29:01 s06 systemd: Started Session 201025 of user root.
Mar  4 10:29:01 s06 systemd: Starting Session 201025 of user root.

But, unfortunately, I don’t understand why it is happening.
Now, NAGIOS server shows that s06 status is ok:

***** Nagios *****

Notification Type: RECOVERY

Service: Brick - /gluster/mnt2/brick
Host: s06
Address: s06-stg
State: OK

Date/Time: Mon Mar 4 10:35:23 CET 2019

Additional Info:
OK: Brick /gluster/mnt2/brick is up

Nothing is changed from RAM, CPUs, and NETWORK point of view.
/var/log/message file has been updated:

Mar  4 10:32:01 s06 systemd: Starting Session 201029 of user root.
Mar  4 10:32:30 s06 glustershd[20355]: [2019-03-04 09:32:30.069082] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-10: server 192.168.0.52:49156 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:32:55 s06 glustershd[20355]: [2019-03-04 09:32:55.074689] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-66: server 192.168.0.54:49167 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:33:01 s06 systemd: Started Session 201030 of user root.
Mar  4 10:33:01 s06 systemd: Starting Session 201030 of user root.
Mar  4 10:34:01 s06 systemd: Started Session 201031 of user root.
Mar  4 10:34:01 s06 systemd: Starting Session 201031 of user root.
Mar  4 10:35:01 s06 nrpe[162562]: Could not read request from client 192.168.1.56, bailing out...
Mar  4 10:35:01 s06 nrpe[162562]: INFO: SSL Socket Shutdown.
Mar  4 10:35:01 s06 systemd: Started Session 201032 of user root.
Mar  4 10:35:01 s06 systemd: Starting Session 201032 of user root.

Could you please help me to understand what it’s happening ?
Thank you in advance.

Rergards,
Mauro


On 1 Mar 2019, at 12:17, Mauro Tridici <[hidden email]> wrote:


Thank you, Milind.
I executed the instructions you suggested:

- grep “blocked for” /var/log/messages on s06 returns no output (no “blocked” word is detected in messages file);
- in /var/log/messages file I can see this kind of error repeated for a lot of times:

Mar  1 08:43:01 s06 systemd: Starting Session 196071 of user root.
Mar  1 08:43:01 s06 systemd: Removed slice User Slice of root.
Mar  1 08:43:01 s06 systemd: Stopping User Slice of root.
Mar  1 08:43:02 s06 kernel: traps: check_vol_utili[57091] general protection ip:7f88e76ee66d sp:7ffe5a5bcc30 error:0 in libglusterfs.so.0.0.1[7f88e769b000+f7000]
Mar  1 08:43:02 s06 abrt-hook-ccpp: Process 57091 (python2.7) of user 0 killed by SIGSEGV - dumping core
Mar  1 08:43:02 s06 abrt-server: Generating core_backtrace
Mar  1 08:43:02 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  1 08:43:58 s06 abrt-server: Duplicate: UUID
Mar  1 08:43:58 s06 abrt-server: DUP_OF_DIR: /var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041
Mar  1 08:43:58 s06 abrt-server: Deleting problem directory ccpp-2019-03-01-08:43:02-57091 (dup of ccpp-2018-09-25-12:27:42-13041)
Mar  1 08:43:58 s06 dbus[1872]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Mar  1 08:43:58 s06 dbus[1872]: [system] Successfully activated service 'org.freedesktop.problems'
Mar  1 08:43:58 s06 abrt-server: Generating core_backtrace
Mar  1 08:43:58 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  1 08:43:58 s06 abrt-server: Cannot notify '/var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041' via uReport: Event 'report_uReport' exited with 1
Mar  1 08:44:01 s06 systemd: Created slice User Slice of root.
Mar  1 08:44:01 s06 systemd: Starting User Slice of root.
Mar  1 08:44:01 s06 systemd: Started Session 196072 of user root.
Mar  1 08:44:01 s06 systemd: Starting Session 196072 of user root.
Mar  1 08:44:01 s06 systemd: Removed slice User Slice of root.

- in /var/log/messages file I can see also 4 errors related to other cluster servers:

Mar  1 11:05:01 s06 systemd: Starting User Slice of root.
Mar  1 11:05:01 s06 systemd: Started Session 196230 of user root.
Mar  1 11:05:01 s06 systemd: Starting Session 196230 of user root.
Mar  1 11:05:01 s06 systemd: Removed slice User Slice of root.
Mar  1 11:05:01 s06 systemd: Stopping User Slice of root.
Mar  1 11:05:59 s06 glustershd[70117]: [2019-03-01 10:05:59.347094] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-33: server 192.168.0.51:49163 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:06:01 s06 systemd: Created slice User Slice of root.
Mar  1 11:06:01 s06 systemd: Starting User Slice of root.
Mar  1 11:06:01 s06 systemd: Started Session 196231 of user root.
Mar  1 11:06:01 s06 systemd: Starting Session 196231 of user root.
Mar  1 11:06:01 s06 systemd: Removed slice User Slice of root.
Mar  1 11:06:01 s06 systemd: Stopping User Slice of root.
Mar  1 11:06:12 s06 glustershd[70117]: [2019-03-01 10:06:12.351319] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-1: server 192.168.0.52:49153 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:06:38 s06 glustershd[70117]: [2019-03-01 10:06:38.356920] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-7: server 192.168.0.52:49155 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:07:01 s06 systemd: Created slice User Slice of root.
Mar  1 11:07:01 s06 systemd: Starting User Slice of root.
Mar  1 11:07:01 s06 systemd: Started Session 196232 of user root.
Mar  1 11:07:01 s06 systemd: Starting Session 196232 of user root.
Mar  1 11:07:01 s06 systemd: Removed slice User Slice of root.
Mar  1 11:07:01 s06 systemd: Stopping User Slice of root.
Mar  1 11:07:36 s06 glustershd[70117]: [2019-03-01 10:07:36.366259] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-0: server 192.168.0.51:49152 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:08:01 s06 systemd: Created slice User Slice of root.

No “blocked” word is in /var/log/messages files on other cluster servers.
In attachment, the /var/log/messages file from s06 server.

Thank you in advance,
Mauro

<messages.zip>


On 1 Mar 2019, at 11:47, Milind Changire <[hidden email]> wrote:

The traces of very high disk activity on the servers are often found in /var/log/messages
You might want to grep for "blocked for" in /var/log/messages on s06 and correlate the timestamps to confirm the unresponsiveness as reported in gluster client logs.
In cases of high disk activity, although the operating system continues to respond to ICMP pings, the processes writing to disks often get blocked to a large flush to the disk which could span beyond 42 seconds and hence result in ping-timer-expiry logs.

As a side note:
If you indeed find gluster processes being blocked in /var/log/messages, you might want to tweak sysctl tunables called vm.dirty_background_ratio or vm.dirty_background_bytes to a smaller value than the existing. Please read up more on those tunables before touching the settings.


On Fri, Mar 1, 2019 at 4:06 PM Mauro Tridici <[hidden email]> wrote:

Hi all,

in attachment the client log captured after changing network.ping-timeout option.
I noticed this error involving server 192.168.0.56 (s06)

[2019-03-01 09:23:36.077287] I [rpc-clnt.c:1962:rpc_clnt_reconfig] 0-tier2-client-71: changing ping timeout to 42 (from 0)
[2019-03-01 09:23:36.078213] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 09:23:36.078432] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 09:23:36.092357] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 09:23:36.094146] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 10:06:24.708082] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-50: server 192.168.0.56:49156 has not responded in the last 42 seconds, disconnecting.

I don’t know why it happens, s06 server seems to be reachable.

[athena_login2][/users/home/sysm02/]> telnet 192.168.0.56 49156
Trying 192.168.0.56...
Connected to 192.168.0.56.
Escape character is '^]'.
^CConnection closed by foreign host.
[athena_login2][/users/home/sysm02/]> ping 192.168.0.56
PING 192.168.0.56 (192.168.0.56) 56(84) bytes of data.
64 bytes from 192.168.0.56: icmp_seq=1 ttl=64 time=0.116 ms
64 bytes from 192.168.0.56: icmp_seq=2 ttl=64 time=0.101 ms

--- 192.168.0.56 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1528ms
rtt min/avg/max/mdev = 0.101/0.108/0.116/0.012 ms

[athena_login2][/users/home/sysm02/]> telnet 192.168.0.56 49156
Trying 192.168.0.56...
Connected to 192.168.0.56.
Escape character is '^]'.

Thank you for your help,
Mauro



On 1 Mar 2019, at 10:29, Mauro Tridici <[hidden email]> wrote:

Hi all,

thank you for the explanation.
I just changed network.ping-timeout option to default value (network.ping-timeout=42).

I will check the logs to see if the errors will appear again.

Regards,
Mauro

On 1 Mar 2019, at 04:43, Milind Changire <[hidden email]> wrote:

network.ping-timeout should not be set to zero for non-glusterd clients.
glusterd is a special case for which ping-timeout is set to zero via /etc/glusterfs/glusterd.vol

Setting network.ping-timeout to zero disables arming of the ping timer for connections. This disables testing the connection for responsiveness and hence avoids proactive fail-over.

Please reset network.ping-timeout to a non-zero positive value, eg. 42


On Thu, Feb 28, 2019 at 5:07 PM Nithya Balachandran <[hidden email]> wrote:
Adding Raghavendra and Milind to comment on this.

What is the effect of setting network.ping-timeout to 0 and should it be set back to 42?
Regards,
Nithya

On Thu, 28 Feb 2019 at 16:01, Mauro Tridici <[hidden email]> wrote:
Hi Nithya,

sorry for the late.
network.ping-timeout has been set to 0 in order to try to solve some timeout problems, but it didn’t help.
I can set it to the default value. 

Can I proceed with the change?

Thank you,
Mauro


On 28 Feb 2019, at 04:41, Nithya Balachandran <[hidden email]> wrote:

Hi Mauro,

Is network.ping-timeout still set to 0. The default value is 42. Is there a particular reason why this was changed?

Regards,
Nithya


On Wed, 27 Feb 2019 at 21:32, Mauro Tridici <[hidden email]> wrote:

Hi Xavi,

thank you for the detailed explanation and suggestions.
Yes, transport.listen-backlog option is still set to 1024.

I will check the network and connectivity status using “ping” and “telnet” as soon as the errors will come back again.

Regards,
Mauro

Il giorno 27 feb 2019, alle ore 16:42, Xavi Hernandez <[hidden email]> ha scritto:

Hi Mauro,

those errors say that the mount point is not connected to some of the bricks while executing operations. I see references to 3rd and 6th bricks of several disperse sets, which seem to map to server s06. For some reason, gluster is having troubles connecting from the client machine to that particular server. At the end of the log I see that after long time a reconnect is done to both of them. However little after, other bricks from the s05 get disconnected and a reconnect times out.

That's really odd. It seems like if server/communication is cut to s06 for some time, then restored, and then the same happens to the next server.

If the servers are really online and it's only a communication issue, it explains why server memory and network has increased: if the problem only exists between the client and servers, any write made by the client will automatically mark the file as damaged, since some of the servers have not been updated. Since self-heal runs from the server nodes, they will probably be correctly connected to all bricks, which allows them to heal the just damaged file, which increases memory and network usage.

I guess you still have transport.listen-backlog set to 1024, right ?

Just to try to identify if the problem really comes from network, can you check if you lose some pings from the client to all of the servers while you are seeing those errors in the log file ?

You can also check if during those errors, you can telnet to the port of the brick from the client.

Xavi



On Tue, Feb 26, 2019 at 10:17 AM Mauro Tridici <[hidden email]> wrote:
Hi Nithya,

“df -h” operation is not still slow, but no users are using the volume, RAM and NETWORK usage is ok on the client node.

I was worried about this kind of warnings/errors:

[2019-02-25 10:59:00.664323] W [MSGID: 122035] [ec-common.c:571:ec_child_select] 0-tier2-disperse-6: Executing operation with some subvolumes unavailable (20)

[2019-02-26 03:11:35.212603] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] ))))) 0-tier2-client-50: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2019-02-26 03:10:56.549903 (xid=0x106f1c5)

[2019-02-26 03:13:03.313831] E [socket.c:2376:socket_connect_finish] 0-tier2-client-50: connection to 192.168.0.56:49156 failed (Timeout della connessione); disconnecting socket

It seems that some subvolumes are not available and 192.168.0.56 server (s06) is not reachable.
But gluster servers are up&running and bricks are ok.

In attachment the updated tier2.log file.


Thank you.
Regards,
Mauro

Il giorno 26 feb 2019, alle ore 04:03, Nithya Balachandran <[hidden email]> ha scritto:

Hi,

I see a lot of EC messages in the log but they don't seem very serious. Xavi, can you take a look?

The only errors I see are:
[2019-02-25 10:58:45.519871] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] ))))) 0-tier2-client-50: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2019-02-25 10:57:47.429969 (xid=0xd26fe7)
[2019-02-25 10:58:51.461493] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] ))))) 0-tier2-client-41: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2019-02-25 10:57:47.499174 (xid=0xf47d6a)
[2019-02-25 11:07:57.152874] E [socket.c:2376:socket_connect_finish] 0-tier2-client-70: connection to 192.168.0.55:49163 failed (Timeout della connessione); disconnecting socket


Is the df -h operation still slow? If yes, can you take a tcpdump of the client while running df -h and send that across?

Regards,
Nithya

On Mon, 25 Feb 2019 at 17:27, Mauro Tridici <[hidden email]> wrote:

Sorry, some minutes after my last mail message, I noticed that “df -h” command hanged for a while before returns the prompt.
Yesterday, everything was ok in the gluster client log, but, today, I see a lot of errors (please, take a look to the attached file).

On the client node, I detected an important RAM e NETWORK usage.



Do you think that the errors have been caused by the client resources usage?

Thank you in advance,
Mauro













_______________________________________________
Gluster-devel mailing list
[hidden email]
https://lists.gluster.org/mailman/listinfo/gluster-devel
Reply | Threaded
Open this post in threaded view
|

Re: [Gluster-users] "rpc_clnt_ping_timer_expired" errors

Raghavendra Gowdappa
In reply to this post by Raghavendra Gowdappa
adding gluster-users and glusterdevel as the discussion has some generic points


On Mon, Mar 4, 2019 at 11:43 PM Raghavendra Gowdappa <[hidden email]> wrote:


On Mon, Mar 4, 2019 at 11:26 PM Yaniv Kaul <[hidden email]> wrote:
Is it that busy that it cannot reply for so many seconds to a simple ping? That starved of CPU resources / threads, or are locks taken?

While I had answered giving some specific examples, my previous answer missed some general observations:
ping timer is designed with following goals:

1. identifying ungraceful shutdown of bricks (hard power shutdown, pulling of n/w cable etc)
2. identifying slow bricks and avoiding traffic to them (If connection is shutdown, clients will not use it till reconnection happens)

When we consider point 2, ping packet is not a special packet. Instead it experiences same delays as any other regular fs traffic. So, the question asked above doesn't apply (as ping is not treated "specially" wrt fs traffic, if the brick is overloaded processing fs traffic, it is overloaded to respond a "simple ping" too)

While the implementation/design neatly caters to the first goal, second goal is not completely met and sort of partly implemented/designed. If goal 2 is to be completely fulfilled, I can think of at least 2 things to be done currently:

1. reconnect logic in client should make intelligent decision whether to reconnect to an overloaded brick and I guess identifying whether the brick is no longer overloaded is not a trivial task
2. As a "fix" to ping-timer expiries, we introduced a "request queue" to glusterfs program. Event threads, which read requests from network will only queue them to this "request queue" and processing of them is done in different threads. This (at least in theory) makes event-threads to read and respond ping packets faster. However, this means ping responses are no longer indicative of "load" on brick. With hindsight, to me this fix is more symptomatic (users no longer see ENOTCONN errors - which pushes the responsibility out of module handling "ping" :), but they might experience slow performance) than address the underlying root cause of what made bricks unresponsive at first. For eg., Manoj suggested suboptimal event-threads might be one reason and in fact that was a correct observation (at least for some setups).


It looks to be the case. I remember following scenarios from various users/customers:

* One customer implemented the suggestion of increasing the number of event-threads (that read msgs from network) and reported back it fixed ping-timer-expiry issues
* Once Pranith and Krutika identified that an xattrop (basically a getxattr and setxattr done atomically holding a global inode lock) took more than 42 seconds (default ping timeout value) - identified through strace output - ping timer expiries were co-related at around the same time. In this case, event-threads were contending on the same lock and hence were blocked from reading further messages from network (that included ping requests). This issue (of global critical sections) are fixed currently.
* Preventing event-threads from executing any code from glusterfs brick stack seem to have solved the issue. This code is present in master and is targeted for 3.4.4
* I've heard reports saying when rebalance/quota heal is in progress ping timer expiries are more likely to occur.
Y. 

On Mon, Mar 4, 2019, 7:47 PM Raghavendra Gowdappa <[hidden email]> wrote:

I would like to point out another issue. Even if what I suggested prevents disconnects, part of the solution would be only symptomatic treatment and doesn't address the root cause of the problem. In most of the ping-timer-expiry issues, the root cause is the increased load on bricks and the inability of bricks to be responsive under high load. So, the actual solution would be doing any or both of the following:
* identify the source of increased load and if possible throttle it. Internal heal processes like self-heal, rebalance, quota heal are known to pump traffic into bricks without much throttling (io-threads _might_ do some throttling, but my understanding is its not sufficient).
* identify the reason for bricks to become unresponsive during load. This may be fixable issues like not enough event-threads to read from network or difficult to fix issues like fsync on backend fs freezing the process or semi fixable issues (in code) like lock contention.

So any genuine effort to fix ping-timer-issues (to be honest most of the times they are not issues related to rpc/network) would involve performance characterization of various subsystems on bricks and clients. Various subsystems can include (but not necessarily limited to), underlying OS/filesystem, glusterfs processes, CPU consumption etc

regards,
Raghavendra

On Mon, Mar 4, 2019 at 9:31 PM Mauro Tridici <[hidden email]> wrote:
Thank you, let’s try!
I will inform you about the effects of the change.

Regards,
Mauro

On 4 Mar 2019, at 16:55, Raghavendra Gowdappa <[hidden email]> wrote:



On Mon, Mar 4, 2019 at 8:54 PM Mauro Tridici <[hidden email]> wrote:
Hi Raghavendra,

thank you for your reply.
Yes, you are right. It is a problem that seems to happen randomly.
At this moment, server.event-threads value is 4. I will try to increase this value to 8. Do you think that it could be a valid value ?

Yes. We can try with that. You should see at least frequency of ping-timer related disconnects  reduce with this value (even if it doesn't eliminate the problem completely).


Regards,
Mauro


On 4 Mar 2019, at 15:36, Raghavendra Gowdappa <[hidden email]> wrote:



On Mon, Mar 4, 2019 at 8:01 PM Nithya Balachandran <[hidden email]> wrote:
Hi Mauro,

It looks like some problem on s06. Are all your other nodes ok? Can you send us the gluster logs from this node?

[hidden email] , do you have any idea as to how this can be debugged? Maybe running top ? Or debug brick logs?

If we can reproduce the problem, collecting tcpdump on both ends of connection will help. But, one common problem is these bugs are inconsistently reproducible and hence we may not be able to capture tcpdump at correct intervals. Other than that, we can try to collect some evidence that poller threads were busy (waiting on locks). But, not sure what debug data provides that information.

From what I know, its difficult to collect evidence for this issue and we could only reason about it.

We can try a workaround though - try increasing server.event-threads and see whether ping-timer expiry issues go away with an optimal value. If that's the case, it kind of provides proof for our hypothesis.



Regards,
Nithya

On Mon, 4 Mar 2019 at 15:25, Mauro Tridici <[hidden email]> wrote:
Hi All,

some minutes ago I received this message from NAGIOS server 

***** Nagios *****

Notification Type: PROBLEM

Service: Brick - /gluster/mnt2/brick
Host: s06
Address: s06-stg
State: CRITICAL

Date/Time: Mon Mar 4 10:25:33 CET 2019

Additional Info:
CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds.

I checked the network, RAM and CPUs usage on s06 node and everything seems to be ok.
No bricks are in error state. In /var/log/messages, I detected again a crash of “check_vol_utili” that I think it is a module used by NRPE executable (that is the NAGIOS client).

Mar  4 10:15:29 s06 kernel: traps: check_vol_utili[161224] general protection ip:7facffa0a66d sp:7ffe9f4e6fc0 error:0 in libglusterfs.so.0.0.1[7facff9b7000+f7000]
Mar  4 10:15:29 s06 abrt-hook-ccpp: Process 161224 (python2.7) of user 0 killed by SIGSEGV - dumping core
Mar  4 10:15:29 s06 abrt-server: Generating core_backtrace
Mar  4 10:15:29 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  4 10:16:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:16:01 s06 systemd: Starting User Slice of root.
Mar  4 10:16:01 s06 systemd: Started Session 201010 of user root.
Mar  4 10:16:01 s06 systemd: Starting Session 201010 of user root.
Mar  4 10:16:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:16:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:16:24 s06 abrt-server: Duplicate: UUID
Mar  4 10:16:24 s06 abrt-server: DUP_OF_DIR: /var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041
Mar  4 10:16:24 s06 abrt-server: Deleting problem directory ccpp-2019-03-04-10:15:29-161224 (dup of ccpp-2018-09-25-12:27:42-13041)
Mar  4 10:16:24 s06 abrt-server: Generating core_backtrace
Mar  4 10:16:24 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  4 10:16:24 s06 abrt-server: Cannot notify '/var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041' via uReport: Event 'report_uReport' exited with 1
Mar  4 10:16:24 s06 abrt-hook-ccpp: Process 161391 (python2.7) of user 0 killed by SIGABRT - dumping core
Mar  4 10:16:25 s06 abrt-server: Generating core_backtrace
Mar  4 10:16:25 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  4 10:17:01 s06 systemd: Created slice User Slice of root.

Also, I noticed the following errors that I think are very critical:

Mar  4 10:21:12 s06 glustershd[20355]: [2019-03-04 09:21:12.954798] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-55: server 192.168.0.55:49158 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:22:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:22:01 s06 systemd: Starting User Slice of root.
Mar  4 10:22:01 s06 systemd: Started Session 201017 of user root.
Mar  4 10:22:01 s06 systemd: Starting Session 201017 of user root.
Mar  4 10:22:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:22:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:22:03 s06 glustershd[20355]: [2019-03-04 09:22:03.964120] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-60: server 192.168.0.54:49165 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:23:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:23:01 s06 systemd: Starting User Slice of root.
Mar  4 10:23:01 s06 systemd: Started Session 201018 of user root.
Mar  4 10:23:01 s06 systemd: Starting Session 201018 of user root.
Mar  4 10:23:02 s06 systemd: Removed slice User Slice of root.
Mar  4 10:23:02 s06 systemd: Stopping User Slice of root.
Mar  4 10:24:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:24:01 s06 systemd: Starting User Slice of root.
Mar  4 10:24:01 s06 systemd: Started Session 201019 of user root.
Mar  4 10:24:01 s06 systemd: Starting Session 201019 of user root.
Mar  4 10:24:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:24:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:24:03 s06 glustershd[20355]: [2019-03-04 09:24:03.982502] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-16: server 192.168.0.52:49158 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746109] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-3: server 192.168.0.51:49153 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746215] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-10: server 192.168.0.52:49156 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746260] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-21: server 192.168.0.51:49159 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746296] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-25: server 192.168.0.52:49161 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746413] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-60: server 192.168.0.54:49165 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:07 s06 glustershd[20355]: [2019-03-04 09:24:07.982952] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-45: server 192.168.0.54:49155 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:18 s06 glustershd[20355]: [2019-03-04 09:24:18.990929] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-25: server 192.168.0.52:49161 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:24:31 s06 glustershd[20355]: [2019-03-04 09:24:31.995781] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-20: server 192.168.0.53:49159 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:25:01 s06 systemd: Created slice User Slice of root.
Mar  4 10:25:01 s06 systemd: Starting User Slice of root.
Mar  4 10:25:01 s06 systemd: Started Session 201020 of user root.
Mar  4 10:25:01 s06 systemd: Starting Session 201020 of user root.
Mar  4 10:25:01 s06 systemd: Removed slice User Slice of root.
Mar  4 10:25:01 s06 systemd: Stopping User Slice of root.
Mar  4 10:25:57 s06 systemd: Created slice User Slice of root.
Mar  4 10:25:57 s06 systemd: Starting User Slice of root.
Mar  4 10:25:57 s06 systemd-logind: New session 201021 of user root.
Mar  4 10:25:57 s06 systemd: Started Session 201021 of user root.
Mar  4 10:25:57 s06 systemd: Starting Session 201021 of user root.
Mar  4 10:26:01 s06 systemd: Started Session 201022 of user root.
Mar  4 10:26:01 s06 systemd: Starting Session 201022 of user root.
Mar  4 10:26:21 s06 nrpe[162388]: Error: Could not complete SSL handshake with 192.168.1.56: 5
Mar  4 10:27:01 s06 systemd: Started Session 201023 of user root.
Mar  4 10:27:01 s06 systemd: Starting Session 201023 of user root.
Mar  4 10:28:01 s06 systemd: Started Session 201024 of user root.
Mar  4 10:28:01 s06 systemd: Starting Session 201024 of user root.
Mar  4 10:29:01 s06 systemd: Started Session 201025 of user root.
Mar  4 10:29:01 s06 systemd: Starting Session 201025 of user root.

But, unfortunately, I don’t understand why it is happening.
Now, NAGIOS server shows that s06 status is ok:

***** Nagios *****

Notification Type: RECOVERY

Service: Brick - /gluster/mnt2/brick
Host: s06
Address: s06-stg
State: OK

Date/Time: Mon Mar 4 10:35:23 CET 2019

Additional Info:
OK: Brick /gluster/mnt2/brick is up

Nothing is changed from RAM, CPUs, and NETWORK point of view.
/var/log/message file has been updated:

Mar  4 10:32:01 s06 systemd: Starting Session 201029 of user root.
Mar  4 10:32:30 s06 glustershd[20355]: [2019-03-04 09:32:30.069082] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-10: server 192.168.0.52:49156 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:32:55 s06 glustershd[20355]: [2019-03-04 09:32:55.074689] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-66: server 192.168.0.54:49167 has not responded in the last 42 seconds, disconnecting.
Mar  4 10:33:01 s06 systemd: Started Session 201030 of user root.
Mar  4 10:33:01 s06 systemd: Starting Session 201030 of user root.
Mar  4 10:34:01 s06 systemd: Started Session 201031 of user root.
Mar  4 10:34:01 s06 systemd: Starting Session 201031 of user root.
Mar  4 10:35:01 s06 nrpe[162562]: Could not read request from client 192.168.1.56, bailing out...
Mar  4 10:35:01 s06 nrpe[162562]: INFO: SSL Socket Shutdown.
Mar  4 10:35:01 s06 systemd: Started Session 201032 of user root.
Mar  4 10:35:01 s06 systemd: Starting Session 201032 of user root.

Could you please help me to understand what it’s happening ?
Thank you in advance.

Rergards,
Mauro


On 1 Mar 2019, at 12:17, Mauro Tridici <[hidden email]> wrote:


Thank you, Milind.
I executed the instructions you suggested:

- grep “blocked for” /var/log/messages on s06 returns no output (no “blocked” word is detected in messages file);
- in /var/log/messages file I can see this kind of error repeated for a lot of times:

Mar  1 08:43:01 s06 systemd: Starting Session 196071 of user root.
Mar  1 08:43:01 s06 systemd: Removed slice User Slice of root.
Mar  1 08:43:01 s06 systemd: Stopping User Slice of root.
Mar  1 08:43:02 s06 kernel: traps: check_vol_utili[57091] general protection ip:7f88e76ee66d sp:7ffe5a5bcc30 error:0 in libglusterfs.so.0.0.1[7f88e769b000+f7000]
Mar  1 08:43:02 s06 abrt-hook-ccpp: Process 57091 (python2.7) of user 0 killed by SIGSEGV - dumping core
Mar  1 08:43:02 s06 abrt-server: Generating core_backtrace
Mar  1 08:43:02 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  1 08:43:58 s06 abrt-server: Duplicate: UUID
Mar  1 08:43:58 s06 abrt-server: DUP_OF_DIR: /var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041
Mar  1 08:43:58 s06 abrt-server: Deleting problem directory ccpp-2019-03-01-08:43:02-57091 (dup of ccpp-2018-09-25-12:27:42-13041)
Mar  1 08:43:58 s06 dbus[1872]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Mar  1 08:43:58 s06 dbus[1872]: [system] Successfully activated service 'org.freedesktop.problems'
Mar  1 08:43:58 s06 abrt-server: Generating core_backtrace
Mar  1 08:43:58 s06 abrt-server: Error: Unable to open './coredump': No such file or directory
Mar  1 08:43:58 s06 abrt-server: Cannot notify '/var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041' via uReport: Event 'report_uReport' exited with 1
Mar  1 08:44:01 s06 systemd: Created slice User Slice of root.
Mar  1 08:44:01 s06 systemd: Starting User Slice of root.
Mar  1 08:44:01 s06 systemd: Started Session 196072 of user root.
Mar  1 08:44:01 s06 systemd: Starting Session 196072 of user root.
Mar  1 08:44:01 s06 systemd: Removed slice User Slice of root.

- in /var/log/messages file I can see also 4 errors related to other cluster servers:

Mar  1 11:05:01 s06 systemd: Starting User Slice of root.
Mar  1 11:05:01 s06 systemd: Started Session 196230 of user root.
Mar  1 11:05:01 s06 systemd: Starting Session 196230 of user root.
Mar  1 11:05:01 s06 systemd: Removed slice User Slice of root.
Mar  1 11:05:01 s06 systemd: Stopping User Slice of root.
Mar  1 11:05:59 s06 glustershd[70117]: [2019-03-01 10:05:59.347094] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-33: server 192.168.0.51:49163 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:06:01 s06 systemd: Created slice User Slice of root.
Mar  1 11:06:01 s06 systemd: Starting User Slice of root.
Mar  1 11:06:01 s06 systemd: Started Session 196231 of user root.
Mar  1 11:06:01 s06 systemd: Starting Session 196231 of user root.
Mar  1 11:06:01 s06 systemd: Removed slice User Slice of root.
Mar  1 11:06:01 s06 systemd: Stopping User Slice of root.
Mar  1 11:06:12 s06 glustershd[70117]: [2019-03-01 10:06:12.351319] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-1: server 192.168.0.52:49153 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:06:38 s06 glustershd[70117]: [2019-03-01 10:06:38.356920] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-7: server 192.168.0.52:49155 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:07:01 s06 systemd: Created slice User Slice of root.
Mar  1 11:07:01 s06 systemd: Starting User Slice of root.
Mar  1 11:07:01 s06 systemd: Started Session 196232 of user root.
Mar  1 11:07:01 s06 systemd: Starting Session 196232 of user root.
Mar  1 11:07:01 s06 systemd: Removed slice User Slice of root.
Mar  1 11:07:01 s06 systemd: Stopping User Slice of root.
Mar  1 11:07:36 s06 glustershd[70117]: [2019-03-01 10:07:36.366259] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-0: server 192.168.0.51:49152 has not responded in the last 42 seconds, disconnecting.
Mar  1 11:08:01 s06 systemd: Created slice User Slice of root.

No “blocked” word is in /var/log/messages files on other cluster servers.
In attachment, the /var/log/messages file from s06 server.

Thank you in advance,
Mauro

<messages.zip>


On 1 Mar 2019, at 11:47, Milind Changire <[hidden email]> wrote:

The traces of very high disk activity on the servers are often found in /var/log/messages
You might want to grep for "blocked for" in /var/log/messages on s06 and correlate the timestamps to confirm the unresponsiveness as reported in gluster client logs.
In cases of high disk activity, although the operating system continues to respond to ICMP pings, the processes writing to disks often get blocked to a large flush to the disk which could span beyond 42 seconds and hence result in ping-timer-expiry logs.

As a side note:
If you indeed find gluster processes being blocked in /var/log/messages, you might want to tweak sysctl tunables called vm.dirty_background_ratio or vm.dirty_background_bytes to a smaller value than the existing. Please read up more on those tunables before touching the settings.


On Fri, Mar 1, 2019 at 4:06 PM Mauro Tridici <[hidden email]> wrote:

Hi all,

in attachment the client log captured after changing network.ping-timeout option.
I noticed this error involving server 192.168.0.56 (s06)

[2019-03-01 09:23:36.077287] I [rpc-clnt.c:1962:rpc_clnt_reconfig] 0-tier2-client-71: changing ping timeout to 42 (from 0)
[2019-03-01 09:23:36.078213] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 09:23:36.078432] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 09:23:36.092357] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 09:23:36.094146] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-03-01 10:06:24.708082] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-50: server 192.168.0.56:49156 has not responded in the last 42 seconds, disconnecting.

I don’t know why it happens, s06 server seems to be reachable.

[athena_login2][/users/home/sysm02/]> telnet 192.168.0.56 49156
Trying 192.168.0.56...
Connected to 192.168.0.56.
Escape character is '^]'.
^CConnection closed by foreign host.
[athena_login2][/users/home/sysm02/]> ping 192.168.0.56
PING 192.168.0.56 (192.168.0.56) 56(84) bytes of data.
64 bytes from 192.168.0.56: icmp_seq=1 ttl=64 time=0.116 ms
64 bytes from 192.168.0.56: icmp_seq=2 ttl=64 time=0.101 ms

--- 192.168.0.56 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1528ms
rtt min/avg/max/mdev = 0.101/0.108/0.116/0.012 ms

[athena_login2][/users/home/sysm02/]> telnet 192.168.0.56 49156
Trying 192.168.0.56...
Connected to 192.168.0.56.
Escape character is '^]'.

Thank you for your help,
Mauro



On 1 Mar 2019, at 10:29, Mauro Tridici <[hidden email]> wrote:

Hi all,

thank you for the explanation.
I just changed network.ping-timeout option to default value (network.ping-timeout=42).

I will check the logs to see if the errors will appear again.

Regards,
Mauro

On 1 Mar 2019, at 04:43, Milind Changire <[hidden email]> wrote:

network.ping-timeout should not be set to zero for non-glusterd clients.
glusterd is a special case for which ping-timeout is set to zero via /etc/glusterfs/glusterd.vol

Setting network.ping-timeout to zero disables arming of the ping timer for connections. This disables testing the connection for responsiveness and hence avoids proactive fail-over.

Please reset network.ping-timeout to a non-zero positive value, eg. 42


On Thu, Feb 28, 2019 at 5:07 PM Nithya Balachandran <[hidden email]> wrote:
Adding Raghavendra and Milind to comment on this.

What is the effect of setting network.ping-timeout to 0 and should it be set back to 42?
Regards,
Nithya

On Thu, 28 Feb 2019 at 16:01, Mauro Tridici <[hidden email]> wrote:
Hi Nithya,

sorry for the late.
network.ping-timeout has been set to 0 in order to try to solve some timeout problems, but it didn’t help.
I can set it to the default value. 

Can I proceed with the change?

Thank you,
Mauro


On 28 Feb 2019, at 04:41, Nithya Balachandran <[hidden email]> wrote:

Hi Mauro,

Is network.ping-timeout still set to 0. The default value is 42. Is there a particular reason why this was changed?

Regards,
Nithya


On Wed, 27 Feb 2019 at 21:32, Mauro Tridici <[hidden email]> wrote:

Hi Xavi,

thank you for the detailed explanation and suggestions.
Yes, transport.listen-backlog option is still set to 1024.

I will check the network and connectivity status using “ping” and “telnet” as soon as the errors will come back again.

Regards,
Mauro

Il giorno 27 feb 2019, alle ore 16:42, Xavi Hernandez <[hidden email]> ha scritto:

Hi Mauro,

those errors say that the mount point is not connected to some of the bricks while executing operations. I see references to 3rd and 6th bricks of several disperse sets, which seem to map to server s06. For some reason, gluster is having troubles connecting from the client machine to that particular server. At the end of the log I see that after long time a reconnect is done to both of them. However little after, other bricks from the s05 get disconnected and a reconnect times out.

That's really odd. It seems like if server/communication is cut to s06 for some time, then restored, and then the same happens to the next server.

If the servers are really online and it's only a communication issue, it explains why server memory and network has increased: if the problem only exists between the client and servers, any write made by the client will automatically mark the file as damaged, since some of the servers have not been updated. Since self-heal runs from the server nodes, they will probably be correctly connected to all bricks, which allows them to heal the just damaged file, which increases memory and network usage.

I guess you still have transport.listen-backlog set to 1024, right ?

Just to try to identify if the problem really comes from network, can you check if you lose some pings from the client to all of the servers while you are seeing those errors in the log file ?

You can also check if during those errors, you can telnet to the port of the brick from the client.

Xavi



On Tue, Feb 26, 2019 at 10:17 AM Mauro Tridici <[hidden email]> wrote:
Hi Nithya,

“df -h” operation is not still slow, but no users are using the volume, RAM and NETWORK usage is ok on the client node.

I was worried about this kind of warnings/errors:

[2019-02-25 10:59:00.664323] W [MSGID: 122035] [ec-common.c:571:ec_child_select] 0-tier2-disperse-6: Executing operation with some subvolumes unavailable (20)

[2019-02-26 03:11:35.212603] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] ))))) 0-tier2-client-50: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2019-02-26 03:10:56.549903 (xid=0x106f1c5)

[2019-02-26 03:13:03.313831] E [socket.c:2376:socket_connect_finish] 0-tier2-client-50: connection to 192.168.0.56:49156 failed (Timeout della connessione); disconnecting socket

It seems that some subvolumes are not available and 192.168.0.56 server (s06) is not reachable.
But gluster servers are up&running and bricks are ok.

In attachment the updated tier2.log file.


Thank you.
Regards,
Mauro

Il giorno 26 feb 2019, alle ore 04:03, Nithya Balachandran <[hidden email]> ha scritto:

Hi,

I see a lot of EC messages in the log but they don't seem very serious. Xavi, can you take a look?

The only errors I see are:
[2019-02-25 10:58:45.519871] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] ))))) 0-tier2-client-50: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2019-02-25 10:57:47.429969 (xid=0xd26fe7)
[2019-02-25 10:58:51.461493] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] ))))) 0-tier2-client-41: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2019-02-25 10:57:47.499174 (xid=0xf47d6a)
[2019-02-25 11:07:57.152874] E [socket.c:2376:socket_connect_finish] 0-tier2-client-70: connection to 192.168.0.55:49163 failed (Timeout della connessione); disconnecting socket


Is the df -h operation still slow? If yes, can you take a tcpdump of the client while running df -h and send that across?

Regards,
Nithya

On Mon, 25 Feb 2019 at 17:27, Mauro Tridici <[hidden email]> wrote:

Sorry, some minutes after my last mail message, I noticed that “df -h” command hanged for a while before returns the prompt.
Yesterday, everything was ok in the gluster client log, but, today, I see a lot of errors (please, take a look to the attached file).

On the client node, I detected an important RAM e NETWORK usage.



Do you think that the errors have been caused by the client resources usage?

Thank you in advance,
Mauro




-------------------------
Mauro Tridici

Fondazione CMCC
CMCC Supercomputing Center
presso Complesso Ecotekne - Università del Salento -
Strada Prov.le Lecce - Monteroni sn
73100 Lecce  IT
http://www.cmcc.it

mobile: (+39) 327 5630841
email: [hidden email]
https://it.linkedin.com/in/mauro-tridici-5977238b

_______________________________________________
Gluster-devel mailing list
[hidden email]
https://lists.gluster.org/mailman/listinfo/gluster-devel
[hidden email]

_______________________________________________

Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/441850968


NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/441850968

Gluster-devel mailing list
[hidden email]
https://lists.gluster.org/mailman/listinfo/gluster-devel