[nmh-workers] strace show # pauses/takes 5+ sec | network issue? ( not nmh )

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
nmh
Reply | Threaded
Open this post in threaded view
|

[nmh-workers] strace show # pauses/takes 5+ sec | network issue? ( not nmh )

nmh
I do not think this is a nmh issue, but I could use help troubleshooting it.

$ mhmail -version
mhmail -- nmh-1.7.1 built 2018-05-19 10:39:15 +0000 on epjdn.zq3q.org

I show where the strace output paused for 5+ seconds in the attachments.

So what should I be asking my vps provider?  Is this some network issue?
This happened on a fedora 30 host, and a centos 7 host (vps hosts on same network).

Fortunately or unfortunately the problem just went away without any action on my part --
started on Sat night, and fixed itself about an hour(?) ago.

--
thanks/thanks for working on nmh,
Tom
Milwaukee WI

--
nmh-workers
https://lists.nongnu.org/mailman/listinfo/nmh-workers

nmh-scan-slow00 (13K) Download Attachment
nmh-show-slow00 (17K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: strace show # pauses/takes 5+ sec | network issue? ( not nmh )

Ken Hornstein-2
>So what should I be asking my vps provider?  Is this some network issue?

Almost certainly "yes".  If I had to guess I would think it was a DNS
timeout on something somewhere.

But unfortunately these system call traces are not very helpful.  This
is because you are tracing the command "command", and all that shows is
you execute a nmh command and it waits 5 seconds for it to complete.

If this happens in the future you should probably use the -f option to
trace all children of the specified process/command, and the -t option
to print the system call timestamps (that would let you see exactly
where the delay is).  I also like using -o to direct the trace output to
a file rather than having it mixed in with the output of a process.

--Ken

--
nmh-workers
https://lists.nongnu.org/mailman/listinfo/nmh-workers
Reply | Threaded
Open this post in threaded view
|

Re: strace show # pauses/takes 5+ sec | network issue? ( not nmh )

Bakul Shah
In reply to this post by nmh
On Sun, 08 Sep 2019 14:57:09 -0500 [hidden email] wrote:

>
> I do not think this is a nmh issue, but I could use help troubleshooting it.
>
> $ mhmail -version
> mhmail -- nmh-1.7.1 built 2018-05-19 10:39:15 +0000 on epjdn.zq3q.org
>
> I show where the strace output paused for 5+ seconds in the attachments.
>
> So what should I be asking my vps provider?  Is this some network issue?
> This happened on a fedora 30 host, and a centos 7 host (vps hosts on same net
> work).
>
> Fortunately or unfortunately the problem just went away without any action on
>  my part --

In addition to -f as Ken suggested, you can specify the -r
flag to strace to know which syscalls take an unusually long
time.  This can then lead you to check more things such as
network traffic etc. Though it is strange if "show" triggers
any network traffic.

--
nmh-workers
https://lists.nongnu.org/mailman/listinfo/nmh-workers
nmh
Reply | Threaded
Open this post in threaded view
|

Re: strace show # pauses/takes 5+ sec | network issue? ( not nmh )

nmh
In reply to this post by Ken Hornstein-2
Hi Ken:

On Sun 9/8/19 18:09 -0400 Ken Hornstein wrote:
>>So what should I be asking my vps provider?  Is this some network issue?
>
>Almost certainly "yes".  If I had to guess I would think it was a DNS
>timeout on something somewhere.

I thought it might be dns. Earlier in the day, twice in a row it hung near the end of
this snip:

[...]
socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 4
connect(4, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("71.19.155.120")}, 16) = 0
clock_gettime(CLOCK_REALTIME, {tv_sec=1567958940, tv_nsec=908626657}) = 0
poll([{fd=4, events=POLLOUT}], 1, 0)    = 1 ([{fd=4, revents=POLLOUT}])
sendmmsg(4, [{msg_hdr={msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\325T\1\0\0\1\0\0\0\0\0\0\5ibisbil\4mumble\3org\0\0\1\0\1", iov_len=32}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=32}, {msg_hdr={msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="ql\1\0\0\1\0\0\0\0\0\0\5ibisbil\4mumble\3org\0\0\34\0\1", iov_len=32}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=32}], 2, MSG_NOSIGNAL) = 2

where ibisbil.mumble.com is the obfuscated hostname of the fedora vps with the problem, and 71.19.155.120
is the 1st nameserver in ibisbil.mumble.com:/etc/resolv.conf. Later in the day though the strace stalled
consistently at the spot shown in my first post.

>But unfortunately these system call traces are not very helpful.  This
>is because you are tracing the command "command", and all that shows is
>you execute a nmh command and it waits 5 seconds for it to complete.

OK/thanks. In the future I will plan to run for example:

    strace -o /tmp/foo -t -f /usr/local/nmh/bin/scan last

>If this happens in the future you should probably use the -f option to
>trace all children of the specified process/command, and the -t option
>to print the system call timestamps (that would let you see exactly
>where the delay is).  I also like using -o to direct the trace output to
>a file rather than having it mixed in with the output of a process.

Appreciate learning about the -t, -f,  and -o switches!

--
Tom

--
nmh-workers
https://lists.nongnu.org/mailman/listinfo/nmh-workers
Reply | Threaded
Open this post in threaded view
|

Re: strace show # pauses/takes 5+ sec | network issue? ( not nmh )

Michael Richardson-5

[hidden email] wrote:
    > I thought it might be dns. Earlier in the day, twice in a row it hung
    > near the end of this snip:

    > [...]  socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK,
    > IPPROTO_IP) = 4 connect(4, {sa_family=AF_INET, sin_port=htons(53),
    > sin_addr=inet_addr("71.19.155.120")}, 16) = 0
    > clock_gettime(CLOCK_REALTIME, {tv_sec=1567958940, tv_nsec=908626657}) =
    > 0 poll([{fd=4, events=POLLOUT}], 1, 0) = 1 ([{fd=4, revents=POLLOUT}])
    > sendmmsg(4, [{msg_hdr={msg_name=NULL, msg_namelen=0,
    > msg_iov=[{iov_base="\325T\1\0\0\1\0\0\0\0\0\0\5ibisbil\4mumble\3org\0\0\1\0\1",
    > iov_len=32}], msg_iovlen=1, msg_controllen=0, msg_flags=0},
    > msg_len=32}, {msg_hdr={msg_name=NULL, msg_namelen=0,
    > msg_iov=[{iov_base="ql\1\0\0\1\0\0\0\0\0\0\5ibisbil\4mumble\3org\0\0\34\0\1",
    > iov_len=32}], msg_iovlen=1, msg_controllen=0, msg_flags=0},
    > msg_len=32}], 2, MSG_NOSIGNAL) = 2

    > where ibisbil.mumble.com is the obfuscated hostname of the fedora vps
    > with the problem, and 71.19.155.120 is the 1st nameserver in
    > ibisbil.mumble.com:/etc/resolv.conf. Later in the day though the strace
    > stalled consistently at the spot shown in my first post.

    >> But unfortunately these system call traces are not very helpful.  This
    >> is because you are tracing the command "command", and all that shows
    >> is you execute a nmh command and it waits 5 seconds for it to
    >> complete.

    > OK/thanks. In the future I will plan to run for example:

    >     strace -o /tmp/foo -t -f /usr/local/nmh/bin/scan last

I suggest -ff, which will populate the files /tmp/foo.12343 by PID#
I can't see why scan is doing DNS, but what do I know.

If it's network related, we'll probably ask you to send tcdpump.
so:
        sudo tcpdump -i any -n -p -w /tmp/network.pcap
       
if you are logged in by SSH, then please add "not port 22"

--
]               Never tell me the odds!                 | ipv6 mesh networks [
]   Michael Richardson, Sandelman Software Works        | network architect  [
]     [hidden email]  http://www.sandelman.ca/        |   ruby on rails    [
       

--
nmh-workers
https://lists.nongnu.org/mailman/listinfo/nmh-workers

signature.asc (497 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: strace show # pauses/takes 5+ sec | network issue? ( not nmh )

Michael Richardson-5
Also, are you running NMH on your VPS?



--
nmh-workers
https://lists.nongnu.org/mailman/listinfo/nmh-workers
Reply | Threaded
Open this post in threaded view
|

Re: strace show # pauses/takes 5+ sec | network issue? ( not nmh )

Ken Hornstein-2
In reply to this post by Michael Richardson-5
>I suggest -ff, which will populate the files /tmp/foo.12343 by PID#

When you have a generic sort of "hang" I'm NOT a fan of that, because it's
tough to figure out which (of any) processes is causing the hang without
some (possibly a lot) cross-correlation.

>I can't see why scan is doing DNS, but what do I know.

I ran into this one time on an airplane.  If you don't have "localname"
set in mts.conf, anything that calls LocalName() (which is essentially
almost everything) ends up calling getaddrinfo(gethostname()), which may
result in a DNS lookup.  And some things will still result in a call
to getaddrinfo(gethostname()) even if you DO have localname set (scan
I don't think will, but I could be wrong).

--Ken

--
nmh-workers
https://lists.nongnu.org/mailman/listinfo/nmh-workers
Reply | Threaded
Open this post in threaded view
|

Re: strace show # pauses/takes 5+ sec | network issue? ( not nmh )

Krullen Van De Trap
In reply to this post by nmh
Set localname in your mts.conf. If this resolves the issue, the delay is
almost definitely caused by DNS. In order to avoid such pauses, I always
set localname in my mts.conf.

--
nmh-workers
https://lists.nongnu.org/mailman/listinfo/nmh-workers