SKS Performance oddity

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

SKS Performance oddity

Jeremy T. Bouse
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

        I don't know what is going on here with my cluster but I have 3 of 4
nodes that absolutely perform as I would expect... They have 2 vCPU
with 4GB RAM each along with an extra 50GB drive exclusively for SKS
use under /var/lib/sks. The three behaving fine are my sks02, sks03
and sks04 secondary nodes. My primary node on the other hand is
another story. First I tried increasing it from 2 vCPU/4GB RAM like
the others to 2 vCPU/8GB RAM and then 4 vCPU/8GB RAM without it making
any change. I then built out a new physical server with a quad-core
Xeon 2.4GHz processor and 4GB RAM and a dedicated 3TB RAID5 array and
I'm seeing the same problem. SKS is constantly pegging the CPU at 100%
and eating up nearly all the memory whether it's running on a virtual
or physical. server. Recon service is working and I'm ingesting keys
from peers and peering with my internal cluster nodes but everytime it
goes into recon mode the node starts failing to respond as the CPU and
RAM spike which then leads to the node being dropped from the pool as
the stats page can't be hit before it times out.

        I've been fighting with this for a several days now... Anyone else
out there seeing this behavior or if not and have similar resourced
servers care to share details to see if I'm missing something here.

        The particulars are that all nodes are Debian 9.8 (Stretch) 64-bit.
Then only primary node handles running NGINX configured for load
balancing the cluster. The only other daemons running across all nodes
besides SKS are OpenSSH for remote access, SSSD for centralized
authenication, Haveged for entropy and Postfix configured for
smarthost relaying.
-----BEGIN PGP SIGNATURE-----

iQGzBAEBCgAdFiEEakJ0F+CHS9VzhSFg6lYpTv4TPXUFAlyDTX0ACgkQ6lYpTv4T
PXUB0Qv/fRbDkGPes3eq3xDkv6MQHfVFLXuUNdjOtrgpvCwkiS8b340dDKmI5a+x
NufUzvSHX4GjOc3Joxivc/N1rA7ENrwEX+2T/cwrE8iu+himuvAJkQtXp2qo2Dye
9CgzGKR/J0BO50tdmNCJLp6xuR4eY4ISBo0FeeGplipmZIv5BSqKcTcYWaFCNddr
FLqk6gKT1yzVHb8aO4KzIyB9CqcJEBbTL/RTaJWslCewYcmikw6NBOc1dV/BoxBA
uGXK3o48o3mo7LJj+sH8/U6F0Ffqnn/tbwIIe/dZSnyonTyP1ENAN2zBWgdzyiRK
qp1/TDoFC6FuujJgJNKOSsPMNw9bVd5gXYUIIDIE9YK7SeCEP2us4TWS4LQJmuB9
7aidQ0rseyN9cSKrswUyWq7k3pM8iLnzx7D8BwW2uvO2SjKo+ALce5UtjyOhgg9v
ECnxoKjeUTujle/0ZRyi5AbC3AfKi/CoREIJ98w+tAh7jdM5w34vYH8plekRGbFp
4bNo9Fyl
=EdIY
-----END PGP SIGNATURE-----

_______________________________________________
Sks-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/sks-devel
Reply | Threaded
Open this post in threaded view
|

Re: SKS Performance oddity

Todd Fleisher
I've been having similar issues his week, though it's mainly high IO load/wait that is the issue. Also it's not been my primary nodes that recon with the outside world, but some of my secondary nodes that only peer internally. I've been restoring them by replacing the DB & PTree files/dirs from another node and that seems to do the trick for a period of time but I have already done it twice in the last few days so it's not really a sustainable approach. I just haven't had time to dig deeper into it to try and determine why it is happening and/or how to better protect against it.

Sent from the Fleishphone

> On Mar 8, 2019, at 19:22, Jeremy T. Bouse <[hidden email]> wrote:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA512
>
>    I don't know what is going on here with my cluster but I have 3 of 4
> nodes that absolutely perform as I would expect... They have 2 vCPU
> with 4GB RAM each along with an extra 50GB drive exclusively for SKS
> use under /var/lib/sks. The three behaving fine are my sks02, sks03
> and sks04 secondary nodes. My primary node on the other hand is
> another story. First I tried increasing it from 2 vCPU/4GB RAM like
> the others to 2 vCPU/8GB RAM and then 4 vCPU/8GB RAM without it making
> any change. I then built out a new physical server with a quad-core
> Xeon 2.4GHz processor and 4GB RAM and a dedicated 3TB RAID5 array and
> I'm seeing the same problem. SKS is constantly pegging the CPU at 100%
> and eating up nearly all the memory whether it's running on a virtual
> or physical. server. Recon service is working and I'm ingesting keys
> from peers and peering with my internal cluster nodes but everytime it
> goes into recon mode the node starts failing to respond as the CPU and
> RAM spike which then leads to the node being dropped from the pool as
> the stats page can't be hit before it times out.
>
>    I've been fighting with this for a several days now... Anyone else
> out there seeing this behavior or if not and have similar resourced
> servers care to share details to see if I'm missing something here.
>
>    The particulars are that all nodes are Debian 9.8 (Stretch) 64-bit.
> Then only primary node handles running NGINX configured for load
> balancing the cluster. The only other daemons running across all nodes
> besides SKS are OpenSSH for remote access, SSSD for centralized
> authenication, Haveged for entropy and Postfix configured for
> smarthost relaying.
> -----BEGIN PGP SIGNATURE-----
>
> iQGzBAEBCgAdFiEEakJ0F+CHS9VzhSFg6lYpTv4TPXUFAlyDTX0ACgkQ6lYpTv4T
> PXUB0Qv/fRbDkGPes3eq3xDkv6MQHfVFLXuUNdjOtrgpvCwkiS8b340dDKmI5a+x
> NufUzvSHX4GjOc3Joxivc/N1rA7ENrwEX+2T/cwrE8iu+himuvAJkQtXp2qo2Dye
> 9CgzGKR/J0BO50tdmNCJLp6xuR4eY4ISBo0FeeGplipmZIv5BSqKcTcYWaFCNddr
> FLqk6gKT1yzVHb8aO4KzIyB9CqcJEBbTL/RTaJWslCewYcmikw6NBOc1dV/BoxBA
> uGXK3o48o3mo7LJj+sH8/U6F0Ffqnn/tbwIIe/dZSnyonTyP1ENAN2zBWgdzyiRK
> qp1/TDoFC6FuujJgJNKOSsPMNw9bVd5gXYUIIDIE9YK7SeCEP2us4TWS4LQJmuB9
> 7aidQ0rseyN9cSKrswUyWq7k3pM8iLnzx7D8BwW2uvO2SjKo+ALce5UtjyOhgg9v
> ECnxoKjeUTujle/0ZRyi5AbC3AfKi/CoREIJ98w+tAh7jdM5w34vYH8plekRGbFp
> 4bNo9Fyl
> =EdIY
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> Sks-devel mailing list
> [hidden email]
> https://lists.nongnu.org/mailman/listinfo/sks-devel
>


_______________________________________________
Sks-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/sks-devel
Reply | Threaded
Open this post in threaded view
|

Re: SKS Performance oddity

Michiel van Baak
In reply to this post by Jeremy T. Bouse
On Sat, Mar 09, 2019 at 12:22:14AM -0500, Jeremy T. Bouse wrote:

> I don't know what is going on here with my cluster but I have 3 of 4
> nodes that absolutely perform as I would expect... They have 2 vCPU
> with 4GB RAM each along with an extra 50GB drive exclusively for SKS
> use under /var/lib/sks. The three behaving fine are my sks02, sks03
> and sks04 secondary nodes. My primary node on the other hand is
> another story. First I tried increasing it from 2 vCPU/4GB RAM like
> the others to 2 vCPU/8GB RAM and then 4 vCPU/8GB RAM without it making
> any change. I then built out a new physical server with a quad-core
> Xeon 2.4GHz processor and 4GB RAM and a dedicated 3TB RAID5 array and
> I'm seeing the same problem. SKS is constantly pegging the CPU at 100%
> and eating up nearly all the memory whether it's running on a virtual
> or physical. server. Recon service is working and I'm ingesting keys
> from peers and peering with my internal cluster nodes but everytime it
> goes into recon mode the node starts failing to respond as the CPU and
> RAM spike which then leads to the node being dropped from the pool as
> the stats page can't be hit before it times out.
>
> I've been fighting with this for a several days now... Anyone else
> out there seeing this behavior or if not and have similar resourced
> servers care to share details to see if I'm missing something here.
>
> The particulars are that all nodes are Debian 9.8 (Stretch) 64-bit.
> Then only primary node handles running NGINX configured for load
> balancing the cluster. The only other daemons running across all nodes
> besides SKS are OpenSSH for remote access, SSSD for centralized
> authenication, Haveged for entropy and Postfix configured for
> smarthost relaying.

Hey,

I hav exactly the same problem.
Several times in the last month I have done the following steps:

- Stop all nodes
- Destroy the datasets (both db and ptree)
- Load in a new dump from max 2 days old
- Create the ptree database
- Start sks on the primary node, without peering configured (comment out
  all peers)
- Give it some time to start
- Check the stats page and run a couple of searches
# Up until here everything works fine #
- Add the outside peers on the primary node and restart it
- After 5 minutes the machine takes 100% CPU, is stuck in I/O most of
  the time and falls off the grid

It doesn't matter if I enable peering with the internal nodes or not.
Just having 1 SKS instance running, and peering it with the network is
enough to basically render this instance unusable.

Like you, I tried in a vm first, and also on a physical machine (dual
6-core xeon E5-2620 0 @ 2.00GHz with 96GB ram and 2 samsung evo 840 pro
ssds for storage)
I see exactly the same every time I follow the steps outlined above.

The systems I tried are Debian linux and FreeBSD and all the same.

--
Michiel van Baak
[hidden email]
GPG key: http://pgp.mit.edu/pks/lookup?op=get&search=0x6FFC75A2679ED069

NB: I have a new GPG key. Old one revoked and revoked key updated on keyservers.

_______________________________________________
Sks-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/sks-devel
Reply | Threaded
Open this post in threaded view
|

Re: SKS Performance oddity

Jim Popovitch
In reply to this post by Jeremy T. Bouse
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On Sat, 2019-03-09 at 00:22 -0500, Jeremy T. Bouse wrote:
> I've been fighting with this for a several days now... Anyone else
> out there seeing this behavior or if not and have similar resourced
> servers care to share details to see if I'm missing something here.
>
> The particulars are that all nodes are Debian 9.8 (Stretch) 64-bit.


I'm running a near identical setup of SKS (but single instance) and I too am
seeing the same behaviour.  A fresh import from Matt's daily data yields a
nice experience.  Once the service is active and online the data quickly
(within hours) slowly becomes corrupted with bogus invalid unsanitized uid
data (which is putting it nicely, it's not really uid data)

- -Jim P.
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCgAdFiEECPbAhaBWEfiXj/kxdRlcPb+1fkUFAlyD0WoACgkQdRlcPb+1
fkXYkxAAq/prjHxKLqcx+xz9T3181ZvBcO9vGJ+y2ex1miy5XfevIPyxGv5pQn5j
zSPjVFgnNTsT82l1qxKOsloVhIK0DQ+Zuv/X7VOv4M/iLRhBRsjZGqcgWEZH+LR2
pbCUUY4yFg8vn0mFo+UVtG7dBsWdoE31+G9y+X1ezlSYkOcUtGqiuEwPc/6EHiGK
LO6TFo1rdx8/7J3nvcGRwGi7UnRyLdJ3QJUC27wLyeE/uRsjmoG1op6jTFNo2Ebx
1yzkPQjfR7mTg3WKx/p9pMV+nEMDf3akHTPPP9OxROkOm2O5xXnjvhgw9jHnlwR8
vh8wnWwk5DTpIIgUiVYF2h/V7ELJwSG1m1AhwpbFHbrd69rLWeEdFfve8s6XkeKc
IYTbC0BFqX3nZrv4YaFy5BtZ6blfrN4fhYCdav4YHM4PVImgRxN301E8Nij9CSVj
eLlyFeAahFtdni2xjJ3IYoWfNcf9MJ1iOWinG63l8fxvrW6u6rCjB5daPyHg4it6
4wgDpqKj8byiqhYiYyggo36NQxfmMP3JFsWx+M8iqH9sG9lFlXx4mLrpX7Rgc+AW
kNS3lQSODn+vbNLOOyx6ABgSuOWtFmYCo59BQM1qLwmP3L29hYDWwU1yQQj5rqBF
+J5nFJl1APKKTmyrwiSxKS020GE2RHQ2giSn2oPUYaUI3N+pr3w=
=expD
-----END PGP SIGNATURE-----


_______________________________________________
Sks-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/sks-devel
Reply | Threaded
Open this post in threaded view
|

Re: SKS Performance oddity

Jeremy T. Bouse
In reply to this post by Michiel van Baak

On 3/9/2019 5:29 AM, Michiel van Baak wrote:

>
> Hey,
>
> I hav exactly the same problem.
> Several times in the last month I have done the following steps:
>
> - Stop all nodes
> - Destroy the datasets (both db and ptree)
> - Load in a new dump from max 2 days old
> - Create the ptree database
> - Start sks on the primary node, without peering configured (comment out
>   all peers)
> - Give it some time to start
> - Check the stats page and run a couple of searches
> # Up until here everything works fine #
> - Add the outside peers on the primary node and restart it
> - After 5 minutes the machine takes 100% CPU, is stuck in I/O most of
>   the time and falls off the grid
>
> It doesn't matter if I enable peering with the internal nodes or not.
> Just having 1 SKS instance running, and peering it with the network is
> enough to basically render this instance unusable.
>
> Like you, I tried in a vm first, and also on a physical machine (dual
> 6-core xeon E5-2620 0 @ 2.00GHz with 96GB ram and 2 samsung evo 840 pro
> ssds for storage)
> I see exactly the same every time I follow the steps outlined above.
>
> The systems I tried are Debian linux and FreeBSD and all the same.
>

I've been trying to narrow it down and zero in on something to fix it,
though I admittedly don't know that much about the internal functions of
the process flow. I have noticed that the issue is not the recon service
itself, despite it appearing so blatantly during the recon mode. It
appears to be from my observation actually the DB service.

At this point I have 5 nodes, sks01 - sks04 are my original 4 VM nodes
all with 2 vCPU/4GB except sks01 which is 4 vCPU/8GB, and then sks0
which is my physical server with 4 core Xeon with 4GB RAM. Currently
sks0 is setup to be my external peering point, originally it was sks01.
I have just finished re-importing the keydump into sks0 and sks01 from
the daily dumps from mattrude.com for 2019-03-08 and 2019-03-09
respectively.

I'm running the following command from another machine to check on things:

>  for I in $(seq 50 54); do echo .${I}; ssh 172.16.20.${I} 'uptime; ps aux| grep sks |grep -v grep; time curl -sf localhost:11371/pks/lookup?op=stats |grep keys:'; echo; done

.50
 18:14:26 up 1 day, 11:30,  7 users,  load average: 0.10, 0.69, 1.31
debian-+ 24595 17.5 13.5 605012 540968 ?       Ss   15:32  28:32
/usr/sbin/sks -stdoutlog db
debian-+ 24596  0.3  0.8  72528 32740 ?        Ss   15:32   0:37
/usr/sbin/sks -stdoutlog recon
<h2>Statistics</h2><p>Total number of keys: 5448526</p>

real    0m0.014s
user    0m0.004s
sys     0m0.004s

.51
 18:14:28 up 1 day, 14:03,  4 users,  load average: 1.30, 1.65, 1.49
debian-+  5166 32.4 36.0 3059044 2950716 ?     Ss   15:37  51:01
/usr/sbin/sks -stdoutlog db
debian-+  5167  0.5  4.0 603644 331260 ?       Ss   15:37   0:48
/usr/sbin/sks -stdoutlog recon
<h2>Statistics</h2><p>Total number of keys: 5448005</p>

real    0m0.022s
user    0m0.012s
sys     0m0.000s

.52
 18:14:30 up 7 days, 19:21,  4 users,  load average: 0.98, 0.38, 0.31
debian-+  6234  0.5 38.6 1609044 1565612 ?     Rs   Mar06  30:33
/usr/sbin/sks -stdoutlog db
debian-+  6235  0.0  3.8 356328 156708 ?       Ss   Mar06   0:51
/usr/sbin/sks -stdoutlog recon
<h2>Statistics</h2><p>Total number of keys: 5447149</p>

real    1m46.269s
user    0m0.012s
sys     0m0.000s

.53
 18:16:17 up 7 days, 19:28,  4 users,  load average: 2.01, 1.55, 0.85
debian-+  5754  0.6 13.6 590840 551360 ?       Ds   Mar05  37:20
/usr/sbin/sks -stdoutlog db
debian-+  5755  0.0  3.1 266908 126064 ?       Ss   Mar05   1:59
/usr/sbin/sks -stdoutlog recon
<h2>Statistics</h2><p>Total number of keys: 5447523</p>

real    0m46.400s
user    0m0.008s
sys     0m0.004s

.54
 18:17:05 up 7 days, 19:28,  4 users,  load average: 1.88, 0.87, 0.41
debian-+  5994  0.6 18.5 791456 752596 ?       Ss   Mar05  35:24
/usr/sbin/sks -stdoutlog db
debian-+  5995  0.0  3.0 260224 122112 ?       Ds   Mar05   1:45
/usr/sbin/sks -stdoutlog recon
<h2>Statistics</h2><p>Total number of keys: 5447788</p>

real    0m0.015s
user    0m0.008s
sys     0m0.000s


For stability sake I'd removed sks0 and sks01 from my NGINX upstreams,
the exception to this is that I have

    location /pks/hashquery {
        proxy_method POST;
        proxy_pass <a href="http://127.0.0.1:11371;">http://127.0.0.1:11371;
    }

so that /pks/hashquery doesn't use the server pool but uses the local
SKS instance. So on sks0 it is only seeing traffic to all traffic to
11370/tcp and only traffic for /pks/hashquery URI to 11371/tcp. All
other /pks URI requests are going to the backend and hitting sks02 - sks04.

I have found some improvement with changes to the *pagesize settings
before re-importing the keydump. Currently all my nodes have had their
data re-imported using the following settings:

pagesize:          128
keyid_pagesize:    64
meta_pagesize:     1
subkeyid_pagesize: 128
time_pagesize:     128
tqueue_pagesize:   1
ptree_pagesize:    8

I also have the hack to short-circuit the bad actor keys that had been
mentioned on the list using:

        if ( $arg_search ~*
"(0x1013D73FECAC918A0A25823986CE877469D2EAD9|0x2016349F5BC6F49340FCCAF99F9169F4B33B4659|0xB33B4659|0x69D2EAD9)"
) {
            return 444;
        }

Which has resulted in me no longer seeing the flood of the requests for
them in my SKS log. The key difference of note from my config vs what
I'd seen others mention on the list is that I'm looking only at the
'search' query arg not the full query string so I'm able to catch it
regardless of 'op', ''options' or any other argument given.

_______________________________________________
Sks-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/sks-devel