SKS intermittently stalls with 100% CPU & rate-limiting

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

SKS intermittently stalls with 100% CPU & rate-limiting

Pete Stephenson
Hi all,

My server, ams.sks.heypete.com, has been suffering from periods where
the amount of CPU used by the sks process goes to 100% for a few minutes
at a time. During this time, my Apache reverse proxy produces errors of
the following type (client IP address obfuscated for their privacy):

[Sun Jun 17 00:00:31.414596 2018] [proxy:error] [pid 4648:tid
139657505371904] [client CLIENT_IP:40327] AH00898: Error reading from
remote server returned by /pks/lookup

This happens across a range of client IP addresses, so it doesn't appear
to be a single malicious user. Rather, it seems that something is
causing the sks process to stall and connections to it time out.

After a minute or two, CPU usage drops to the normal value of a few
percent up to 15%, with queries being promptly answered until the CPU
usage spikes again and things stall out.

The server is in close sync with its peers, with no particular issues on
the recon side.

Any ideas what might be causing this? I'm running 1.1.6 on Debian, and
things have generally been working well for several years. For good
measure, I recently deleted the key database and recreated it from a
fresh dump, but that had no effect.

Potentially related: several clients, evidently corporate mail servers
that query the SKS pool for every email they send or receive, are making
dozens of queries per second to my server. Is it reasonable to impose
rate limits on such clients (e.g. no more than X queries in Y seconds)?
If so, what would reasonable values be for X and Y?

Thank you.

Cheers!
-Pete

--
Pete Stephenson

_______________________________________________
Sks-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/sks-devel
Reply | Threaded
Open this post in threaded view
|

Re: SKS intermittently stalls with 100% CPU & rate-limiting

Pete Stephenson
On 6/16/2018 5:18 PM, Pete Stephenson wrote:
> Hi all,
>
> My server, ams.sks.heypete.com, has been suffering from periods where
> the amount of CPU used by the sks process goes to 100% for a few minutes
> at a time. During this time, my Apache reverse proxy produces errors of
> the following type (client IP address obfuscated for their privacy):

[snip]

As a follow-up, I set a rate limit on ports 80, 443, and 11371 to 6
requests per 30 second window. The high-speed queries ceased almost
immediately and are being blocked by the firewall (they still continue
making their rapid queries, but SKS doesn't see more than 12 a minute).

More ordinary queries are seemingly not affected: only between one and
three IP addresses are being rate limited.

However, this doesn't seem to resolve the problem: SKS still pegs the
CPU meter at 100% for a minute or so every minute or two, with all
queries hanging until it sorts out what's going on.

--
Pete Stephenson

_______________________________________________
Sks-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/sks-devel
Reply | Threaded
Open this post in threaded view
|

Re: SKS intermittently stalls with 100% CPU & rate-limiting

Moritz Wirth-2
In reply to this post by Pete Stephenson
Hi,

seems like that is the "problem":

https://bitbucket.org/skskeyserver/sks-keyserver/issues/60/denial-of-service-via-large-uid-packets
https://bitbucket.org/skskeyserver/sks-keyserver/issues/57/anyone-can-make-any-pgp-key-unimportable

Best regards,

Moritz

Am 17.06.18 um 02:18 schrieb Pete Stephenson:

> Hi all,
>
> My server, ams.sks.heypete.com, has been suffering from periods where
> the amount of CPU used by the sks process goes to 100% for a few minutes
> at a time. During this time, my Apache reverse proxy produces errors of
> the following type (client IP address obfuscated for their privacy):
>
> [Sun Jun 17 00:00:31.414596 2018] [proxy:error] [pid 4648:tid
> 139657505371904] [client CLIENT_IP:40327] AH00898: Error reading from
> remote server returned by /pks/lookup
>
> This happens across a range of client IP addresses, so it doesn't appear
> to be a single malicious user. Rather, it seems that something is
> causing the sks process to stall and connections to it time out.
>
> After a minute or two, CPU usage drops to the normal value of a few
> percent up to 15%, with queries being promptly answered until the CPU
> usage spikes again and things stall out.
>
> The server is in close sync with its peers, with no particular issues on
> the recon side.
>
> Any ideas what might be causing this? I'm running 1.1.6 on Debian, and
> things have generally been working well for several years. For good
> measure, I recently deleted the key database and recreated it from a
> fresh dump, but that had no effect.
>
> Potentially related: several clients, evidently corporate mail servers
> that query the SKS pool for every email they send or receive, are making
> dozens of queries per second to my server. Is it reasonable to impose
> rate limits on such clients (e.g. no more than X queries in Y seconds)?
> If so, what would reasonable values be for X and Y?
>
> Thank you.
>
> Cheers!
> -Pete
>



_______________________________________________
Sks-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/sks-devel
Reply | Threaded
Open this post in threaded view
|

Re: SKS intermittently stalls with 100% CPU & rate-limiting

Pete Stephenson
Thanks.

I then have three more questions:

1. If this issue is affecting my server to the point of it being booted
from the pool (since it's stalling near-continuously and can't respond
toe queries), why are other servers not being similar affected? There's
lots of servers still in the pool.

2. Is there some countermeasure one can use to protect their server? I
have LimitRequestBody set to 8000000 (8MB) to prevent blatant abuse, but
clearly something is still annoying the server.

3. Any suggestions on how to deal with the unreasonably high-speed
queries from corporate mail systems? Ideally, they'd run their own
server locally to handle their huge amount of queries, but I have no
real way of communicating that with them. I'd love to slow down their
queries (tarpitting, maybe?) to minimize excess resource consumption
while still answering their queries as opposed to just cutting them off
once they hit a rate limit.

Cheers!
-Pete

On 6/16/2018 5:47 PM, Moritz Wirth wrote:

> Hi,
>
> seems like that is the "problem":
>
> https://bitbucket.org/skskeyserver/sks-keyserver/issues/60/denial-of-service-via-large-uid-packets
> https://bitbucket.org/skskeyserver/sks-keyserver/issues/57/anyone-can-make-any-pgp-key-unimportable
>
> Best regards,
>
> Moritz
>
> Am 17.06.18 um 02:18 schrieb Pete Stephenson:
>> Hi all,
>>
>> My server, ams.sks.heypete.com, has been suffering from periods where
>> the amount of CPU used by the sks process goes to 100% for a few minutes
>> at a time. During this time, my Apache reverse proxy produces errors of
>> the following type (client IP address obfuscated for their privacy):
>>
>> [Sun Jun 17 00:00:31.414596 2018] [proxy:error] [pid 4648:tid
>> 139657505371904] [client CLIENT_IP:40327] AH00898: Error reading from
>> remote server returned by /pks/lookup
>>
>> This happens across a range of client IP addresses, so it doesn't appear
>> to be a single malicious user. Rather, it seems that something is
>> causing the sks process to stall and connections to it time out.
>>
>> After a minute or two, CPU usage drops to the normal value of a few
>> percent up to 15%, with queries being promptly answered until the CPU
>> usage spikes again and things stall out.
>>
>> The server is in close sync with its peers, with no particular issues on
>> the recon side.
>>
>> Any ideas what might be causing this? I'm running 1.1.6 on Debian, and
>> things have generally been working well for several years. For good
>> measure, I recently deleted the key database and recreated it from a
>> fresh dump, but that had no effect.
>>
>> Potentially related: several clients, evidently corporate mail servers
>> that query the SKS pool for every email they send or receive, are making
>> dozens of queries per second to my server. Is it reasonable to impose
>> rate limits on such clients (e.g. no more than X queries in Y seconds)?
>> If so, what would reasonable values be for X and Y?
>>
>> Thank you.
>>
>> Cheers!
>> -Pete
>>
>
>
>
> _______________________________________________
> Sks-devel mailing list
> [hidden email]
> https://lists.nongnu.org/mailman/listinfo/sks-devel
>


--
Pete Stephenson

_______________________________________________
Sks-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/sks-devel
Reply | Threaded
Open this post in threaded view
|

Re: SKS intermittently stalls with 100% CPU & rate-limiting

Paul M Furley
Hi Pete,

On 17/06/18 04:53, Pete Stephenson wrote:
> Thanks.
>
> I then have three more questions:
>
> 1. If this issue is affecting my server to the point of it being booted
> from the pool (since it's stalling near-continuously and can't respond
> toe queries), why are other servers not being similar affected? There's
> lots of servers still in the pool.

I certainly should've been booted from the pool since my server has
filled up its disk and trashed its database (twice) so it was offline
all of yesterday.

I'm bringing it back up with the `set_flags DB_LOG_AUTOREMOVE` setting
this time which will hopefully save it.

>
> 2. Is there some countermeasure one can use to protect their server? I
> have LimitRequestBody set to 8000000 (8MB) to prevent blatant abuse, but
> clearly something is still annoying the server.

It appears from Rob's previous email that our servers are failing to
synchronise a 22M key (because of settings like this) which is causing
sks to continually retry:

https://lists.nongnu.org/archive/html/sks-devel/2018-06/msg00014.html :

> The size is causing timeouts on some reverse proxies and the constant
retries is causing the .log files to be created and growing in the DB
directory.

>
> 3. Any suggestions on how to deal with the unreasonably high-speed
> queries from corporate mail systems? Ideally, they'd run their own
> server locally to handle their huge amount of queries, but I have no
> real way of communicating that with them. I'd love to slow down their
> queries (tarpitting, maybe?) to minimize excess resource consumption
> while still answering their queries as opposed to just cutting them off
> once they hit a rate limit.

Are you sure these users are the cause of your troubles? Or is it this
constant-retry loop caused by this large key?

I'd suggest contacting them before rate limiting them, ask them to point
at the pool or slow down their queries.

Paul

>
> Cheers!
> -Pete
>
> On 6/16/2018 5:47 PM, Moritz Wirth wrote:
>> Hi,
>>
>> seems like that is the "problem":
>>
>> https://bitbucket.org/skskeyserver/sks-keyserver/issues/60/denial-of-service-via-large-uid-packets
>> https://bitbucket.org/skskeyserver/sks-keyserver/issues/57/anyone-can-make-any-pgp-key-unimportable
>>
>> Best regards,
>>
>> Moritz
>>
>> Am 17.06.18 um 02:18 schrieb Pete Stephenson:
>>> Hi all,
>>>
>>> My server, ams.sks.heypete.com, has been suffering from periods where
>>> the amount of CPU used by the sks process goes to 100% for a few minutes
>>> at a time. During this time, my Apache reverse proxy produces errors of
>>> the following type (client IP address obfuscated for their privacy):
>>>
>>> [Sun Jun 17 00:00:31.414596 2018] [proxy:error] [pid 4648:tid
>>> 139657505371904] [client CLIENT_IP:40327] AH00898: Error reading from
>>> remote server returned by /pks/lookup
>>>
>>> This happens across a range of client IP addresses, so it doesn't appear
>>> to be a single malicious user. Rather, it seems that something is
>>> causing the sks process to stall and connections to it time out.
>>>
>>> After a minute or two, CPU usage drops to the normal value of a few
>>> percent up to 15%, with queries being promptly answered until the CPU
>>> usage spikes again and things stall out.
>>>
>>> The server is in close sync with its peers, with no particular issues on
>>> the recon side.
>>>
>>> Any ideas what might be causing this? I'm running 1.1.6 on Debian, and
>>> things have generally been working well for several years. For good
>>> measure, I recently deleted the key database and recreated it from a
>>> fresh dump, but that had no effect.
>>>
>>> Potentially related: several clients, evidently corporate mail servers
>>> that query the SKS pool for every email they send or receive, are making
>>> dozens of queries per second to my server. Is it reasonable to impose
>>> rate limits on such clients (e.g. no more than X queries in Y seconds)?
>>> If so, what would reasonable values be for X and Y?
>>>
>>> Thank you.
>>>
>>> Cheers!
>>> -Pete
>>>
>>
>>
>>
>> _______________________________________________
>> Sks-devel mailing list
>> [hidden email]
>> https://lists.nongnu.org/mailman/listinfo/sks-devel
>>
>
>

_______________________________________________
Sks-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/sks-devel

signature.asc (817 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: SKS intermittently stalls with 100% CPU & rate-limiting

Moritz Wirth-2
In reply to this post by Pete Stephenson
I have an idea about this, however i am not sure that this is still the
same problem.

The spider who queries the availability of the keyservers requests
/pks/lookup?op=get&search=0x16e0cf8d6b0b9508 - which contains the
problematic key (just look it up..). 

I am not sure that this is the actual problem, but just imagine the
request of the key causes massive load - the request is not answered and
your keyserver is kicked out of the pool.


Am 17.06.18 um 05:53 schrieb Pete Stephenson:

> Thanks.
>
> I then have three more questions:
>
> 1. If this issue is affecting my server to the point of it being booted
> from the pool (since it's stalling near-continuously and can't respond
> toe queries), why are other servers not being similar affected? There's
> lots of servers still in the pool.
> 2. Is there some countermeasure one can use to protect their server? I
> have LimitRequestBody set to 8000000 (8MB) to prevent blatant abuse, but
> clearly something is still annoying the server.
> 3. Any suggestions on how to deal with the unreasonably high-speed
> queries from corporate mail systems? Ideally, they'd run their own
> server locally to handle their huge amount of queries, but I have no
> real way of communicating that with them. I'd love to slow down their
> queries (tarpitting, maybe?) to minimize excess resource consumption
> while still answering their queries as opposed to just cutting them off
> once they hit a rate limit.
>
> Cheers!
> -Pete
>
> On 6/16/2018 5:47 PM, Moritz Wirth wrote:
>> Hi,
>>
>> seems like that is the "problem":
>>
>> https://bitbucket.org/skskeyserver/sks-keyserver/issues/60/denial-of-service-via-large-uid-packets
>> https://bitbucket.org/skskeyserver/sks-keyserver/issues/57/anyone-can-make-any-pgp-key-unimportable
>>
>> Best regards,
>>
>> Moritz
>>
>> Am 17.06.18 um 02:18 schrieb Pete Stephenson:
>>> Hi all,
>>>
>>> My server, ams.sks.heypete.com, has been suffering from periods where
>>> the amount of CPU used by the sks process goes to 100% for a few minutes
>>> at a time. During this time, my Apache reverse proxy produces errors of
>>> the following type (client IP address obfuscated for their privacy):
>>>
>>> [Sun Jun 17 00:00:31.414596 2018] [proxy:error] [pid 4648:tid
>>> 139657505371904] [client CLIENT_IP:40327] AH00898: Error reading from
>>> remote server returned by /pks/lookup
>>>
>>> This happens across a range of client IP addresses, so it doesn't appear
>>> to be a single malicious user. Rather, it seems that something is
>>> causing the sks process to stall and connections to it time out.
>>>
>>> After a minute or two, CPU usage drops to the normal value of a few
>>> percent up to 15%, with queries being promptly answered until the CPU
>>> usage spikes again and things stall out.
>>>
>>> The server is in close sync with its peers, with no particular issues on
>>> the recon side.
>>>
>>> Any ideas what might be causing this? I'm running 1.1.6 on Debian, and
>>> things have generally been working well for several years. For good
>>> measure, I recently deleted the key database and recreated it from a
>>> fresh dump, but that had no effect.
>>>
>>> Potentially related: several clients, evidently corporate mail servers
>>> that query the SKS pool for every email they send or receive, are making
>>> dozens of queries per second to my server. Is it reasonable to impose
>>> rate limits on such clients (e.g. no more than X queries in Y seconds)?
>>> If so, what would reasonable values be for X and Y?
>>>
>>> Thank you.
>>>
>>> Cheers!
>>> -Pete
>>>
>>
>>
>> _______________________________________________
>> Sks-devel mailing list
>> [hidden email]
>> https://lists.nongnu.org/mailman/listinfo/sks-devel
>>
>



_______________________________________________
Sks-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/sks-devel
Reply | Threaded
Open this post in threaded view
|

Re: SKS intermittently stalls with 100% CPU & rate-limiting

Pete Stephenson
In reply to this post by Paul M Furley
On 6/17/2018 12:59 AM, Paul M Furley wrote:

> Hi Pete,
>
> On 17/06/18 04:53, Pete Stephenson wrote:
>> Thanks.
>>
>> I then have three more questions:
>>
>> 1. If this issue is affecting my server to the point of it being booted
>> from the pool (since it's stalling near-continuously and can't respond
>> toe queries), why are other servers not being similar affected? There's
>> lots of servers still in the pool.
>
> I certainly should've been booted from the pool since my server has
> filled up its disk and trashed its database (twice) so it was offline
> all of yesterday.
>
> I'm bringing it back up with the `set_flags DB_LOG_AUTOREMOVE` setting
> this time which will hopefully save it.

Yeah, I added the same line. There's now just two log files rather than
dozens. Seems to work ok in controlling the disk space usage, but it
doesn't seem to do anything about the spikes in CPU usage,
non-responsiveness, etc.

>> 2. Is there some countermeasure one can use to protect their server? I
>> have LimitRequestBody set to 8000000 (8MB) to prevent blatant abuse, but
>> clearly something is still annoying the server.
>
> It appears from Rob's previous email that our servers are failing to
> synchronise a 22M key (because of settings like this) which is causing
> sks to continually retry:
>
> https://lists.nongnu.org/archive/html/sks-devel/2018-06/msg00014.html :

The server had been running with no limits on the request body size for
several years without problems. I added that line in the hopes of
controlling things from getting worse. I've since removed it, but it
doesn't seem to have much of an effect.

Is there some way of (a) resolving the problem with this key (e.g.
locally adding it to the server, so it won't keep choking while
retrying) and (b) preventing such issues from occurring in the future
that I can take now?

>> 3. Any suggestions on how to deal with the unreasonably high-speed
>> queries from corporate mail systems? Ideally, they'd run their own
>> server locally to handle their huge amount of queries, but I have no
>> real way of communicating that with them. I'd love to slow down their
>> queries (tarpitting, maybe?) to minimize excess resource consumption
>> while still answering their queries as opposed to just cutting them off
>> once they hit a rate limit.
>
> Are you sure these users are the cause of your troubles? Or is it this
> constant-retry loop caused by this large key?

I don't know.

Regardless, I do think that the high-volume users are being a bit
unreasonable: SKS queries are relatively "heavy" compared to lightweight
queries like those to DNSbls, so making queries to the SKS pool for each
email sent or received seems excessive, but that may just be me.

Anyway, I've removed the rate limits since they didn't seem to have any
effect on the constant-retry loop or stalling.

> I'd suggest contacting them before rate limiting them, ask them to point
> at the pool or slow down their queries.

I think they already were querying the pool, and just happened to get my
server as part of the rotation. I just sent them an email inquiring
about the number of queries and encouraging them to run their own server
and have it join the pool, but in general I don't have the time or
motivation to contact every potentially abusive user. I'm just curious
if there's any recommended practices for throttling abusive users.

Cheers!
-Pete

--
Pete Stephenson

_______________________________________________
Sks-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/sks-devel
Reply | Threaded
Open this post in threaded view
|

Re: SKS intermittently stalls with 100% CPU & rate-limiting

Pete Stephenson
In reply to this post by Paul M Furley
On 6/17/2018 12:59 AM, Paul M Furley wrote:

> Hi Pete,
>
>> 2. Is there some countermeasure one can use to protect their server? I
>> have LimitRequestBody set to 8000000 (8MB) to prevent blatant abuse, but
>> clearly something is still annoying the server.
>
> It appears from Rob's previous email that our servers are failing to
> synchronise a 22M key (because of settings like this) which is causing
> sks to continually retry:
>
> https://lists.nongnu.org/archive/html/sks-devel/2018-06/msg00014.html :

It's been four days and my server is still stalling and connections time
out. The server is regularly being added to and removed from the pool.

I've removed the Apache LimitRequestBody directive in my relevant
reverse proxy configuration file, but what else can I do to stop this
continuous cycle such that my server is again a stable member of the pool?

>> 3. Any suggestions on how to deal with the unreasonably high-speed
>> queries from corporate mail systems? Ideally, they'd run their own
>> server locally to handle their huge amount of queries, but I have no
>> real way of communicating that with them. I'd love to slow down their
>> queries (tarpitting, maybe?) to minimize excess resource consumption
>> while still answering their queries as opposed to just cutting them off
>> once they hit a rate limit.
>
> Are you sure these users are the cause of your troubles? Or is it this
> constant-retry loop caused by this large key?
>
> I'd suggest contacting them before rate limiting them, ask them to point
> at the pool or slow down their queries.

It turns out that contacting them was the right thing to do: they've
implemented a caching proxy on their end to minimize the load to the
pool and are looking at running their own server going forward.
Excellent. Thanks for the suggestion.

Cheers!
-Pete

--
Pete Stephenson

_______________________________________________
Sks-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/sks-devel
Reply | Threaded
Open this post in threaded view
|

Re: SKS intermittently stalls with 100% CPU & rate-limiting

Moritz Wirth-2
I am afraid there is not much you can do about this right now - the pool
itself is very unstable and crashes multiple times per day.

I found over 8 key hashes which cause an Eventloop - this happens every
2-3 minutes, sometimes with the same key, sometimes with other keys. 

Best regards,


Am 22.06.18 um 00:11 schrieb Pete Stephenson:

> On 6/17/2018 12:59 AM, Paul M Furley wrote:
>> Hi Pete,
>>
>>> 2. Is there some countermeasure one can use to protect their server? I
>>> have LimitRequestBody set to 8000000 (8MB) to prevent blatant abuse, but
>>> clearly something is still annoying the server.
>> It appears from Rob's previous email that our servers are failing to
>> synchronise a 22M key (because of settings like this) which is causing
>> sks to continually retry:
>>
>> https://lists.nongnu.org/archive/html/sks-devel/2018-06/msg00014.html :
> It's been four days and my server is still stalling and connections time
> out. The server is regularly being added to and removed from the pool.
>
> I've removed the Apache LimitRequestBody directive in my relevant
> reverse proxy configuration file, but what else can I do to stop this
> continuous cycle such that my server is again a stable member of the pool?
>
>>> 3. Any suggestions on how to deal with the unreasonably high-speed
>>> queries from corporate mail systems? Ideally, they'd run their own
>>> server locally to handle their huge amount of queries, but I have no
>>> real way of communicating that with them. I'd love to slow down their
>>> queries (tarpitting, maybe?) to minimize excess resource consumption
>>> while still answering their queries as opposed to just cutting them off
>>> once they hit a rate limit.
>> Are you sure these users are the cause of your troubles? Or is it this
>> constant-retry loop caused by this large key?
>>
>> I'd suggest contacting them before rate limiting them, ask them to point
>> at the pool or slow down their queries.
> It turns out that contacting them was the right thing to do: they've
> implemented a caching proxy on their end to minimize the load to the
> pool and are looking at running their own server going forward.
> Excellent. Thanks for the suggestion.
>
> Cheers!
> -Pete
>


_______________________________________________
Sks-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/sks-devel

signature.asc (876 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: SKS intermittently stalls with 100% CPU & rate-limiting

Paul Fontela
In reply to this post by Pete Stephenson
Hello everyone,
without the intention of sticking your finger in the wound ....
 
I have spent almost 10 days investigating the problem that I see related
in different threads of the list [Sks-devel], the falls of the sks
servers for abuse of requests.

I have tried almost everything, from downloading a dump and starting the
server sks again to reinstall system and everything else, the result is
always the same, it works well for a while, sometimes an hour sometimes
a little more and suddenly it it freezes the key server, reaching 80%
RAM, which makes it unstable and inoperable.

Of the three servers that I have, only 2 of them are surviving with
difficulty to this strange problem that has appeared "suddenly", I
wonder the following:

Is there any way to solve this problem?

Checking the logs of Nginx and SKS I have seen that there are some types
that consult without rest for a long time.

Is it possible to block mercenaries who do not want to spend a few
dollars to set up their own key server?

What happens to those huge keys that clog servers?

Is it possible to limit or block queries with scripts and limit them
only to the web interface?

Seen the seen, I'm going to stop one of the servers, the smallest of
them and that is hosted in the site that has been working best until
now, it is a small virtual machine with little RAM (1Gb) and it is that
server that most Problems is causing me, I think it is not worth having
a server running 24 hours if only it fulfills its mission 30 minutes a
day and that makes me be aware of it to restart services every time it
hangs.

I will keep the other servers until I see that they start giving me
promises too, if this happens, I will have to make a difficult decision.

What I do not want to do is have machines consuming electricity,
bandwidth and resources so that they are not fulfilling their mission.

Greetings to all and a lot of encouragement.
Paul Fontela

--

Paul Fontela
keyserver.ispfontela.es 11370 # Paul Fontela <[hidden email]> 0x31743FFC33E746C5
a.0.na.ispfontela.es 11370 # Paul Fontela Gmail <[hidden email]> 0x3D7FCDA03AAD46F1
 


_______________________________________________
Sks-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/sks-devel
Reply | Threaded
Open this post in threaded view
|

Re: SKS intermittently stalls with 100% CPU & rate-limiting

Gabor Kiss
> I have tried almost everything, from downloading a dump and starting the
> server sks again to reinstall system and everything else, the result is
> always the same, it works well for a while, sometimes an hour sometimes
> a little more and suddenly it it freezes the key server, reaching 80%
> RAM, which makes it unstable and inoperable.

Eeerrr... A few years ago I had a similar problem.
See thread at http://lists.nongnu.org/archive/html/sks-devel/2015-03/msg00004.html

Regards

Gabor

_______________________________________________
Sks-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/sks-devel
Reply | Threaded
Open this post in threaded view
|

Re: SKS intermittently stalls with 100% CPU & rate-limiting

Phil Pennock-17
In reply to this post by Paul Fontela
On 2018-06-25 at 13:08 +0200, Paul Fontela wrote:
> I have tried almost everything, from downloading a dump and starting the
> server sks again to reinstall system and everything else, the result is
> always the same, it works well for a while, sometimes an hour sometimes
> a little more and suddenly it it freezes the key server, reaching 80%
> RAM, which makes it unstable and inoperable.

That sounds like recon gone wild, normally a sign that you're peering
with someone who is very much behind on keys.  The recon system only
works if your peers are "mostly up-to-date".

This is why we introduced the template for introducing yourself to the
community, in the Peering wiki page, showing how many keys you have
loaded.  It cut down on people joining with 0 keys, expecting recon to
do all the work, and new peers complaining that their SKS was hanging.

Per <https://sks-keyservers.net/status/> the lower bound of keys to be
included is:  5105570
You have:     5109664

Using <http://keyserver.ispfontela.es:11371/pks/lookup?op=stats> as a
starting point, and skipping your in-house 11380 peers, opening all the
others up in tabs and looking (I don't have this scripted) we see:

  5109604  keys.niif.hu
  5065412  keys.sbell.io
  5107576  sks.mbk-lab.ru
  5109585  pgp.neopost.com
  5108773  pgp.uni-mainz.de
  5109639  pgpkeys.urown.net
  4825075  pgp.key-server.io
  <can't connect>  sks.funkymonkey.org
  5084241  keyserver.iseclib.ru
  5109254  keyserver.swabian.net
  5109628  sks-cmh.semperen.com
  <sks down behind proxy>  keys-02.licoho.de
  5109629  keyserver.dobrev.eu
  5109121  sks.mirror.square-r00t.net
  5109629  keyserver.escomposlinux.org
  5108778  keyserver.lohn24-datenschutz.de

If your in-house peers are way behind, fix that.

Comment out all peers with fewer than 5_100_000 keys.  Restart sks and
sks-recon.

The 284,000 key difference is pretty severe.  Since that peer isn't
getting updates, they're probably hanging on peering and causing even
more problems for you.

Disable peering _at least_ with those three hosts.


Whenever SKS isn't performing right, the _first_ step after looking for
errors in logs should always be a Peering Hygiene Audit.  Find the peers
who are sufficiently behind that their keeping the peering up is
anti-social and likely causing _you_ problems, comment out the peering
entries, restart (for a completely clean slate) and then reach out to
those peers to ask "Hey, what's up?".

Regards,
-Phil

_______________________________________________
Sks-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/sks-devel
Reply | Threaded
Open this post in threaded view
|

Re: SKS intermittently stalls with 100% CPU & rate-limiting

Paul Fontela

Hi Phill,

Thank you very much for your interest and your answer, the server keyserver.ispfontela.es has no problems, in fact has been able to synchronize almost 200,000 keys in less than 2 hours, that computer is powerful, has a large processor and a lot of RAM, the one that has a serious problem is a.0.na.ispfontela.es, is a virtual host that only has 1Gb of RAM, has always worked well until a few days ago that suddenly has begun to suffer what other colleagues comment, including with the updated database, more than 5100000 keys, it got stuck and stopped, I asked myself then:
If nothing has been modified in the configuration of the server or in the SKS service, what has happened?
That's when I started with the battery of tests.
1 - Changes in Nginx configuration.
2 - Begin the database of keys with a new dump from scratch.
3 - System re-installation (Ubuntu)
4 - Other modifications (add swap to linux that you did not have).

The result was always the same, after a short period of time after starting SKS it increased RAM consumption up to 80% and did not decrease at any time.

Maybe some system update may have affected?

Today is underway synchronizing with only 2 pairs from 26,000 keys until it reaches 5,100,000 with that I will know more or less what is happening.

I have seen that some other servers that are also hosted on Amazon datacenters are suffering from the same problem, could it be Amazon, I do not know, I can not answer that yet.

I will continue investigating and if in the end it does not improve, I will eliminate that server and I will leave running only keyserver.ispfontela.es that for the moment works well


El 25/06/2018 a las 23:46, Phil Pennock escribió:
That sounds like recon gone wild, normally a sign that you're peering
with someone who is very much behind on keys.  The recon system only
works if your peers are "mostly up-to-date".

This is why we introduced the template for introducing yourself to the
community, in the Peering wiki page, showing how many keys you have
loaded.  It cut down on people joining with 0 keys, expecting recon to
do all the work, and new peers complaining that their SKS was hanging.

Per <https://sks-keyservers.net/status/> the lower bound of keys to be
included is:  5105570
You have:     5109664

Using <http://keyserver.ispfontela.es:11371/pks/lookup?op=stats> as a
starting point, and skipping your in-house 11380 peers, opening all the
others up in tabs and looking (I don't have this scripted) we see:

  5109604  keys.niif.hu
  5065412  keys.sbell.io
  5107576  sks.mbk-lab.ru
  5109585  pgp.neopost.com
  5108773  pgp.uni-mainz.de
  5109639  pgpkeys.urown.net
  4825075  pgp.key-server.io
  <can't connect>  sks.funkymonkey.org
  5084241  keyserver.iseclib.ru
  5109254  keyserver.swabian.net
  5109628  sks-cmh.semperen.com
  <sks down behind proxy>  keys-02.licoho.de
  5109629  keyserver.dobrev.eu
  5109121  sks.mirror.square-r00t.net
  5109629  keyserver.escomposlinux.org
  5108778  keyserver.lohn24-datenschutz.de

If your in-house peers are way behind, fix that.

Comment out all peers with fewer than 5_100_000 keys.  Restart sks and
sks-recon.

The 284,000 key difference is pretty severe.  Since that peer isn't
getting updates, they're probably hanging on peering and causing even
more problems for you.

Disable peering _at least_ with those three hosts.


Whenever SKS isn't performing right, the _first_ step after looking for
errors in logs should always be a Peering Hygiene Audit.  Find the peers
who are sufficiently behind that their keeping the peering up is
anti-social and likely causing _you_ problems, comment out the peering
entries, restart (for a completely clean slate) and then reach out to
those peers to ask "Hey, what's up?".

Regards,
-Phil

-- 

Paul Fontela
keyserver.ispfontela.es 11370	# Paul Fontela [hidden email] 0x31743FFC33E746C5
a.0.na.ispfontela.es	11370	# Paul Fontela Gmail [hidden email] 0x3D7FCDA03AAD46F1

_______________________________________________
Sks-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/sks-devel
Reply | Threaded
Open this post in threaded view
|

Re: SKS intermittently stalls with 100% CPU & rate-limiting

John Zaitseff
Hi, everyone,

Paul Fontela wrote:

> If nothing has been modified in the configuration of the server or
> in the SKS service, what has happened?

As others have commented at length, could this indeed be related to
malicious or problematic keys?

> I have seen that some other servers that are also hosted on Amazon
> datacenters are suffering from the same problem, could it be
> Amazon, I do not know, I can not answer that yet.

The problem is definitely more widespread than Amazon.  I am seeing
the same issues on my physical server located in Sydney, Australia.

My server has plenty of memory and disk space, so that is not an
issue (/var/lib/sks/DB is currently 118GB), but one processor core
continually goes in and out of being 100% utilised by the
single-threaded "sks db" process.

I can confirm that I have not changed any major OS component nor the
SKS daemon itself--I'm running an up-to-date Debian installation,
uptime is currently 48 days, and the problems appeared the same time
everyone else's did, just a couple of weeks ago.

Happy to provide log files if anyone is debugging; I myself have not
spent much time on this, nor looked through the SKS source code.

By the way, I tried Phil Pennock's suggestion of removing peers that
were significantly behind mine in terms of number of keys, but that
made no difference to the situation.

Yours truly,

John Zaitseff

--
John Zaitseff                    ,--_|\    The ZAP Group
Phone:  +61 2 9643 7737         /      \   Sydney, Australia
E-mail: [hidden email]   \_,--._*   http://www.zap.org.au/
                                      v

_______________________________________________
Sks-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/sks-devel