using --remove-older-than takes a very long time?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

using --remove-older-than takes a very long time?

Derek Atkins-2
HI,

I've been running backups on my servers for a long time using
rdiff-backup.  My backup server has been offline for about a week and I
just put it back online yesterday.  The server does the following:

foreach host (list of machines to backup); do
  rdiff-backup host local-storage-for-host
  rdiff-backup --force --remove-older-than 1Y local-storage-for-host
end

This process starts at 1am.  I noticed today (it's not 8am) that the
remove process for the first server started at 4:09am and is still
running.  In other words, the "clear history" process is taking longer
than the backup did!

Is this expected behavior?

While I keep track of how long it takes to run the backup process for
each server, I didnt separate out the backup vs cleanup stages of the
process.  If cleanup is going to take so long, I might separate it out
into a different stage!  Although it would be nice if cleanup could be
sped up!

Is there any way to profile the code to see where it is spending its
time?

Thanks,

-derek

PS: I should mention this is with 1.2.8, not a more recent version.

--
       Derek Atkins                 617-623-3745
       [hidden email]             www.ihtfp.com
       Computer and Internet Security Consultant

Reply | Threaded
Open this post in threaded view
|

Re: using --remove-older-than takes a very long time?

EricZolf
Hi Derek,

On 07/02/2020 14:46, Derek Atkins wrote:

> HI,
>
> I've been running backups on my servers for a long time using
> rdiff-backup.  My backup server has been offline for about a week and I
> just put it back online yesterday.  The server does the following:
>
> foreach host (list of machines to backup); do
>   rdiff-backup host local-storage-for-host
>   rdiff-backup --force --remove-older-than 1Y local-storage-for-host
> end
>
> This process starts at 1am.  I noticed today (it's not 8am) that the
> remove process for the first server started at 4:09am and is still
> running.  In other words, the "clear history" process is taking longer
> than the backup did!
>
> Is this expected behavior?

I don't think so but it probably depends on how many backups it has to
remove.

>
> While I keep track of how long it takes to run the backup process for
> each server, I didnt separate out the backup vs cleanup stages of the
> process.  If cleanup is going to take so long, I might separate it out
> into a different stage!  Although it would be nice if cleanup could be
> sped up!
>
> Is there any way to profile the code to see where it is spending its
> time?

A profiling as such is possible at python level but I would start with
just calling the removal option with `-v9`, it should give you/us a
first hint as the date/times for each action are logged.

KR, Eric

>
> Thanks,
>
> -derek
>
> PS: I should mention this is with 1.2.8, not a more recent version.

I wouldn't expect a big difference but you never know...

Reply | Threaded
Open this post in threaded view
|

Re: using --remove-older-than takes a very long time?

Derek Atkins-2
Hi,

On Fri, February 7, 2020 12:31 pm, Eric L. wrote:
> Hi Derek,
>
[snip]
> I don't think so but it probably depends on how many backups it has to
> remove.

Well, normally just one.   Today I suspect it is removing around 7.

>>
>> While I keep track of how long it takes to run the backup process for
>> each server, I didnt separate out the backup vs cleanup stages of the
>> process.  If cleanup is going to take so long, I might separate it out
>> into a different stage!  Although it would be nice if cleanup could be
>> sped up!
>>
>> Is there any way to profile the code to see where it is spending its
>> time?
>
> A profiling as such is possible at python level but I would start with
> just calling the removal option with `-v9`, it should give you/us a
> first hint as the date/times for each action are logged.

I can certainly change the code to do that tomorrow night; I think it's a
bit late to make that change now for today's run.

> KR, Eric

-derek
--
       Derek Atkins                 617-623-3745
       [hidden email]             www.ihtfp.com
       Computer and Internet Security Consultant


Reply | Threaded
Open this post in threaded view
|

Re: using --remove-older-than takes a very long time?

Robert Nichols-2
On 2/7/20 11:42 AM, Derek Atkins wrote:
> Hi,
>
> On Fri, February 7, 2020 12:31 pm, Eric L. wrote:
>> A profiling as such is possible at python level but I would start with
>> just calling the removal option with `-v9`, it should give you/us a
>> first hint as the date/times for each action are logged.
>
> I can certainly change the code to do that tomorrow night; I think it's a
> bit late to make that change now for today's run.

You can use the "lsof" command or look in /proc/{PID}/fd/ to see what
files the process is handling. You might get some clue from that.

--
Bob Nichols     "NOSPAM" is really part of my email address.
                 Do NOT delete it.


Reply | Threaded
Open this post in threaded view
|

Re: using --remove-older-than takes a very long time?

Derek Atkins-2

On Fri, February 7, 2020 2:05 pm, Robert Nichols wrote:

> On 2/7/20 11:42 AM, Derek Atkins wrote:
>> Hi,
>>
>> On Fri, February 7, 2020 12:31 pm, Eric L. wrote:
>>> A profiling as such is possible at python level but I would start with
>>> just calling the removal option with `-v9`, it should give you/us a
>>> first hint as the date/times for each action are logged.
>>
>> I can certainly change the code to do that tomorrow night; I think it's
>> a
>> bit late to make that change now for today's run.
>
> You can use the "lsof" command or look in /proc/{PID}/fd/ to see what
> files the process is handling. You might get some clue from that.

The process has moved on to the next server now.  But I can try that
tomorrow.  Then again, tomorrow it will be running with -v9.

-derek

--
       Derek Atkins                 617-623-3745
       [hidden email]             www.ihtfp.com
       Computer and Internet Security Consultant


Reply | Threaded
Open this post in threaded view
|

Re: using --remove-older-than takes a very long time?

Derek Atkins-2
In reply to this post by EricZolf
"Eric L." <[hidden email]> writes:

> I don't think so but it probably depends on how many backups it has to
> remove.

So that first run it had to remove 7.  Normally it's just removing 1 (as
it runs once per day).  However the difference in runtime between
removing 7 and removing 1 was definitely not 7x the runtime.  But it
definitely spends more time on the remove-old-backups than it does
backing up the day's changes.

[snip]
> A profiling as such is possible at python level but I would start with
> just calling the removal option with `-v9`, it should give you/us a
> first hint as the date/times for each action are logged.

I now have some multi-megabyte emails detailing the last few days of
processing (modulo some VM system crashes that happened over the
weekend).  There are dozens of operations per second (or more -- I
haven't looked closely), but still thousands of operations as it runs
for hours.  Frankly I'm not sure what I am looking at/for here.

I'm happy to share a log email (privately) if someone wants to look at
it with me?

-derek

--
       Derek Atkins                 617-623-3745
       [hidden email]             www.ihtfp.com
       Computer and Internet Security Consultant

Reply | Threaded
Open this post in threaded view
|

Re: using --remove-older-than takes a very long time?

Derek Atkins-2
Eric (et al),

Last night's backup session ran.  It started at 1am, finished a backup
of a server at 3:30 (which is to be expected; it has a lot of data), and
then proceeded to take the next NINE (9) HOURS to delete a single
increment.

I am happy to forward the data, but it's a relatively large email
(multiple megabytes) so I'm hesitant to send it to the list or to anyone
else without prior approval.

Anyone else willing to look at this?

I'll note that I have not completely ruled out filesystem slowness on
deletions.  The underlying file system is EncFS/Fuse over NFS over ZFS,
but raw Filesystem I/O tests show good throughput in general.

-derek

Derek Atkins <[hidden email]> writes:

> "Eric L." <[hidden email]> writes:
>
>> I don't think so but it probably depends on how many backups it has to
>> remove.
>
> So that first run it had to remove 7.  Normally it's just removing 1 (as
> it runs once per day).  However the difference in runtime between
> removing 7 and removing 1 was definitely not 7x the runtime.  But it
> definitely spends more time on the remove-old-backups than it does
> backing up the day's changes.
>
> [snip]
>> A profiling as such is possible at python level but I would start with
>> just calling the removal option with `-v9`, it should give you/us a
>> first hint as the date/times for each action are logged.
>
> I now have some multi-megabyte emails detailing the last few days of
> processing (modulo some VM system crashes that happened over the
> weekend).  There are dozens of operations per second (or more -- I
> haven't looked closely), but still thousands of operations as it runs
> for hours.  Frankly I'm not sure what I am looking at/for here.
>
> I'm happy to share a log email (privately) if someone wants to look at
> it with me?
>
> -derek

--
       Derek Atkins                 617-623-3745
       [hidden email]             www.ihtfp.com
       Computer and Internet Security Consultant

Reply | Threaded
Open this post in threaded view
|

Re: using --remove-older-than takes a very long time?

EricZolf
Hi Derek,

sorry for not reacting earlier, just overburdened with other things...
Also absolutely sorry for my first answer, it wasn't going in the right
direction.

On 21/02/2020 18:21, Derek Atkins wrote:

> Eric (et al),
>
> Last night's backup session ran.  It started at 1am, finished a backup
> of a server at 3:30 (which is to be expected; it has a lot of data), and
> then proceeded to take the next NINE (9) HOURS to delete a single
> increment.
>
> I am happy to forward the data, but it's a relatively large email
> (multiple megabytes) so I'm hesitant to send it to the list or to anyone
> else without prior approval.
>
> Anyone else willing to look at this?

Given that nothing fails but is "only" slow (and I realize that "only
slow" is also an issue), I hesitate to spend too much time on it for
now. I would prefer to spend the time on overall performance
improvements (if any is possible) once we've simplified the code.

> I'll note that I have not completely ruled out filesystem slowness on
> deletions.  The underlying file system is EncFS/Fuse over NFS over ZFS,
> but raw Filesystem I/O tests show good throughput in general.

I wouldn't expect that throughput is the main issue but latency, as it's
more a question of many files being removed than few big files being
transferred. Also the raw value isn't really interesting (assuming most
of the operations done by rdiff-backup happen in cache).

With the new version of rdiff-backup, you could try the `--no-fsync`
option, at the slight risk of losing data, if something goes wrong at
the wrong time.

Check those times on a normal HDD:

$ time for i in $(seq 1 100); do touch $i; sync; done

real 0m7.212s
user 0m0.191s
sys 0m0.435s

$ time for i in $(seq 1 100); do rm $i; sync; done
real 0m7.789s
user 0m0.209s
sys 0m0.424s

$ time for i in $(seq 1 100); do touch $i; done
real 0m0.091s
user 0m0.044s
sys 0m0.054s

$ time for i in $(seq 1 100); do rm $i; done
real 0m0.097s
user 0m0.042s
sys 0m0.062s

We talk about a factor of 70 to 80 between synced and non-synced, and
it's almost only latency we're talking about here, which I'd expect to
be even worse in your case, with user space and network involved.

Perhaps your setup would be more performant and with less risk if you'd
use the remote features of rdiff-backup instead of using NFS, i.e. have
rdiff-backup being installed and reachable over SSH where you have your
NFS server running (if at all possible, and without promise that it
improves performance).

Hope this helps, because I don't think the issue is solely in the code
but rather in your setup. We can look at improving performance but I
don't expect any quick win (beside the slightly dangerous --no-fsync
option).

KR, Eric

>
> -derek
>
> Derek Atkins <[hidden email]> writes:
>
>> "Eric L." <[hidden email]> writes:
>>
>>> I don't think so but it probably depends on how many backups it has to
>>> remove.
>>
>> So that first run it had to remove 7.  Normally it's just removing 1 (as
>> it runs once per day).  However the difference in runtime between
>> removing 7 and removing 1 was definitely not 7x the runtime.  But it
>> definitely spends more time on the remove-old-backups than it does
>> backing up the day's changes.
>>
>> [snip]
>>> A profiling as such is possible at python level but I would start with
>>> just calling the removal option with `-v9`, it should give you/us a
>>> first hint as the date/times for each action are logged.
>>
>> I now have some multi-megabyte emails detailing the last few days of
>> processing (modulo some VM system crashes that happened over the
>> weekend).  There are dozens of operations per second (or more -- I
>> haven't looked closely), but still thousands of operations as it runs
>> for hours.  Frankly I'm not sure what I am looking at/for here.
>>
>> I'm happy to share a log email (privately) if someone wants to look at
>> it with me?
>>
>> -derek
>


Reply | Threaded
Open this post in threaded view
|

Re: using --remove-older-than takes a very long time?

Derek Atkins-2
HI,

On Sat, February 22, 2020 5:37 am, EricZolf wrote:
> Hi Derek,
>
> sorry for not reacting earlier, just overburdened with other things...
> Also absolutely sorry for my first answer, it wasn't going in the right
> direction.

No worries...  Should I turn off -v9?

> On 21/02/2020 18:21, Derek Atkins wrote:
[snip]
> Given that nothing fails but is "only" slow (and I realize that "only
> slow" is also an issue), I hesitate to spend too much time on it for
> now. I would prefer to spend the time on overall performance
> improvements (if any is possible) once we've simplified the code.

I understand.  I suspect that I could use rdiff-backup 1.2.8 to talk to
the host and then rdiff-backup-2/3 to do the removals, right?  That should
work?

>> I'll note that I have not completely ruled out filesystem slowness on
>> deletions.  The underlying file system is EncFS/Fuse over NFS over ZFS,
>> but raw Filesystem I/O tests show good throughput in general.
>
> I wouldn't expect that throughput is the main issue but latency, as it's
> more a question of many files being removed than few big files being
> transferred. Also the raw value isn't really interesting (assuming most
> of the operations done by rdiff-backup happen in cache).

Yes, you're right..  Throughput and latency are both potential issues.  I
ran a bonnie++ experiment last night (which ended 4 seconds into the
rdiff-backup start) to test out the system.  The output is below.

> With the new version of rdiff-backup, you could try the `--no-fsync`
> option, at the slight risk of losing data, if something goes wrong at
> the wrong time.
>
> Check those times on a normal HDD:
>
> $ time for i in $(seq 1 100); do touch $i; sync; done
>
> real 0m7.212s
> user 0m0.191s
> sys 0m0.435s
>
> $ time for i in $(seq 1 100); do rm $i; sync; done
> real 0m7.789s
> user 0m0.209s
> sys 0m0.424s
>
> $ time for i in $(seq 1 100); do touch $i; done
> real 0m0.091s
> user 0m0.044s
> sys 0m0.054s
>
> $ time for i in $(seq 1 100); do rm $i; done
> real 0m0.097s
> user 0m0.042s
> sys 0m0.062s

I can test this out tonight once today's backups are complete.

> We talk about a factor of 70 to 80 between synced and non-synced, and
> it's almost only latency we're talking about here, which I'd expect to
> be even worse in your case, with user space and network involved.
>
> Perhaps your setup would be more performant and with less risk if you'd
> use the remote features of rdiff-backup instead of using NFS, i.e. have
> rdiff-backup being installed and reachable over SSH where you have your
> NFS server running (if at all possible, and without promise that it
> improves performance).

Here is how I run:

[FreeNAS] -- NFS -- [Backup Server] -- SSH -- [Server being Backed up]

FreeNAS uses ZFS as the underlying storage and the Backup server mounts
that as an NFS share.  Then on the backup server it uses EncFS so what
gets stored on FreeNAS is encrypted.  Then it uses rdiff-backup to connect
to the servers being backed up and "pulls" the backup from the target
server.  Finally, after the backup is done, the backup server removes any
increments older than 1 year.

> Hope this helps, because I don't think the issue is solely in the code
> but rather in your setup. We can look at improving performance but I
> don't expect any quick win (beside the slightly dangerous --no-fsync
> option).

I certainly have not ruled this out.  Older tests I did a while ago showed
good performance, but possibly not the "rm'ing thousands of files" use
case.

Thanks,

-derek

> KR, Eric

The Bonnie++ output:

date; time bonnie++ -r 8192 ; date
Fri 21 Feb 2020 09:06:21 PM EST
Writing a byte at a time...done
Writing intelligently...done
Rewriting...done
Reading a byte at a time...done
Reading intelligently...done
start 'em...done...done...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version  1.98       ------Sequential Output------ --Sequential Input-
--Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
/sec %CP
backup          16G    0k  18 3835k  17 3283k  15  273k  85 54.8m  11
482.3  33
Latency             12246ms     859ms    1633ms    1577ms   10015us     403ms
Version  1.98       ------Sequential Create------ --------Random
Create--------
backup              -Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
/sec %CP
                 16     0   8     0   9     0   9     0   8     0   9    
0   8
Latency             10227us    2896us   96684us   11198us    3271us   10385us
1.98,1.98,backup,1,1582413653,16G,,8192,5,0,18,3835,17,3283,15,273,85,56122,11,482.3,33,16,,,,,154,8,685,9,398,9,161,8,724,9,417,8,12246ms,859ms,1633ms,1577ms,10015us,403ms,10227us,2896us,96684us,11198us,3271us,10385us

real 233m43.383s
user 2m50.934s
sys 36m6.502s
Sat 22 Feb 2020 01:00:04 AM EST


--
       Derek Atkins                 617-623-3745
       [hidden email]             www.ihtfp.com
       Computer and Internet Security Consultant


Reply | Threaded
Open this post in threaded view
|

Re: using --remove-older-than takes a very long time?

Derek Atkins-2
In reply to this post by EricZolf
Hi Eric,

I got more tests run tonight.

Last night I ran a bonnie++ test through encfs.  Tonight I ran it directly
on NFS.  Boy, what a difference that made!  The test through encfs took
233 minutes of real time to finish whereas directly into NFS it finished
in 11 minutes.  Maybe I need to re-think my backup encryption solution?

Then I ran your sets of tests, once through encfs and once directly on
NFS.  The encfs tests were slower (of course), but there was a good 2x
speed difference when including the sync vs not.  As for encfs vs raw-NFS,
there appears to be about a 25% speed difference there.  But this test,
unlike bonnie++, was able to use the cache whereas the bonnie++ tests use
a 16G dataset when there is only 8G of RAM on the system.

All test results are included here:

TESTING THROUGH ENCFS:

date; time bonnie++ -r 8192 ; date
Fri 21 Feb 2020 09:06:21 PM EST
Writing a byte at a time...done
Writing intelligently...done
Rewriting...done
Reading a byte at a time...done
Reading intelligently...done
start 'em...done...done...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version  1.98       ------Sequential Output------ --Sequential Input-
--Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
/sec %CP
backup          16G    0k  18 3835k  17 3283k  15  273k  85 54.8m  11
482.3  33
Latency             12246ms     859ms    1633ms    1577ms   10015us     403ms
Version  1.98       ------Sequential Create------ --------Random
Create--------
backup              -Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
/sec %CP
                 16     0   8     0   9     0   9     0   8     0   9    
0   8
Latency             10227us    2896us   96684us   11198us    3271us   10385us
1.98,1.98,backup,1,1582413653,16G,,8192,5,0,18,3835,17,3283,15,273,85,56122,11,482.3,33,16,,,,,154,8,685,9,398,9,161,8,724,9,417,8,12246ms,859ms,1633ms,1577ms,10015us,403ms,10227us,2896us,96684us,11198us,3271us,10385us

real 233m43.383s
user 2m50.934s
sys 36m6.502s
Sat 22 Feb 2020 01:00:04 AM EST

--------------------------------------------------
TESTING DIRECTLY INTO NFS:

( cd /net/freenas/mnt/freenas-0/backups/TEST; date; time bonnie++ -r 8192
-d /net/freenas/mnt/freenas-0/backups/TEST ; date )
Sat 22 Feb 2020 08:23:20 PM EST
Writing a byte at a time...done
Writing intelligently...done
Rewriting...done
Reading a byte at a time...done
Reading intelligently...done
start 'em...done...done...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version  1.98       ------Sequential Output------ --Sequential Input-
--Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
/sec %CP
backup          16G  239k  98  106m  34 54.2m  24  247k  99  117m  23
2104  70
Latency             78483us   44241us    2014ms    2122ms     192ms    8766us
Version  1.98       ------Sequential Create------ --------Random
Create--------
backup              -Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
/sec %CP
                 16     0  29     0  44     0  30     0  29     0  20    
0  20
Latency              4981us   20428us    1313us   19010us    1701us    1910us
1.98,1.98,backup,1,1582341907,16G,,8192,5,239,98,108291,34,55517,24,247,99,119606,23,2104,70,16,,,,,760,29,32324,44,3098,30,1039,29,1130,20,1066,20,78483us,44241us,2014ms,2122ms,192ms,8766us,4981us,20428us,1313us,19010us,1701us,1910us

real 11m44.364s
user 0m43.998s
sys 2m43.481s
Sat 22 Feb 2020 08:35:05 PM EST

--------------------------------------------------
TESTING THROUGH ENCFS:

$ time for i in $(seq 1 100); do touch $i; sync; done

real 0m2.076s
user 0m0.281s
sys 0m1.006s
$ time for i in $(seq 1 100); do rm $i; sync; done

real 0m1.390s
user 0m0.267s
sys 0m0.802s
$ time for i in $(seq 1 100); do touch $i; done

real 0m1.349s
user 0m0.148s
sys 0m0.446s
$ time for i in $(seq 1 100); do rm $i; done

real 0m0.930s
user 0m0.160s
sys 0m0.476s

--------------------------------------------------
TESTING DIRECTLY INTO NFS:

$ time for i in $(seq 1 100); do touch $i; sync; done

real 0m1.469s
user 0m0.296s
sys 0m0.947s
$ time for i in $(seq 1 100); do rm $i; sync; done

real 0m1.007s
user 0m0.230s
sys 0m0.588s
$ time for i in $(seq 1 100); do touch $i; done

real 0m0.852s
user 0m0.151s
sys 0m0.498s
$ time for i in $(seq 1 100); do rm $i; done

real 0m0.691s
user 0m0.106s
sys 0m0.371s


--
       Derek Atkins                 617-623-3745
       [hidden email]             www.ihtfp.com
       Computer and Internet Security Consultant