serialization format

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

serialization format

Markus Wanner-2
Hi,

I'd like to get some feedback regarding some ideas around the
serialization format used for storage and exchange of data in monotone.
Currently, we're mostly using basic_io (for revisions, manifests, certs,
AFAIK even for automate).

Three things are bug me about basic_io:

 * while well readable, it's a custom format, not used anywhere else

 * it's flat and cannot represent nested structures

 * it cannot handle binary data (therefore monotone is spending quite a
   bit of time converting between hex and raw data (mostly revision
   ids))


There are plenty of alternatives when considering a binary format: good
old ASN.1, Google Protocol Buffers, MessagePack, Blink, etc...

Human readable alternatives (which would at least eliminate the first
two concerns) might be: JSON, YAML, or (bear with me) even XML. But for
hashes and such we need a canonical format. And nothing for those three
remains readable in any of their canonical forms that I've seen so far.


At the moment, the most important question seems to be: how much do you
value the human readable representation? How about a binary format that
you can easily transform to and from a human readable one?

I appreciate your feedback.

Regards

Markus Wanner


_______________________________________________
Monotone-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/monotone-devel

signature.asc (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: serialization format

Stephen Leake-3
Markus Wanner <[hidden email]> writes:

> Hi,
>
> I'd like to get some feedback regarding some ideas around the
> serialization format used for storage and exchange of data in monotone.
> Currently, we're mostly using basic_io (for revisions, manifests, certs,
> AFAIK even for automate).
>
> Three things are bug me about basic_io:
>
>  * while well readable, it's a custom format, not used anywhere else
>
>  * it's flat and cannot represent nested structures
>
>  * it cannot handle binary data (therefore monotone is spending quite a
>    bit of time converting between hex and raw data (mostly revision
>    ids))
>
>
> There are plenty of alternatives when considering a binary format: good
> old ASN.1, Google Protocol Buffers, MessagePack, Blink, etc...
>
> Human readable alternatives (which would at least eliminate the first
> two concerns) might be: JSON, YAML, or (bear with me) even XML. But for
> hashes and such we need a canonical format. And nothing for those three
> remains readable in any of their canonical forms that I've seen so far.
>
>
> At the moment, the most important question seems to be: how much do you
> value the human readable representation? How about a binary format that
> you can easily transform to and from a human readable one?

Human readable makes testing and developing new features much easier. If
we use binary, we will need a separate tool that translates that to
readable, which is then another source of bugs (or the same source, just
in a different place).

Unless you are planning major work on monotone, it's not worth changing
from basic_io.

--
-- Stephe

_______________________________________________
Monotone-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/monotone-devel
Reply | Threaded
Open this post in threaded view
|

Re: serialization format

Markus Wanner-2
Hello Stephen,

thanks for your feedback.

On 04/04/2016 06:58 PM, Stephen Leake wrote:
> Human readable makes testing and developing new features much easier. If
> we use binary, we will need a separate tool that translates that to
> readable, which is then another source of bugs (or the same source, just
> in a different place).

Yeah, that's a point. However, I'd also argue that we should target the
user and not the developer. And from a user's perspective, isn't
monotone the very tool that does that kind of translation?

Or put another way: Do *users* really care what serialization format
monotone uses underneath?

Regards

Markus Wanner



_______________________________________________
Monotone-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/monotone-devel

signature.asc (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: serialization format

Stephen Leake-3
Markus Wanner <[hidden email]> writes:

> Hello Stephen,
>
> thanks for your feedback.
>
> On 04/04/2016 06:58 PM, Stephen Leake wrote:
>> Human readable makes testing and developing new features much easier. If
>> we use binary, we will need a separate tool that translates that to
>> readable, which is then another source of bugs (or the same source, just
>> in a different place).
>
> Yeah, that's a point. However, I'd also argue that we should target the
> user and not the developer. And from a user's perspective, isn't
> monotone the very tool that does that kind of translation?
>
> Or put another way: Do *users* really care what serialization format
> monotone uses underneath?

No, users don't care (as long as the tools work).

Caveat; if they have to compose an email with the Monotone output (to
send it somewhere), ASCII text is safer than binary.

But that means they have no opinion on basic_io vs json, either.

Unless there are _other_ tools (not provided by monotone) that users
could use with Monotone output if it was other than basic_io.


--
-- Stephe

_______________________________________________
Monotone-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/monotone-devel
Reply | Threaded
Open this post in threaded view
|

Re: serialization format

Ludovic Brenta-2
In reply to this post by Markus Wanner-2
Markus Wanner <[hidden email]> writes:

> Hello Stephen,
>
> thanks for your feedback.
>
> On 04/04/2016 06:58 PM, Stephen Leake wrote:
>> Human readable makes testing and developing new features much easier. If
>> we use binary, we will need a separate tool that translates that to
>> readable, which is then another source of bugs (or the same source, just
>> in a different place).
>
> Yeah, that's a point. However, I'd also argue that we should target the
> user and not the developer. And from a user's perspective, isn't
> monotone the very tool that does that kind of translation?
>
> Or put another way: Do *users* really care what serialization format
> monotone uses underneath?

No but they might care about performance.  How much of monotone's time
is actually spent translating between binary and hex?  Is this really a
major performance bottleneck?

--
Ludovic Brenta.

_______________________________________________
Monotone-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/monotone-devel
Reply | Threaded
Open this post in threaded view
|

Re: serialization format

Markus Wanner-2
On 04/04/2016 10:02 PM, Ludovic Brenta wrote:
> No but they might care about performance.  How much of monotone's time
> is actually spent translating between binary and hex?  Is this really a
> major performance bottleneck?

Well, not the conversion between hex and binary itself, no. But the
effect the serialization format has on hashing.

Let's have a look at some perf samples gathered during a functional test
run:

> #
> # Overhead  Shared Object          Symbol                                                                                        
> # ........  .....................  ...................................................................................................................
> #
>      6.80%  libbotan-1.10.so.1.10  [.] _ZN5Botan12SHA_160_SSE210compress_nEPKhm                                                                      
>      3.74%  libc-2.21.so           [.] _int_free                                                                                                      
>      2.60%  libstdc++.so.6.0.21    [.] _ZSt18_Rb_tree_incrementPKSt18_Rb_tree_node_base                                                              
>      2.24%  libstdc++.so.6.0.21    [.] _ZSt29_Rb_tree_insert_and_rebalancebPSt18_Rb_tree_node_baseS0_RS_                                              
>      1.85%  libc-2.21.so           [.] malloc                                                                                                        
>      1.85%  mtn                    [.] _ZNSt8_Rb_treeIN6option6optionI7optionsEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE7_M_copyINS9_20_Reuse_or_alloc
>      1.73%  mtn                    [.] _ZSt11__set_unionISt23_Rb_tree_const_iteratorIN6option6optionI7optionsEEES5_St15insert_iteratorISt3setIS4_St4le
>      1.66%  ld-2.21.so             [.] do_lookup_x                                                                                                    
>      1.57%  libcrypto.so.1.0.0     [.] DES_encrypt2                                                                                                  
>      1.36%  libc-2.21.so           [.] __memcmp_sse4_1                                                                                                
>      1.17%  mtn                    [.] _ZNSt8_Rb_treeIN6option6optionI7optionsEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS
>      1.04%  libc-2.21.so           [.] free                                                                                                          
>      1.03%  libc-2.21.so           [.] malloc_consolidate                                                                                            
>      0.98%  [unknown]              [k] 0xffffffff817f4ca0                                                                                            
>      0.75%  mtn                    [.] _ZNSt17_Function_handlerIFvP7optionsNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEMS0_FvS7_EE10_M_manage
>      0.71%  ld-2.21.so             [.] _dl_lookup_symbol_x                                                                                            
>      0.71%  mtn                    [.] _ZNSt17_Function_handlerIFvP7optionsEMS0_FvvEE10_M_managerERSt9_Any_dataRKS6_St18_Manager_operation            
>      0.67%  libcrypto.so.1.0.0     [.] DES_encrypt1                                                                                                  
>      0.64%  libbotan-1.10.so.1.10  [.] _ZN5Botan16MDx_HashFunction12final_resultEPh                                                                  
>      0.64%  libgmp.so.10.2.0       [.] __gmpn_redc_1                                                                                                  
>      0.62%  [unknown]              [k] 0xffffffff811b24fa                                                                                            
>      0.58%  libc-2.21.so           [.] strlen                                                                                                        
>      0.58%  libbotan-1.10.so.1.10  [.] _ZN5Botan16MDx_HashFunction8add_dataEPKhm                                                                      
>      0.57%  [unknown]              [k] 0xffffffff813d3417                                                                                            
...
>      0.06%  libbotan-1.10.so.1.10  [.] _ZN5Botan10hex_decodeEPhPKcmRmb        
...
>      0.02%  libbotan-1.10.so.1.10  [.] _ZN5Botan10hex_encodeEPcPKhmb


Hashing probably is the single most time consuming operation here, with
about 8% of the time spent (note that the add_data and final_result
methods are within the top 25 as well).

The CPU time that's used for the actual hex encoding and decoding is
vanishingly small, below 0.1%.


Now, I'm clearly not into micro optimizations (but rather consider
modifications like using base58 instead of the hex encoding for hashes
presented to the user - an encoding that's certain to consume more CPU
time, not sure how much more, though.)

However, reducing the amount of data to be hashed, cached and moved
around (in memory, network, etc..) sounds like a generally good idea to
me (performance wise). However, it's equally clearly a bad idea from a
usability perspective. So there's a balance. That's why I started this
thread.

Given the arguments so far I tend towards a binary encoding, as I think
developers should be able to handle binary data. And if users really
don't care...

Regards

Markus Wanner



_______________________________________________
Monotone-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/monotone-devel

signature.asc (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: serialization format

J Decker
If the structures might mutate with time something like json is pretty brief.
if you have high reliability, sqlite for instance will store a blob
with only \0 for the 0  and \\ for \ ...

which results in a copy or shift of data but only a simple comparison
if '\\'   kinda like base 254 sorta :)
depending on what character happens least you could replace <nul> for
<del> or something ...



On Tue, Apr 5, 2016 at 9:25 AM, Markus Wanner <[hidden email]> wrote:

> On 04/04/2016 10:02 PM, Ludovic Brenta wrote:
>> No but they might care about performance.  How much of monotone's time
>> is actually spent translating between binary and hex?  Is this really a
>> major performance bottleneck?
>
> Well, not the conversion between hex and binary itself, no. But the
> effect the serialization format has on hashing.
>
> Let's have a look at some perf samples gathered during a functional test
> run:
>
>> #
>> # Overhead  Shared Object          Symbol
>> # ........  .....................  ...................................................................................................................
>> #
>>      6.80%  libbotan-1.10.so.1.10  [.] _ZN5Botan12SHA_160_SSE210compress_nEPKhm
>>      3.74%  libc-2.21.so           [.] _int_free
>>      2.60%  libstdc++.so.6.0.21    [.] _ZSt18_Rb_tree_incrementPKSt18_Rb_tree_node_base
>>      2.24%  libstdc++.so.6.0.21    [.] _ZSt29_Rb_tree_insert_and_rebalancebPSt18_Rb_tree_node_baseS0_RS_
>>      1.85%  libc-2.21.so           [.] malloc
>>      1.85%  mtn                    [.] _ZNSt8_Rb_treeIN6option6optionI7optionsEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE7_M_copyINS9_20_Reuse_or_alloc
>>      1.73%  mtn                    [.] _ZSt11__set_unionISt23_Rb_tree_const_iteratorIN6option6optionI7optionsEEES5_St15insert_iteratorISt3setIS4_St4le
>>      1.66%  ld-2.21.so             [.] do_lookup_x
>>      1.57%  libcrypto.so.1.0.0     [.] DES_encrypt2
>>      1.36%  libc-2.21.so           [.] __memcmp_sse4_1
>>      1.17%  mtn                    [.] _ZNSt8_Rb_treeIN6option6optionI7optionsEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS
>>      1.04%  libc-2.21.so           [.] free
>>      1.03%  libc-2.21.so           [.] malloc_consolidate
>>      0.98%  [unknown]              [k] 0xffffffff817f4ca0
>>      0.75%  mtn                    [.] _ZNSt17_Function_handlerIFvP7optionsNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEMS0_FvS7_EE10_M_manage
>>      0.71%  ld-2.21.so             [.] _dl_lookup_symbol_x
>>      0.71%  mtn                    [.] _ZNSt17_Function_handlerIFvP7optionsEMS0_FvvEE10_M_managerERSt9_Any_dataRKS6_St18_Manager_operation
>>      0.67%  libcrypto.so.1.0.0     [.] DES_encrypt1
>>      0.64%  libbotan-1.10.so.1.10  [.] _ZN5Botan16MDx_HashFunction12final_resultEPh
>>      0.64%  libgmp.so.10.2.0       [.] __gmpn_redc_1
>>      0.62%  [unknown]              [k] 0xffffffff811b24fa
>>      0.58%  libc-2.21.so           [.] strlen
>>      0.58%  libbotan-1.10.so.1.10  [.] _ZN5Botan16MDx_HashFunction8add_dataEPKhm
>>      0.57%  [unknown]              [k] 0xffffffff813d3417
> ...
>>      0.06%  libbotan-1.10.so.1.10  [.] _ZN5Botan10hex_decodeEPhPKcmRmb
> ...
>>      0.02%  libbotan-1.10.so.1.10  [.] _ZN5Botan10hex_encodeEPcPKhmb
>
>
> Hashing probably is the single most time consuming operation here, with
> about 8% of the time spent (note that the add_data and final_result
> methods are within the top 25 as well).
>
> The CPU time that's used for the actual hex encoding and decoding is
> vanishingly small, below 0.1%.
>
>
> Now, I'm clearly not into micro optimizations (but rather consider
> modifications like using base58 instead of the hex encoding for hashes
> presented to the user - an encoding that's certain to consume more CPU
> time, not sure how much more, though.)
>
> However, reducing the amount of data to be hashed, cached and moved
> around (in memory, network, etc..) sounds like a generally good idea to
> me (performance wise). However, it's equally clearly a bad idea from a
> usability perspective. So there's a balance. That's why I started this
> thread.
>
> Given the arguments so far I tend towards a binary encoding, as I think
> developers should be able to handle binary data. And if users really
> don't care...
>
> Regards
>
> Markus Wanner
>
>
>
> _______________________________________________
> Monotone-devel mailing list
> [hidden email]
> https://lists.nongnu.org/mailman/listinfo/monotone-devel
>

_______________________________________________
Monotone-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/monotone-devel
Reply | Threaded
Open this post in threaded view
|

Re: serialization format

J Decker
Sorry just a sidenote;
err...
encode into utf8 codepoints maybe?  which would expand 0x80-0xFF by 1
character each... and you could violate utf rules and encode a F880
that's a 0 codepoint...

(take a value, that's a codepoint, make a utf8 version of that
value... which for 0x0 to 0x7F is that character... but it provides a
way to escape \0 if that's a concern and why are you encoding it at
all?)

_______________________________________________
Monotone-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/monotone-devel
Reply | Threaded
Open this post in threaded view
|

Re: serialization format

Markus Wanner-2
In reply to this post by J Decker
On 04/06/2016 05:26 AM, J Decker wrote:
> If the structures might mutate with time something like json is pretty brief.
> if you have high reliability, sqlite for instance will store a blob
> with only \0 for the 0  and \\ for \ ...

JSON doesn't handle binary welll, it's a text format. Usually, base64 is
used for binary data inside JSON - which is neither human readable nor
space efficient.

> which results in a copy or shift of data but only a simple comparison
> if '\\'   kinda like base 254 sorta :)
> depending on what character happens least you could replace <nul> for
> <del> or something ...

That's nonsense, according to http://stackoverflow.com/a/1443240, the
JSON spec supports only 94 Unicode characters that can be represented as
one byte (in UTF-8).

Nor is there any canonical *and* human readable variant.

If human readable, I'd currently prefer to try something canonical
that's still valid YAML.

Kind Regards

Markus Wanner


_______________________________________________
Monotone-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/monotone-devel
Reply | Threaded
Open this post in threaded view
|

Re: serialization format

Markus Wanner-2
In reply to this post by J Decker
On 04/06/2016 05:44 AM, J Decker wrote:
> encode into utf8 codepoints maybe?  which would expand 0x80-0xFF by 1
> character each... and you could violate utf rules and encode a F880
> that's a 0 codepoint...

You mean for hashes? Hm.. that's an interesting idea, which might get us
a whole new encoding. However, I don't quite think we can use Unicode
there, but should really stick to ASCII.

Even base64 is a bad idea, because it contains '/' and '+' chars, which
are usually treated as separators. But you don't want revision ids to
word-wrap.

With these restrictions, base58 is about as space efficient as you can get.

Kind Regards

Markus Wanner


_______________________________________________
Monotone-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/monotone-devel

signature.asc (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: serialization format

Hendrik Boom-2
In reply to this post by Markus Wanner-2
On Wed, Apr 06, 2016 at 07:28:39AM +0200, Markus Wanner wrote:

> On 04/06/2016 05:26 AM, J Decker wrote:
> > If the structures might mutate with time something like json is pretty brief.
> > if you have high reliability, sqlite for instance will store a blob
> > with only \0 for the 0  and \\ for \ ...
>
> JSON doesn't handle binary welll, it's a text format. Usually, base64 is
> used for binary data inside JSON - which is neither human readable nor
> space efficient.
>
> > which results in a copy or shift of data but only a simple comparison
> > if '\\'   kinda like base 254 sorta :)
> > depending on what character happens least you could replace <nul> for
> > <del> or something ...
>
> That's nonsense, according to http://stackoverflow.com/a/1443240, the
> JSON spec supports only 94 Unicode characters that can be represented as
> one byte (in UTF-8).
>
> Nor is there any canonical *and* human readable variant.
>
> If human readable, I'd currently prefer to try something canonical
> that's still valid YAML.

And if you want it to be human-readable you'd also want to avoid
visually confusing characters.  No using both 0 and O.  Or using 1, l,
and I.  Even , and . can be hard to distinguish in soe fonts.

-- hendrik

_______________________________________________
Monotone-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/monotone-devel
Reply | Threaded
Open this post in threaded view
|

Re: serialization format

Markus Wanner-2
On 04/06/2016 02:56 PM, Hendrik Boom wrote:
> And if you want it to be human-readable you'd also want to avoid
> visually confusing characters.  No using both 0 and O.  Or using 1, l,
> and I.  Even , and . can be hard to distinguish in soe fonts.

Exactly. Did I already mention base58... ;-)

Regards

Markus Wanner



_______________________________________________
Monotone-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/monotone-devel

signature.asc (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: serialization format

Peter Stirling
I apologise for being late to the part here: Is the goal here to reduce the barrier to entry for automate clients (by using something which has a decent chance of having a parsing library in most languages)?

 On 06/04/16 14:15, Markus Wanner wrote:
On 04/06/2016 02:56 PM, Hendrik Boom wrote:
And if you want it to be human-readable you'd also want to avoid 
visually confusing characters.  No using both 0 and O.  Or using 1, l, 
and I.  Even , and . can be hard to distinguish in soe fonts.
Exactly. Did I already mention base58... ;-)

Regards

Markus Wanner




_______________________________________________
Monotone-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/monotone-devel


_______________________________________________
Monotone-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/monotone-devel
Reply | Threaded
Open this post in threaded view
|

Re: serialization format

Stephen Leake-3
Peter Stirling <[hidden email]> writes:

> I apologise for being late to the part here: Is the goal here to
> reduce the barrier to entry for automate clients (by using something
> which has a decent chance of having a parsing library in most
> languages)?

That is a reasonable goal.

But only if the current output is preserved as an option; I've got it
working with Emacs. Which is a non-starter; we don't have enough
manpower to maintain two output formats.

--
-- Stephe

_______________________________________________
Monotone-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/monotone-devel
Reply | Threaded
Open this post in threaded view
|

Re: serialization format

Stephen Leake-3
Markus Wanner <[hidden email]> writes:

> On 04/07/2016 05:21 PM, Stephen Leake wrote:
>> Peter Stirling <[hidden email]> writes:
>>> I apologise for being late to the part here: Is the goal here to
>>> reduce the barrier to entry for automate clients (by using something
>>> which has a decent chance of having a parsing library in most
>>> languages)?
>
> Yes, that'd be one of the nice properties of a standard format; whether
> its a binary (e.g. ASN.1) or textual (e.g. YAML) one.
>
>> But only if the current output is preserved as an option; I've got it
>> working with Emacs. Which is a non-starter; we don't have enough
>> manpower to maintain two output formats.
>
> Well, I'm willing to put some effort into it and I certainly value
> backwards compatibility as well, yes. However, there can only be one
> format that monotone uses internally (think pre vs post flag day).

There's a version number in the internal format, so we don't need a flag
day (or maybe that was on a branch; anyway, we can add one). We do need
to maintain both formats for compatibility with old databases.

--
-- Stephe

_______________________________________________
Monotone-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/monotone-devel
Reply | Threaded
Open this post in threaded view
|

Re: serialization format

Markus Wanner-2
On 04/07/2016 11:37 PM, Stephen Leake wrote:
> There's a version number in the internal format, so we don't need a flag
> day (or maybe that was on a branch; anyway, we can add one). We do need
> to maintain both formats for compatibility with old databases.

There's a version identifier for things like certs, revs, etc.., yes.
However, in any case, there's a point in time where monotone stops using
the old format and starts to use the new one. We can soften the
migration by prolonging the time between the first release that supports
a feature until we activate it. (Not that past flag days had a pretty
narrow window...)

I even thought about a mtn:features attribute (on the root node),
allowing users to switch on features per branch, as they see fit. For
example, I still want atomic certs. That feature will require at least a
new version, if not a new format, for certificates. Certainly something
that monotone-1.1 doesn't understand. So 1.2 might still need to write
the old format/version, at least by default.

Kind Regards

Markus Wanner



_______________________________________________
Monotone-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/monotone-devel

signature.asc (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: serialization format

Hendrik Boom-2
On Thu, Apr 07, 2016 at 11:49:15PM +0200, Markus Wanner wrote:

> On 04/07/2016 11:37 PM, Stephen Leake wrote:
> > There's a version number in the internal format, so we don't need a flag
> > day (or maybe that was on a branch; anyway, we can add one). We do need
> > to maintain both formats for compatibility with old databases.
>
> There's a version identifier for things like certs, revs, etc.., yes.
> However, in any case, there's a point in time where monotone stops using
> the old format and starts to use the new one. We can soften the
> migration by prolonging the time between the first release that supports
> a feature until we activate it. (Not that past flag days had a pretty
> narrow window...)

Or perhaps we could use a new-style certificate to certify that each
old-style certificate is valid.  This might even be secure if the
ew-style certificates are issued before the old-style ones are
vulnerable to attack.

-- hendrik


_______________________________________________
Monotone-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/monotone-devel
Reply | Threaded
Open this post in threaded view
|

Re: serialization format

Markus Wanner-2
In reply to this post by Markus Wanner-2
On 04/08/2016 06:34 AM, J Decker wrote:
> 1) Hashes... once they're serliazed, can't 90% of the time they just
> be compared as strings?  (The output of which fits in utf-8 as ascii
> subset esp if you're using 58)

Monotone did that, but migrated to using binary representation for
efficiency. Note that we do hash calculations quite frequently, so we
need to serialize pretty frequently, too.

I rather think we need to migrate to binary all the way and encode the
hash just before displaying it to the user. That doesn't need to scale,
because the user hardly wants to see millions of hashes at once.

> 2) hashes fed through as utf-8 codpoints (because any value from
> 0-4,000,000 is encodable in a general algorithm, regardless of
> arbitrary restrictions) would yes more often be outside of the 94
> characters, and be encoded characters... but since the output is just
> characters anyway...

So you could come up with some kind of base4000000 encoding, where every
code point would cost 1-4 bytes in utf-8 encoding, i.e. we're speaking
about encoding twice. And loose all of the benefits of using a subset of
ASCII.... I don't see the point.

> Yuck, YAML has keywords?
>
> {"cert":"1249123840182028934801az","Idunno":"blah"}

Something like that may be a canonical format, but without any newline,
I don't consider it human readable. I'd rather use something like:

{
  cert: "1249123840182028934801az"
  Idunno: "blah"
}

> and that itself is in utf-8... which emans any value is storable in a
> rune (to borrow a type name from Go)

Well, yes, we're already using utf-8 for commit messages and such. So
any human-readable, textual format is very likely using Unicode and be
encoded as UTF-8.

Regards

Markus Wanner


_______________________________________________
Monotone-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/monotone-devel

signature.asc (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: serialization format

Lapo Luchini
In reply to this post by Markus Wanner-2
Markus Wanner wrote:
> There are plenty of alternatives when considering a binary format: good
> old ASN.1, Google Protocol Buffers, MessagePack, Blink, etc...

As far as internal formats go (which users don't care about) I am
strongly biased* towards ASN.1, it's really flexible and space-efficient
(even in DER, I wouldn't go as far as using the bit-packed ones) and can
be "decoded" pretty easily using Peter Gutman's dumpasn1 or my
http://lapo.it/asn1js/

What I doubt mainly is: is migrating from basic_io worthy of being done?

I mean: changing hash is (will be) more or less necessary, we have to
accept that. Changing the hash output from hex to something like base58
more or less too (256 bit in hex is really long).

But do we want (and do we have the manpower) to change basic_io?

(I'm not saying the answer is "no", I'm just dubious it's worth it)

> At the moment, the most important question seems to be: how much do you
> value the human readable representation? How about a binary format that
> you can easily transform to and from a human readable one?

I'm strongly about the latter; personally I hate human readable formats
because they are overly redundant… while being slower at it. ;-)

cheers,

*: well part of it is because I'm forced to use it in any cryptographic
format out there, more or less, but once you gt the gist of it it's a
much less fearsome beast than I originally thought.

--
Lapo Luchini - http://lapo.it/


_______________________________________________
Monotone-devel mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/monotone-devel