Here are the reasons for creating the BON8 data format:

Exact translation between JSON and BON8
A canonical representation to allow signing of messages.
No extensibility, allows every parser to handle every BON8.
Low amount of overhead.
Quick encode / decode.
Self terminating, a valid BON8 message can be decoded without a buffer-length.

Encoding

The idea of this encoding is that strings are encoded as UTF-8.

UTF-8 has a lot of codes that fall within the invalid range which can be used to encode different kinds of data.

All of these invalid UTF-8 start-code-units have been assigned below to reduce the ability to extent the format, and to improve compression of the data.

message := value;
 
value := number | boolean | null | string | array | object;
 
number := integer | float;
 
integer := positive_integer | negative_integer | signed_integer;
 
float := binary32 | binary64;
 
boolean := true | false;
 
string := character character* | character* eos;
 
character := utf8_1 | utf8_2 | utf8_3 | utf8_4;
 
array := start_array_with_count value*
       | start_array value* eoc;
 
object := start_object_with_count (string value)*
        | start_object (string value)* eoc

This table gives an overview on the encoding:

Code Sequence	#	Type	Bits	Min	Max
00-7f	1	UTF-8 One byte sequence	7	U+0000	U+007f
c2-df 80-bf	2	UTF-8 Two byte sequence	11	U+0080	U+07ff
e0-ef 80-bf byte*1	3	UTF-8 Three byte sequence	16	U+0800	U+ffff
f0-f7 80-bf byte*2	4	UTF-8 Four byte sequence	21	U+100000	U+10ffff
80-84		Start array with count		0	4

85 | | Start array | | | 86-8a | | Start object with count | | 0 | 4 8b | | Start object | | | 8c byte*4 | 5 | Signed integer | 32 | -2^31 | 2^31 - 1 8d byte*8 | 9 | Signed integer | 64 | -2^63 | 2^63 - 1 8e byte*4 | 5 | Floating point binary32 | 32 | | 8f byte*8 | 9 | Floating point binary64 | 64 | | 90-b7 | 1 | Positive integer | | 0 | 39 b8-c1 | 1 | Negative Integer | | -1 | -10 c2-df 00-7f | 2 | Positive Integer (30 * 128) | | 40 | 3879 e0-ef 00-7f byte*1 | 3 | Positive Integer | 19 | 3880 | 528167 f0-f7 00-7f byte*2 | 4 | Positive Integer | 26 | 528168 | 67637031 c2-df c0-ff | 2 | Negative Integer (30 * 64) | | -11 | -1930 e0-ef c0-ff byte*1 | 3 | Negative Integer | 18 | -1931 | -264074 f0-f7 c0-ff byte*2 | 4 | Negative Integer | 25 | -264075 | -33818506 f8 | 1 | False | | | f9 | 1 | True | | | fa | 1 | null | | | fb | 1 | -1.0 | | | fc | 1 | +0.0 | | | fd | 1 | +1.0 | | | fe | 1 | End Of Container (eoc) | | | ff | 1 | End Of String (eos) | | |

Extra rules

The rules below ensures minimum message size and consistency, which allows for cryptographically signing of messages. All encoders MUST follow these rules.

A message is a single value. Most often this value is of type Object or type Array.
A string does not need to be terminated with an eos (0xff), if the ending of a string can be determined in other ways.
String MUST end with eos (0xff) when:
- The string is empty,
- When the next byte in the message starts another string,
- If there are no more bytes left in the message.
Strings MUST be a valid UTF-8 encoded string.
Strings MAY contain any Unicode code-point between U+0000 and U+10ffff.
Integers and floating point numbers are stored most significant bits first (big endian).

Canonical Representation

The rules below ensure a minimum message and a consistent encoding between different implementation. The canonical representation is consistent enough to be used for cryptographically signing of data.

The unicode string MUST be in Normalization Form C (NFC).
Floating point negative zero, infinite and NaN MUST be encoded as binary32.
NaN should be encoded as a 0x7f800001.
Floating point numbers that can be converted to binary32 without loss of precision or range MUST be encoded as binary32.
Strings, Integers, Arrays and Objects MUST be encoded with the least amount of bytes.
The keys of an Object MUST be lexically ordered based on UTF-8 code-units.

Examples

Single string

"ab"

A string at the end of a message needs to be terminated.

'a' 'b' eos

0x61 0x62 0xff

Array with two strings

["ab", "bc"]

The first string needs to be terminated to differentiate it with the second.
The array is not terminated because it has an count.
The second string is at the end of the message so it must be terminated.

[2 'a' 'b' eos 'b' 'c' eos

0x82 0x61 0x62 0xff 0x62 0x63 0xff

Array with five strings

["a", "b", "c", "d", "e"]

All but the last string needs to be terminated to differentiate it with the next string.
The array is terminated, because it does not have an count.
The last string is not terminated because the array terminator is a natural ending.

[ 'a' eos 'b' eos 'c' eos 'd' eos 'e' ]

0x85 0x61 0xff 0x62 0xff 0x63 0xff 0x64 0xff 0x65 0xfe

A object with two integer values

{"ab": 1, "bc", 2}

The strings are not terminated because an integer is a natural ending of a string.
The object is not terminated because it has a count.
The message ends because the object ends.

{2 'a' 'b' 1 'b' 'c' 2

0x88 0x61 0x62 0x91 0x62 0x63 0x92

Nested array with strings

{"a": ["b", "c"], "d": 1}

The strings "b" and "c" need to be terminated because the following byte is the start of a string.
Both object and array do not have terminators because it has a count
The strings "a" and "d" are followed by a non string so do not need to be terminated.

{2 'a' [2 'b' 'c' 'd' 1

0x88 0x61 0x82 0x62 0x63 0x64 0x91

Object with empty string key

{"": 1, "a": 2}

The first string is empty so it will be terminated.
The second string is followed directly by an integer so is natually terminated.

{2 eos 1 'a' 2

0x88 0xff 0x91 0x61 0x92