BitcoinTalk

Protocol Buffers for Bitcoin

Protocol Buffers for Bitcoin

There has been a discussion going on elsewhere about using protocol buffers for bitcoin. To summarise the advantages:

-> Small encoding
-> Very fast
-> Implementations in loads of languages (So writing new clients become a lot simpler)
-> Forwards compatible (indeed, this is most of the point of protocol buffers)
-> Extremely simpleto use in code

So initially I would suggest storing the wallet file using protocol buffers, this isn't a breaking change and immediately makes the wallet file easier for other programs to parse. Eventually I would hope that bitcoin could use protocol buffers for networking.

Some people have been suggesting that protocol buffers might be larger than the custom written packet layout. I suspect that actually it would be *smaller* due to some of the clever encoding used in protocol buffers. To resolve this, I think a test is in order, I shall encode a wallet file/network packet using protocol buffers and compare the size the packets in the current scheme. However, I have no idea what's in a packet, what data is stored in a packet, and in what format?

Re: Protocol Buffers for Bitcoin

Some people have been suggesting that protocol buffers might be larger than the custom written packet layout. I suspect that actually it would be *smaller* due to some of the clever encoding used in protocol buffers.
I agree that it could be smaller; not necessarily because of clever encoding, but because it would allow us to drop reserved bytes and the like.

To resolve this, I think a test is in order, I shall encode a wallet file/network packet using protocol buffers and compare the size the packets in the current scheme. However, I have no idea what's in a packet, what data is stored in a packet, and in what format?
That would be the hard part, of course. If you want to test with the version packet (not really ideal, since it's only sent once per connection), I've decoded that fully:
http://bitcointalk.org/index.php?topic=231.msg6250#msg6250

Re: Protocol Buffers for Bitcoin

Some people have been suggesting that protocol buffers might be larger than the custom written packet layout. I suspect that actually it would be *smaller* due to some of the clever encoding used in protocol buffers.
I agree that it could be smaller; not necessarily because of clever encoding, but because it would allow us to drop reserved bytes and the like.

That too, although the counter argument people always make to that is that we could do away with reserved bytes anyway. No matter how impractical that would be :/

To resolve this, I think a test is in order, I shall encode a wallet file/network packet using protocol buffers and compare the size the packets in the current scheme. However, I have no idea what's in a packet, what data is stored in a packet, and in what format?
That would be the hard part, of course. If you want to test with the version packet (not really ideal, since it's only sent once per connection), I've decoded that fully:
http://bitcointalk.org/index.php?topic=231.msg6250#msg6250

I was hoping for a transaction packet or something, but I'll give it a go with that for now. I could also test with the wallet file if anyone has decoded that?

Addendum:

Ok, Working from this summary of the version packet layout:

Quote
version
    * {0xf9,0xbe,0xb4,0xd9}
    * "version" (0x00 padded)
    * 4 byte message size
    * 4 byte checksum
    * 8 byte nLocalServices (always 1 if !fClient, no idea either what that means)
    * 8 byte timestamp (remember to use network byte order)
    * Remote address (the address this Node thinks he is):
          o nServices - uint64 (8b), still cryptic, don't know the meaning yet
          o pchReserved - (12b): some reserved space, apparently for later IPv6
          o ip - uint (4b)
          o port - unsigned short (2b)
    * Local address (the address this Node sees you under):
          o nServices - uint64 (8b), still cryptic, don't know the meaning yet
          o pchReserved - (12b): some reserved space, apparently for later IPv6
          o ip - uint (4b)
          o port - unsigned short (2b)
    * 8 byte nLocalHostNonce (needed for a handshake, if I'm not mistaken)
    * A subversion string ".0" in my case
    * nBestHeight - int (4b): appears to be the last block number

I created this protocol buffer definition:

Quote
message version
{
   message AddressInfo
   {
      required unint64 nServices;
      required fixed32 ip;
      required uint32 port;
   }

   required uint32 magic = 2045;         //0xf9 | 0xbe << 1 | 0xb4 << 2 | 0xd9 << 3
   required uint32 version;
   required int64 checksum;
   required uint64 timestamp;
        required uint64 nLocalServices;

   required AddressInfo Remote;      //the address this node thinks he is
   required AddressInfo Local;      //the address this node sees you under

   required fixed64 nLocalHostNonce;
   required string SubversionString;
   required uint32 nBestHeight;
}

Does that look correct? The only changes I've made are that the indented things in the bullet point list are nested message types, and I've completely dropped the 12 bytes of reserved ipv6 space (since that can easily be added in later, which is the point of protocol buffers). I should point out that I probbaly haven't picked the best encoding types for all these fields, that depends upon the values they're likely to store, so in practice the packet will probably be a little smaller than my tests indicate

Re: Protocol Buffers for Bitcoin

I used the above protocol buffer (as I said before, it's probably not optimal) and data obtained via http://www.alloscomp.com/bitcoin/version.pys as test data.

Quote
Version: 306
nLocalServices: 1
nTime: 1280487684
addrYou: #.#.#.#:#### (nServices: 1)
addrMe: #.#.#.#:#### (nServices: 1)
nLocalHostNonce: 2359069617775922941
vSubStr: ""
nBestHeight: 71137

The encoded protocol buffer is just 55 bytes, wheras the bitcoin version is 85 0x00 sets (each one representing 2 bytes each I assume). This means that my badly designed protocol buffer is over half the size of the hand built layout!

Re: Protocol Buffers for Bitcoin

Some people have been suggesting that protocol buffers might be larger than the custom written packet layout. I suspect that actually it would be *smaller* due to some of the clever encoding used in protocol buffers.
I agree that it could be smaller; not necessarily because of clever encoding, but because it would allow us to drop reserved bytes and the like.

Not only does it allow it to drop reserved fields, but it uses ZigZag encoding and some other tricks to keep integers and the like as absolutely small as possible.  So yea, it uses clever encoding. =P  It's also blazingly fast to process!

Re: Protocol Buffers for Bitcoin

The encoded protocol buffer is just 55 bytes, wheras the bitcoin version is 85 0x00 sets (each one representing 2 bytes each I assume). This means that my badly designed protocol buffer is half the size of the hand built layout!

I realize that you are evangelizing for protocol buffers (and you seem to be doing a very good job of it too, I might add), but I will challenge that hand built data layouts are always bad.

Still, I hope this does give some food for thought and on a practical basis any improvement in the network protocol that shaves off a few bytes is always better.  This doesn't seem to sacrifice too much in terms of the overhead either.  More significantly, you are calling attention to an area of efficiency that needs to be addressed and is very helpful to the project.  Thank you for doing that.  I'm hoping to get caught up to where you are at now on this protocol business.

Re: Protocol Buffers for Bitcoin

Speaking of the network...
... is there any really robust, generic, low-latency, open source p2p network "middleware" out there?

I think using protocol buffers as the serialization format is a good idea, but I don't think just switching to protocol buffers "buys" enough to be worth the effort (at least not now, when transaction volume is low).

I'd like to see some experimenting with running bitcoin on top of a different networking layer (and use protocol buffers, too).  Is there a p2p network that is designed to be extremely highly reliable and difficult to infiltrate or attack with malicious nodes?

Re: Protocol Buffers for Bitcoin

FYI, it is pointless to make a packet smaller than 60 bytes -- the minimum size of an Ethernet packet.  Packets are padded up to 60 bytes, if they are smaller.

Re: Protocol Buffers for Bitcoin

The encoded protocol buffer is just 55 bytes, wheras the bitcoin version is 85 0x00 sets (each one representing 2 bytes each I assume). This means that my badly designed protocol buffer is half the size of the hand built layout!

I realize that you are evangelizing for protocol buffers (and you seem to be doing a very good job of it too, I might add), but I will challenge that hand built data layouts are always bad.

Still, I hope this does give some food for thought and on a practical basis any improvement in the network protocol that shaves off a few bytes is always better.  This doesn't seem to sacrifice too much in terms of the overhead either.  More significantly, you are calling attention to an area of efficiency that needs to be addressed and is very helpful to the project.  Thank you for doing that.  I'm hoping to get caught up to where you are at now on this protocol business.

They're not always bad. However, if you put in so much effort that your hand built packet was smaller than a protocol buffer then you're probably putting too much effort into a micro optimisation Wink

I'll be happy to help anyone catch up with the protocol buffers. If someone is willing to work with me I'd even work on a patch, I have very little C++ experience so I can't do it alone unfortunately.

I think using protocol buffers as the serialization format is a good idea, but I don't think just switching to protocol buffers "buys" enough to be worth the effort (at least not now, when transaction volume is low).

I would disagree, protocol buffers are smaller which is nice, but it's not their main advantage - they're forwards compatible which is a hugely important thing in a p2p network, they're also something which can easily be used in many languages, which make implementing new clients in new languages easier, which in my opinion is vital for bitcoin.

FYI, it is pointless to make a packet smaller than 60 bytes -- the minimum size of an Ethernet packet.  Packets are padded up to 60 bytes, if they are smaller.

Indeed, but the version packet is probably the smallest packet of all the ones sent, so we'll gain more elsewhere. Also, keep an eye on the main point. The fact that protocol buffers are smaller is a nice aside to the fact that they're Forwards compatible and make bitcoin portable between languages.

Re: Protocol Buffers for Bitcoin

The encoded protocol buffer is just 55 bytes, wheras the bitcoin version is 85 0x00 sets (each one representing 2 bytes each I assume). This means that my badly designed protocol buffer is over half the size of the hand built layout!
The "0x00" groups each represent one byte. The length of the standard version packet is 87 bytes plus 20 for the header. The header could be massively optimized as well:
Code:
message start "magic bytes" - 0xF9 0xBE 0xB4 0xD9
command - name of command, 0 padded to 12 bytes "version\0\0\0\0\0"
size - 4 byte int
checksum (absent for messages without data and version messages) - 4 bytes
Obviously using proto buffers here, while absolutely a breaking change, would save a fair bit of space, especially because the "I've created a transaction" packet has the name "tx" meaning that there's at least 10 bytes of overhead in every one of those packets.

Re: Protocol Buffers for Bitcoin

The "0x00" groups each represent one byte.

Oops Embarrassed

breaking change

I think the best way to phase in protocol bufferswould to avoid breaking changes to start with, instead start with protocol buffers for the local files (like the wallet), which would gain us a little bit of size on disk, ease of reading the wallet file in other software, and get some experience using protocol buffers. Then is the time to start phasing in protocol buffers for networking in my opinion.

Does the current version of bitcoin have any handling for ignoring chunks of a packet? If so, phasing in protocol buffers could be as simple as writing the current packet AND writing the protocol buffer (as an ignored field for older clients), then once enough people have upgraded get rid of the old encoding.

Re: Protocol Buffers for Bitcoin

Why do you consider it a breaking change? There's no reason you couldn't first try with the new protocol and then retry using the old bitcoin serialization technique. Also I think this is a change that should be made sooner rather then later while the BitCoin community is still small. It's already been a major blocker in making new clients and delaying it is going to hamper bitcoin's adoption.

Re: Protocol Buffers for Bitcoin

Why do you consider it a breaking change? There's no reason you couldn't first try with the new protocol and then retry using the old bitcoin serialization technique.

That's a very good idea, and I would say it's the way to go with this.

Also I think this is a change that should be made sooner rather then later while the BitCoin community is still small. It's already been a major blocker in making new clients and delaying it is going to hamper bitcoin's adoption.

Since it's a non breaking change, it should be done as soon as possible in my opinion, for those very reasons.

The question remains, is anyone willing to help implement it? I'm an experience programmer but I have no C++ experience unfortunately, so I'm gonna need a little help if I try to do this myself Wink

Re: Protocol Buffers for Bitcoin

The reason I didn't use protocol buffers or boost serialization is because they looked too complex to make absolutely airtight and secure.  Their code is too large to read and be sure that there's no way to form an input that would do something unexpected.

I hate reinventing the wheel and only resorted to writing my own serialization routines reluctantly.  The serialization format we have is as dead simple and flat as possible.  There is no extra freedom in the way the input stream is formed.  At each point, the next field in the data structure is expected.  The only choices given are those that the receiver is expecting.  There is versioning so upgrades are possible.

CAddress is about the only object with significant reserved space in it.  (about 7 bytes for flags and 12 bytes for possible future IPv6 expansion)

The larger things we have like blocks and transactions can't be optimized much more for size.  The bulk of their data is hashes and keys and signatures, which are uncompressible.  The serialization overhead is very small, usually 1 byte for size fields.

On Gavin's idea about an existing P2P broadcast infrastructure, I doubt one exists.  There are few P2P systems that only need broadcast.  There are some libraries like Chord that try to provide a distributed hash table infrastructure, but that's a huge difficult problem that we don't need or want.  Those libraries are also much harder to install than ourselves.