Skip to content
haberman edited this page Jul 6, 2011 · 30 revisions


μpb (or more commonly, “upb”) is an implementation of the Protocol Buffers serialization format released by Google in mid-2008. The Greek letter mu (μ) is the SI prefix for “micro”, which reflects the goal of keeping upb as small as possible while providing a great deal of flexibility and functionality.

upb is written in ~5000 sloc of C, and compiles to ~30kb of object code on x86.

Why Protocol Buffers?

Protocol Buffers are an excellent platform for data processing and interchange. They take all the best parts of JSON, XML, and UNIX pipes and leave out all of the bad parts. They can be as easy to use as JSON, as versatile as UNIX pipes, as explicit as XML-Schema, while being more efficient in both CPU and memory. They can be human-readable when you want (and can even use JSON as their on-the-wire format), binary and extremely compact when you need. And all of this can be implemented in a very small library (<50kb of object code).

Using Protocol Buffers, you can define your schema using a simple syntax:

message Person {
  required uint32 age = 1;
  required uint32 birthday = 2;
  enum Gender {
    MALE = 0,
    FEMALE = 1
  }
  optional Gender gender = 3;
}

You can then create structures of this type in any language that has protocol buffer support. These structures can be very efficiently serialized and deserialized into either binary or text formats.

  • vs JSON: the protocol buffer schema is explicit, as opposed to JSON where messages can have arbitrary keys that map to any kind of value. Protocol Buffers and JSON can interoperate nicely; you can serialize a Protocol Buffer structure to JSON and parse from JSON. Protocol Buffers (properly implemented) are more efficient in-memory, since they are stored as offset-based structures instead of hash tables. The Protocol Buffers text format is roughly comparable to JSON. The Protocol Buffers binary serialization format is significantly smaller than JSON, but is not human readable.
  • vs XML: Protocol Buffers have a data model that more cleanly maps to programming languages (no special “DOMAPI is required). Protocol Buffers are significantly more efficient both on the wire and in memory. Protocol Buffers integrate data types and a schema at the lowest level, instead of layering it on top of XML using technologies like XML Schema. Protocol Buffers are significantly less complex than the XML stack.

Why another Protocol Buffers implementation?

The Google implementation of Protocol Buffers is open source, released under a liberal license (BSD). Other people have written implementations also, such as protobuf-c. Why did I write a completely new implementation from scratch? Why should anybody use my implementation?

High performance without code generation

Most protobuf implementations focus on code generation as their primary means of achieving speed. “Code generation” in this context means using a compiler to translate a .proto file to C or C++ code that is specific to those .proto types. A C or C++ compiler is then used to output machine code that can parse, serialize, or manipulate those types.

Code generation can achieve high speeds, but also has a high cost:

The generated code can be large
descriptor.proto, which can be represented as a 3.5kb protobuf, compiles to >150kb of machine code on x86. If you have a binary that processes lots of message types, this code can really add up.
You have to link in any message types you want to parse
This means you have to decide ahead of time what messages you might possibly want to process, and you pay the size and compile time hit for all of them. Whenever they change, you have to recompile.
There is an extra step in your edit/compile/run cycle
Or worse, if you didn’t have an edit/compile/run cycle before (like with interpreted languages), you do now.
The generated code is inflexible
Generated code achieves it speed by compiling for one very specific configuration. In other words, it takes all your decisions about how you want to parse and fixes them at compile time. This means that the generated code is only good for one very specific purpose. Want to change the set of fields you care about? Recompile. Want to reference the input strings instead of copying them? Recompile. Want to do callback-based parsing instead of parsing into the stock data structures? Recompile.

upb was designed with the belief that protobuf parsing without code generation could achieve speeds comparable to code generation. If this can be achieved, we can avoid the drawbacks of code generation. Programs need only compile the upb core (<50k object code), and all .proto files can be loaded at runtime as they are needed.

upb supports both interpreted, table-based parsing and a JIT compiler that generates machine code at runtime. Benchmarks indicate that the table-based parser is roughly 70% the speed of proto2’s generated code. The JIT compiler is on the order of 3x the speed of proto2’s generated code. For more details, see: upb status and preliminary performance numbers

.h3 Stream-based parsing (decoupling parsing from in-memory representations)

upb’s core parsing and serialization interfaces are streaming. You register some callbacks with the parser and the parser calls those callbacks as individual values are parsed. It’s comparable to SAX in the XML world, whereas proto2 is comparable to DOM.

This might not seem too important, since most applications aren’t truly streaming. The key point about stream parsing is that it decouples parsers/serializers from in-memory representations. This makes upb far more flexible than a DOM-based parser, because you can use it with whatever in-memory representation is most convenient. For dynamic language extensions, it might be much more convenient to put the data in a native object for that interpreter/VM and let the interpreter/VM handle synchronization and memory management. This can yield a much more “native” feeling extension than one that just wraps generated C++ classes, and has to respect their memory management and threading scheme.

Even C++ users might already have a class hierarchy that they want to serialize directly instead of having to copy it all into a protobuf first. They might want to store a map as std::map instead of an array of key,value pairs. Any of this is possible with streaming interfaces.

.h3 Flexible, minimal design

The upb design has been refined for over two years to maximize flexibility and performance while staying as small as possible.

upb is designed to be easy to integrate with VMs and interpreters. It is as hands-off as possible about memory management and threading.

upb’s design enables several optimizations:

Skipping fields/submesages you don’t need.
The protobuf format makes it possible to skip submessages very efficiently. If you are only reading a small portion of a large, nested protobuf, you can get the fields you need in orders of magnitude less time than it would take to parse the whole thing.
Lazy parsing of submessages
A slightly different take on the previous point, it is possible to parse submessages only if/when they are accessed. This can achieve the same speeds as the previous without requiring you to statically analyze the set of fields you need. The downside is that parse errors surface later and unsynchronized reads are no longer thread-safe.
Referencing input string data instead of copying.
If the input contains strings, it is possible to reference them from the input string instead of paying for malloc() and memcpy(). This might be desirable in some cases but not others — a non-code-generation approach lets you decide at runtime. For high level languages, this means you can avoid creating the interpreter-level string object until it is needed.
Callback/Event-based parsing
Event-based parsing (like SAX in XML) can be much more efficient than parsing into a data structure.

Support for Dynamic Languages

The dynamic nature of upb is especially useful in the context of dynamic or interpreted languages. upb is specifically designed to be an ideal target for dynamic language extensions.

Protocol Buffers has an enormous potential to be useful to users of dynamic languages. It provides a format that languages can use to exchange data in a very efficient way. It provides the efficiency benefits of using built-in serialization formats like Python’s “Pickle”, Perl’s “Storable,” and Ruby’s “Marshal”, or JavaScript’s “JSON”, but with a more explicit schema and greater interoperability across languages.

Despite this promise, Protocol Buffers haven’t seen much adoption in dynamic languages because the existing implementations are either inefficient (when implemented without a C extension) or experimental (ie PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=cpp). upb was designed from the outset to be an ideal implementation for supporting very fast Protocol Buffers implementations for dynamic languages. This is much of the reason upb is focused on making the runtime dynamic and configurable (ie. no code generation), so that .proto types are easy to load at runtime and flexible in the ways you can process them.

Another important feature is developing memory-management interfaces that can integrate with the memory managers of dynamic languages. This is no easy task, because each language runtime does memory management differently. Some use reference counting, some use garbage collection, some use a combination, and the interfaces for interacting with the memory managers are different for every runtime. A key goal of upb was to design a memory management scheme that could gracefully integrate with all of these.