Skip to content
haberman edited this page Sep 13, 2010 · 30 revisions

Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.
—Antoine de Saint-Exupery


μpb (or more commonly, “upb”) is an implementation of the Protocol Buffers serialization format released by Google in mid-2008. The Greek letter mu (μ) is the SI prefix for “micro”, which reflects the goal of keeping upb as small as possible while providing a great deal of flexibility and functionality.

upb is written in 2300 sloc of C, and compiles to just under 30kb of object code on x86.

Why use Protocol Buffers?

Did you know that “The Letter” by The Box Tops was a minute and 58 seconds long? Means nothing. Nil. But it takes them less than two minutes to accomplish what Jethro Tull takes hours to not accomplish!
  — Lester Bangs’s character in , “Almost Famous”

Protocol Buffers are everything XML has spent the last 10 years trying to be.

XML 1.0 was released in 1998 as a 35 page specification. At its essence, XML alone lets you define trees of elements, attributes, and strings. It was designed to be a simple subset of SGML.

Soon people realized that strings were not enough — for a data language you want data types. Also they wanted something more expressive than DTDs. So in 2001 came XML Schema part 1, structures (104 pages) and part 2, datatypes (111 pages).

XML Schema theoretically could give you your document in data form (ie. with real data types) using something called the “Post-Schema-Validation Infoset,” but this did not achieve widespread support. Everybody was using DOM, because it was supported in the browsers, so they standardized on that. Hundreds more pages of specifications, but unlike XML Schema there were still no datatypes — everything is a string.

Meanwhile a perennial complaint of XML is that the documents are too big and too expensive to parse. So there are “Binary XML” proposals that fly around from time to time, none of which really gain traction.

Tangentially, other formats start to pop up like YAML and JSON, which are attractive to developers because of their simplicity

Protocol Buffers supports natively all of the important functionality that it takes a whole stack of XML technologies to achieve. Out of the box Protocol Buffers give you:

  • a binary format, a text format, and an in-memory format that are much more efficient in space and CPU time than XML.
  • a schema language (.proto files) that supports real data types.
  • convenient APIs for major languages.

Using protocol buffers, you can define a data structure in a .proto file, and instantly have efficient, convenient, programmatic access to your data from any popular language. And when I say “convenient”, I mean:

print person.age

not:

person.getElementsByTagName("name")->item(0)->innerText()

And you can implement Protocol Buffers in a fraction of the code that it takes to implement XML.

Why another Protocol Buffers implementation?

The Google implementation of Protocol Buffers is open source, released under a liberal license (BSD). Other people have written implementations also, such as protobuf-c. Why did I write a completely new implementation from scratch? Why should anybody use my implementation?

I will give two main reasons, besides the goal of minimalism (which has either already won you over or failed to pique your interest):

Flexibility and Adaptability

upb is designed for maximum flexibility. What this means is that it gives you as a programmer more choices about how you want to store and process your data. Specifically:

upb is fully streaming-capable.
This means that your serialized data doesn’t have to be in one big contiguous buffer to start parsing it. If your buffer is scattered across chunks of memory or if you are streaming data off of a disk or network, upb lets you parse as much data as you currently have in your buffer. When you have more data, you can resume parsing.
upb’s lowest-level parser is event-driven, like SAX.
SAX-based parsers are a great fit for some applications. You might want to parse the Protocol Buffer data into your own custom data structure instead of the stock message classes. Or your application might be capable of processing the data in a streaming fashion, in which case you can avoid the malloc/free/memcpy overhead of saving the data into a tree structure.
upb’s memory management policies are adaptable
Memory management can make or break performance. malloc(), free(), and memcpy() are expensive when overused, especially taking into account the cache effects. Deep in upb’s design is a recognition of this fact, and interfaces that let you optimize for intelligent memory management. For example, upb is capable of making strings reference the original protobuf data (rather than copying), and upb’s memory management interface lets you reuse submessages instead of destroying and reallocating them.

upb is designed to be a toolbox of paradigms for manipulating protocol buffer data. upb is built in layers, and any of the layers are available for clients to use as they see fit.

In addition, there are (or will be) several different code generation strategies, for compiled languages that wish to use generated code.

Support for Dynamic Languages

Protocol Buffers has an enormous potential to be useful to users of dynamic languages. It provides a format that languages can use to exchange data in a very efficient way. It provides the efficiency benefits of using built-in serialization formats like Python’s “Pickle”, Perl’s “Storable,” and Ruby’s “Marshal”, or JavaScript’s “JSON”, but with a more explicit schema and greater interoperability across languages.

Despite this promise, Protocol Buffers haven’t seen much adoption in dynamic languages because the existing implementations aren’t very efficient. upb was designed from the outset to be an ideal implementation for supporting very fast Protocol Buffers implementations for dynamic languages.

One key part of this strategy was designing the table-driven parsing code-path — the method of operation that doesn’t require you to generate and compile C or C++ for each message — as fast as possible. It is inconvenient for users of dynamic languages to have a compile step in their development cycle.

Another important feature is developing memory-management interfaces that can integrate with the memory managers of dynamic languages. This is no easy task, because each language runtime does memory management differently. Some use reference counting, some use garbage collection, some use a combination, and the interfaces for interacting with the memory managers are different for every runtime. A key goal of upb was to design a memory management scheme that could gracefully integrate with all of these.