Skip to content
haberman edited this page Sep 13, 2010 · 30 revisions

Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.
—Antoine de Saint-Exupery

μpb (or more commonly, “upb”) is an implementation of the Protocol Buffers serialization format released by Google in mid-2008. The Greek letter mu (μ) is the SI prefix for “micro”, which reflects the goal of keeping upb as small as possible while providing a great deal of flexibility and functionality.

Why another Protocol Buffers implementation?

The Google implementation of Protocol Buffers is open source, released under a liberal license (BSD). Other people have written implementations also, such as protobuf-c. Why did I write a completely new implementation from scratch? Why should anybody use my implementation?

I will give two main reasons, besides the goal of minimalism (which has either already won you over or failed to pique your interest):

Flexibility and Adaptability

upb is designed for maximum flexibility. What this means is that it gives you as a programmer more choices about how you want to store and process your data. Specifically:

upb is fully streaming-capable.
This means that your serialized data doesn’t have to be in one big contiguous buffer to start parsing it. If your buffer is scattered across chunks of memory or if you are streaming data off of a disk or network, upb lets you parse as much data as you currently have in your buffer. When you have more data, you can resume parsing.
upb’s lowest-level parser is event-driven, like SAX.
SAX-based parsers are a great fit for some applications. You might want to parse the Protocol Buffer data into your own custom data structure instead of the stock message classes. Or your application might be capable of processing the data in a streaming fashion, in which case you can avoid the malloc/free/memcpy overhead of saving the data into a tree structure.
upb’s memory management policies are adaptable
Memory management can make or break performance. malloc(), free(), and memcpy() are expensive when overused, especially taking into account the cache effects. Deep in upb’s design is a recognition of this fact, and interfaces that let you optimize for intelligent memory management. For example, upb is capable of making strings reference the original protobuf data (rather than copying), and upb’s memory management interface lets you reuse submessages instead of destroying and reallocating them.

upb is designed to be a toolbox of paradigms for manipulating protocol buffer data. upb is built in layers, and any of the layers are available for clients to use as they see fit.

In addition, there are (or will be) several different code generation strategies, for compiled languages that wish to use generated code.