Skip to content
Philip (flip) Kromer edited this page Jun 17, 2012 · 4 revisions

Things in (?question blocks?) are unresolved.

Foundations

  • Simple: The thing you would draw at the whiteboard, spoken aloud, is the the thing you write.

  • Scalable: Scalability is more important than performance. Scalability is important even when there are no performance concerns -- scalable for robots means scalable for people.

  • Readable: The plot should be as simple as the story. Orthogonal concerns (schema, transformation, topology, transport, resources, and configuration) should be crisply separated.

Crucibles

Executor indpendence

Every Hanuman graph can be expressed

  • Buffered flow -- Flume
  • Event-driven flow -- cat, HTTP post, Hadoop
  • Picture -- graphviz etc
  • Explanation -- each record appends the plain-language action it takes on the record

Graph is a graph

Whereever concepts can be paired across the following, they should be, until they prove how they are distinct:

  • micro dataflow
  • macro dataflow
  • workflow
  • ----~
  • subroutine
  • system diagram

Petri Net / Graph Dual

  • every edge connects a stage to a resource or a resource to a stage
    • The resource might be implied by the edge, and elided from the declaration.

Comparables

  • Kafka, ZeroMQ, Storm, Esper, Flume, Pusher
  • Rake, Chef, Shell scripts, Thor
  • Rack
  • Hadoop

Objects

  • stages

    • resources
    • actions -- alter, augment or act on data. Some important types of actions:
      • transformers -- actions with exactly one input and one output
      • sources -- actions with zero inputs and (?one or more outputs?). Sources may be different in another way, not sure yet
      • sinks -- actions with (?one or more inputs?) and no output.
      • taps -- actions that make no changes to the data (?can a tap add metadata? can it mutate metadata?)
    • graphs -- contains other stages. A graph is also an action.
  • schema

  • record

  • control path

  • configuration

  • topic -- a filter This filter is special because the transport can

  • runner -- pairs a graph with concrete executors and resources

(?questions exist about the following?)

  • partition --
  • chain -- (?a group of sequential edges understood to be related in an important way?)
  • delivery guarantee --
  • contract

Key Decisions

Data is primary Data is represented as "Arbitrary record with sideband metadata". That is, rather than an event wrapper with a message body you access, the data is primary and the metadata is accessible through it.

Flows are understandable in isolation:

Unless a processor explicitly maintains state or mixes in external entropy, the outcome of a data stream may be completely characterized by

  • Processor
  • Configuration (static parameters delivered at run-time to the processors on the graph)
  • Data record itself
  • Metadata attached to that record

The important point is that the outcome of a processor is not affected by changes to

  • the graph as a whole, including the stages that it pulls from and writes to
  • the machine or machines it runs on
  • the transport layer handling the flow
  • spatially or temporally where the data originates from or proceeds to.

As long as it is fed the same data and the same metadata with the same configuration, a processor doesn't care what's running it, where it is, who fed it the data or who is consuming it.

Orthogonal Capabilities

  • guarantees

    • end-to-end -- ensures acknowledgement by destination
    • next-hop -- guarantees successful handoff to next stage, but nothing more
    • best effort -- fire and forget

Key Mysteries

  • naming stages -- here are some desirable features:

    • unambiguously referencable
    • name doesn't change if irrelevant changes are made to the graph
    • name does change if relevant changes are made
    • name is predictable (by looking at the graph
    • name is readable _ name is unaffected by configuration
  • fanout: when several stages consume a given stage's ourputs,

    • does the stage have one output (a resourec) and the resource has many consuming stages
    • or does the stage have multiple outputs, and the emit method needs re-think?
  • messages

    • can stages message each other directly?
    • or do they spek event a sideband that other stages can consume
    • or are they broadcase to all notds
  • amelioration: when are

    • mixin of a module (calling super)
    • wrapping a stage (before/after/around filter)
    • flow insertion (put module before / after it in the flow)
  • delivery guarantees

  • control path

  • biographers

Have / Want

processor widgets

  • choke -- emits all records it receives but not faster than a given averaged rate
  • Uniform consistent sampling
  • dashpot tap
  • channel topic
  • aggregates -- sum, avg/stdev, %ile
  • graph change stream -- an activity stream of macro graph actions (topology or configuration changes).

topology widgets

  • many-to-many -- all input records sent to all output slots. (?input partition is somehow recorded in event metadata?).
  • switch -- each record sent to exactly one out slot.

contract

  • window --

  • cogroup

  • sort

  • grouping "fences" -- indicate start/end boundaries of a group in the stream (or in metadata) rather than by combining into a rolled-up record

  • buffer -- ...

  • retry --

  • load balance --

  • barrier -- execution pauses until some condition met

  • schedule -- ...

central-ish

  • counters, timers, gauges, values
  • announce / discover

understanding

  • simulate -- sampled yet illustrative flow goes end-to-end
  • explain -- processors explain what they do, appending it to the explanation passed along the graph
  • audit -- metadata element showing full provenance of the record

demo

  • Hamming's Problem
  • Newton's method

Further reading

Clone this wiki locally