philosophy

Things in (?question blocks?) are unresolved.

Foundations

Simple: The thing you would draw at the whiteboard, spoken aloud, is the the thing you write.
Scalable: Scalability is more important than performance. Scalability is important even when there are no performance concerns -- scalable for robots means scalable for people.
Readable: The plot should be as simple as the story. Orthogonal concerns (schema, transformation, topology, transport, resources, and configuration) should be crisply separated.

Crucibles

Executor indpendence

Every Hanuman graph can be expressed

Buffered flow -- Flume
Event-driven flow -- cat, HTTP post, Hadoop
Picture -- graphviz etc
Explanation -- each record appends the plain-language action it takes on the record

Graph is a graph

Whereever concepts can be paired across the following, they should be, until they prove how they are distinct:

micro dataflow
macro dataflow
workflow
----~
subroutine
system diagram

Petri Net / Graph Dual

every edge connects a stage to a resource or a resource to a stage
- The resource might be implied by the edge, and elided from the declaration.

Comparables

Kafka, ZeroMQ, Storm, Esper, Flume, Pusher
Rake, Chef, Shell scripts, Thor
Rack
Hadoop

Objects

stages
- resources
- actions -- alter, augment or act on data. Some important types of actions:
  - transformers -- actions with exactly one input and one output
  - sources -- actions with zero inputs and (?one or more outputs?). Sources may be different in another way, not sure yet
  - sinks -- actions with (?one or more inputs?) and no output.
  - taps -- actions that make no changes to the data (?can a tap add metadata? can it mutate metadata?)
- graphs -- contains other stages. A graph is also an action.
schema
record
control path
configuration
topic -- a filter This filter is special because the transport can
runner -- pairs a graph with concrete executors and resources

(?questions exist about the following?)

partition --
chain -- (?a group of sequential edges understood to be related in an important way?)
delivery guarantee --
contract

Key Decisions

Data is primary Data is represented as "Arbitrary record with sideband metadata". That is, rather than an event wrapper with a message body you access, the data is primary and the metadata is accessible through it.

Flows are understandable in isolation:

Unless a processor explicitly maintains state or mixes in external entropy, the outcome of a data stream may be completely characterized by

Processor
Configuration (static parameters delivered at run-time to the processors on the graph)
Data record itself
Metadata attached to that record

The important point is that the outcome of a processor is not affected by changes to

the graph as a whole, including the stages that it pulls from and writes to
the machine or machines it runs on
the transport layer handling the flow
spatially or temporally where the data originates from or proceeds to.

As long as it is fed the same data and the same metadata with the same configuration, a processor doesn't care what's running it, where it is, who fed it the data or who is consuming it.

Orthogonal Capabilities

guarantees
- end-to-end -- ensures acknowledgement by destination
- next-hop -- guarantees successful handoff to next stage, but nothing more
- best effort -- fire and forget

Key Mysteries

naming stages -- here are some desirable features:
- unambiguously referencable
- name doesn't change if irrelevant changes are made to the graph
- name does change if relevant changes are made
- name is predictable (by looking at the graph
- name is readable _ name is unaffected by configuration
fanout: when several stages consume a given stage's ourputs,
- does the stage have one output (a resourec) and the resource has many consuming stages
- or does the stage have multiple outputs, and the emit method needs re-think?
messages
- can stages message each other directly?
- or do they spek event a sideband that other stages can consume
- or are they broadcase to all notds
amelioration: when are
- mixin of a module (calling super)
- wrapping a stage (before/after/around filter)
- flow insertion (put module before / after it in the flow)
delivery guarantees
control path
biographers

Have / Want

processor widgets

choke -- emits all records it receives but not faster than a given averaged rate
Uniform consistent sampling
dashpot tap
channel topic
aggregates -- sum, avg/stdev, %ile
graph change stream -- an activity stream of macro graph actions (topology or configuration changes).

topology widgets

many-to-many -- all input records sent to all output slots. (?input partition is somehow recorded in event metadata?).
switch -- each record sent to exactly one out slot.

contract

window --
cogroup
sort
grouping "fences" -- indicate start/end boundaries of a group in the stream (or in metadata) rather than by combining into a rolled-up record
buffer -- ...
retry --
load balance --
barrier -- execution pauses until some condition met
schedule -- ...

central-ish

counters, timers, gauges, values
announce / discover

understanding

simulate -- sampled yet illustrative flow goes end-to-end
explain -- processors explain what they do, appending it to the explanation passed along the graph
audit -- metadata element showing full provenance of the record

demo

Hamming's Problem
Newton's method

Provide feedback

Saved searches

Use saved searches to filter your results more quickly