-
Notifications
You must be signed in to change notification settings - Fork 57
philosophy
Things in (?question blocks?) are unresolved.
-
Simple: The thing you would draw at the whiteboard, spoken aloud, is the the thing you write.
-
Scalable: Scalability is more important than performance. Scalability is important even when there are no performance concerns -- scalable for robots means scalable for people.
-
Readable: The plot should be as simple as the story. Orthogonal concerns (schema, transformation, topology, transport, resources, and configuration) should be crisply separated.
Every Hanuman graph can be expressed
- Buffered flow -- Flume
- Event-driven flow -- cat, HTTP post, Hadoop
- Picture -- graphviz etc
- Explanation -- each record appends the plain-language action it takes on the record
Whereever concepts can be paired across the following, they should be, until they prove how they are distinct:
- micro dataflow
- macro dataflow
- workflow
-
----~ - subroutine
- system diagram
- every edge connects a stage to a resource or a resource to a stage
- The resource might be implied by the edge, and elided from the declaration.
- Kafka, ZeroMQ, Storm, Esper, Flume, Pusher
- Rake, Chef, Shell scripts, Thor
- Rack
- Hadoop
-
stages
- resources
- actions -- alter, augment or act on data. Some important types of actions:
- transformers -- actions with exactly one input and one output
- sources -- actions with zero inputs and (?one or more outputs?). Sources may be different in another way, not sure yet
- sinks -- actions with (?one or more inputs?) and no output.
- taps -- actions that make no changes to the data (?can a tap add metadata? can it mutate metadata?)
- graphs -- contains other stages. A graph is also an action.
-
schema
-
record
-
control path
-
configuration
-
topic -- a filter This filter is special because the transport can
-
runner -- pairs a graph with concrete executors and resources
(?questions exist about the following?)
- partition --
- chain -- (?a group of sequential edges understood to be related in an important way?)
- delivery guarantee --
- contract
Data is primary Data is represented as "Arbitrary record with sideband metadata". That is, rather than an event wrapper with a message body you access, the data is primary and the metadata is accessible through it.
Flows are understandable in isolation:
Unless a processor explicitly maintains state or mixes in external entropy, the outcome of a data stream may be completely characterized by
- Processor
- Configuration (static parameters delivered at run-time to the processors on the graph)
- Data record itself
- Metadata attached to that record
The important point is that the outcome of a processor is not affected by changes to
- the graph as a whole, including the stages that it pulls from and writes to
- the machine or machines it runs on
- the transport layer handling the flow
- spatially or temporally where the data originates from or proceeds to.
As long as it is fed the same data and the same metadata with the same configuration, a processor doesn't care what's running it, where it is, who fed it the data or who is consuming it.
-
guarantees
- end-to-end -- ensures acknowledgement by destination
- next-hop -- guarantees successful handoff to next stage, but nothing more
- best effort -- fire and forget
-
naming stages -- here are some desirable features:
- unambiguously referencable
- name doesn't change if irrelevant changes are made to the graph
- name does change if relevant changes are made
- name is predictable (by looking at the graph
- name is readable _ name is unaffected by configuration
-
fanout: when several stages consume a given stage's ourputs,
- does the stage have one output (a resourec) and the resource has many consuming stages
- or does the stage have multiple outputs, and the emit method needs re-think?
-
messages
- can stages message each other directly?
- or do they spek event a sideband that other stages can consume
- or are they broadcase to all notds
-
amelioration: when are
- mixin of a module (calling super)
- wrapping a stage (before/after/around filter)
- flow insertion (put module before / after it in the flow)
-
delivery guarantees
-
biographers
- choke -- emits all records it receives but not faster than a given averaged rate
- Uniform consistent sampling
- dashpot tap
- channel topic
- aggregates -- sum, avg/stdev, %ile
- graph change stream -- an activity stream of macro graph actions (topology or configuration changes).
- many-to-many -- all input records sent to all output slots. (?input partition is somehow recorded in event metadata?).
- switch -- each record sent to exactly one out slot.
-
window --
-
cogroup
-
sort
-
grouping "fences" -- indicate start/end boundaries of a group in the stream (or in metadata) rather than by combining into a rolled-up record
-
buffer -- ...
-
retry --
-
load balance --
-
barrier -- execution pauses until some condition met
-
schedule -- ...
- counters, timers, gauges, values
- announce / discover
- simulate -- sampled yet illustrative flow goes end-to-end
- explain -- processors explain what they do, appending it to the explanation passed along the graph
- audit -- metadata element showing full provenance of the record
- Hamming's Problem
- Newton's method
- Kafka design
- ZeroMQ design and its pattern language
- http://en.wikipedia.org/wiki/Lucid_(programming_language)#Details
- http://en.wikipedia.org/wiki/Oz_(programming_language)