How does Nodlin manage the propagation of change?

Nodlin models the world as one large network dependency graph. The nodes in the graph represent the data elements, and these nodes can change in relation to each other.

Handling the management of change is key in Nodlin due to the vast networks of nodes that require to be updated. In order to successfully handle change Nodlin employs a number of techniques and heuristics.

This description is technical in nature and assumes an understanding of the core principles in distributed systems.

As a reminder, a node has some detail (a payload), relationships to other nodes (immediate connections that may or may not be interested in change), and can process some defined actions (e.g. create and delete at a minimum) and events (from related nodes).

The Nodlin scheduler of operations (aka the Nodlin opProcessor) is a dragonfly cluster (redis compatible) that features a number of enhancements for concurrent processing. The opProcessor is responsible for scheduling instructions to all the Agents handling change (processing events and actions) across the network of nodes.

Key principle 1: Events do NOT have payloads

An action on a node will have a payload (new detail to apply). An event on a node does not allow for a payload.

This may seem strange at first as in most distributed systems an event will have a payload detailing at least some information that the target system can reason about. Software engineers can discuss at length what this event payload should contain (enough to avoid the ’thundering herd’ problem, not exposing too much internal detail to create unnecessary coupling etc).

Nodlin would not work if events had payloads ¹. This is because we de-dupe events to handle the volume.

Example graph (tree) detailing a simple effort spent that totals its own cost and the sum of the effort spent on child tasks.

In the example above you can see a Task node type that simply is a sum of the effort spent on that task, and its children. When one of Tasks gets updated (by an Action), the Task over the relationship ‘parent’ will receive an event and update its record, and in turn will inform its parent and so forth. If this tree is hundreds of levels deep, with many thousands of tasks, this can be computationally expensive. If events have a payload (say a record of the delta) they would need to be processed. The task at the top of the tree would continually be being updated for every change at the bottom. If the events however have no payload and we de-dupe these events, then if the task at the top received say 100 updates in a 1 second period, then it only needs to compute its balance 1’ce for those 100 ‘update’ events from its children.

Note that when a node is provided with an event to process, that event will contain its current state (payload), and the payloads of those immediate neighbour nodes. The state of a node is based on its internal state (recorded by actions or computed), and the state of its immediate dependencies. Events do not need payloads as the state of connected nodes is provided.

The level of de-duplication applied becomes more important the longer the dependency chain.

Key principle 2: Loops are expected

Loops are expected. We are dealing with node details that are not simple values, but most likely a collection of values. Taking a simple task example, a task may want to know whether its related sub-tasks (of the same type) are complete. The child task may want to know whether it will complete within the same constraint as its parent (say not scheduled to complete after the end date for its parent). An update to the parent task, may result in an update to the sub-task, which results in a further update to the parent. These types of dependency are expected in more complex types. We allow for these up to a limit (configurable). It is expected that the nodes in the graph will reach an equilibrium within a sensible period. Nodlin is not ideal for simulation style processing.

When a loop is detected a block on node is put in place (for a configurable time period) and the client is informed. Without any client action the block will be removed and processing will continue.

Internally Nodlin will create a trace ID on the start of any action on a node (e.g. user update). Using the trace ID Nodlin can identify duplicate updates (up to a point, accounting for cross trace impacts). Heuristics are applied due on the de-duplication of events to keep the longest trace executing which is likely to identify a loop.

When a loop is detected, a temporary block is placed on the node and the client informed. A lag is applied on further operations until resolved. The block is automatically removed after defined period to allow updates to continue.

If loops are not resolved there will be a higher load on the agent processing. The application of a lag for the longer chains will reduce overall load but these will need to be addressed client side.

Note that the longer the ‘chain’ of derived operations, the longer the lag we apply to execution. This will capture more duplicates (if they occur) and therefore reduce overall system load.

Warning

Further practical analysis is required to optimise and configure the handling of loops. Practically it has not created issues in the examples we have currently defined. Loops are only an issue in respect of resource consumption that would result in system instability. We do apply additional lag for longer chains so have this as a safegaurd. At a high level loops are common in business. You may determine some avg cost for employees in budget calculations. This factor may be used in budget planning and you use to determine a cost per project. This ‘avg’ calculation in the manual business process however may only be performed once per budget cycle, but if you apply automatically in Nodlin you may form a loop. In Nodlin an approval workflow node or something of this nature would resolve these loops.

In addition some further improvements may be required to try and isolate the root cause of a loop. Currently the first node that exceeds the defined limit is marked as the block.

Key principle 3: Nodes that are closest to the node where the initial change was initiated take priority

From a visual perspective, the user would expect related nodes to update in relation to a change. The Nodlin opProcessor will apply a lag to operations after a specific distance from the source node of the change. Note that a lag is only applied if there is high system load (we monitor system load reported from Kubernetes to throttle priority behaviour across the system). All these properties are configurable in a Nodlin cluster.

Key principle 4: Parallel execution of change is a necessity

Performance of change across a Nodlin network is key. As is well documented, concurrency is about the structure of independent processes, parallelism is about the execution. The nodlin opProcessor runs on dragonfly cluster, which will provide some concurrent and parallelized execution.

The Nodlin opProcessor employs a simple mechanism of handling operations (instructions for actions and events to agents for specific nodes), and checkpoints which are dependencies between the operations on which they are to execute. This model is not a stack but a tree which provides the benefit of potential parallel execution.

An example:

The Nodlin opProcessor will maintain a record of all current operations and checkpoints. If for example an agent is processing an Action on a node, in the above example it has reported 3 additional operations (other actions or events). These operations will be instructed by the opProcessor immediately. If these operations are picked up for processing on across multiple machines, then they can all execute in parallel. An agent should be stateless (it is provided with the current node state & those of its relationships), and therefore there may be multiple instances across machines executing that are handling the same type of node. Note that Kubernetes may scale up and down based on load, all controlled through the Nodlin administration panel.

A checkpoint is simply a control point. Once all its prior operations are successful (where operations have this checkpoint as the target), then any successor operations (those operations that have this checkpoint as the start condition) can then be scheduled for execution.

Key principle 5: Don’t assume any ordering of operations

Handling the race conditions for operations is key to reliable execution. The opProcessor cannot expect to have received and processed the status of a successful operation, after having received any further operations that may have been requested during execution. Therefore any communication on execution (success or otherwise), will include any operation ID’s and to/from checkpoint detail for that operation. A client side ‘helper’ library for communicating with Nodlin is provided that can be used to build external agents for integration to firm systems or for building external agents.

Agents (script or external) can use these simple ‘checkpoint’ functions and operation helpers (for actions and events) to enact behaviour on the Nodlin network.

Key principle 6: Identify ‘hot’ nodes

The opProcessor will identify nodes that continuously update. This is configurable through the administration panel and defined as more than N operations in Y seconds timeframe. An additional lag will be applied to updates by the opProcessor to capture (and deduplicate) more operations thereby reducing overall load on the agent processing.

Key principle 7: Allow user updates on nodes (temporary block)

In Nodlin all the nodes are versioned. A node will also record the version of the nodes on which it is dependent (so we can recognise that it is consistent). Nodes that are not consistent with a related node are visually highlighted in Nodlin with a red edge.

The state of a node is dependent on its own internal payload, and those of its immediate neighbours on which it is dependent.

When the user updates detail on a node (its internal payload), the client will need to provide the version it is updating (and optionally new version reference). If the version reference to be updated is not the same as the one on record the operation will fail.

These race conditions will be more prevalent on nodes that are regularly updated (or hot). In order to allow for updates on nodes, a client app can reserve a block for N seconds (default 30 seconds). A lag will be applied to any operation on this node during this period. This block request will need to be repeated by the client if an edit operation is still in progress. Obviously this should be transparent to the user.

The Nodlin scheduler of operations utilises Dragonfly cache (redis compatible) with a number of enhancements for concurrent processing across instances (designed to operate in a cluster).

Payloads could be enabled on events. An event could have a payload as long as you did not mind it getting dropped (e.g. like loosing packets in voice communications over the network). However currently we do not see a use-case for this and do not want developers to mistakenly utilise event payloads. ↩︎