Skip to content

Proposed Learner Design

Kyle Daruwalla edited this page Nov 10, 2020 · 5 revisions

Note: This wiki has now been moved to https://github.com/FluxML/ML-Coordination-Tracker/wiki

Proposal for a Learner Design

A Learner is a utility for training ML models. It combines model, training and validation data, and logic controlling training. The training logic is contained in Stages and Callbacks. Learners are composed of a sequence of Stages. Each Stage is preceded and followed by optional Callbacks. There is one of each Stage, but possibly many Callbacks. Stages may invoke sub Stages, but Callbacks can not invoke other Callbacks. Both Stages and Callbacks may monitor and modify the State of the Learner.

Example Learner

  • Learner
    • Epoch Stage
      • Fit Stage
        • Before Batch Callback
        • Fit Batch Stage
          • Fit Gradient/Loss Stage
            • After Gradient Loss Callback 1
            • After Gradient Loss Callback 2
          • Fit Update Model Stage
      • Validate Stage
        • Validate Batch Stage
          • Validate Loss Stage
        • After Validate Callback

Goals of Learner

  • Allow capture state of the art best practices as Stages and Callbacks
  • Allow plug-in combinations of Stages and Callbacks by outside developers
  • Enable easy debugging of combinations of Stages and Callbacks
    • Allow logging of changes of state by Stages and Callbacks
    • Prevent access to invalid state
    • Prevent invalid modification of state
  • Allow Stages to run in parallel if desired
    • Allow Stages and Callbacks that implement parallel and distributed training algorithms

Learner Interfaces

  • State: value of all variables involved in training a model at a certain point in the training process
  • Stage: logic that transforms one state into another. For example, calculating the gradient of loss is a Stage, and updating the model is a different Stage. Stages represent different parts of the training process, therefore the interface to each Stage is different. Stages may also invoke other Stages. For example the Epoch stage invokes Batch stages, which invoke Gradient and Update stages.
  • Callback: logic that monitors state, and may also transforms one state into another. Callbacks are registered to precede or follow Stages. Callbacks may not invoke other Callbacks.

Open Questions

  • Should Stages be synchronous? Should multiple stages be allowed to run in parallel
  • Should Callbacks be synchronous? Should multiple callbacks be allowed to run in parallel
  • How to represent State
  • How to implement access/modify control of State?