Define the Optimizer.py

Definitively the notion of optimizer is somewhat fuzzy and so is the class `Optimizer`.

We should attempt to clarify definitions we are going to use in the library.

**Definitions (to be added in a wiki)**
- [ ] `Trainer`: manages the optimization procedure of a model on a particular dataset. 
- [ ] `Optimizer`: optimizes a certain objective function (e.g. a loss) by updating some parameters (the ones used in the computation of the objective function: parameters of the model)).
- [ ] `UpdateRule`: Something that modified a direction (often the gradient) in order to update some parameters.
- [ ] `BatchScheduler`: manages the batches (nb. of examples, order of the examples) to give to the learn function.
- [ ] `Loss`: is responsible of outputting the Theano graph corresponding to the desired loss function to optimizer by the `Optimizer`. It takes as inputs a `Model` and a `Dataset`.

**Some questions**
- [x] What should an optimizer take as inputs? **The loss function to optimize. Now a `Loss` class.**
- [x] What are the different kind of optimizers? (**bold**: already available in the library)
  - Zeroth order (needs only function) (not used for real)
  - First order (needs only gradient):
    - **GD**, **SGD**, adam, adadelta, adagrad, nag, svrg, sdca, sag, sagvr, ...
  - Quasi-Newton (needs only gradient, builds an hessian approximation):
    - L-BFGS, ...
  - Second order (needs the gradient and the hessian (or hessian-vector product)
    - Newton, Newton-Trust Region, Hessian-Free, ARC, ...
- [x] Should an optimizer be agnostic to the notion of batch, batch size, batch ordering, etc? Yes, we created a `BatchScheduler` for that.
- [ ] How do we call ADAGRAD, Adam, Adadelta, etc.? Right now those are called `UpdateRule`.
- [ ] Should we allow trivially multiple `UpdateRule` or create special `UpdateRule` that will combine them as the user want. Right now, we blindly applied them one after the other.
- [x] Is SGD really something in our framework? **Yes, otherwise we would need a `SMART-optim` module.**
- [x] Is L-BFGS simply what we call an update rule? **No. It requires the current, and past parameters and the past gradients.**
- [x] Is using the Hessian (e.g. in Newton Method) can be seen as an update rule? No, using exact second-order information should be done in a given subclass of Optimizer, it would then call the necessary method of the model (e.g. hessian or the Rop - Hessian-vector product ).
- [x] Does `Optimizer` should be the one computing `nb_updates_per_epoch`? No, a BatchScheduler should do it.

**Suggestions**
- [x] We could define a class `Loss` that will be provided to the optimizer. This class could know about the model and the dataset, provide the necessary symbolic variables ~~(maybe it should build the `given` for the Theano function)~~.
- [x] Currently, all calls to `update_rules.apply` in `SGD` should be moved inside `Optimizer`. The same goes for calls to `param_modifier.apply`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define the Optimizer.py #7

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Define the Optimizer.py #7

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions