Skip to content

Developer Guide

Codebase Structure

The codebase is organized in a modular, datatype / feature centric way so that adding a feature for a new datatype is pretty straightforward and requires isolated code changes. All the datatype specific logic lives in the corresponding feature module all of which are under ludwig/features/.

Feature classes contain raw data preprocessing logic specific to each data type. All input (output) features implement build_input (build_output) method which is used to build encodings (decode outputs). Output features also contain datatype-specific logic to compute output measures such as loss, accuracy, etc.

Encoders and decoders are modularized as well (they are under ludwig/models/modules) so that they can be used by multiple features. For example sequence encoders are shared among text, sequence, and timeseries features.

Various model architecture components which can be reused are also split into dedicated modules, for example convolutional modules, fully connected modules, etc.

Bulk of the training logic resides in ludwig/models/model.py which initializes a tensorflow session, feeds the data, and executes training.

Adding an Encoder

1. Add a new encoder class

Source code for encoders lives under ludwig/models/modules. New encoder objects should be defined in the corresponding files, for example all new sequence encoders should be added to ludwig/models/modules/sequence_encoders.py.

All the encoder parameters should be provided as arguments in the constructor with their default values set. For example RNN encoder takes the following list of arguments in its constructor:

def __init__(
    self,
    should_embed=True,
    vocab=None,
    representation='dense',
    embedding_size=256,
    embeddings_trainable=True,
    pretrained_embeddings=None,
    embeddings_on_cpu=False,
    num_layers=1,
    state_size=256,
    cell_type='rnn',
    bidirectional=False,
    dropout=False,
    initializer=None,
    regularize=True,
    reduce_output='last',
    **kwargs
):

Typically all the dependencies are initialized in the encoder's constructor (in the case of the RNN encoder these are EmbedSequence and RecurrentStack modules) so that at the end of the constructor call all the layers are fully described.

Actual creation of tensorflow variables takes place inside the __call__ method of the encoder. All encoders should have the following signature:

__call__(
    self,
    input_placeholder,
    regularizer,
    dropout,
    is_training
)

Inputs

  • input_placeholder (tf.Tensor): input tensor.
  • regularizer (A (Tensor -> Tensor or None) function): regularizer function passed to tf.get_variable method.
  • dropout (tf.Tensor(dtype: tf.float32)): dropout rate.
  • is_training (tf.Tensor(dtype: tf.bool), default: True): boolean indicating whether this is a training dataset.

Return

  • hidden (tf.Tensor(dtype: tf.float32)): feature encodings.
  • hidden_size (int): feature encodings size.

Encoders are initialized as class member variables in input feature object constructors and called inside build_input methods.

2. Add the new encoder class to the corresponding encoder registry

Mapping between encoder keywords in the model definition and encoder classes is done by encoder registries: for example sequence encoder registry is defined in ludwig/features/sequence_feature.py

sequence_encoder_registry = {
    'stacked_cnn': StackedCNN,
    'parallel_cnn': ParallelCNN,
    'stacked_parallel_cnn': StackedParallelCNN,
    'rnn': RNN,
    'cnnrnn': CNNRNN,
    'embed': EmbedEncoder
}

Adding a Decoder

1. Add a new decoder class

Souce code for decoders lives under ludwig/models/modules. New decoder objects should be defined in the corresponding files, for example all new sequence decoders should be added to ludwig/models/modules/sequence_decoders.py.

All the decoder parameters should be provided as arguments in the constructor with their default values set. For example Generator decoder takes the following list of arguments in its constructor:

__init__(
    self,
    cell_type='rnn',
    state_size=256,
    embedding_size=64,
    beam_width=1,
    num_layers=1,
    attention_mechanism=None,
    tied_embeddings=None,
    initializer=None,
    regularize=True,
    **kwargs
)

Decoders are initialized as class member variables in output feature object constructors and called inside build_output methods.

2. Add the new decoder class to the corresponding decoder registry

Mapping between decoder keywords in the model definition and decoder classes is done by decoder registries: for example sequence decoder registry is defined in ludwig/features/sequence_feature.py

sequence_decoder_registry = {
    'generator': Generator,
    'tagger': Tagger
}

Adding a new Feature Type

1. Add a new feature class

Souce code for feature classes lives under ludwig/features. Input and output feature classes are defined in the same file, for example CategoryInputFeature and CategoryOutputFeature are defined in ludwig/features/category_feature.py.

An input features inherit from the InputFeature and corresponding base feature classes, for example CategoryInputFeature inherits from CategoryBaseFeature and InputFeature.

Similarly, output features inherit from the OutputFeature and corresponding base feature classes, for example CategoryOutputFeature inherits from CategoryBaseFeature and OutputFeature.

Feature parameters are provided in a dictionary of key-value pairs as an argument to the input or output feature constructor which contains default parameter values as well.

All input and output features should implement build_input and build_output methods correspondingly with the following signatures:

build_input

build_input(
    self,
    regularizer,
    dropout_rate,
    is_training=False,
    **kwargs
)

Inputs

  • regularizer (A (Tensor -> Tensor or None) function): regularizer function passed to tf.get_variable method.
  • dropout_rate (tf.Tensor(dtype: tf.float32)): dropout rate.
  • is_training (tf.Tensor(dtype: tf.bool), default: True): boolean indicating whether this is a training dataset.

Return

  • feature_representation (dict): the following dictionary
{
    'type': self.type, # str
    'representation': feature_representation, # tf.Tensor(dtype: tf.float32)
    'size': feature_representation_size, # int
    'placeholder': placeholder # tf.Tensor(dtype: tf.float32)
}

build_output

build_output(
    self,
    hidden,
    hidden_size,
    regularizer=None,
    **kwargs
)

Inputs

  • hidden (tf.Tensor(dtype: tf.float32)): output feature representation.
  • hidden_size (int): output feature representation size.
  • regularizer (A (Tensor -> Tensor or None) function): regularizer function passed to tf.get_variable method.

Return - train_mean_loss (tf.Tensor(dtype: tf.float32)): mean loss for train dataset. - eval_loss (tf.Tensor(dtype: tf.float32)): mean loss for evaluation dataset. - output_tensors (dict): dictionary containing feature specific output tensors (predictions, probabilities, losses, etc).

2. Add the new feature class to the corresponding feature registry

Input and output feature registries are defined in ludwig/features/feature_registries.py.

Style Guidelines

We expect contributions to mimic existing patterns in the codebase and demonstrate good practices: the code should be concise, readable, PEP8-compliant, and conforming to 80 character line length limit.

Tests

We are using pytest to run tests. Current test coverage is limited to several integration tests which ensure end-to-end functionality but we are planning to expand it.

Checklist

Before running tests, make sure 1. Your environment is properly setup. 2. You have write access on the machine. Some of the tests require saving data to disk.

Running tests

To run all tests, just run python -m pytest from the ludwig root directory. Note that you don't need to have ludwig module installed and in this case code change will take effect immediately.

To run a single test, run

python -m pytest path_to_filename::test_method_name

Example

python -m pytest tests/integration_tests/test_experiment.py::test_visual_question_answering