Skip to content

API

[source]

LudwigModel class

ludwig.LudwigModel(
  model_definition,
  model_definition_file=None,
  logging_level=40
)

Class that allows access to high level Ludwig functionalities.

Inputs

  • model_definition (dict): a dictionary containing information needed to build a model. Refer to the [User Guide] (http://ludwig.ai/user_guide/#model-definition) for details.
  • model_definition_file (string, optional, default: None): path to a YAML file containing the model definition. If available it will be used instead of the model_definition dict.
  • logging_level (int, default: logging.ERROR): logging level to use for logging. Use logging constants like logging.DEBUG, logging.INFO and logging.ERROR. By default only errors will be printed.

Example usage:

from ludwig import LudwigModel

Train a model:

model_definition = {...}
ludwig_model = LudwigModel(model_definition)
train_stats = ludwig_model.train(data_csv=csv_file_path)

or

train_stats = ludwig_model.train(data_df=dataframe)

If you have already trained a model you can load it and use it to predict

ludwig_model = LudwigModel.load(model_dir)

Predict:

predictions = ludwig_model.predict(data_csv=csv_file_path)

or

predictions = ludwig_model.predict(data_df=dataframe)

Finally in order to release resources:

model.close()

LudwigModel methods

close

close(
)

Closes an open LudwigModel (closing the session running it). It should be called once done with the model to release resources.


initialize_model

initialize_model(
  train_set_metadata=None,
  train_set_metadata_json=None,
  gpus=None,
  gpu_fraction=1,
  random_seed=42,
  logging_level=40,
  debug=False
)

This function initializes a model. It is need for performing online learning, so it has to be called before train_online. train initialize the model under the hood, so there is no need to call this function if you don't use train_online.

Inputs

  • train_set_metadata (dict): it contains metadata information for the input and output features the model is going to be trained on. It's the same content of the metadata json file that is created while training.
  • train_set_metadata_json (string): path to the JSON metadata file created while training. it contains metadata information for the input and output features the model is going to be trained on
  • gpus (string, default: None): list of GPUs to use (it uses the same syntax of CUDA_VISIBLE_DEVICES)
  • gpu_fraction (float, default 1.0): fraction of GPU memory to initialize the process with
  • random_seed (int, default42): a random seed that is going to be used anywhere there is a call to a random number generator: data splitting, parameter initialization and training set shuffling
  • logging_level (int, default: logging.ERROR): logging level to use for logging. Use logging constants like logging.DEBUG, logging.INFO and logging.ERROR. By default only errors will be printed.
  • debug (bool, default: False): enables debugging mode

load

load(
  model_dir,
  logging_level=40
)

This function allows for loading pretrained models

Inputs

  • model_dir (string): path to the directory containing the model. If the model was trained by the train or experiment command, the model is in results_dir/experiment_dir/model.
  • logging_level (int, default: logging.ERROR): logging level to use for logging. Use logging constants like logging.DEBUG, logging.INFO and logging.ERROR. By default only errors will be printed.

Return

  • return (LudwigModel): a LudwigModel object

Example usage

ludwig_model = LudwigModel.load(model_dir)

predict

predict(
  data_df=None,
  data_csv=None,
  data_dict=None,
  return_type=<class 'pandas.core.frame.DataFrame'>,
  batch_size=128,
  gpus=None,
  gpu_fraction=1,
  logging_level=40
)

This function is used to predict the output variables given the input variables using the trained model.

Inputs

  • data_df (DataFrame): dataframe containing data. Only the input features defined in the model definition need to be present in the dataframe.
  • data_csv (string): input data CSV file. Only the input features defined in the model definition need to be present in the CSV.
  • data_dict (dict): input data dictionary. It is expected to contain one key for each field and the values have to be lists of the same length. Each index in the lists corresponds to one datapoint. Only the input features defined in the model definition need to be present in the dataframe. For example a data set consisting of two datapoints with a input text may be provided as the following dict `{'text_field_name}: ['text of the first datapoint', text of the second datapoint']}.
  • return_type (strng or type, default: DataFrame): string describing the type of the returned prediction object. 'dataframe', 'df' and DataFrame will return a pandas DataFrame , while 'dict', ''dictionary'anddict` will return a dictionary.
  • batch_size (int, default: 128): batch size
  • gpus (string, default: None): list of GPUs to use (it uses the same syntax of CUDA_VISIBLE_DEVICES)
  • gpu_fraction (float, default 1.0): fraction of gpu memory to initialize the process with
  • logging_level (int, default: logging.ERROR): logging level to use for logging. Use logging constants like logging.DEBUG, logging.INFO and logging.ERROR. By default only errors will be printed.

Return

  • return (DataFrame or dict): a dataframe containing the predictions for each output feature and their probabilities (for types that return them) will be returned. For instance in a 3 way multiclass classification problem with a category field names class as output feature with possible values one, two and three, the dataframe will have as many rows as input datapoints and five columns: class_predictions, class_UNK_probability, class_one_probability, class_two_probability, class_three_probability. (The UNK class is always present in categorical features). If the return_type is a dictionary, the returned object be a dictionary contaning one entry for each output feature. Each entry is itself a dictionary containing aligned arrays of predictions and probabilities / scores.

save

save(
  save_path
)

This function allows for loading pretrained models

Inputs

  • save_path (string): path to the directory where the model is going to be saved. Both a JSON file containing the model architecture hyperparameters and checkpoints files containing model weights will be saved.

Example usage

ludwig_model.save(save_path)

test

test(
  data_df=None,
  data_csv=None,
  data_dict=None,
  return_type=<class 'pandas.core.frame.DataFrame'>,
  batch_size=128,
  gpus=None,
  gpu_fraction=1,
  logging_level=40
)

This function is used to predict the output variables given the input variables using the trained model and compute test statistics like performance measures, confusion matrices and the like.

Inputs

  • data_df (DataFrame): dataframe containing data. Both input and output features defined in the model definition need to be present in the dataframe.
  • data_csv (string): input data CSV file. Both input and output features defined in the model definition need to be present in the CSV.
  • data_dict (dict): input data dictionary. It is expected to contain one key for each field and the values have to be lists of the same length. Each index in the lists corresponds to one datapoint. Both input and output features defined in the model definition need to be present in the dataframe. For example a data set consisting of two datapoints with a input text may be provided as the following dict `{'text_field_name}: ['text of the first datapoint', text of the second datapoint']}.
  • return_type (strng or type, default: DataFrame): string describing the type of the returned prediction object. 'dataframe', 'df' and DataFrame will return a pandas DataFrame , while 'dict', ''dictionary'anddict` will return a dictionary.
  • batch_size (int, default: 128): batch size
  • gpus (string, default: None): list of GPUs to use (it uses the same syntax of CUDA_VISIBLE_DEVICES)
  • gpu_fraction (float, default 1.0): fraction of GPU memory to initialize the process with
  • logging_level (int, default: logging.ERROR): logging level to use for logging. Use logging constants like logging.DEBUG, logging.INFO and logging.ERROR. By default only errors will be printed.

Return

  • return (tuple((DataFrame or dict):, dict)) a tuple of a dataframe and a dictionary. The dataframe contains the predictions for each output feature and their probabilities (for types that return them) will be returned. For instance in a 3 way multiclass classification problem with a category field names class as output feature with possible values one, two and three, the dataframe will have as many rows as input datapoints and five columns: class_predictions, class_UNK_probability, class_one_probability, class_two_probability, class_three_probability. (The UNK class is always present in categorical features). If the return_type is a dictionary, the first object of the tuple will be a dictionary contaning one entry for each output feature. Each entry is itself a dictionary containing aligned arrays of predictions and probabilities / scores. The second object of the tuple is a dictionary that contains the test statistics, with each key being the name of an output feature and the values being dictionaries containing measures names and their values.

train

train(
  data_df=None,
  data_train_df=None,
  data_validation_df=None,
  data_test_df=None,
  data_csv=None,
  data_train_csv=None,
  data_validation_csv=None,
  data_test_csv=None,
  data_hdf5=None,
  data_train_hdf5=None,
  data_validation_hdf5=None,
  data_test_hdf5=None,
  train_set_metadata_json=None,
  dataset_type='generic',
  model_name='run',
  model_load_path=None,
  model_resume_path=None,
  skip_save_model=False,
  skip_save_progress=False,
  skip_save_log=False,
  skip_save_processed_input=False,
  output_directory='results',
  gpus=None,
  gpu_fraction=1.0,
  random_seed=42,
  logging_level=40,
  debug=False
)

This function is used to perform a full training of the model on the specified dataset.

Inputs

  • data_df (DataFrame): dataframe containing data. If it has a split column, it will be used for splitting (0: train, 1: validation, 2: test), otherwise the dataset will be randomly split
  • data_train_df (DataFrame): dataframe containing training data
  • data_validation_df (DataFrame): dataframe containing validation data
  • data_test_df (DataFrame dataframe containing test dat):data_test_df: (DataFrame dataframe containing test data
  • data_csv (string): input data CSV file. If it has a split column, it will be used for splitting (0: train, 1: validation, 2: test), otherwise the dataset will be randomly split
  • data_train_csv (string): input train data CSV file
  • data_validation_csv (string): input validation data CSV file
  • data_test_csv (string): input test data CSV file
  • data_hdf5 (string): input data HDF5 file. It is an intermediate preprocess version of the input CSV created the first time a CSV file is used in the same directory with the same name and a hdf5 extension
  • data_train_hdf5 (string): input train data HDF5 file. It is an intermediate preprocess version of the input CSV created the first time a CSV file is used in the same directory with the same name and a hdf5 extension
  • data_validation_hdf5 (string): input validation data HDF5 file. It is an intermediate preprocess version of the input CSV created the first time a CSV file is used in the same directory with the same name and a hdf5 extension
  • data_test_hdf5 (string): input test data HDF5 file. It is an intermediate preprocess version of the input CSV created the first time a CSV file is used in the same directory with the same name and a hdf5 extension
  • train_set_metadata_json (string): input metadata JSON file. It is an intermediate preprocess file containing the mappings of the input CSV created the first time a CSV file is used in the same directory with the same name and a json extension
  • dataset_type (string, default: 'default'): determines the type of preprocessing will be applied to the data. Only generic is available at the moment
  • model_name (string): a name for the model, user for the save directory
  • model_load_path (string): path of a pretrained model to load as initialization
  • model_resume_path (string): path of a the model directory to resume training of
  • skip_save_model (bool, default: False): disables saving model weights and hyperparameters each time the model improves. By default Ludwig saves model weights after each epoch the validation measure imrpvoes, but if the model is really big that can be time consuming if you do not want to keep the weights and just find out what performance can a model get with a set of hyperparameters, use this parameter to skip it, but the model will not be loadable later on.
  • skip_save_progress (bool, default: False): disables saving progress each epoch. By default Ludwig saves weights and stats after each epoch for enabling resuming of training, but if the model is really big that can be time consuming and will uses twice as much space, use this parameter to skip it, but training cannot be resumed later on.
  • skip_save_log (bool, default: False): disables saving TensorBoard logs. By default Ludwig saves logs for the TensorBoard, but if it is not needed turning it off can slightly increase the overall speed.
  • skip_save_processed_input (bool, default: False): skips saving intermediate HDF5 and JSON files
  • output_directory (string, default: 'results'): directory that contains the results
  • gpus (string, default: None): list of GPUs to use (it uses the same syntax of CUDA_VISIBLE_DEVICES)
  • gpu_fraction (float, default 1.0): fraction of gpu memory to initialize the process with
  • random_seed (int, default42): a random seed that is going to be used anywhere there is a call to a random number generator: data splitting, parameter initialization and training set shuffling
  • debug (bool, default: False): enables debugging mode
  • logging_level (int, default: logging.ERROR): logging level to use for logging. Use logging constants like logging.DEBUG, logging.INFO and logging.ERROR. By default only errors will be printed.

There are three ways to provide data: by dataframes using the _df parameters, by CSV using the _csv parameters and by HDF5 and JSON, using _hdf5 and _json parameters. The DataFrame approach uses data previously obtained and put in a dataframe, the CSV approach loads data from a CSV file, while HDF5 and JSON load previously preprocessed HDF5 and JSON files (they are saved in the same directory of the CSV they are obtained from). For all three approaches either a full dataset can be provided (which will be split randomly according to the split probabilities defined in the model definition, by default 70% training, 10% validation and 20% test) or, if it contanins a plit column, it will be plit according to that column (interpreting 0 as training, 1 as validation and 2 as test). Alternatively separated dataframes / CSV / HDF5 files can beprovided for each split.

During training the model and statistics will be saved in a directory [output_dir]/[experiment_name]_[model_name]_n where all variables are resolved to user spiecified ones and n is an increasing number starting from 0 used to differentiate different runs.

Return

  • return (dict): a dictionary containing training statistics for each output feature containing loss and measures values for each epoch.

train_online

train_online(
  data_df=None,
  data_csv=None,
  data_dict=None,
  batch_size=None,
  learning_rate=None,
  regularization_lambda=None,
  dropout_rate=None,
  bucketing_field=None,
  gpus=None,
  gpu_fraction=1,
  logging_level=40
)

This function is used to perform one epoch of training of the model on the specified dataset.

Inputs

  • data_df (DataFrame): dataframe containing data.
  • data_csv (string): input data CSV file.
  • data_dict (dict): input data dictionary. It is expected to contain one key for each field and the values have to be lists of the same length. Each index in the lists corresponds to one datapoint. For example a data set consisting of two datapoints with a text and a class may be provided as the following dict `{'text_field_name': ['text of the first datapoint', text of the second datapoint'], 'class_filed_name': ['class_datapoints_1', 'class_datapoints_2']}.
  • batch_size (int): the batch size to use for training. By default it's the one specified in the model definition.
  • learning_rate (float): the learning rate to use for training. By default the values is the one specified in the model definition.
  • regularization_lambda (float): the regularization lambda parameter to use for training. By default the values is the one specified in the model definition.
  • dropout_rate (float): the dropout rate to use for training. By default the values is the one specified in the model definition.
  • bucketing_field (string): the bucketing field to use for bucketing the data. By default the values is one specified in the model definition.
  • gpus (string, default: None): list of GPUs to use (it uses the same syntax of CUDA_VISIBLE_DEVICES)
  • gpu_fraction (float, default 1.0): fraction of GPU memory to initialize the process with
  • logging_level (int, default: logging.ERROR): logging level to use for logging. Use logging constants like logging.DEBUG, logging.INFO and logging.ERROR. By default only errors will be printed.

There are three ways to provide data: by dataframes using the data_df parameter, by CSV using the data_csv parameter and by dictionary, using the data_dict parameter.

The DataFrame approach uses data previously obtained and put in a dataframe, the CSV approach loads data from a CSV file, while dict approach uses data organized by keys representing columns and values that are lists of the datapoints for each. For example a data set consisting of two datapoints with a text and a class may be provided as the following dict `{'text_field_name}: ['text of the first datapoint', text of the second datapoint'], 'class_filed_name': ['class_datapoints_1', 'class_datapoints_2']}.