Jump to content

Search the Community

Showing results for tags 'python'.

  • Search By Tags

    Type tags separated by commas.
  • Search By Author

Content Type


Forums

  • Lucd AI Platform Suite
    • JedAI Client
    • Python Client
    • Modeling Framework
    • General
  • Collaborate & Discuss
    • Questions & Answers
    • Data Management
    • Machine Learning
    • AI Solutions
    • Community Feedback & Requests
  • Health & Life Sciences's Discussion
  • Financial Services's Discussion
  • Retail & Consumer Packaged Goods's Discussion
  • Media & Entertainment's Discussion
  • Energy's Discussion
  • Manufacturing's Topics

Blogs

  • Lucd Team Blog
  • UX Club (Beta version only)'s Blog

Calendars

  • Community Calendar
  • Health & Life Sciences's Events
  • Financial Services's Events
  • Retail & Consumer Packaged Goods's Events
  • Media & Entertainment's Events
  • Energy's Events
  • Manufacturing's Events

Categories

  • Datasheets and White Papers
  • Press Releases
  • Industry Focus
    • Energy
    • Financial Services
    • Health and Life Services
    • Media and Entertainment
    • Manufacturing
    • Retail

Categories

  • JedAI Client
  • Python Client
  • Modeling Framework
  • General

Find results in...

Find results that contain...


Date Created

  • Start

    End


Last Updated

  • Start

    End


Filter by number of...

Joined

  • Start

    End


Group


About Me


Interests


Industry

Found 12 results

  1. Enables the ability to authenticate to Lucd backend and access data for custom analysis and model prototyping. Enables the ability to define custom feature transformation operations, and upload to Lucd for use in the GUI. Provides access to the Asset, Concept, Custom Contract, Custom Operation, Explore, Model, UDS and VDS REST APIs. Provides multiple working examples of individual REST calls, as well as complete models. Enables Dask exploration of the data on the client side. Enables development and testing of model development using the Lucd PyTorch, Tensorflow, Keras and XGBoost classes. View full record
  2. This section covers the high-level tasks needed to support model training: importing and preparing data and analyzing performance. Most of the content here pertains to the full model approach, but some (e.g., reporting model status) is still helpful for PyTorch compact modeling. Importing and Preparing Data Data can be imported into a modeling context using the Lucd Unified Dataspace (UDS) API (eda.lib.lucd_uds). This library provides functions for creating datasets of various formats (TensorFlow, PyTorch, Dask dataframe) based on Lucd virtual datasets defined in the Unity client. It also provides the capability to retrieve previously trained word embeddings. The Lucd UDS functions providing data retrieval are listed below. Some are used for straightforward data importing (e.g., get_datframe) while others are used for preparing framework-specific datasets for AI models (e.g., get_tf_dataset for defining TensorFlow datasets). Refer to the API documentation for full function descriptions, and visit the Lucd Model Shop gitlab project for examples on how to use the functions for developing AI models. get_asset get_dataframe get_tf_dataset get_tf_dataset_image get_tf_dataset_text train_eval_test_split_dataframe train_eval_test_split_pytorch train_eval_test_split_tensorflow Important notes for implementing multi-class modeling TensorFlow offers different approaches to building multi-class models, two prominent ones being using pre-made Estimators (https://www.tensorflow.org/tutorials/estimator/premade#overview_of_programming_with_estimators), and using general techniques such as with Keras models and Estimators. If one-hot encoded data labels are needed (i.e., to match the number of nodes in a neural network output layer), the num_classes parameter should be used when calling relevant functions to get data (e.g., lucd_uds.get_tf_dataset). Note that most (if not all) TensorFlow pre-made Estimator models do not require explicitly one-hot encoded data labels for non-binary modeling (e.g., tensorflow.estimator.DNNClassifier) and hence, the num_classes argument can be omitted. In the case of TensorFlow Estimators, developers are encouraged to understand how to shape input for the models. The same goes for modeling with PyTorch or XGBoost. Analyzing Model Performance Post-training performance analysis tasks are supported by the Lucd Machine Learning (ML) API (eda.lib.lucd_ml). This library provides functions supporting automatic execution and reporting of critical performance analysis tasks (e.g., creating confusion matrices, ROC curves), preventing the need to repeatedly write such code. The tables and plots created from these library functions can be viewed in the Unity client after the entire model training process has completed. The Lucd ML functions for performance analysis are listed below. Refer to the API documentation for full function descriptions. get_predictions_classification_pt get_predictions_classification_tf get_predictions_regression_pt get_predictions_regression_tf lucd_precision_recall_curve lucd_roc_curve lucd_confusion_matrix update_plots Submitting Performance Analysis Results Trained models and metadata can be uploaded to the Lucd backend via the eda.int.train.update function. The following piece of example code illustrates how to use the function. model_filename = lucd_uds.zip_model_tf(trained_classifier, serving_input_receiver_fn, model_id, graph_version, log_dir) # Store model graph and performance stats back to Lucd back-end with open(model_filename, "rb") as graph_file: train.update({tid: { 'performance': { 'loss': loss, 'accuracy': accuracy, 'macro_precision': precision_macro, 'macro_recall': recall_macro, 'macro_f1': f1_macro, 'micro_precision': precision_micro, 'micro_recall': recall_micro, 'micro_f1': f1_micro, 'weighted_precision': precision_weighted, 'weighted_recall': recall_weighted, 'weighted_f1': f1_weighted, 'precision_recall_f1_per_label': results_string, }, 'graph_version': graph_version, 'graph_file': graph_file.read() }}) train.update takes a python dictionary as the argument, with the train_id, described in Table 1, as the top-level key (tid represents the table_id in the code snippet above). The secondary keys graph_version and graph_file store the graph version and trained graph file (model) respectively. The secondary key performance stores another dictionary for performance values. There is no restriction on the key-value pairs here. The developer is allowed to choose the performance values and they will be viewable in the Unity client afterward. The values shown in the code snippet above are customary for evaluating classification models. Again, see example models in The Lucd Model Shop for more insights. As for precision_recall_f1_per_label, an example of how to compute performance statistics per label can be found at https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html. precision_recall_f1_per_label must be formatted as a semicolon-separated string of tuples, with each tuple formatted as “-,,. An example is as follows, precision_recall_f1_per_label: setosa-1.0,0.5,1.0;virginica-1.0,1.0,0.6;versicolor-1.0,0.6666666666666666,0.7499999999999999. Enabling Model Explainability To enable a trained model to be used by the explainability tool in the Unity client, some parameters must be defined. For TensorFlow models, ordered_feature_names, ordered_class_names, input_name, and output_name must be defined. ordered_feature_names (not to be confused with training data input column names) is a list of ordered names of the inputs to the trained model, commonly defined in TensorFlow model definitions as tf.feature_column. For example, for a TensorFlow text classification model, the named input might be embedding_input. Please see example code in The Lucd Model Shop. ordered_class_names is a list formatted such that string class names are ordered by their integer representations (the order of outputs from your model). For instance, for a binary classification model for which the labels are 0 and 1, the order of strings must be negative and positive (or whatever string labels you choose). input_name is the name of the input layer in your TensorFlow model to which your ordered_feature_names data will be passed. output_name is the name of the output layer in your TensorFlow model (by default these can be named things like ‘dense_2’ and ‘scores’). The output_name is used to retrieve your model outputs in the proper format for explanation. PyTorch models only require that ordered_class_names be provided. Plots Lucd allows users to plot associated model training metrics in the Unity client. The plots will update in realtime during model training, providing insight into the viability, problems, and successes of training runs. Users are able to store any information they want to be plotted as part of a training run using the eda.lib.lucd_ml.update_plots function. The provided information must abide by the following assumptions: 1. The top level keys represent individual plots; 2. Each inner dictionary must specify a labels and a description key, where labels are [“plot x_label”, “plot y_label”] and the description can be any string; 3. All remaining keys in the inner dictionary will be treated as individual lines on the plot, so in the following example "accuracy" is a line on the plot. { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } The individual elements in the “l1” line above represent [x_val, y_val]. A common example as shown above includes the following: "l1": [ [epoch, accuracy], [epoch, accuracy], ... ]. Once created, the dictionary may be stored for plotting with the lucd_ml.update_plots function: def update_plots(train_or_eval: bool, tid: str, dictionary: dict) -> str: ... dictionary = { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } update_plots(True, train_id, dictionary) Regarding the update_plots function, train_or_eval allows a user to specify whether their plot is part of the training or evaluation (or validation) cycle of the model training: train=true, eval=false. A TensorFlow hook is provided in lucd_ml for automatically parsing generated events files (the same as used by TensorBoard) and passing them to update_plots as part of a TensorFlow model. It can be provided as part of a TenorFlow EvalSpec or TrainSpec object as follows (stub included for posterity): class LucdTFEstimatorHook(tf.estimator.SessionRunHook): def __init__(self, train_hook: bool, log_dir: str, tid: str, freq: int, last_epoch: int): ... train_spec = tf.estimator.TrainSpec( input_fn=lambda: lucd_uds.get_tf_dataset_image(type_dict, training_data, num_features, target_type, num_classes).repeat(count=None).shuffle(30).batch(int(30)), max_steps=training_steps, hooks=lucd_ml.LucdTFEstimatorHook(train_hook=True, log_dir=log_dir, tid=tid, freq=10, max_epoch=training_steps)]) train_hook allows a user to specify whether the hook is providing train or eval metrics to the user (train=true, eval=false). log_dir tells the hook where to find TensorFlow events files. freq is the frequency that the hook should look for metrics in the events files. last_epoch tells the hook the number of epochs being run so the hook can ignore freq for the last epoch. Two last helper functions are provided as part of the Lucd plotting framework: lucd_roc_curve and lucd_precision_recall_curve. These functions generate ROC curves and precision-recall curves respectively, and are called selectively when using the compact modeling approach (enabled by “performance_curves” button in unity). Further documentation for these functions is provided in the API. def lucd_roc_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): def lucd_precision_recall_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): ... Confusion Matrix Lucd provides an in-depth, interactive confusion matrix for classification model evaluation. Users may select a square in the Unity client to be shown actual records associated with the square selected. This may be enabled by using the following function: def lucd_confusion_matrix(test_set: list or DataLoader, predictions: list, num_classes: int, label_mapping: type(abs), tid: str, write_accumulo: bool,) -> (dict, str): The function arguments details are provided below. test_set: Users may directly pass the PyTorch DataLoader or list of delayed dask dataframes returned from the respective train_eval_test_split_pytorch/tensorflow function. predictions: This should be a list of predictions generated by your model (the list returned from lucd_ml get_predictions_classification). The list must be in the same order as the test_set data. num_classes: An integer number of classes for the confusion matrix to represent. label_mapping: A function to map integers to class labels, which is used to map predictions to a human-readable format. tid: Training id to associate confusion matrix with. write_accumulo: Boolean specifying whether to write the dictionary directly to accumulo. Must be True to enable Unity client Confusion Matrix. If False, the generated confusion matrix will only be returned & not written to the database. Further documentation for this function exists in the API documentation. Here is an example usage: def _label_mapping(): return {0: 'I. versicolor', 1: 'I. virginica', 2: 'I. setosa'} ... # Prepare vds data for modeling delayed_values_training, delayed_values_evaluation, delayed_values_testing, my_df_testing_label, num_features = \ lucd_uds.train_eval_test_split_tensorflow(virtual_dataset_id, evaluation_dataset_percent, testing_dataset_percent) ... predictions, scores = lucd_ml.get_predictions_classification_tf(_estimator, lambda: uds.get_tf_dataset( feature_dict, delayed_values_testing, num_features, target_type).batch(1), classification_mode, .5) ... lucd_ml.lucd_confusion_matrix(delayed_values_testing, predictions, 3, label_mapping(), tid, True) Submitting Model Training Status Another helpful function is eda.int.train.status, which is used for storing the status of a developer’s training pipeline. This enables a model’s status to be displayed on the Unity client. The function definition is below. def status(uid, code, message=None): """Update model status in the database. Args: uid: Int representing a model's ID. code: 0 - RETRIEVING VDS, 1 - TRAINING, 2 - ANALYZING PERFORMANCE, 3 - STORING MODEL, 4 - TRAINING COMPLETE, 5 - ERROR, 6 - QUEUED, 7 - STOPPED. message: String representing optional custom message to include. Returns: Status message. Raises: TypeError: If code is not of type int. Exception: If code is invalid. """
  3. The Lucd Modeling Framework (LMF) enables developers to build and interface custom AI models with the Lucd JedAI platform for streamlined management, experimentation, and training using data and parameters established in the JudAI Unity client (or simply, Unity client). The framework supports python-based AI models built with TensorFlow, PyTorch, and Dask XGBoost. LMF’s python libraries support the following tasks: - accessing Lucd virtual datasets (VDSes) for model training and evaluation, - analyzing and reporting model performance metrics (e.g., with confusion matrices, ROC curves), - storing structures representing trained models and training checkpoints. Model Development Approaches LMF provides flexibility in the level of effort and control needed for preparing models for Lucd. The two approaches include the full and compact model approaches; their differences are illustrated in Figure 1. Figure 1. Conceptual illustration of full and compact model approaches. Full Model Approach In the full model approach, a developer creates some AI model and manually uses LMF python libraries to complete the model training workflow (e.g., train, validate, holdout data testing, store results). This enables complete flexibility for more advanced use cases which might include designing complex or experimental training loops, advanced performance analysis, custom model compression, etc. Full models are implemented using normal python scripts. Further details are in the Developing Full Models section of this documentation. Compact Model Approach The compact model approach enables a developer to focus most if not all effort on defining an AI model, leaving other workflow tasks like holdout data testing and storage of performance results for the LMF to do automatically behind the scenes. In the case of TensorFlow, the developer does not even need to write training logic. The major benefits of the compact model approach include (1) significantly less coding effort and (2) potential reduction of errors and/or inconsistencies in writing boilerplate performance-testing logic. These benefits are especially useful for formatting models for multi-run experiments such as k-fold cross validation and learning curves (which will be introduced in an upcoming LMF release). Further details about compact modeling are in Developing Compact Models. Notable Framework Capabilities The LMF consists of an evolving set of capabilities. The following subsections describe notable modeling capabilities supported as of release 6.3.0 RC1. TensorFlow Estimator-Based Modeling TensorFlow supports AI modeling using either low-level APIs or easier-to-use high-level Estimator APIs. The LMF is designed to support Estimator-based model development. Keras may be used to create models, especially for enabling more customization. However, such models must be converted to Estimators for LMF and the broader Lucd JedAI platform to manage them appropriately. See for following link for an introduction to TensorFlow Estimators, https://www.tensorflow.org/guide/estimator. Various Feature Types For TensorFlow modeling, all dataset feature column types are supported (see https://www.tensorflow.org/guide/feature_columns), enabling support for a broad range of numeric and categorical features. Regarding categorical features, the domain of such a feature must be known at training time. For example, if you choose to use a feature car_make as a categorical feature, you must know all the possible makes when you write your model. This requirement will be removed in a future release. Also, the conversion of non-numerical data to numerical data (e.g., for encoding label/target values) based on a scan of the entire dataset is not supported in the current release. However, to help with this, data value replacement operations are supported in the Unity client. For TensorFlow modeling, label types are assumed to be TensorFlow int32. Embedding Datasets For TensorFlow and PyTorch modeling, LMF supports the use of embedding data, e.g., word2vec for representing free text. For PyTorch, the TorchText library is supported, but n-grams are not supported in the current release. Important Note: Currently, when using text input, only the text/embedding input is allowed as a feature, enabling conventional text classification. Future releases will enable the use of multiple feature inputs alongside text data. Image Datasets For TensorFlow and PyTorch modeling, use of image data (i.e., pixel values) as model input is supported. Distributed XGBoost using Dask Distributed training of XGBoost models using the Dask parallel data analytics framework is supported. See the following link for more information, https://ml.dask.org/xgboost.html. Support for TensorFlow and PyTorch distributed training is under development. Supported Libraries The Lucd modeling framework supports the following languages and machine learning -related libraries: - Python v3.6.5 - TensorFlow (for Python) v2.1 - PyTorch v1.5.1 - TorchText - Dask - Dask-ML - Dask-XGBoost - Numpy v1.16.4 - Scikit-learn v0.19.2 - Pandas v0.25.1 Since Dask and Dask-ML libraries are included in the modeling framework, Dask-based distributed Scikit-learn modules should also be supported. However, the usage of such models has not been tested, and hence, Dask-based Scikit-learn operation may be unpredictable. Example Code While this documentation introduces all the core components and best practices for developing AI models for the Lucd JedAI platform, there is rarely a replacement for sample code. The Lucd Model Shop provides a wide range of code (prepared by Lucd engineers) to help developers get started with preparing AI models. In the future, the Lucd Model Shop will also allow for the larger Lucd developer community to share their code, further helping others with their AI goals. Python API Documentation The LMF Python API documentation can be found in the following Lucd GitLab Pages site, https://lucd.pages.lucd.ai/mock-product-development/lucd-eda-rest/. Preparing Models Using the Lucd Modeling Framework The following documentation contains further details and examples for developing AI models for Lucd. - Developing Compact Models - Developing Full Models - Working with Data and Performance Analysis - The Lucd Model Shop
  4. In the compact modeling approach, the Lucd Modeling Framework (LMF) provides a lightweight approach to training TensorFlow and PyTorch models in Lucd. Boilerplate tasks (e.g., creating confusion matrices, ROC curves, etc.) are automatically executed by LMF, enabling a developer to write and maintain less code. Furthermore, this supports modularity so that developers can effortlessly use new data loading, training, and performance measurement logic as they become available in LMF. Compact modeling differs according the framework being used for model development, i.e., TensorFlow and PyTorch. The following sections describe the separate approaches in more detail. Examples illustrating how to use the compact modeling approach are in<>The Lucd Model Shop. TensorFlow For TensorFlow-based modeling, compact modeling requires the developer to implement only two functions:<>model<>and<>label_mapping. Model Function The<>model<>function is intended to contain code for building TensorFlow Estimator models. There are plenty of examples on the web demonstrating how to build and configure Estimator models; the following link is a good starting point,<>https://www.tensorflow.org/guide/estimator. The LMF sends essential data (e.g., training data, training steps, pre-defined word embeddings) to this function to configure an Estimator model for training. In turn, the function must return some essential elements to essentially configure the LMF for training, performance analysis, etc. Below is a formal description of the<>model<>function. def model(training_data, validation_data, num_features, training_steps, learning_rate, regularization_value, log_dir, training_param_map, embedding_matrix, embedding_size, word_index_mapping, max_document_length, pad_value, train_id): """Function used by LMF for training and analyzing TensorFlow Estimator models. Args: training_data (list): List of delayed "chunks" of Dask dataframe representing training data. validation_data (list): List of delayed "chunks" of Dask dataframe representing validation data. num_features (tuple): The shape of the features input for a model. training_steps (int): Number of steps for model training. learning_rate (float): Model's learning rate. regularization_value (float): Model's regularization value. log_dir (string): Path designating where checkpoints will be written (needed for training). training_param_map (dict): Dictionary containing miscellaneous parameters. embedding_matrix (numpy array): 2D numpy array where each row represents an embedding for an indexed word (for text classification models). embedding_size (int): Size/length of embeddings in embedding_matrix, or length of embeddings to be learned (for text classification models). word_index_mapping (dict): Dict mapping string words to their int index representations (for text classification models). max_document_length (int): Int defining the maximum number of tokens to be used for free text input into the model for training (for text classification models). pad_value (int): Int defining index value used for padding documents for training, validation, and testing (for text classification models). train_id (str): Unique identifier of the underlying training in the database. Returns: TensorFlow Estimator object (for training), TensorFlow Estimator TrainSpec object (for running training), TensorFlow Estimator EvalSpec object (for running validation), Dict mapping feature names to feature types (for loading data into the model), Type of target/label in training/validation/testing data (for loading data into the model), TensorFlow serving_input_receiver_fn suitable for use in serving (for model serving/prediction), List of feature names (same order as in training data, GUI display purposes), List of class names (same order as their numerical representation in training data, for confusion matrix and GUI display purposes), Number of classes to use in lucd_uds.get_tf_dataset_* and lucd_ml.get_predictions_* functions for TensorFlow. String "input_name" representing the name of the model input layer for use with TF signature def when generating predictions. """ Note that the inputs to the<>model<>function are defined in the Unity client, and hence should not be altered. Label_mapping Function Return values from the<>label_mapping<>function are used by the LMF to compute the confusion matrix precision and recall statistics. For proper construction of the confusion matrix, a dict mapping training data’s label values (integers) to expressive strings should be returned. PyTorch For PyTorch-based modeling, the developer is required to implement the same functions as for TensorFlow:<>model<>and<>lable_mapping. The use of the<>label_mapping<>function for PyTorch is exactly the same as for TensorFlow. Hence, only details for the<>model<>function will be described for PyTorch. Model Function As opposed to TensorFlow-based modeling, for which the<>model<>function implements a developer’s AI model,<>model<>for PyTorch is primarily used for executing model training and validation logic. This is mainly because with PyTorch, model training logic is designed to be much more under a developer’s control. As one can see in the<>example code, and similar to traditional PyTorch style, the actual AI model can be defined as a separate class inside python file. Details of the<>model<>function for PyTorch are below. def model(training_data, validation_data, num_features, training_steps, learning_rate, regularization_value, log_dir, training_param_map, embedding_matrix, embedding_size, word_index_mapping, max_document_length, pad_value, train_id): """Function used by LMF for training and analyzing TensorFlow Estimator models. Args: training_data (torch.utils.data.Dataset): PyTorch dataset representing training data. validation_data (torch.utils.data.Dataset): PyTorch dataset representing validation data. num_features (tuple): The shape of the features input for a model (no use for PyTorch identified). training_steps (int): Number of steps for model training. learning_rate (float): Model's learning rate. regularization_value (float): Model's regularization value. log_dir (string): Path designating where checkpoints will be written (needed for training). training_param_map (dict): Dictionary containing miscellaneous parameters. embedding_matrix (numpy array): 2D numpy array where each row represents an embedding for an indexed word (for text classification models). embedding_size (int): Size/length of embeddings in embedding_matrix, or length of embeddings to be learned (for text classification models). word_index_mapping (dict): Dict mapping string words to their int index representations (for text classification models). max_document_length (int): Int defining the maximum number of tokens to be used for free text input into the model for training (for text classification models). pad_value (int): Int defining index value used for padding documents for training, validation, and testing (for text classification models). train_id (str): Unique identifier of the underlying training in the database. Returns: Trained PyTorch model, List floats representing final model performance statistics values, List of class names (same order as their numerical representation in training data, for confusion matrix and GUI display purposes). """
  5. In the compact modeling approach, the Lucd Modeling Framework (LMF) provides a lightweight approach to training TensorFlow and PyTorch models in Lucd. Boilerplate tasks (e.g., creating confusion matrices, ROC curves, etc.) are automatically executed by LMF, enabling a developer to write and maintain less code. Furthermore, this supports modularity so that developers can effortlessly use new data loading, training, and performance measurement logic as they become available in LMF. Compact modeling differs according the framework being used for model development, i.e., TensorFlow and PyTorch. The following sections describe the separate approaches in more detail. Examples illustrating how to use the compact modeling approach are in<>The Lucd Model Shop. TensorFlow For TensorFlow-based modeling, compact modeling requires the developer to implement only two functions:<>model<>and<>label_mapping. Model Function The<>model<>function is intended to contain code for building TensorFlow Estimator models. There are plenty of examples on the web demonstrating how to build and configure Estimator models; the following link is a good starting point,<>https://www.tensorflow.org/guide/estimator. The LMF sends essential data (e.g., training data, training steps, pre-defined word embeddings) to this function to configure an Estimator model for training. In turn, the function must return some essential elements to essentially configure the LMF for training, performance analysis, etc. Below is a formal description of the<>model<>function. def model(training_data, validation_data, num_features, training_steps, learning_rate, regularization_value, log_dir, training_param_map, embedding_matrix, embedding_size, word_index_mapping, max_document_length, pad_value, train_id): """Function used by LMF for training and analyzing TensorFlow Estimator models. Args: training_data (list): List of delayed "chunks" of Dask dataframe representing training data. validation_data (list): List of delayed "chunks" of Dask dataframe representing validation data. num_features (tuple): The shape of the features input for a model. training_steps (int): Number of steps for model training. learning_rate (float): Model's learning rate. regularization_value (float): Model's regularization value. log_dir (string): Path designating where checkpoints will be written (needed for training). training_param_map (dict): Dictionary containing miscellaneous parameters. embedding_matrix (numpy array): 2D numpy array where each row represents an embedding for an indexed word (for text classification models). embedding_size (int): Size/length of embeddings in embedding_matrix, or length of embeddings to be learned (for text classification models). word_index_mapping (dict): Dict mapping string words to their int index representations (for text classification models). max_document_length (int): Int defining the maximum number of tokens to be used for free text input into the model for training (for text classification models). pad_value (int): Int defining index value used for padding documents for training, validation, and testing (for text classification models). train_id (str): Unique identifier of the underlying training in the database. Returns: TensorFlow Estimator object (for training), TensorFlow Estimator TrainSpec object (for running training), TensorFlow Estimator EvalSpec object (for running validation), Dict mapping feature names to feature types (for loading data into the model), Type of target/label in training/validation/testing data (for loading data into the model), TensorFlow serving_input_receiver_fn suitable for use in serving (for model serving/prediction), List of feature names (same order as in training data, GUI display purposes), List of class names (same order as their numerical representation in training data, for confusion matrix and GUI display purposes), Number of classes to use in lucd_uds.get_tf_dataset_* and lucd_ml.get_predictions_* functions for TensorFlow. String "input_name" representing the name of the model input layer for use with TF signature def when generating predictions. """ Note that the inputs to the<>model<>function are defined in the Unity client, and hence should not be altered. Label_mapping Function Return values from the<>label_mapping<>function are used by the LMF to compute the confusion matrix precision and recall statistics. For proper construction of the confusion matrix, a dict mapping training data’s label values (integers) to expressive strings should be returned. PyTorch For PyTorch-based modeling, the developer is required to implement the same functions as for TensorFlow:<>model<>and<>lable_mapping. The use of the<>label_mapping<>function for PyTorch is exactly the same as for TensorFlow. Hence, only details for the<>model<>function will be described for PyTorch. Model Function As opposed to TensorFlow-based modeling, for which the<>model<>function implements a developer’s AI model,<>model<>for PyTorch is primarily used for executing model training and validation logic. This is mainly because with PyTorch, model training logic is designed to be much more under a developer’s control. As one can see in the<>example code, and similar to traditional PyTorch style, the actual AI model can be defined as a separate class inside python file. Details of the<>model<>function for PyTorch are below. def model(training_data, validation_data, num_features, training_steps, learning_rate, regularization_value, log_dir, training_param_map, embedding_matrix, embedding_size, word_index_mapping, max_document_length, pad_value, train_id): """Function used by LMF for training and analyzing TensorFlow Estimator models. Args: training_data (torch.utils.data.Dataset): PyTorch dataset representing training data. validation_data (torch.utils.data.Dataset): PyTorch dataset representing validation data. num_features (tuple): The shape of the features input for a model (no use for PyTorch identified). training_steps (int): Number of steps for model training. learning_rate (float): Model's learning rate. regularization_value (float): Model's regularization value. log_dir (string): Path designating where checkpoints will be written (needed for training). training_param_map (dict): Dictionary containing miscellaneous parameters. embedding_matrix (numpy array): 2D numpy array where each row represents an embedding for an indexed word (for text classification models). embedding_size (int): Size/length of embeddings in embedding_matrix, or length of embeddings to be learned (for text classification models). word_index_mapping (dict): Dict mapping string words to their int index representations (for text classification models). max_document_length (int): Int defining the maximum number of tokens to be used for free text input into the model for training (for text classification models). pad_value (int): Int defining index value used for padding documents for training, validation, and testing (for text classification models). train_id (str): Unique identifier of the underlying training in the database. Returns: Trained PyTorch model, List floats representing final model performance statistics values, List of class names (same order as their numerical representation in training data, for confusion matrix and GUI display purposes). """ View full record
  6. The Lucd Modeling Framework (LMF) enables developers to build and interface custom AI models with the Lucd JedAI platform for streamlined management, experimentation, and training using data and parameters established in the JudAI Unity client (or simply, Unity client). The framework supports python-based AI models built with TensorFlow, PyTorch, and Dask XGBoost. LMF’s python libraries support the following tasks: - accessing Lucd virtual datasets (VDSes) for model training and evaluation, - analyzing and reporting model performance metrics (e.g., with confusion matrices, ROC curves), - storing structures representing trained models and training checkpoints. Model Development Approaches LMF provides flexibility in the level of effort and control needed for preparing models for Lucd. The two approaches include the full and compact model approaches; their differences are illustrated in Figure 1. Figure 1. Conceptual illustration of full and compact model approaches. Full Model Approach In the full model approach, a developer creates some AI model and manually uses LMF python libraries to complete the model training workflow (e.g., train, validate, holdout data testing, store results). This enables complete flexibility for more advanced use cases which might include designing complex or experimental training loops, advanced performance analysis, custom model compression, etc. Full models are implemented using normal python scripts. Further details are in the Developing Full Models section of this documentation. Compact Model Approach The compact model approach enables a developer to focus most if not all effort on defining an AI model, leaving other workflow tasks like holdout data testing and storage of performance results for the LMF to do automatically behind the scenes. In the case of TensorFlow, the developer does not even need to write training logic. The major benefits of the compact model approach include (1) significantly less coding effort and (2) potential reduction of errors and/or inconsistencies in writing boilerplate performance-testing logic. These benefits are especially useful for formatting models for multi-run experiments such as k-fold cross validation and learning curves (which will be introduced in an upcoming LMF release). Further details about compact modeling are in Developing Compact Models. Notable Framework Capabilities The LMF consists of an evolving set of capabilities. The following subsections describe notable modeling capabilities supported as of release 6.3.0 RC1. TensorFlow Estimator-Based Modeling TensorFlow supports AI modeling using either low-level APIs or easier-to-use high-level Estimator APIs. The LMF is designed to support Estimator-based model development. Keras may be used to create models, especially for enabling more customization. However, such models must be converted to Estimators for LMF and the broader Lucd JedAI platform to manage them appropriately. See for following link for an introduction to TensorFlow Estimators, https://www.tensorflow.org/guide/estimator. Various Feature Types For TensorFlow modeling, all dataset feature column types are supported (see https://www.tensorflow.org/guide/feature_columns), enabling support for a broad range of numeric and categorical features. Regarding categorical features, the domain of such a feature must be known at training time. For example, if you choose to use a feature car_make as a categorical feature, you must know all the possible makes when you write your model. This requirement will be removed in a future release. Also, the conversion of non-numerical data to numerical data (e.g., for encoding label/target values) based on a scan of the entire dataset is not supported in the current release. However, to help with this, data value replacement operations are supported in the Unity client. For TensorFlow modeling, label types are assumed to be TensorFlow int32. Embedding Datasets For TensorFlow and PyTorch modeling, LMF supports the use of embedding data, e.g., word2vec for representing free text. For PyTorch, the TorchText library is supported, but n-grams are not supported in the current release. Important Note: Currently, when using text input, only the text/embedding input is allowed as a feature, enabling conventional text classification. Future releases will enable the use of multiple feature inputs alongside text data. Image Datasets For TensorFlow and PyTorch modeling, use of image data (i.e., pixel values) as model input is supported. Distributed XGBoost using Dask Distributed training of XGBoost models using the Dask parallel data analytics framework is supported. See the following link for more information, https://ml.dask.org/xgboost.html. Support for TensorFlow and PyTorch distributed training is under development. Supported Libraries The Lucd modeling framework supports the following languages and machine learning -related libraries: - Python v3.6.5 - TensorFlow (for Python) v2.1 - PyTorch v1.5.1 - TorchText - Dask - Dask-ML - Dask-XGBoost - Numpy v1.16.4 - Scikit-learn v0.19.2 - Pandas v0.25.1 Since Dask and Dask-ML libraries are included in the modeling framework, Dask-based distributed Scikit-learn modules should also be supported. However, the usage of such models has not been tested, and hence, Dask-based Scikit-learn operation may be unpredictable. Example Code While this documentation introduces all the core components and best practices for developing AI models for the Lucd JedAI platform, there is rarely a replacement for sample code. The Lucd Model Shop provides a wide range of code (prepared by Lucd engineers) to help developers get started with preparing AI models. In the future, the Lucd Model Shop will also allow for the larger Lucd developer community to share their code, further helping others with their AI goals. Python API Documentation The LMF Python API documentation can be found in the following Lucd GitLab Pages site, https://lucd.pages.lucd.ai/mock-product-development/lucd-eda-rest/. Preparing Models Using the Lucd Modeling Framework The following documentation contains further details and examples for developing AI models for Lucd. - Developing Compact Models - Developing Full Models - Working with Data and Performance Analysis - The Lucd Model Shop View full record
  7. This section covers the high-level tasks needed to support model training: importing and preparing data and analyzing performance. Most of the content here pertains to the full model approach, but some (e.g., reporting model status) is still helpful for PyTorch compact modeling. Data can be imported into a modeling context using the Lucd Unified Dataspace (UDS) API (eda.lib.lucd_uds). This library provides functions for creating datasets of various formats (TensorFlow, PyTorch, Dask dataframe) based on Lucd virtual datasets defined in the Unity client. It also provides the capability to retrieve previously trained word embeddings. The Lucd UDS functions providing data retrieval are listed below. Some are used for straightforward data importing (e.g., get_datframe) while others are used for preparing framework-specific datasets for AI models (e.g., get_tf_dataset for defining TensorFlow datasets). Refer to the API documentation for full function descriptions, and visit the Lucd Model Shop gitlab project for examples on how to use the functions for developing AI models. get_asset get_dataframe get_tf_dataset get_tf_dataset_image get_tf_dataset_text train_eval_test_split_dataframe train_eval_test_split_pytorch train_eval_test_split_tensorflow Important notes for implementing multi-class modeling TensorFlow offers different approaches to building multi-class models, two prominent ones being using pre-made Estimators (https://www.tensorflow.org/tutorials/estimator/premade#overview_of_programming_with_estimators), and using general techniques such as with Keras models and Estimators. If one-hot encoded data labels are needed (i.e., to match the number of nodes in a neural network output layer), the num_classes parameter should be used when calling relevant functions to get data (e.g., lucd_uds.get_tf_dataset). Note that most (if not all) TensorFlow pre-made Estimator models do not require explicitly one-hot encoded data labels for non-binary modeling (e.g., tensorflow.estimator.DNNClassifier) and hence, the num_classes argument can be omitted. In the case of TensorFlow Estimators, developers are encouraged to understand how to shape input for the models. The same goes for modeling with PyTorch or XGBoost. Analyzing Model Performance Post-training performance analysis tasks are supported by the Lucd Machine Learning (ML) API (eda.lib.lucd_ml). This library provides functions supporting automatic execution and reporting of critical performance analysis tasks (e.g., creating confusion matrices, ROC curves), preventing the need to repeatedly write such code. The tables and plots created from these library functions can be viewed in the Unity client after the entire model training process has completed. The Lucd ML functions for performance analysis are listed below. Refer to the API documentation for full function descriptions. get_predictions_classification_pt get_predictions_classification_tf get_predictions_regression_pt get_predictions_regression_tf lucd_precision_recall_curve lucd_roc_curve lucd_confusion_matrix update_plots Submitting Performance Analysis Results Trained models and metadata can be uploaded to the Lucd backend via the eda.int.train.update function. The following piece of example code illustrates how to use the function. model_filename = lucd_uds.zip_model_tf(trained_classifier, serving_input_receiver_fn, model_id, graph_version, log_dir) # Store model graph and performance stats back to Lucd back-end with open(model_filename, "rb") as graph_file: train.update({tid: { 'performance': { 'loss': loss, 'accuracy': accuracy, 'macro_precision': precision_macro, 'macro_recall': recall_macro, 'macro_f1': f1_macro, 'micro_precision': precision_micro, 'micro_recall': recall_micro, 'micro_f1': f1_micro, 'weighted_precision': precision_weighted, 'weighted_recall': recall_weighted, 'weighted_f1': f1_weighted, 'precision_recall_f1_per_label': results_string, }, 'graph_version': graph_version, 'graph_file': graph_file.read() }}) train.update takes a python dictionary as the argument, with the train_id, described in Table 1, as the top-level key (tid represents the table_id in the code snippet above). The secondary keys graph_version and graph_file store the graph version and trained graph file (model) respectively. The secondary key performance stores another dictionary for performance values. There is no restriction on the key-value pairs here. The developer is allowed to choose the performance values and they will be viewable in the Unity client afterward. The values shown in the code snippet above are customary for evaluating classification models. Again, see example models in The Lucd Model Shop for more insights. As for precision_recall_f1_per_label, an example of how to compute performance statistics per label can be found at: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html. precision_recall_f1_per_label must be formatted as a semicolon-separated string of tuples, with each tuple formatted as “-,,. An example is as follows, precision_recall_f1_per_label: setosa-1.0,0.5,1.0;virginica-1.0,1.0,0.6;versicolor-1.0,0.6666666666666666,0.7499999999999999. Enabling Model Explainability To enable a trained model to be used by the explainability tool in the Unity client, some parameters must be defined. For TensorFlow models, ordered_feature_names, ordered_class_names, input_name, and output_name must be defined. ordered_feature_names (not to be confused with training data input column names) is a list of ordered names of the inputs to the trained model, commonly defined in TensorFlow model definitions as tf.feature_column. For example, for a TensorFlow text classification model, the named input might be embedding_input. Please see example code in The Lucd Model Shop. ordered_class_names is a list formatted such that string class names are ordered by their integer representations (the order of outputs from your model). For instance, for a binary classification model for which the labels are 0 and 1, the order of strings must be negative and positive (or whatever string labels you choose). input_name is the name of the input layer in your TensorFlow model to which your ordered_feature_names data will be passed. output_name is the name of the output layer in your TensorFlow model (by default these can be named things like ‘dense_2’ and ‘scores’). The output_name is used to retrieve your model outputs in the proper format for explanation. PyTorch models only require that ordered_class_names be provided. Plots Lucd allows users to plot associated model training metrics in the Unity client. The plots will update in realtime during model training, providing insight into the viability, problems, and successes of training runs. Users are able to store any information they want to be plotted as part of a training run using the eda.lib.lucd_ml.update_plots function. The provided information must abide by the following assumptions: 1. The top level keys represent individual plots; 2. Each inner dictionary must specify a labels and a description key, where labels are [“plot x_label”, “plot y_label”] and the description can be any string; 3. All remaining keys in the inner dictionary will be treated as individual lines on the plot, so in the following example "accuracy" is a line on the plot. { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } The individual elements in the “l1” line above represent [x_val, y_val]. A common example as shown above includes the following: "l1": [ [epoch, accuracy], [epoch, accuracy], ... ] Once created, the dictionary may be stored for plotting with the lucd_ml.update_plots function: def update_plots(train_or_eval: bool, tid: str, dictionary: dict) -> str: ... dictionary = { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } update_plots(True, train_id, dictionary) Regarding the update_plots function, train_or_eval allows a user to specify whether their plot is part of the training or evaluation (or validation) cycle of the model training: train=true, eval=false. A TensorFlow hook is provided in lucd_ml for automatically parsing generated events files (the same as used by TensorBoard) and passing them to update_plots as part of a TensorFlow model. It can be provided as part of a TensorFlow EvalSpec or TrainSpec object as follows (stub included for posterity): class LucdTFEstimatorHook(tf.estimator.SessionRunHook): def __init__(self, train_hook: bool, log_dir: str, tid: str, freq: int, last_epoch: int): ... train_spec = tf.estimator.TrainSpec( input_fn=lambda: lucd_uds.get_tf_dataset_image(type_dict, training_data, num_features, target_type, num_classes).repeat(count=None).shuffle(30).batch(int(30)), max_steps=training_steps, hooks=lucd_ml.LucdTFEstimatorHook(train_hook=True, log_dir=log_dir, tid=tid, freq=10, max_epoch=training_steps)]) train_hook allows a user to specify whether the hook is providing train or eval metrics to the user (train=true, eval=false). log_dir tells the hook where to find TensorFlow events files. freq is the frequency that the hook should look for metrics in the events files. last_epoch tells the hook the number of epochs being run so the hook can ignore freq for the last epoch. Two last helper functions are provided as part of the Lucd plotting framework: lucd_roc_curve and lucd_precision_recall_curve. These functions generate ROC curves and precision-recall curves respectively and are called selectively when using the compact modeling approach (enabled by “performance_curves” button in unity). Further documentation for these functions is provided in the API. def lucd_roc_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): def lucd_precision_recall_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): ... Confusion Matrix Lucd provides an in-depth, interactive confusion matrix for classification model evaluation. Users may select a square in the Unity client to be shown actual records associated with the square selected. This may be enabled by using the following function: def lucd_confusion_matrix(test_set: list or DataLoader, predictions: list, num_classes: int, label_mapping: type(abs), tid: str, write_accumulo: bool,) -> (dict, str): The function arguments details are provided below. test_set: Users may directly pass the PyTorch DataLoader or list of delayed dask dataframes returned from the respective train_eval_test_split_pytorch/tensorflow function. predictions: This should be a list of predictions generated by your model (the list returned from lucd_ml get_predictions_classification). The list must be in the same order as the test_set data. num_classes: An integer number of classes for the confusion matrix to represent. label_mapping: A function to map integers to class labels, which is used to map predictions to a human-readable format. tid: Training id to associate confusion matrix with. write_accumulo: Boolean specifying whether to write the dictionary directly to accumulo. Must be True to enable Unity client Confusion Matrix. If False, the generated confusion matrix will only be returned & not written to the database. Further documentation for this function exists in the API documentation. Here is an example usage: def _label_mapping(): return {0: 'I. versicolor', 1: 'I. virginica', 2: 'I. setosa'} ... # Prepare vds data for modeling delayed_values_training, delayed_values_evaluation, delayed_values_testing, my_df_testing_label, num_features = \ lucd_uds.train_eval_test_split_tensorflow(virtual_dataset_id, evaluation_dataset_percent, testing_dataset_percent) ... predictions, scores = lucd_ml.get_predictions_classification_tf(_estimator, lambda: uds.get_tf_dataset( feature_dict, delayed_values_testing, num_features, target_type).batch(1), classification_mode, .5) ... lucd_ml.lucd_confusion_matrix(delayed_values_testing, predictions, 3, label_mapping(), tid, True) Submitting Model Training Status Another helpful function is eda.int.train.status, which is used for storing the status of a developer’s training pipeline. This enables a model’s status to be displayed on the Unity client. The function definition is below. def status(uid, code, message=None): """Update model status in the database. Args: uid: Int representing a model's ID. code: 0 - RETRIEVING VDS, 1 - TRAINING, 2 - ANALYZING PERFORMANCE, 3 - STORING MODEL, 4 - TRAINING COMPLETE, 5 - ERROR, 6 - QUEUED, 7 - STOPPED. message: String representing optional custom message to include. Returns: Status message. Raises: TypeError: If code is not of type int. Exception: If code is invalid. """
  8. This section covers the high-level tasks needed to support model training: importing and preparing data and analyzing performance. Most of the content here pertains to the full model approach, but some (e.g., reporting model status) is still helpful for PyTorch compact modeling. Data can be imported into a modeling context using the Lucd Unified Dataspace (UDS) API (eda.lib.lucd_uds). This library provides functions for creating datasets of various formats (TensorFlow, PyTorch, Dask dataframe) based on Lucd virtual datasets defined in the Unity client. It also provides the capability to retrieve previously trained word embeddings. The Lucd UDS functions providing data retrieval are listed below. Some are used for straightforward data importing (e.g., get_datframe) while others are used for preparing framework-specific datasets for AI models (e.g., get_tf_dataset for defining TensorFlow datasets). Refer to the API documentation for full function descriptions, and visit the Lucd Model Shop gitlab project for examples on how to use the functions for developing AI models. get_asset get_dataframe get_tf_dataset get_tf_dataset_image get_tf_dataset_text train_eval_test_split_dataframe train_eval_test_split_pytorch train_eval_test_split_tensorflow Important notes for implementing multi-class modeling TensorFlow offers different approaches to building multi-class models, two prominent ones being using pre-made Estimators (https://www.tensorflow.org/tutorials/estimator/premade#overview_of_programming_with_estimators), and using general techniques such as with Keras models and Estimators. If one-hot encoded data labels are needed (i.e., to match the number of nodes in a neural network output layer), the num_classes parameter should be used when calling relevant functions to get data (e.g., lucd_uds.get_tf_dataset). Note that most (if not all) TensorFlow pre-made Estimator models do not require explicitly one-hot encoded data labels for non-binary modeling (e.g., tensorflow.estimator.DNNClassifier) and hence, the num_classes argument can be omitted. In the case of TensorFlow Estimators, developers are encouraged to understand how to shape input for the models. The same goes for modeling with PyTorch or XGBoost. Analyzing Model Performance Post-training performance analysis tasks are supported by the Lucd Machine Learning (ML) API (eda.lib.lucd_ml). This library provides functions supporting automatic execution and reporting of critical performance analysis tasks (e.g., creating confusion matrices, ROC curves), preventing the need to repeatedly write such code. The tables and plots created from these library functions can be viewed in the Unity client after the entire model training process has completed. The Lucd ML functions for performance analysis are listed below. Refer to the API documentation for full function descriptions. get_predictions_classification_pt get_predictions_classification_tf get_predictions_regression_pt get_predictions_regression_tf lucd_precision_recall_curve lucd_roc_curve lucd_confusion_matrix update_plots Submitting Performance Analysis Results Trained models and metadata can be uploaded to the Lucd backend via the eda.int.train.update function. The following piece of example code illustrates how to use the function. model_filename = lucd_uds.zip_model_tf(trained_classifier, serving_input_receiver_fn, model_id, graph_version, log_dir) # Store model graph and performance stats back to Lucd back-end with open(model_filename, "rb") as graph_file: train.update({tid: { 'performance': { 'loss': loss, 'accuracy': accuracy, 'macro_precision': precision_macro, 'macro_recall': recall_macro, 'macro_f1': f1_macro, 'micro_precision': precision_micro, 'micro_recall': recall_micro, 'micro_f1': f1_micro, 'weighted_precision': precision_weighted, 'weighted_recall': recall_weighted, 'weighted_f1': f1_weighted, 'precision_recall_f1_per_label': results_string, }, 'graph_version': graph_version, 'graph_file': graph_file.read() }}) train.update takes a python dictionary as the argument, with the train_id, described in Table 1, as the top-level key (tid represents the table_id in the code snippet above). The secondary keys graph_version and graph_file store the graph version and trained graph file (model) respectively. The secondary key performance stores another dictionary for performance values. There is no restriction on the key-value pairs here. The developer is allowed to choose the performance values and they will be viewable in the Unity client afterward. The values shown in the code snippet above are customary for evaluating classification models. Again, see example models in The Lucd Model Shop for more insights. As for precision_recall_f1_per_label, an example of how to compute performance statistics per label can be found at: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html. precision_recall_f1_per_label must be formatted as a semicolon-separated string of tuples, with each tuple formatted as “-,,. An example is as follows, precision_recall_f1_per_label: setosa-1.0,0.5,1.0;virginica-1.0,1.0,0.6;versicolor-1.0,0.6666666666666666,0.7499999999999999. Enabling Model Explainability To enable a trained model to be used by the explainability tool in the Unity client, some parameters must be defined. For TensorFlow models, ordered_feature_names, ordered_class_names, input_name, and output_name must be defined. ordered_feature_names (not to be confused with training data input column names) is a list of ordered names of the inputs to the trained model, commonly defined in TensorFlow model definitions as tf.feature_column. For example, for a TensorFlow text classification model, the named input might be embedding_input. Please see example code in The Lucd Model Shop. ordered_class_names is a list formatted such that string class names are ordered by their integer representations (the order of outputs from your model). For instance, for a binary classification model for which the labels are 0 and 1, the order of strings must be negative and positive (or whatever string labels you choose). input_name is the name of the input layer in your TensorFlow model to which your ordered_feature_names data will be passed. output_name is the name of the output layer in your TensorFlow model (by default these can be named things like ‘dense_2’ and ‘scores’). The output_name is used to retrieve your model outputs in the proper format for explanation. PyTorch models only require that ordered_class_names be provided. Plots Lucd allows users to plot associated model training metrics in the Unity client. The plots will update in realtime during model training, providing insight into the viability, problems, and successes of training runs. Users are able to store any information they want to be plotted as part of a training run using the eda.lib.lucd_ml.update_plots function. The provided information must abide by the following assumptions: 1. The top level keys represent individual plots; 2. Each inner dictionary must specify a labels and a description key, where labels are [“plot x_label”, “plot y_label”] and the description can be any string; 3. All remaining keys in the inner dictionary will be treated as individual lines on the plot, so in the following example "accuracy" is a line on the plot. { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } The individual elements in the “l1” line above represent [x_val, y_val]. A common example as shown above includes the following: "l1": [ [epoch, accuracy], [epoch, accuracy], ... ] Once created, the dictionary may be stored for plotting with the lucd_ml.update_plots function: def update_plots(train_or_eval: bool, tid: str, dictionary: dict) -> str: ... dictionary = { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } update_plots(True, train_id, dictionary) Regarding the update_plots function, train_or_eval allows a user to specify whether their plot is part of the training or evaluation (or validation) cycle of the model training: train=true, eval=false. A TensorFlow hook is provided in lucd_ml for automatically parsing generated events files (the same as used by TensorBoard) and passing them to update_plots as part of a TensorFlow model. It can be provided as part of a TensorFlow EvalSpec or TrainSpec object as follows (stub included for posterity): class LucdTFEstimatorHook(tf.estimator.SessionRunHook): def __init__(self, train_hook: bool, log_dir: str, tid: str, freq: int, last_epoch: int): ... train_spec = tf.estimator.TrainSpec( input_fn=lambda: lucd_uds.get_tf_dataset_image(type_dict, training_data, num_features, target_type, num_classes).repeat(count=None).shuffle(30).batch(int(30)), max_steps=training_steps, hooks=lucd_ml.LucdTFEstimatorHook(train_hook=True, log_dir=log_dir, tid=tid, freq=10, max_epoch=training_steps)]) train_hook allows a user to specify whether the hook is providing train or eval metrics to the user (train=true, eval=false). log_dir tells the hook where to find TensorFlow events files. freq is the frequency that the hook should look for metrics in the events files. last_epoch tells the hook the number of epochs being run so the hook can ignore freq for the last epoch. Two last helper functions are provided as part of the Lucd plotting framework: lucd_roc_curve and lucd_precision_recall_curve. These functions generate ROC curves and precision-recall curves respectively and are called selectively when using the compact modeling approach (enabled by “performance_curves” button in unity). Further documentation for these functions is provided in the API. def lucd_roc_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): def lucd_precision_recall_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): ... Confusion Matrix Lucd provides an in-depth, interactive confusion matrix for classification model evaluation. Users may select a square in the Unity client to be shown actual records associated with the square selected. This may be enabled by using the following function: def lucd_confusion_matrix(test_set: list or DataLoader, predictions: list, num_classes: int, label_mapping: type(abs), tid: str, write_accumulo: bool,) -> (dict, str): The function arguments details are provided below. test_set: Users may directly pass the PyTorch DataLoader or list of delayed dask dataframes returned from the respective train_eval_test_split_pytorch/tensorflow function. predictions: This should be a list of predictions generated by your model (the list returned from lucd_ml get_predictions_classification). The list must be in the same order as the test_set data. num_classes: An integer number of classes for the confusion matrix to represent. label_mapping: A function to map integers to class labels, which is used to map predictions to a human-readable format. tid: Training id to associate confusion matrix with. write_accumulo: Boolean specifying whether to write the dictionary directly to accumulo. Must be True to enable Unity client Confusion Matrix. If False, the generated confusion matrix will only be returned & not written to the database. Further documentation for this function exists in the API documentation. Here is an example usage: def _label_mapping(): return {0: 'I. versicolor', 1: 'I. virginica', 2: 'I. setosa'} ... # Prepare vds data for modeling delayed_values_training, delayed_values_evaluation, delayed_values_testing, my_df_testing_label, num_features = \ lucd_uds.train_eval_test_split_tensorflow(virtual_dataset_id, evaluation_dataset_percent, testing_dataset_percent) ... predictions, scores = lucd_ml.get_predictions_classification_tf(_estimator, lambda: uds.get_tf_dataset( feature_dict, delayed_values_testing, num_features, target_type).batch(1), classification_mode, .5) ... lucd_ml.lucd_confusion_matrix(delayed_values_testing, predictions, 3, label_mapping(), tid, True) Submitting Model Training Status Another helpful function is eda.int.train.status, which is used for storing the status of a developer’s training pipeline. This enables a model’s status to be displayed on the Unity client. The function definition is below. def status(uid, code, message=None): """Update model status in the database. Args: uid: Int representing a model's ID. code: 0 - RETRIEVING VDS, 1 - TRAINING, 2 - ANALYZING PERFORMANCE, 3 - STORING MODEL, 4 - TRAINING COMPLETE, 5 - ERROR, 6 - QUEUED, 7 - STOPPED. message: String representing optional custom message to include. Returns: Status message. Raises: TypeError: If code is not of type int. Exception: If code is invalid. """ View full record
  9. Enables the ability to authenticate to Lucd backend and access data for custom analysis and model prototyping. Enables the ability to define custom feature transformation operations, and upload to Lucd for use in the GUI. Provides access to the Asset, Concept, Custom Contract, Custom Operation, Explore, Model, UDS and VDS REST APIs. Provides multiple working examples of individual REST calls, as well as complete models. Enables Dask exploration of the data on the client side. Enables development and testing of model development using the Lucd PyTorch, Tensorflow, Keras and XGBoost classes.
  10. This documentation describes how to use the Lucd Python libraries for creating custom feature engineering operations for processing data for model training. Custom feature engineering operations enable a Lucd user to apply feature engineering operations that are not available in the Lucd GUI. A simple example includes adjusting a given attribute as a function of the value of several other attributes. Overall, this should enable near open-ended feature transformation needs. The Custom Operation Module The eda.int.custom_operation module is used for sending custom operations to the Lucd backend so that they are selectable in the Lucd EDA section of the GUI. The simple Python script below illustrates how one may define a custom feature engineering operation and send it to the Lucd backend using the Lucd Python libraries. from eda.int import custom_operation import lucd def transform(row): row['flower.petal_length'] = row['flower.petal_length'] + 1 return row client = lucd.LucdClient(domain="domain", username="username", password="password", login_domain="Lucd Platform URL" ) data = { "operation_name": "simple iris attribute adder", "author_name": "J. Branch", "author_email": "email@address", "operation_description": "Testing custom feature op using simple transform", "operation_purpose": "Numeric feature scaling", "operation_features": ["flower.petal_length"], "operation_function": transform } response_json, rv = custom_operation.create(data) client.close() The create function from the eda.int.custom_operation module is used to submit the actual function and metadata to Lucd. The required attributes in the dict data are used for display in the Lucd GUI (and may at some point be used for search purposes). Table 1 describes the attributes in detail. Table 1. Custom Operation Attribute Descriptions Attribute Description name String name/label of the custom feature engineering operation author_name Name of the developer who wrote the operation author_email Author’s email purpose Short description of what the operation achieves description Longer description of how the operation might achieve its purpose, as well as other notes feature List of strings identifying features/facets the operation affects transform_function The Python function that implements the logic of the custom operation Regarding the custom transformation function, it is essential that the function be defined in the same context as the dict and in which create is called. This ensures that the de-serialization of the function works properly on the Lucd backend. Custom Operation Types Applied Operations Applied operations denote those custom operations which are to be applied to data using conventional Dask and Pandas function application mechanisms. Custom functions can be applied to data (dataframes) via the following Dask functions: apply; map_partitions; applymap. Note that applied operations are applied to Pandas dataframes (which act as partitions of larger Dask dataframes). Hence, the user should consider this when designing their custom operation. As shown in the example code above, the user does not identify which Dask mechanism to use for applying a function to data in the create function. This will be selectable in the Lucd GUI when applying custom operations in the eda section. Dask Operations Dask operations denote custom operations which are to be applied directly to a Dask dataframe. This is useful in cases where custom operations/algorithms cannot be correctly applied to a *partition” of data as described above. An example would be a clustering algorithm. We expect these custom operations will use modules from Dask libraries such as Dask-ML (https://ml.dask.org/). An example custom operation is shown below. def cluster_function(df): from dask_ml.cluster import KMeans X = df[['car.days_in_service']] kmeans = KMeans(n_clusters=4) k_service_days = kmeans.fit(X) labels = k_service_days.predict(X) df['cars.service_days_cluster'] = labels return df When a Dask operation is to be applied (in the Unity client), the user should select apply_direct as the application type.
  11. This documentation describes how to use the Lucd Python libraries for creating custom feature engineering operations for processing data for model training. Custom feature engineering operations enable a Lucd user to apply feature engineering operations that are not available in the Lucd GUI. A simple example includes adjusting a given attribute as a function of the value of several other attributes. Overall, this should enable near open-ended feature transformation needs. The Custom Operation Module The eda.int.custom_operation module is used for sending custom operations to the Lucd backend so that they are selectable in the Lucd EDA section of the GUI. The simple Python script below illustrates how one may define a custom feature engineering operation and send it to the Lucd backend using the Lucd Python libraries. from eda.int import custom_operation import lucd def transform(row): row['flower.petal_length'] = row['flower.petal_length'] + 1 return row client = lucd.LucdClient(domain="domain", username="username", password="password", login_domain="Lucd Platform URL" ) data = { "operation_name": "simple iris attribute adder", "author_name": "J. Branch", "author_email": "email@address", "operation_description": "Testing custom feature op using simple transform", "operation_purpose": "Numeric feature scaling", "operation_features": ["flower.petal_length"], "operation_function": transform } response_json, rv = custom_operation.create(data) client.close() The create function from the eda.int.custom_operation module is used to submit the actual function and metadata to Lucd. The required attributes in the dict data are used for display in the Lucd GUI (and may at some point be used for search purposes). Table 1 describes the attributes in detail. Table 1. Custom Operation Attribute Descriptions Attribute Description name String name/label of the custom feature engineering operation author_name Name of the developer who wrote the operation author_email Author’s email purpose Short description of what the operation achieves description Longer description of how the operation might achieve its purpose, as well as other notes feature List of strings identifying features/facets the operation affects transform_function The Python function that implements the logic of the custom operation Regarding the custom transformation function, it is essential that the function be defined in the same context as the dict and in which create is called. This ensures that the de-serialization of the function works properly on the Lucd backend. Custom Operation Types Applied Operations Applied operations denote those custom operations which are to be applied to data using conventional Dask and Pandas function application mechanisms. Custom functions can be applied to data (dataframes) via the following Dask functions: apply; map_partitions; applymap. Note that applied operations are applied to Pandas dataframes (which act as partitions of larger Dask dataframes). Hence, the user should consider this when designing their custom operation. As shown in the example code above, the user does not identify which Dask mechanism to use for applying a function to data in the create function. This will be selectable in the Lucd GUI when applying custom operations in the eda section. Dask Operations Dask operations denote custom operations which are to be applied directly to a Dask dataframe. This is useful in cases where custom operations/algorithms cannot be correctly applied to a *partition” of data as described above. An example would be a clustering algorithm. We expect these custom operations will use modules from Dask libraries such as Dask-ML (https://ml.dask.org/). An example custom operation is shown below. def cluster_function(df): from dask_ml.cluster import KMeans X = df[['car.days_in_service']] kmeans = KMeans(n_clusters=4) k_service_days = kmeans.fit(X) labels = k_service_days.predict(X) df['cars.service_days_cluster'] = labels return df When a Dask operation is to be applied (in the Unity client), the user should select apply_direct as the application type. View full record
  12. This section covers the high-level tasks needed to support model training: importing and preparing data and analyzing performance. Most of the content here pertains to the full model approach, but some (e.g., reporting model status) is still helpful for PyTorch compact modeling. Importing and Preparing Data Data can be imported into a modeling context using the Lucd Unified Dataspace (UDS) API (eda.lib.lucd_uds). This library provides functions for creating datasets of various formats (TensorFlow, PyTorch, Dask dataframe) based on Lucd virtual datasets defined in the Unity client. It also provides the capability to retrieve previously trained word embeddings. The Lucd UDS functions providing data retrieval are listed below. Some are used for straightforward data importing (e.g., get_datframe) while others are used for preparing framework-specific datasets for AI models (e.g., get_tf_dataset for defining TensorFlow datasets). Refer to the API documentation for full function descriptions, and visit the Lucd Model Shop gitlab project for examples on how to use the functions for developing AI models. get_asset get_dataframe get_tf_dataset get_tf_dataset_image get_tf_dataset_text train_eval_test_split_dataframe train_eval_test_split_pytorch train_eval_test_split_tensorflow Important notes for implementing multi-class modeling TensorFlow offers different approaches to building multi-class models, two prominent ones being using pre-made Estimators (https://www.tensorflow.org/tutorials/estimator/premade#overview_of_programming_with_estimators), and using general techniques such as with Keras models and Estimators. If one-hot encoded data labels are needed (i.e., to match the number of nodes in a neural network output layer), the num_classes parameter should be used when calling relevant functions to get data (e.g., lucd_uds.get_tf_dataset). Note that most (if not all) TensorFlow pre-made Estimator models do not require explicitly one-hot encoded data labels for non-binary modeling (e.g., tensorflow.estimator.DNNClassifier) and hence, the num_classes argument can be omitted. In the case of TensorFlow Estimators, developers are encouraged to understand how to shape input for the models. The same goes for modeling with PyTorch or XGBoost. Analyzing Model Performance Post-training performance analysis tasks are supported by the Lucd Machine Learning (ML) API (eda.lib.lucd_ml). This library provides functions supporting automatic execution and reporting of critical performance analysis tasks (e.g., creating confusion matrices, ROC curves), preventing the need to repeatedly write such code. The tables and plots created from these library functions can be viewed in the Unity client after the entire model training process has completed. The Lucd ML functions for performance analysis are listed below. Refer to the API documentation for full function descriptions. get_predictions_classification_pt get_predictions_classification_tf get_predictions_regression_pt get_predictions_regression_tf lucd_precision_recall_curve lucd_roc_curve lucd_confusion_matrix update_plots Submitting Performance Analysis Results Trained models and metadata can be uploaded to the Lucd backend via the eda.int.train.update function. The following piece of example code illustrates how to use the function. model_filename = lucd_uds.zip_model_tf(trained_classifier, serving_input_receiver_fn, model_id, graph_version, log_dir) # Store model graph and performance stats back to Lucd back-end with open(model_filename, "rb") as graph_file: train.update({tid: { 'performance': { 'loss': loss, 'accuracy': accuracy, 'macro_precision': precision_macro, 'macro_recall': recall_macro, 'macro_f1': f1_macro, 'micro_precision': precision_micro, 'micro_recall': recall_micro, 'micro_f1': f1_micro, 'weighted_precision': precision_weighted, 'weighted_recall': recall_weighted, 'weighted_f1': f1_weighted, 'precision_recall_f1_per_label': results_string, }, 'graph_version': graph_version, 'graph_file': graph_file.read() }}) train.update takes a python dictionary as the argument, with the train_id, described in Table 1, as the top-level key (tid represents the table_id in the code snippet above). The secondary keys graph_version and graph_file store the graph version and trained graph file (model) respectively. The secondary key performance stores another dictionary for performance values. There is no restriction on the key-value pairs here. The developer is allowed to choose the performance values and they will be viewable in the Unity client afterward. The values shown in the code snippet above are customary for evaluating classification models. Again, see example models in The Lucd Model Shop for more insights. As for precision_recall_f1_per_label, an example of how to compute performance statistics per label can be found at https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html. precision_recall_f1_per_label must be formatted as a semicolon-separated string of tuples, with each tuple formatted as “-,,. An example is as follows, precision_recall_f1_per_label: setosa-1.0,0.5,1.0;virginica-1.0,1.0,0.6;versicolor-1.0,0.6666666666666666,0.7499999999999999. Enabling Model Explainability To enable a trained model to be used by the explainability tool in the Unity client, some parameters must be defined. For TensorFlow models, ordered_feature_names, ordered_class_names, input_name, and output_name must be defined. ordered_feature_names (not to be confused with training data input column names) is a list of ordered names of the inputs to the trained model, commonly defined in TensorFlow model definitions as tf.feature_column. For example, for a TensorFlow text classification model, the named input might be embedding_input. Please see example code in The Lucd Model Shop. ordered_class_names is a list formatted such that string class names are ordered by their integer representations (the order of outputs from your model). For instance, for a binary classification model for which the labels are 0 and 1, the order of strings must be negative and positive (or whatever string labels you choose). input_name is the name of the input layer in your TensorFlow model to which your ordered_feature_names data will be passed. output_name is the name of the output layer in your TensorFlow model (by default these can be named things like ‘dense_2’ and ‘scores’). The output_name is used to retrieve your model outputs in the proper format for explanation. PyTorch models only require that ordered_class_names be provided. Plots Lucd allows users to plot associated model training metrics in the Unity client. The plots will update in realtime during model training, providing insight into the viability, problems, and successes of training runs. Users are able to store any information they want to be plotted as part of a training run using the eda.lib.lucd_ml.update_plots function. The provided information must abide by the following assumptions: 1. The top level keys represent individual plots; 2. Each inner dictionary must specify a labels and a description key, where labels are [“plot x_label”, “plot y_label”] and the description can be any string; 3. All remaining keys in the inner dictionary will be treated as individual lines on the plot, so in the following example "accuracy" is a line on the plot. { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } The individual elements in the “l1” line above represent [x_val, y_val]. A common example as shown above includes the following: "l1": [ [epoch, accuracy], [epoch, accuracy], ... ]. Once created, the dictionary may be stored for plotting with the lucd_ml.update_plots function: def update_plots(train_or_eval: bool, tid: str, dictionary: dict) -> str: ... dictionary = { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } update_plots(True, train_id, dictionary) Regarding the update_plots function, train_or_eval allows a user to specify whether their plot is part of the training or evaluation (or validation) cycle of the model training: train=true, eval=false. A TensorFlow hook is provided in lucd_ml for automatically parsing generated events files (the same as used by TensorBoard) and passing them to update_plots as part of a TensorFlow model. It can be provided as part of a TenorFlow EvalSpec or TrainSpec object as follows (stub included for posterity): class LucdTFEstimatorHook(tf.estimator.SessionRunHook): def __init__(self, train_hook: bool, log_dir: str, tid: str, freq: int, last_epoch: int): ... train_spec = tf.estimator.TrainSpec( input_fn=lambda: lucd_uds.get_tf_dataset_image(type_dict, training_data, num_features, target_type, num_classes).repeat(count=None).shuffle(30).batch(int(30)), max_steps=training_steps, hooks=lucd_ml.LucdTFEstimatorHook(train_hook=True, log_dir=log_dir, tid=tid, freq=10, max_epoch=training_steps)]) train_hook allows a user to specify whether the hook is providing train or eval metrics to the user (train=true, eval=false). log_dir tells the hook where to find TensorFlow events files. freq is the frequency that the hook should look for metrics in the events files. last_epoch tells the hook the number of epochs being run so the hook can ignore freq for the last epoch. Two last helper functions are provided as part of the Lucd plotting framework: lucd_roc_curve and lucd_precision_recall_curve. These functions generate ROC curves and precision-recall curves respectively, and are called selectively when using the compact modeling approach (enabled by “performance_curves” button in unity). Further documentation for these functions is provided in the API. def lucd_roc_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): def lucd_precision_recall_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): ... Confusion Matrix Lucd provides an in-depth, interactive confusion matrix for classification model evaluation. Users may select a square in the Unity client to be shown actual records associated with the square selected. This may be enabled by using the following function: def lucd_confusion_matrix(test_set: list or DataLoader, predictions: list, num_classes: int, label_mapping: type(abs), tid: str, write_accumulo: bool,) -> (dict, str): The function arguments details are provided below. test_set: Users may directly pass the PyTorch DataLoader or list of delayed dask dataframes returned from the respective train_eval_test_split_pytorch/tensorflow function. predictions: This should be a list of predictions generated by your model (the list returned from lucd_ml get_predictions_classification). The list must be in the same order as the test_set data. num_classes: An integer number of classes for the confusion matrix to represent. label_mapping: A function to map integers to class labels, which is used to map predictions to a human-readable format. tid: Training id to associate confusion matrix with. write_accumulo: Boolean specifying whether to write the dictionary directly to accumulo. Must be True to enable Unity client Confusion Matrix. If False, the generated confusion matrix will only be returned & not written to the database. Further documentation for this function exists in the API documentation. Here is an example usage: def _label_mapping(): return {0: 'I. versicolor', 1: 'I. virginica', 2: 'I. setosa'} ... # Prepare vds data for modeling delayed_values_training, delayed_values_evaluation, delayed_values_testing, my_df_testing_label, num_features = \ lucd_uds.train_eval_test_split_tensorflow(virtual_dataset_id, evaluation_dataset_percent, testing_dataset_percent) ... predictions, scores = lucd_ml.get_predictions_classification_tf(_estimator, lambda: uds.get_tf_dataset( feature_dict, delayed_values_testing, num_features, target_type).batch(1), classification_mode, .5) ... lucd_ml.lucd_confusion_matrix(delayed_values_testing, predictions, 3, label_mapping(), tid, True) Submitting Model Training Status Another helpful function is eda.int.train.status, which is used for storing the status of a developer’s training pipeline. This enables a model’s status to be displayed on the Unity client. The function definition is below. def status(uid, code, message=None): """Update model status in the database. Args: uid: Int representing a model's ID. code: 0 - RETRIEVING VDS, 1 - TRAINING, 2 - ANALYZING PERFORMANCE, 3 - STORING MODEL, 4 - TRAINING COMPLETE, 5 - ERROR, 6 - QUEUED, 7 - STOPPED. message: String representing optional custom message to include. Returns: Status message. Raises: TypeError: If code is not of type int. Exception: If code is invalid. """ View full record

HELP & SUPPORT

ABOUT US

Lucd is an AI software platform company that supports multiple industry verticals, allowing for its users to build enterprise-ready AI solutions with Low Code / No Code development practices. Lucd supports the entire AI lifecycle, allowing for the secure fusing of structured and unstructured data, empowering data analysts as well as business professionals to work collaboratively, resulting in reduced time to uncover new opportunities and solutions.

×
×
  • Create New...