Jump to content

Search the Community

Showing results for tags 'lucd'.

  • Search By Tags

    Type tags separated by commas.
  • Search By Author

Content Type


Forums

  • Lucd AI Platform Suite
    • JedAI Client
    • Python Client
    • Modeling Framework
    • General
  • Collaborate & Discuss
    • Questions & Answers
    • Data Management
    • Machine Learning
    • AI Solutions
    • Community Feedback & Requests
  • Health & Life Sciences's Discussion
  • Financial Services's Discussion
  • Retail & Consumer Packaged Goods's Discussion
  • Media & Entertainment's Discussion
  • Energy's Discussion
  • Manufacturing's Topics

Blogs

  • Lucd Team Blog
  • UX Club (Beta version only)'s Blog

Calendars

  • Community Calendar
  • Health & Life Sciences's Events
  • Financial Services's Events
  • Retail & Consumer Packaged Goods's Events
  • Media & Entertainment's Events
  • Energy's Events
  • Manufacturing's Events

Categories

  • Datasheets and White Papers
  • Press Releases
  • Industry Focus
    • Energy
    • Financial Services
    • Health and Life Services
    • Media and Entertainment
    • Manufacturing
    • Retail

Categories

  • JedAI Client
  • Python Client
  • Modeling Framework
  • General

Find results in...

Find results that contain...


Date Created

  • Start

    End


Last Updated

  • Start

    End


Filter by number of...

Joined

  • Start

    End


Group


About Me


Interests


Industry

Found 22 results

  1. Enables the ability to authenticate to Lucd backend and access data for custom analysis and model prototyping. Enables the ability to define custom feature transformation operations, and upload to Lucd for use in the GUI. Provides access to the Asset, Concept, Custom Contract, Custom Operation, Explore, Model, UDS and VDS REST APIs. Provides multiple working examples of individual REST calls, as well as complete models. Enables Dask exploration of the data on the client side. Enables development and testing of model development using the Lucd PyTorch, Tensorflow, Keras and XGBoost classes. View full record
  2. The Lucd platform enables the user to perform model performance analysis. Easy to use tools allows the user to view and compare the data prior to and after the data has been trained against a selected model. The platform also enables tracking of the critical governance and explainability associated with the process.
  3. The Lucd platform provides necessary, easy-to-use tools for dataset query and data visualization. Query and visualization are important initial steps when preparing data for machine learning and Lucd makes this a snap.
  4. Lucd enables users to transform large-scale data and provides an easy-to-use method for creating a "virtual dataset" based on the data desired to be trained. This transformed data can then be trained against the required models.
  5. This section covers the high-level tasks needed to support model training: importing and preparing data and analyzing performance. Most of the content here pertains to the full model approach, but some (e.g., reporting model status) is still helpful for PyTorch compact modeling. Importing and Preparing Data Data can be imported into a modeling context using the Lucd Unified Dataspace (UDS) API (eda.lib.lucd_uds). This library provides functions for creating datasets of various formats (TensorFlow, PyTorch, Dask dataframe) based on Lucd virtual datasets defined in the Unity client. It also provides the capability to retrieve previously trained word embeddings. The Lucd UDS functions providing data retrieval are listed below. Some are used for straightforward data importing (e.g., get_datframe) while others are used for preparing framework-specific datasets for AI models (e.g., get_tf_dataset for defining TensorFlow datasets). Refer to the API documentation for full function descriptions, and visit the Lucd Model Shop gitlab project for examples on how to use the functions for developing AI models. get_asset get_dataframe get_tf_dataset get_tf_dataset_image get_tf_dataset_text train_eval_test_split_dataframe train_eval_test_split_pytorch train_eval_test_split_tensorflow Important notes for implementing multi-class modeling TensorFlow offers different approaches to building multi-class models, two prominent ones being using pre-made Estimators (https://www.tensorflow.org/tutorials/estimator/premade#overview_of_programming_with_estimators), and using general techniques such as with Keras models and Estimators. If one-hot encoded data labels are needed (i.e., to match the number of nodes in a neural network output layer), the num_classes parameter should be used when calling relevant functions to get data (e.g., lucd_uds.get_tf_dataset). Note that most (if not all) TensorFlow pre-made Estimator models do not require explicitly one-hot encoded data labels for non-binary modeling (e.g., tensorflow.estimator.DNNClassifier) and hence, the num_classes argument can be omitted. In the case of TensorFlow Estimators, developers are encouraged to understand how to shape input for the models. The same goes for modeling with PyTorch or XGBoost. Analyzing Model Performance Post-training performance analysis tasks are supported by the Lucd Machine Learning (ML) API (eda.lib.lucd_ml). This library provides functions supporting automatic execution and reporting of critical performance analysis tasks (e.g., creating confusion matrices, ROC curves), preventing the need to repeatedly write such code. The tables and plots created from these library functions can be viewed in the Unity client after the entire model training process has completed. The Lucd ML functions for performance analysis are listed below. Refer to the API documentation for full function descriptions. get_predictions_classification_pt get_predictions_classification_tf get_predictions_regression_pt get_predictions_regression_tf lucd_precision_recall_curve lucd_roc_curve lucd_confusion_matrix update_plots Submitting Performance Analysis Results Trained models and metadata can be uploaded to the Lucd backend via the eda.int.train.update function. The following piece of example code illustrates how to use the function. model_filename = lucd_uds.zip_model_tf(trained_classifier, serving_input_receiver_fn, model_id, graph_version, log_dir) # Store model graph and performance stats back to Lucd back-end with open(model_filename, "rb") as graph_file: train.update({tid: { 'performance': { 'loss': loss, 'accuracy': accuracy, 'macro_precision': precision_macro, 'macro_recall': recall_macro, 'macro_f1': f1_macro, 'micro_precision': precision_micro, 'micro_recall': recall_micro, 'micro_f1': f1_micro, 'weighted_precision': precision_weighted, 'weighted_recall': recall_weighted, 'weighted_f1': f1_weighted, 'precision_recall_f1_per_label': results_string, }, 'graph_version': graph_version, 'graph_file': graph_file.read() }}) train.update takes a python dictionary as the argument, with the train_id, described in Table 1, as the top-level key (tid represents the table_id in the code snippet above). The secondary keys graph_version and graph_file store the graph version and trained graph file (model) respectively. The secondary key performance stores another dictionary for performance values. There is no restriction on the key-value pairs here. The developer is allowed to choose the performance values and they will be viewable in the Unity client afterward. The values shown in the code snippet above are customary for evaluating classification models. Again, see example models in The Lucd Model Shop for more insights. As for precision_recall_f1_per_label, an example of how to compute performance statistics per label can be found at https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html. precision_recall_f1_per_label must be formatted as a semicolon-separated string of tuples, with each tuple formatted as “-,,. An example is as follows, precision_recall_f1_per_label: setosa-1.0,0.5,1.0;virginica-1.0,1.0,0.6;versicolor-1.0,0.6666666666666666,0.7499999999999999. Enabling Model Explainability To enable a trained model to be used by the explainability tool in the Unity client, some parameters must be defined. For TensorFlow models, ordered_feature_names, ordered_class_names, input_name, and output_name must be defined. ordered_feature_names (not to be confused with training data input column names) is a list of ordered names of the inputs to the trained model, commonly defined in TensorFlow model definitions as tf.feature_column. For example, for a TensorFlow text classification model, the named input might be embedding_input. Please see example code in The Lucd Model Shop. ordered_class_names is a list formatted such that string class names are ordered by their integer representations (the order of outputs from your model). For instance, for a binary classification model for which the labels are 0 and 1, the order of strings must be negative and positive (or whatever string labels you choose). input_name is the name of the input layer in your TensorFlow model to which your ordered_feature_names data will be passed. output_name is the name of the output layer in your TensorFlow model (by default these can be named things like ‘dense_2’ and ‘scores’). The output_name is used to retrieve your model outputs in the proper format for explanation. PyTorch models only require that ordered_class_names be provided. Plots Lucd allows users to plot associated model training metrics in the Unity client. The plots will update in realtime during model training, providing insight into the viability, problems, and successes of training runs. Users are able to store any information they want to be plotted as part of a training run using the eda.lib.lucd_ml.update_plots function. The provided information must abide by the following assumptions: 1. The top level keys represent individual plots; 2. Each inner dictionary must specify a labels and a description key, where labels are [“plot x_label”, “plot y_label”] and the description can be any string; 3. All remaining keys in the inner dictionary will be treated as individual lines on the plot, so in the following example "accuracy" is a line on the plot. { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } The individual elements in the “l1” line above represent [x_val, y_val]. A common example as shown above includes the following: "l1": [ [epoch, accuracy], [epoch, accuracy], ... ]. Once created, the dictionary may be stored for plotting with the lucd_ml.update_plots function: def update_plots(train_or_eval: bool, tid: str, dictionary: dict) -> str: ... dictionary = { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } update_plots(True, train_id, dictionary) Regarding the update_plots function, train_or_eval allows a user to specify whether their plot is part of the training or evaluation (or validation) cycle of the model training: train=true, eval=false. A TensorFlow hook is provided in lucd_ml for automatically parsing generated events files (the same as used by TensorBoard) and passing them to update_plots as part of a TensorFlow model. It can be provided as part of a TenorFlow EvalSpec or TrainSpec object as follows (stub included for posterity): class LucdTFEstimatorHook(tf.estimator.SessionRunHook): def __init__(self, train_hook: bool, log_dir: str, tid: str, freq: int, last_epoch: int): ... train_spec = tf.estimator.TrainSpec( input_fn=lambda: lucd_uds.get_tf_dataset_image(type_dict, training_data, num_features, target_type, num_classes).repeat(count=None).shuffle(30).batch(int(30)), max_steps=training_steps, hooks=lucd_ml.LucdTFEstimatorHook(train_hook=True, log_dir=log_dir, tid=tid, freq=10, max_epoch=training_steps)]) train_hook allows a user to specify whether the hook is providing train or eval metrics to the user (train=true, eval=false). log_dir tells the hook where to find TensorFlow events files. freq is the frequency that the hook should look for metrics in the events files. last_epoch tells the hook the number of epochs being run so the hook can ignore freq for the last epoch. Two last helper functions are provided as part of the Lucd plotting framework: lucd_roc_curve and lucd_precision_recall_curve. These functions generate ROC curves and precision-recall curves respectively, and are called selectively when using the compact modeling approach (enabled by “performance_curves” button in unity). Further documentation for these functions is provided in the API. def lucd_roc_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): def lucd_precision_recall_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): ... Confusion Matrix Lucd provides an in-depth, interactive confusion matrix for classification model evaluation. Users may select a square in the Unity client to be shown actual records associated with the square selected. This may be enabled by using the following function: def lucd_confusion_matrix(test_set: list or DataLoader, predictions: list, num_classes: int, label_mapping: type(abs), tid: str, write_accumulo: bool,) -> (dict, str): The function arguments details are provided below. test_set: Users may directly pass the PyTorch DataLoader or list of delayed dask dataframes returned from the respective train_eval_test_split_pytorch/tensorflow function. predictions: This should be a list of predictions generated by your model (the list returned from lucd_ml get_predictions_classification). The list must be in the same order as the test_set data. num_classes: An integer number of classes for the confusion matrix to represent. label_mapping: A function to map integers to class labels, which is used to map predictions to a human-readable format. tid: Training id to associate confusion matrix with. write_accumulo: Boolean specifying whether to write the dictionary directly to accumulo. Must be True to enable Unity client Confusion Matrix. If False, the generated confusion matrix will only be returned & not written to the database. Further documentation for this function exists in the API documentation. Here is an example usage: def _label_mapping(): return {0: 'I. versicolor', 1: 'I. virginica', 2: 'I. setosa'} ... # Prepare vds data for modeling delayed_values_training, delayed_values_evaluation, delayed_values_testing, my_df_testing_label, num_features = \ lucd_uds.train_eval_test_split_tensorflow(virtual_dataset_id, evaluation_dataset_percent, testing_dataset_percent) ... predictions, scores = lucd_ml.get_predictions_classification_tf(_estimator, lambda: uds.get_tf_dataset( feature_dict, delayed_values_testing, num_features, target_type).batch(1), classification_mode, .5) ... lucd_ml.lucd_confusion_matrix(delayed_values_testing, predictions, 3, label_mapping(), tid, True) Submitting Model Training Status Another helpful function is eda.int.train.status, which is used for storing the status of a developer’s training pipeline. This enables a model’s status to be displayed on the Unity client. The function definition is below. def status(uid, code, message=None): """Update model status in the database. Args: uid: Int representing a model's ID. code: 0 - RETRIEVING VDS, 1 - TRAINING, 2 - ANALYZING PERFORMANCE, 3 - STORING MODEL, 4 - TRAINING COMPLETE, 5 - ERROR, 6 - QUEUED, 7 - STOPPED. message: String representing optional custom message to include. Returns: Status message. Raises: TypeError: If code is not of type int. Exception: If code is invalid. """
  6. The Lucd Modeling Framework (LMF) enables developers to build and interface custom AI models with the Lucd JedAI platform for streamlined management, experimentation, and training using data and parameters established in the JudAI Unity client (or simply, Unity client). The framework supports python-based AI models built with TensorFlow, PyTorch, and Dask XGBoost. LMF’s python libraries support the following tasks: - accessing Lucd virtual datasets (VDSes) for model training and evaluation, - analyzing and reporting model performance metrics (e.g., with confusion matrices, ROC curves), - storing structures representing trained models and training checkpoints. Model Development Approaches LMF provides flexibility in the level of effort and control needed for preparing models for Lucd. The two approaches include the full and compact model approaches; their differences are illustrated in Figure 1. Figure 1. Conceptual illustration of full and compact model approaches. Full Model Approach In the full model approach, a developer creates some AI model and manually uses LMF python libraries to complete the model training workflow (e.g., train, validate, holdout data testing, store results). This enables complete flexibility for more advanced use cases which might include designing complex or experimental training loops, advanced performance analysis, custom model compression, etc. Full models are implemented using normal python scripts. Further details are in the Developing Full Models section of this documentation. Compact Model Approach The compact model approach enables a developer to focus most if not all effort on defining an AI model, leaving other workflow tasks like holdout data testing and storage of performance results for the LMF to do automatically behind the scenes. In the case of TensorFlow, the developer does not even need to write training logic. The major benefits of the compact model approach include (1) significantly less coding effort and (2) potential reduction of errors and/or inconsistencies in writing boilerplate performance-testing logic. These benefits are especially useful for formatting models for multi-run experiments such as k-fold cross validation and learning curves (which will be introduced in an upcoming LMF release). Further details about compact modeling are in Developing Compact Models. Notable Framework Capabilities The LMF consists of an evolving set of capabilities. The following subsections describe notable modeling capabilities supported as of release 6.3.0 RC1. TensorFlow Estimator-Based Modeling TensorFlow supports AI modeling using either low-level APIs or easier-to-use high-level Estimator APIs. The LMF is designed to support Estimator-based model development. Keras may be used to create models, especially for enabling more customization. However, such models must be converted to Estimators for LMF and the broader Lucd JedAI platform to manage them appropriately. See for following link for an introduction to TensorFlow Estimators, https://www.tensorflow.org/guide/estimator. Various Feature Types For TensorFlow modeling, all dataset feature column types are supported (see https://www.tensorflow.org/guide/feature_columns), enabling support for a broad range of numeric and categorical features. Regarding categorical features, the domain of such a feature must be known at training time. For example, if you choose to use a feature car_make as a categorical feature, you must know all the possible makes when you write your model. This requirement will be removed in a future release. Also, the conversion of non-numerical data to numerical data (e.g., for encoding label/target values) based on a scan of the entire dataset is not supported in the current release. However, to help with this, data value replacement operations are supported in the Unity client. For TensorFlow modeling, label types are assumed to be TensorFlow int32. Embedding Datasets For TensorFlow and PyTorch modeling, LMF supports the use of embedding data, e.g., word2vec for representing free text. For PyTorch, the TorchText library is supported, but n-grams are not supported in the current release. Important Note: Currently, when using text input, only the text/embedding input is allowed as a feature, enabling conventional text classification. Future releases will enable the use of multiple feature inputs alongside text data. Image Datasets For TensorFlow and PyTorch modeling, use of image data (i.e., pixel values) as model input is supported. Distributed XGBoost using Dask Distributed training of XGBoost models using the Dask parallel data analytics framework is supported. See the following link for more information, https://ml.dask.org/xgboost.html. Support for TensorFlow and PyTorch distributed training is under development. Supported Libraries The Lucd modeling framework supports the following languages and machine learning -related libraries: - Python v3.6.5 - TensorFlow (for Python) v2.1 - PyTorch v1.5.1 - TorchText - Dask - Dask-ML - Dask-XGBoost - Numpy v1.16.4 - Scikit-learn v0.19.2 - Pandas v0.25.1 Since Dask and Dask-ML libraries are included in the modeling framework, Dask-based distributed Scikit-learn modules should also be supported. However, the usage of such models has not been tested, and hence, Dask-based Scikit-learn operation may be unpredictable. Example Code While this documentation introduces all the core components and best practices for developing AI models for the Lucd JedAI platform, there is rarely a replacement for sample code. The Lucd Model Shop provides a wide range of code (prepared by Lucd engineers) to help developers get started with preparing AI models. In the future, the Lucd Model Shop will also allow for the larger Lucd developer community to share their code, further helping others with their AI goals. Python API Documentation The LMF Python API documentation can be found in the following Lucd GitLab Pages site, https://lucd.pages.lucd.ai/mock-product-development/lucd-eda-rest/. Preparing Models Using the Lucd Modeling Framework The following documentation contains further details and examples for developing AI models for Lucd. - Developing Compact Models - Developing Full Models - Working with Data and Performance Analysis - The Lucd Model Shop
  7. New Features Added the “compact modeling” framework, enabling end-users to avoid writing boilerplate code (e.g., data retrieval, performance analysis) when preparing models. This now supports PyTorch and TensorFlow. Added capability for producing customizable real-time training update graphs for TensorFlow and PyTorch modeling. Other graphs (e.g., precision-recall) have been added as well. Added interactive confusion matrix capabilities, enabling detailed view of mis-labeled testing data, etc. Added capability to evaluate output of trained models in the JedAI Unity client via the explainability tab. Works for TensorFlow and PyTorch tabular, image and text classification as well as regression models. Changes Expanded use of “model_type” parameter to include “tabular_classification,” alleviating need to perform guesswork in code to determine model type. Refactored/improved retrieval code for virtual datasets and predictions to consider differences between TensorFlow and PyTorch data. TensorFlow models must now return all four of ordered_class_names, ordered_feature_names, input_name and output_name with compact modeling. PyTorch models must return ordered_class_names with compact modeling. Code refactoring to fix input and output mappings with TensorFlow models Various other bugfixes Miscellaneous code refactoring to make PyTorch modeling less error-prone. Enablement of ElasticSearch upgrades, minor related bugfixes
  8. New Features Added the “compact modeling” framework, enabling end-users to avoid writing boilerplate code (e.g., data retrieval, performance analysis) when preparing models. This now supports PyTorch and TensorFlow. Added capability for producing customizable real-time training update graphs for TensorFlow and PyTorch modeling. Other graphs (e.g., precision-recall) have been added as well. Added interactive confusion matrix capabilities, enabling detailed view of mis-labeled testing data, etc. Added capability to evaluate output of trained models in the JedAI Unity client via the explainability tab. Works for TensorFlow and PyTorch tabular, image and text classification as well as regression models. Changes Expanded use of “model_type” parameter to include “tabular_classification,” alleviating need to perform guesswork in code to determine model type. Refactored/improved retrieval code for virtual datasets and predictions to consider differences between TensorFlow and PyTorch data. TensorFlow models must now return all four of ordered_class_names, ordered_feature_names, input_name and output_name with compact modeling. PyTorch models must return ordered_class_names with compact modeling. Code refactoring to fix input and output mappings with TensorFlow models Various other bugfixes Miscellaneous code refactoring to make PyTorch modeling less error-prone. Enablement of ElasticSearch upgrades, minor related bugfixes View full record
  9. With the full modeling approach, developers implement their model training routines in a python script file and directly use python functions for defining and using PyTorch datasets, TensorFlow datasets, and Dask dataframes (for XGBoost modeling) based on Lucd virtual datasets (defined in the Unity client). Additionally, a developer must call functions for uploading trained models and metadata (e.g., model training performance metrics) to the Lucd backend. The advantage of using the full model approach is that developers are free to “carry-over” modeling and customized and/or experimental performance analysis techniques from previously written code. Full model examples are contained in The Lucd Model Shop. Full Model Format Full models are implemented using python scripts. As opposed to using a main function, the code’s entrypoint function must be called start. The arguments passed to start are described in the sections below. As a further note, in full model scripts, except blocks (for handling exceptions) MUST end with the raise statement, as opposed to another terminating statement like return. This ensures that the status of the model is accurately captured in the Unity client. TensorFlow and PyTorch Table 1 describes the python arguments (defined in the Unity client when starting mode training) which are always passed to the start function for TensorFlow and PyTorch models. Argument Description args['model'] (string) Model ID, used for storing checkpoints and models to Lucd backend args['train_id'] (string) Model “training” ID, to be used for storing trained model asset to Lucd backend args['vds'] (string) Lucd virtual dataset ID, used for retrieving training/validation/testing data for model training args['asset'] (string) Asset (word embedding) ID, used for retrieving word embeddings for text classification model training args['parameters']['steps'] (int) Number of steps for model training args['parameters']['lr'] (float) Learning rate for model training args['parameters']['regularization_value'] (float) Regularization value for model training args['parameters']['eval_percent'] (float) Percentage of the virtual dataset to use for validation args['parameters']['test_percent'] (float) Percentage of the virtual dataset to use for testing args['parameters']['classification_mode'] (string) Type of classification (binary, multiclass, tf_premade_multiclass) as selected in the GUI (only applies to classification models) args['parameters']['prediction_threshold'] (float) For binary classification models, minimum threshold for designating a positive decision args['parameters']['max_document_length'] (int) Maximum number of tokens to be used for free text input into the model for training (for text classification). args['exportdir'] (string) Directory used for storing trained model (for upload purposes) args['graphversion'] (string) Version of the graph being trained Table 1. Full model python script arguments for TensorFlow and PyTorch models. Dask XGBoost Table 2 describes the python arguments passed to the start function for Dask XGBoost models. Argument Description args['booster'] (string) XGBoost booster type args['objective'] (string) Learning task and the corresponding learning objective args['base_score'] (float) The initial prediction score of all instances, global bias args['eval_metric'] (string) Evaluation metrics for validation data, a default metric will be assigned according to ‘objective’ args['seed'] (int) Random number seed args['eta'] (int) Step size shrinkage used in update to prevents overfitting args['gamma'] (float) Minimum loss reduction required to make a further partition on a leaf node of the tree args['max_depth'] (int) Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit; beware that XGBoost aggressively consumes memory when training a deep tree args['min_child_weight'] (float) Minimum sum of instance weight(hessian) needed in a child args['max_delta_step'] (int) Maximum delta step we allow each tree’s weight estimation to be args['subsample'] (float) Subsample ratio of the training instance args['colsample_bytree'] (float) Subsample ratio of columns when constructing each tree args['colsample_bylevel'] (float) Subsample ratio of columns for each level args['colsample_bynode'] (float) Subsample ratio of columns for each split args['xgboost_lambda'] (float) L2 regularization term on weights; increasing this value will make model more conservative args['alpha'] (float) L1 regularization term on weights; increasing this value will make model more conservative args['tree_method'] (string) The tree construction algorithm used in XGBoost args['scale_pos_weight'] (float) Balancing of positive and negative weights args['refresh_leaf'] (int) This is a parameter of the refresh updater plugin; when this flag is 1, tree leaves as well as tree nodes’ stats are updated; when it is 0, only node stats are updated args['process_type'] (string) A type of boosting process to run args['num_parallel_tree'] (int) Number of parallel trees constructed during each iteration; this option is used to support boosted random forest args['sample_type'] (string) Type of sampling algorithm args['normalize_type'] (string) Type of normalization algorithm args['rate_drop'] (float) Dropout rate (a fraction of previous trees to drop during the dropout) args['one_drop'] (string) When this flag is enabled, at least one tree is always dropped during the dropout (allows Binomial-plus-one or epsilon-dropout from the original DART paper) args['skip_drop'] (float) Probability of skipping the dropout procedure during a boosting iteration args['feature_selector'] (string) Feature selection and ordering method args['top_k'] (int) The number of top features to select in greedy and thrifty feature selector; the value of 0 means using all the features args['updater'] (string) A comma-separated string defining the sequence of tree updaters to run, providing a modular way to construct and to modify the trees. This is an advanced parameter that is usually set automatically, depending on some other parameters. However, it could be also set explicitly by a user. Table 2. Full model python script arguments for Dask-XGBoost models.
  10. With the full modeling approach, developers implement their model training routines in a python script file and directly use python functions for defining and using PyTorch datasets, TensorFlow datasets, and Dask dataframes (for XGBoost modeling) based on Lucd virtual datasets (defined in the Unity client). Additionally, a developer must call functions for uploading trained models and metadata (e.g., model training performance metrics) to the Lucd backend. The advantage of using the full model approach is that developers are free to “carry-over” modeling and customized and/or experimental performance analysis techniques from previously written code. Full model examples are contained in The Lucd Model Shop. Full Model Format Full models are implemented using python scripts. As opposed to using a main function, the code’s entrypoint function must be called start. The arguments passed to start are described in the sections below. As a further note, in full model scripts, except blocks (for handling exceptions) MUST end with the raise statement, as opposed to another terminating statement like return. This ensures that the status of the model is accurately captured in the Unity client. TensorFlow and PyTorch Table 1 describes the python arguments (defined in the Unity client when starting mode training) which are always passed to the start function for TensorFlow and PyTorch models. Argument Description args['model'] (string) Model ID, used for storing checkpoints and models to Lucd backend args['train_id'] (string) Model “training” ID, to be used for storing trained model asset to Lucd backend args['vds'] (string) Lucd virtual dataset ID, used for retrieving training/validation/testing data for model training args['asset'] (string) Asset (word embedding) ID, used for retrieving word embeddings for text classification model training args['parameters']['steps'] (int) Number of steps for model training args['parameters']['lr'] (float) Learning rate for model training args['parameters']['regularization_value'] (float) Regularization value for model training args['parameters']['eval_percent'] (float) Percentage of the virtual dataset to use for validation args['parameters']['test_percent'] (float) Percentage of the virtual dataset to use for testing args['parameters']['classification_mode'] (string) Type of classification (binary, multiclass, tf_premade_multiclass) as selected in the GUI (only applies to classification models) args['parameters']['prediction_threshold'] (float) For binary classification models, minimum threshold for designating a positive decision args['parameters']['max_document_length'] (int) Maximum number of tokens to be used for free text input into the model for training (for text classification). args['exportdir'] (string) Directory used for storing trained model (for upload purposes) args['graphversion'] (string) Version of the graph being trained Table 1. Full model python script arguments for TensorFlow and PyTorch models. Dask XGBoost Table 2 describes the python arguments passed to the start function for Dask XGBoost models. Argument Description args['booster'] (string) XGBoost booster type args['objective'] (string) Learning task and the corresponding learning objective args['base_score'] (float) The initial prediction score of all instances, global bias args['eval_metric'] (string) Evaluation metrics for validation data, a default metric will be assigned according to ‘objective’ args['seed'] (int) Random number seed args['eta'] (int) Step size shrinkage used in update to prevents overfitting args['gamma'] (float) Minimum loss reduction required to make a further partition on a leaf node of the tree args['max_depth'] (int) Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit; beware that XGBoost aggressively consumes memory when training a deep tree args['min_child_weight'] (float) Minimum sum of instance weight(hessian) needed in a child args['max_delta_step'] (int) Maximum delta step we allow each tree’s weight estimation to be args['subsample'] (float) Subsample ratio of the training instance args['colsample_bytree'] (float) Subsample ratio of columns when constructing each tree args['colsample_bylevel'] (float) Subsample ratio of columns for each level args['colsample_bynode'] (float) Subsample ratio of columns for each split args['xgboost_lambda'] (float) L2 regularization term on weights; increasing this value will make model more conservative args['alpha'] (float) L1 regularization term on weights; increasing this value will make model more conservative args['tree_method'] (string) The tree construction algorithm used in XGBoost args['scale_pos_weight'] (float) Balancing of positive and negative weights args['refresh_leaf'] (int) This is a parameter of the refresh updater plugin; when this flag is 1, tree leaves as well as tree nodes’ stats are updated; when it is 0, only node stats are updated args['process_type'] (string) A type of boosting process to run args['num_parallel_tree'] (int) Number of parallel trees constructed during each iteration; this option is used to support boosted random forest args['sample_type'] (string) Type of sampling algorithm args['normalize_type'] (string) Type of normalization algorithm args['rate_drop'] (float) Dropout rate (a fraction of previous trees to drop during the dropout) args['one_drop'] (string) When this flag is enabled, at least one tree is always dropped during the dropout (allows Binomial-plus-one or epsilon-dropout from the original DART paper) args['skip_drop'] (float) Probability of skipping the dropout procedure during a boosting iteration args['feature_selector'] (string) Feature selection and ordering method args['top_k'] (int) The number of top features to select in greedy and thrifty feature selector; the value of 0 means using all the features args['updater'] (string) A comma-separated string defining the sequence of tree updaters to run, providing a modular way to construct and to modify the trees. This is an advanced parameter that is usually set automatically, depending on some other parameters. However, it could be also set explicitly by a user. Table 2. Full model python script arguments for Dask-XGBoost models. View full record
  11. The Lucd Modeling Framework (LMF) enables developers to build and interface custom AI models with the Lucd JedAI platform for streamlined management, experimentation, and training using data and parameters established in the JudAI Unity client (or simply, Unity client). The framework supports python-based AI models built with TensorFlow, PyTorch, and Dask XGBoost. LMF’s python libraries support the following tasks: - accessing Lucd virtual datasets (VDSes) for model training and evaluation, - analyzing and reporting model performance metrics (e.g., with confusion matrices, ROC curves), - storing structures representing trained models and training checkpoints. Model Development Approaches LMF provides flexibility in the level of effort and control needed for preparing models for Lucd. The two approaches include the full and compact model approaches; their differences are illustrated in Figure 1. Figure 1. Conceptual illustration of full and compact model approaches. Full Model Approach In the full model approach, a developer creates some AI model and manually uses LMF python libraries to complete the model training workflow (e.g., train, validate, holdout data testing, store results). This enables complete flexibility for more advanced use cases which might include designing complex or experimental training loops, advanced performance analysis, custom model compression, etc. Full models are implemented using normal python scripts. Further details are in the Developing Full Models section of this documentation. Compact Model Approach The compact model approach enables a developer to focus most if not all effort on defining an AI model, leaving other workflow tasks like holdout data testing and storage of performance results for the LMF to do automatically behind the scenes. In the case of TensorFlow, the developer does not even need to write training logic. The major benefits of the compact model approach include (1) significantly less coding effort and (2) potential reduction of errors and/or inconsistencies in writing boilerplate performance-testing logic. These benefits are especially useful for formatting models for multi-run experiments such as k-fold cross validation and learning curves (which will be introduced in an upcoming LMF release). Further details about compact modeling are in Developing Compact Models. Notable Framework Capabilities The LMF consists of an evolving set of capabilities. The following subsections describe notable modeling capabilities supported as of release 6.3.0 RC1. TensorFlow Estimator-Based Modeling TensorFlow supports AI modeling using either low-level APIs or easier-to-use high-level Estimator APIs. The LMF is designed to support Estimator-based model development. Keras may be used to create models, especially for enabling more customization. However, such models must be converted to Estimators for LMF and the broader Lucd JedAI platform to manage them appropriately. See for following link for an introduction to TensorFlow Estimators, https://www.tensorflow.org/guide/estimator. Various Feature Types For TensorFlow modeling, all dataset feature column types are supported (see https://www.tensorflow.org/guide/feature_columns), enabling support for a broad range of numeric and categorical features. Regarding categorical features, the domain of such a feature must be known at training time. For example, if you choose to use a feature car_make as a categorical feature, you must know all the possible makes when you write your model. This requirement will be removed in a future release. Also, the conversion of non-numerical data to numerical data (e.g., for encoding label/target values) based on a scan of the entire dataset is not supported in the current release. However, to help with this, data value replacement operations are supported in the Unity client. For TensorFlow modeling, label types are assumed to be TensorFlow int32. Embedding Datasets For TensorFlow and PyTorch modeling, LMF supports the use of embedding data, e.g., word2vec for representing free text. For PyTorch, the TorchText library is supported, but n-grams are not supported in the current release. Important Note: Currently, when using text input, only the text/embedding input is allowed as a feature, enabling conventional text classification. Future releases will enable the use of multiple feature inputs alongside text data. Image Datasets For TensorFlow and PyTorch modeling, use of image data (i.e., pixel values) as model input is supported. Distributed XGBoost using Dask Distributed training of XGBoost models using the Dask parallel data analytics framework is supported. See the following link for more information, https://ml.dask.org/xgboost.html. Support for TensorFlow and PyTorch distributed training is under development. Supported Libraries The Lucd modeling framework supports the following languages and machine learning -related libraries: - Python v3.6.5 - TensorFlow (for Python) v2.1 - PyTorch v1.5.1 - TorchText - Dask - Dask-ML - Dask-XGBoost - Numpy v1.16.4 - Scikit-learn v0.19.2 - Pandas v0.25.1 Since Dask and Dask-ML libraries are included in the modeling framework, Dask-based distributed Scikit-learn modules should also be supported. However, the usage of such models has not been tested, and hence, Dask-based Scikit-learn operation may be unpredictable. Example Code While this documentation introduces all the core components and best practices for developing AI models for the Lucd JedAI platform, there is rarely a replacement for sample code. The Lucd Model Shop provides a wide range of code (prepared by Lucd engineers) to help developers get started with preparing AI models. In the future, the Lucd Model Shop will also allow for the larger Lucd developer community to share their code, further helping others with their AI goals. Python API Documentation The LMF Python API documentation can be found in the following Lucd GitLab Pages site, https://lucd.pages.lucd.ai/mock-product-development/lucd-eda-rest/. Preparing Models Using the Lucd Modeling Framework The following documentation contains further details and examples for developing AI models for Lucd. - Developing Compact Models - Developing Full Models - Working with Data and Performance Analysis - The Lucd Model Shop View full record
  12. This section covers the high-level tasks needed to support model training: importing and preparing data and analyzing performance. Most of the content here pertains to the full model approach, but some (e.g., reporting model status) is still helpful for PyTorch compact modeling. Data can be imported into a modeling context using the Lucd Unified Dataspace (UDS) API (eda.lib.lucd_uds). This library provides functions for creating datasets of various formats (TensorFlow, PyTorch, Dask dataframe) based on Lucd virtual datasets defined in the Unity client. It also provides the capability to retrieve previously trained word embeddings. The Lucd UDS functions providing data retrieval are listed below. Some are used for straightforward data importing (e.g., get_datframe) while others are used for preparing framework-specific datasets for AI models (e.g., get_tf_dataset for defining TensorFlow datasets). Refer to the API documentation for full function descriptions, and visit the Lucd Model Shop gitlab project for examples on how to use the functions for developing AI models. get_asset get_dataframe get_tf_dataset get_tf_dataset_image get_tf_dataset_text train_eval_test_split_dataframe train_eval_test_split_pytorch train_eval_test_split_tensorflow Important notes for implementing multi-class modeling TensorFlow offers different approaches to building multi-class models, two prominent ones being using pre-made Estimators (https://www.tensorflow.org/tutorials/estimator/premade#overview_of_programming_with_estimators), and using general techniques such as with Keras models and Estimators. If one-hot encoded data labels are needed (i.e., to match the number of nodes in a neural network output layer), the num_classes parameter should be used when calling relevant functions to get data (e.g., lucd_uds.get_tf_dataset). Note that most (if not all) TensorFlow pre-made Estimator models do not require explicitly one-hot encoded data labels for non-binary modeling (e.g., tensorflow.estimator.DNNClassifier) and hence, the num_classes argument can be omitted. In the case of TensorFlow Estimators, developers are encouraged to understand how to shape input for the models. The same goes for modeling with PyTorch or XGBoost. Analyzing Model Performance Post-training performance analysis tasks are supported by the Lucd Machine Learning (ML) API (eda.lib.lucd_ml). This library provides functions supporting automatic execution and reporting of critical performance analysis tasks (e.g., creating confusion matrices, ROC curves), preventing the need to repeatedly write such code. The tables and plots created from these library functions can be viewed in the Unity client after the entire model training process has completed. The Lucd ML functions for performance analysis are listed below. Refer to the API documentation for full function descriptions. get_predictions_classification_pt get_predictions_classification_tf get_predictions_regression_pt get_predictions_regression_tf lucd_precision_recall_curve lucd_roc_curve lucd_confusion_matrix update_plots Submitting Performance Analysis Results Trained models and metadata can be uploaded to the Lucd backend via the eda.int.train.update function. The following piece of example code illustrates how to use the function. model_filename = lucd_uds.zip_model_tf(trained_classifier, serving_input_receiver_fn, model_id, graph_version, log_dir) # Store model graph and performance stats back to Lucd back-end with open(model_filename, "rb") as graph_file: train.update({tid: { 'performance': { 'loss': loss, 'accuracy': accuracy, 'macro_precision': precision_macro, 'macro_recall': recall_macro, 'macro_f1': f1_macro, 'micro_precision': precision_micro, 'micro_recall': recall_micro, 'micro_f1': f1_micro, 'weighted_precision': precision_weighted, 'weighted_recall': recall_weighted, 'weighted_f1': f1_weighted, 'precision_recall_f1_per_label': results_string, }, 'graph_version': graph_version, 'graph_file': graph_file.read() }}) train.update takes a python dictionary as the argument, with the train_id, described in Table 1, as the top-level key (tid represents the table_id in the code snippet above). The secondary keys graph_version and graph_file store the graph version and trained graph file (model) respectively. The secondary key performance stores another dictionary for performance values. There is no restriction on the key-value pairs here. The developer is allowed to choose the performance values and they will be viewable in the Unity client afterward. The values shown in the code snippet above are customary for evaluating classification models. Again, see example models in The Lucd Model Shop for more insights. As for precision_recall_f1_per_label, an example of how to compute performance statistics per label can be found at: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html. precision_recall_f1_per_label must be formatted as a semicolon-separated string of tuples, with each tuple formatted as “-,,. An example is as follows, precision_recall_f1_per_label: setosa-1.0,0.5,1.0;virginica-1.0,1.0,0.6;versicolor-1.0,0.6666666666666666,0.7499999999999999. Enabling Model Explainability To enable a trained model to be used by the explainability tool in the Unity client, some parameters must be defined. For TensorFlow models, ordered_feature_names, ordered_class_names, input_name, and output_name must be defined. ordered_feature_names (not to be confused with training data input column names) is a list of ordered names of the inputs to the trained model, commonly defined in TensorFlow model definitions as tf.feature_column. For example, for a TensorFlow text classification model, the named input might be embedding_input. Please see example code in The Lucd Model Shop. ordered_class_names is a list formatted such that string class names are ordered by their integer representations (the order of outputs from your model). For instance, for a binary classification model for which the labels are 0 and 1, the order of strings must be negative and positive (or whatever string labels you choose). input_name is the name of the input layer in your TensorFlow model to which your ordered_feature_names data will be passed. output_name is the name of the output layer in your TensorFlow model (by default these can be named things like ‘dense_2’ and ‘scores’). The output_name is used to retrieve your model outputs in the proper format for explanation. PyTorch models only require that ordered_class_names be provided. Plots Lucd allows users to plot associated model training metrics in the Unity client. The plots will update in realtime during model training, providing insight into the viability, problems, and successes of training runs. Users are able to store any information they want to be plotted as part of a training run using the eda.lib.lucd_ml.update_plots function. The provided information must abide by the following assumptions: 1. The top level keys represent individual plots; 2. Each inner dictionary must specify a labels and a description key, where labels are [“plot x_label”, “plot y_label”] and the description can be any string; 3. All remaining keys in the inner dictionary will be treated as individual lines on the plot, so in the following example "accuracy" is a line on the plot. { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } The individual elements in the “l1” line above represent [x_val, y_val]. A common example as shown above includes the following: "l1": [ [epoch, accuracy], [epoch, accuracy], ... ] Once created, the dictionary may be stored for plotting with the lucd_ml.update_plots function: def update_plots(train_or_eval: bool, tid: str, dictionary: dict) -> str: ... dictionary = { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } update_plots(True, train_id, dictionary) Regarding the update_plots function, train_or_eval allows a user to specify whether their plot is part of the training or evaluation (or validation) cycle of the model training: train=true, eval=false. A TensorFlow hook is provided in lucd_ml for automatically parsing generated events files (the same as used by TensorBoard) and passing them to update_plots as part of a TensorFlow model. It can be provided as part of a TensorFlow EvalSpec or TrainSpec object as follows (stub included for posterity): class LucdTFEstimatorHook(tf.estimator.SessionRunHook): def __init__(self, train_hook: bool, log_dir: str, tid: str, freq: int, last_epoch: int): ... train_spec = tf.estimator.TrainSpec( input_fn=lambda: lucd_uds.get_tf_dataset_image(type_dict, training_data, num_features, target_type, num_classes).repeat(count=None).shuffle(30).batch(int(30)), max_steps=training_steps, hooks=lucd_ml.LucdTFEstimatorHook(train_hook=True, log_dir=log_dir, tid=tid, freq=10, max_epoch=training_steps)]) train_hook allows a user to specify whether the hook is providing train or eval metrics to the user (train=true, eval=false). log_dir tells the hook where to find TensorFlow events files. freq is the frequency that the hook should look for metrics in the events files. last_epoch tells the hook the number of epochs being run so the hook can ignore freq for the last epoch. Two last helper functions are provided as part of the Lucd plotting framework: lucd_roc_curve and lucd_precision_recall_curve. These functions generate ROC curves and precision-recall curves respectively and are called selectively when using the compact modeling approach (enabled by “performance_curves” button in unity). Further documentation for these functions is provided in the API. def lucd_roc_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): def lucd_precision_recall_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): ... Confusion Matrix Lucd provides an in-depth, interactive confusion matrix for classification model evaluation. Users may select a square in the Unity client to be shown actual records associated with the square selected. This may be enabled by using the following function: def lucd_confusion_matrix(test_set: list or DataLoader, predictions: list, num_classes: int, label_mapping: type(abs), tid: str, write_accumulo: bool,) -> (dict, str): The function arguments details are provided below. test_set: Users may directly pass the PyTorch DataLoader or list of delayed dask dataframes returned from the respective train_eval_test_split_pytorch/tensorflow function. predictions: This should be a list of predictions generated by your model (the list returned from lucd_ml get_predictions_classification). The list must be in the same order as the test_set data. num_classes: An integer number of classes for the confusion matrix to represent. label_mapping: A function to map integers to class labels, which is used to map predictions to a human-readable format. tid: Training id to associate confusion matrix with. write_accumulo: Boolean specifying whether to write the dictionary directly to accumulo. Must be True to enable Unity client Confusion Matrix. If False, the generated confusion matrix will only be returned & not written to the database. Further documentation for this function exists in the API documentation. Here is an example usage: def _label_mapping(): return {0: 'I. versicolor', 1: 'I. virginica', 2: 'I. setosa'} ... # Prepare vds data for modeling delayed_values_training, delayed_values_evaluation, delayed_values_testing, my_df_testing_label, num_features = \ lucd_uds.train_eval_test_split_tensorflow(virtual_dataset_id, evaluation_dataset_percent, testing_dataset_percent) ... predictions, scores = lucd_ml.get_predictions_classification_tf(_estimator, lambda: uds.get_tf_dataset( feature_dict, delayed_values_testing, num_features, target_type).batch(1), classification_mode, .5) ... lucd_ml.lucd_confusion_matrix(delayed_values_testing, predictions, 3, label_mapping(), tid, True) Submitting Model Training Status Another helpful function is eda.int.train.status, which is used for storing the status of a developer’s training pipeline. This enables a model’s status to be displayed on the Unity client. The function definition is below. def status(uid, code, message=None): """Update model status in the database. Args: uid: Int representing a model's ID. code: 0 - RETRIEVING VDS, 1 - TRAINING, 2 - ANALYZING PERFORMANCE, 3 - STORING MODEL, 4 - TRAINING COMPLETE, 5 - ERROR, 6 - QUEUED, 7 - STOPPED. message: String representing optional custom message to include. Returns: Status message. Raises: TypeError: If code is not of type int. Exception: If code is invalid. """
  13. This section covers the high-level tasks needed to support model training: importing and preparing data and analyzing performance. Most of the content here pertains to the full model approach, but some (e.g., reporting model status) is still helpful for PyTorch compact modeling. Data can be imported into a modeling context using the Lucd Unified Dataspace (UDS) API (eda.lib.lucd_uds). This library provides functions for creating datasets of various formats (TensorFlow, PyTorch, Dask dataframe) based on Lucd virtual datasets defined in the Unity client. It also provides the capability to retrieve previously trained word embeddings. The Lucd UDS functions providing data retrieval are listed below. Some are used for straightforward data importing (e.g., get_datframe) while others are used for preparing framework-specific datasets for AI models (e.g., get_tf_dataset for defining TensorFlow datasets). Refer to the API documentation for full function descriptions, and visit the Lucd Model Shop gitlab project for examples on how to use the functions for developing AI models. get_asset get_dataframe get_tf_dataset get_tf_dataset_image get_tf_dataset_text train_eval_test_split_dataframe train_eval_test_split_pytorch train_eval_test_split_tensorflow Important notes for implementing multi-class modeling TensorFlow offers different approaches to building multi-class models, two prominent ones being using pre-made Estimators (https://www.tensorflow.org/tutorials/estimator/premade#overview_of_programming_with_estimators), and using general techniques such as with Keras models and Estimators. If one-hot encoded data labels are needed (i.e., to match the number of nodes in a neural network output layer), the num_classes parameter should be used when calling relevant functions to get data (e.g., lucd_uds.get_tf_dataset). Note that most (if not all) TensorFlow pre-made Estimator models do not require explicitly one-hot encoded data labels for non-binary modeling (e.g., tensorflow.estimator.DNNClassifier) and hence, the num_classes argument can be omitted. In the case of TensorFlow Estimators, developers are encouraged to understand how to shape input for the models. The same goes for modeling with PyTorch or XGBoost. Analyzing Model Performance Post-training performance analysis tasks are supported by the Lucd Machine Learning (ML) API (eda.lib.lucd_ml). This library provides functions supporting automatic execution and reporting of critical performance analysis tasks (e.g., creating confusion matrices, ROC curves), preventing the need to repeatedly write such code. The tables and plots created from these library functions can be viewed in the Unity client after the entire model training process has completed. The Lucd ML functions for performance analysis are listed below. Refer to the API documentation for full function descriptions. get_predictions_classification_pt get_predictions_classification_tf get_predictions_regression_pt get_predictions_regression_tf lucd_precision_recall_curve lucd_roc_curve lucd_confusion_matrix update_plots Submitting Performance Analysis Results Trained models and metadata can be uploaded to the Lucd backend via the eda.int.train.update function. The following piece of example code illustrates how to use the function. model_filename = lucd_uds.zip_model_tf(trained_classifier, serving_input_receiver_fn, model_id, graph_version, log_dir) # Store model graph and performance stats back to Lucd back-end with open(model_filename, "rb") as graph_file: train.update({tid: { 'performance': { 'loss': loss, 'accuracy': accuracy, 'macro_precision': precision_macro, 'macro_recall': recall_macro, 'macro_f1': f1_macro, 'micro_precision': precision_micro, 'micro_recall': recall_micro, 'micro_f1': f1_micro, 'weighted_precision': precision_weighted, 'weighted_recall': recall_weighted, 'weighted_f1': f1_weighted, 'precision_recall_f1_per_label': results_string, }, 'graph_version': graph_version, 'graph_file': graph_file.read() }}) train.update takes a python dictionary as the argument, with the train_id, described in Table 1, as the top-level key (tid represents the table_id in the code snippet above). The secondary keys graph_version and graph_file store the graph version and trained graph file (model) respectively. The secondary key performance stores another dictionary for performance values. There is no restriction on the key-value pairs here. The developer is allowed to choose the performance values and they will be viewable in the Unity client afterward. The values shown in the code snippet above are customary for evaluating classification models. Again, see example models in The Lucd Model Shop for more insights. As for precision_recall_f1_per_label, an example of how to compute performance statistics per label can be found at: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html. precision_recall_f1_per_label must be formatted as a semicolon-separated string of tuples, with each tuple formatted as “-,,. An example is as follows, precision_recall_f1_per_label: setosa-1.0,0.5,1.0;virginica-1.0,1.0,0.6;versicolor-1.0,0.6666666666666666,0.7499999999999999. Enabling Model Explainability To enable a trained model to be used by the explainability tool in the Unity client, some parameters must be defined. For TensorFlow models, ordered_feature_names, ordered_class_names, input_name, and output_name must be defined. ordered_feature_names (not to be confused with training data input column names) is a list of ordered names of the inputs to the trained model, commonly defined in TensorFlow model definitions as tf.feature_column. For example, for a TensorFlow text classification model, the named input might be embedding_input. Please see example code in The Lucd Model Shop. ordered_class_names is a list formatted such that string class names are ordered by their integer representations (the order of outputs from your model). For instance, for a binary classification model for which the labels are 0 and 1, the order of strings must be negative and positive (or whatever string labels you choose). input_name is the name of the input layer in your TensorFlow model to which your ordered_feature_names data will be passed. output_name is the name of the output layer in your TensorFlow model (by default these can be named things like ‘dense_2’ and ‘scores’). The output_name is used to retrieve your model outputs in the proper format for explanation. PyTorch models only require that ordered_class_names be provided. Plots Lucd allows users to plot associated model training metrics in the Unity client. The plots will update in realtime during model training, providing insight into the viability, problems, and successes of training runs. Users are able to store any information they want to be plotted as part of a training run using the eda.lib.lucd_ml.update_plots function. The provided information must abide by the following assumptions: 1. The top level keys represent individual plots; 2. Each inner dictionary must specify a labels and a description key, where labels are [“plot x_label”, “plot y_label”] and the description can be any string; 3. All remaining keys in the inner dictionary will be treated as individual lines on the plot, so in the following example "accuracy" is a line on the plot. { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } The individual elements in the “l1” line above represent [x_val, y_val]. A common example as shown above includes the following: "l1": [ [epoch, accuracy], [epoch, accuracy], ... ] Once created, the dictionary may be stored for plotting with the lucd_ml.update_plots function: def update_plots(train_or_eval: bool, tid: str, dictionary: dict) -> str: ... dictionary = { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } update_plots(True, train_id, dictionary) Regarding the update_plots function, train_or_eval allows a user to specify whether their plot is part of the training or evaluation (or validation) cycle of the model training: train=true, eval=false. A TensorFlow hook is provided in lucd_ml for automatically parsing generated events files (the same as used by TensorBoard) and passing them to update_plots as part of a TensorFlow model. It can be provided as part of a TensorFlow EvalSpec or TrainSpec object as follows (stub included for posterity): class LucdTFEstimatorHook(tf.estimator.SessionRunHook): def __init__(self, train_hook: bool, log_dir: str, tid: str, freq: int, last_epoch: int): ... train_spec = tf.estimator.TrainSpec( input_fn=lambda: lucd_uds.get_tf_dataset_image(type_dict, training_data, num_features, target_type, num_classes).repeat(count=None).shuffle(30).batch(int(30)), max_steps=training_steps, hooks=lucd_ml.LucdTFEstimatorHook(train_hook=True, log_dir=log_dir, tid=tid, freq=10, max_epoch=training_steps)]) train_hook allows a user to specify whether the hook is providing train or eval metrics to the user (train=true, eval=false). log_dir tells the hook where to find TensorFlow events files. freq is the frequency that the hook should look for metrics in the events files. last_epoch tells the hook the number of epochs being run so the hook can ignore freq for the last epoch. Two last helper functions are provided as part of the Lucd plotting framework: lucd_roc_curve and lucd_precision_recall_curve. These functions generate ROC curves and precision-recall curves respectively and are called selectively when using the compact modeling approach (enabled by “performance_curves” button in unity). Further documentation for these functions is provided in the API. def lucd_roc_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): def lucd_precision_recall_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): ... Confusion Matrix Lucd provides an in-depth, interactive confusion matrix for classification model evaluation. Users may select a square in the Unity client to be shown actual records associated with the square selected. This may be enabled by using the following function: def lucd_confusion_matrix(test_set: list or DataLoader, predictions: list, num_classes: int, label_mapping: type(abs), tid: str, write_accumulo: bool,) -> (dict, str): The function arguments details are provided below. test_set: Users may directly pass the PyTorch DataLoader or list of delayed dask dataframes returned from the respective train_eval_test_split_pytorch/tensorflow function. predictions: This should be a list of predictions generated by your model (the list returned from lucd_ml get_predictions_classification). The list must be in the same order as the test_set data. num_classes: An integer number of classes for the confusion matrix to represent. label_mapping: A function to map integers to class labels, which is used to map predictions to a human-readable format. tid: Training id to associate confusion matrix with. write_accumulo: Boolean specifying whether to write the dictionary directly to accumulo. Must be True to enable Unity client Confusion Matrix. If False, the generated confusion matrix will only be returned & not written to the database. Further documentation for this function exists in the API documentation. Here is an example usage: def _label_mapping(): return {0: 'I. versicolor', 1: 'I. virginica', 2: 'I. setosa'} ... # Prepare vds data for modeling delayed_values_training, delayed_values_evaluation, delayed_values_testing, my_df_testing_label, num_features = \ lucd_uds.train_eval_test_split_tensorflow(virtual_dataset_id, evaluation_dataset_percent, testing_dataset_percent) ... predictions, scores = lucd_ml.get_predictions_classification_tf(_estimator, lambda: uds.get_tf_dataset( feature_dict, delayed_values_testing, num_features, target_type).batch(1), classification_mode, .5) ... lucd_ml.lucd_confusion_matrix(delayed_values_testing, predictions, 3, label_mapping(), tid, True) Submitting Model Training Status Another helpful function is eda.int.train.status, which is used for storing the status of a developer’s training pipeline. This enables a model’s status to be displayed on the Unity client. The function definition is below. def status(uid, code, message=None): """Update model status in the database. Args: uid: Int representing a model's ID. code: 0 - RETRIEVING VDS, 1 - TRAINING, 2 - ANALYZING PERFORMANCE, 3 - STORING MODEL, 4 - TRAINING COMPLETE, 5 - ERROR, 6 - QUEUED, 7 - STOPPED. message: String representing optional custom message to include. Returns: Status message. Raises: TypeError: If code is not of type int. Exception: If code is invalid. """ View full record
  14. Enables the ability to authenticate to Lucd backend and access data for custom analysis and model prototyping. Enables the ability to define custom feature transformation operations, and upload to Lucd for use in the GUI. Provides access to the Asset, Concept, Custom Contract, Custom Operation, Explore, Model, UDS and VDS REST APIs. Provides multiple working examples of individual REST calls, as well as complete models. Enables Dask exploration of the data on the client side. Enables development and testing of model development using the Lucd PyTorch, Tensorflow, Keras and XGBoost classes.
  15. This documentation describes how to use the Lucd Python libraries for creating custom feature engineering operations for processing data for model training. Custom feature engineering operations enable a Lucd user to apply feature engineering operations that are not available in the Lucd GUI. A simple example includes adjusting a given attribute as a function of the value of several other attributes. Overall, this should enable near open-ended feature transformation needs. The Custom Operation Module The eda.int.custom_operation module is used for sending custom operations to the Lucd backend so that they are selectable in the Lucd EDA section of the GUI. The simple Python script below illustrates how one may define a custom feature engineering operation and send it to the Lucd backend using the Lucd Python libraries. from eda.int import custom_operation import lucd def transform(row): row['flower.petal_length'] = row['flower.petal_length'] + 1 return row client = lucd.LucdClient(domain="domain", username="username", password="password", login_domain="Lucd Platform URL" ) data = { "operation_name": "simple iris attribute adder", "author_name": "J. Branch", "author_email": "email@address", "operation_description": "Testing custom feature op using simple transform", "operation_purpose": "Numeric feature scaling", "operation_features": ["flower.petal_length"], "operation_function": transform } response_json, rv = custom_operation.create(data) client.close() The create function from the eda.int.custom_operation module is used to submit the actual function and metadata to Lucd. The required attributes in the dict data are used for display in the Lucd GUI (and may at some point be used for search purposes). Table 1 describes the attributes in detail. Table 1. Custom Operation Attribute Descriptions Attribute Description name String name/label of the custom feature engineering operation author_name Name of the developer who wrote the operation author_email Author’s email purpose Short description of what the operation achieves description Longer description of how the operation might achieve its purpose, as well as other notes feature List of strings identifying features/facets the operation affects transform_function The Python function that implements the logic of the custom operation Regarding the custom transformation function, it is essential that the function be defined in the same context as the dict and in which create is called. This ensures that the de-serialization of the function works properly on the Lucd backend. Custom Operation Types Applied Operations Applied operations denote those custom operations which are to be applied to data using conventional Dask and Pandas function application mechanisms. Custom functions can be applied to data (dataframes) via the following Dask functions: apply; map_partitions; applymap. Note that applied operations are applied to Pandas dataframes (which act as partitions of larger Dask dataframes). Hence, the user should consider this when designing their custom operation. As shown in the example code above, the user does not identify which Dask mechanism to use for applying a function to data in the create function. This will be selectable in the Lucd GUI when applying custom operations in the eda section. Dask Operations Dask operations denote custom operations which are to be applied directly to a Dask dataframe. This is useful in cases where custom operations/algorithms cannot be correctly applied to a *partition” of data as described above. An example would be a clustering algorithm. We expect these custom operations will use modules from Dask libraries such as Dask-ML (https://ml.dask.org/). An example custom operation is shown below. def cluster_function(df): from dask_ml.cluster import KMeans X = df[['car.days_in_service']] kmeans = KMeans(n_clusters=4) k_service_days = kmeans.fit(X) labels = k_service_days.predict(X) df['cars.service_days_cluster'] = labels return df When a Dask operation is to be applied (in the Unity client), the user should select apply_direct as the application type.
  16. User Guide The Lucd Python Client provides capabilities for data scientists and AI model developers to prototype AI model solutions before uploading them to the Lucd Unity Client for extended training and performance analysis. The Lucd Python Client provides the following features: functions for accessing raw data and other assets from in Lucd for general analysis and custom visualization; functions for uploading user-defined feature transformation operations to Lucd, which can then be applied in the Lucd Unity Client to create a virtual dataset; functions for accessing ingesting data into TensorFlow and PyTorch models, which can be used for prototyping models. Installation The lucd-python-client python package should be installed using the pip command with a python wheel file. Instructions are as follows: Download or clone the lucd-python-client package (unzip if needed) from here: Lucd Python Client Project and open a command prompt and change to the package directory. At a command prompt, type python setup.py bdist_wheel The wheel file will appear in the dist directory. Switch to the dist directory and type pip install <wheel_filename> Requirements Python 3.6.5 is required for custom feature operations to work appropriately. APIs The Lucd Python Client uses python and REST APIs. Code examples using both API types are available in the examples directory of the project. Lucd Python Client API Examples Example code illustrating how to perform tasks such as authenticating to Lucd, performing queries, obtaining virtual datasets and training models resides in the examples directory of the project. Below are specific examples of how to access Lucd data using the client as well as how to create and upload a custom feature transformation operation. Accessing Data from lucd import LucdClient, log from eda.int import asset from eda.int import vds from eda.int import uds from eda.lib import lucd_uds if __name__ == "__main__": username = 'xxx' password = 'xxx' domain = 'xxx' client = LucdClient(username=username, password=password, domain=domain) log.info(f"Connected to Lucd platform.") # queries follow Elasticsearch API.. # See: https://www.elastic.co/guide/en/elasticsearch/reference/6.8/query-dsl.html query = \ { "query": { "bool": { "must": [ { "bool": { "should": [ { "match_phrase": { "source": "iris" } } ] } }, { "bool": { "should": [] } } ], "filter": [ { "bool": { "filter": [ ] } } ] } }, "size": 2000, "dataset": "iris" } results, http = uds.search(query) print(f"Search Results ({http}):\n{results}\n") hits, stats = client.search_to_dataframe(results) print(f"Search Results:\n{hits.head()}\n") print(f"Search Statistics:\n{stats}\n") all_models, http = client.rest('lucd/model/read', {"uid": username}) print(f"All Models ({http}):\n{all_models}\n") all_vds, http = vds.read({"uid": username}) print(f"All Virtual Datasets ({http}):\n{all_vds}\n") all_assets, http = asset.read({"uid": username}) print(f"All Asset Embeddings ({http}):\n{all_assets}\n") # # Lucd Library Calls to fetch assets and VDSes # # When limiting asset size, you could encounter issues with missing index entries. embeddings_index, embedding_matrix, embedding_size, word_index_mapping, word_index_mapping_padded = \ lucd_uds.get_asset("xxx", limit=100) print(embeddings_index, embedding_matrix, embedding_size, word_index_mapping, word_index_mapping_padded) # When limiting data size, you will encounter delays bring back large amounts of data # over the network, and possibly run the client out of memory. all_vds, http = vds.read({"uid": None}) print(f"All Virtual Datasets ({http}):\n{all_vds}\n") df = lucd_uds.get_dataframe("xxx", limit=100) print(f"Dataframe Data\n{df.head(20)}") client.close() Custom Feature Transformation¶ from eda.int import custom_operation import lucd def create_greater_than_mean_column(df): column_mean = df["flower.petal_length"].mean() df["flower.petal_length_Mean"] = df["flower.petal_length"] > column_mean return df if __name__ == "__main__": client = lucd.LucdClient(domain="xxx", username="xxx", password="xxx", login_domain="xxx" ) data = { "operation_name": "create_greater_than_mean_column_JBstyle", "author_name": "J. Black", "author_email": "j.black@lucd.ai", "operation_description": "Sample operation", "operation_purpose": "add a new column", "operation_features": ["flower.petal_length"], "operation_function": create_greater_than_mean_column } response_json, rv = custom_operation.create(data) client.close() Federated Endpoints To support federated machine learning, much of the high-level rest functionality operates in a federated manner. This means that unless otherwise specified, these actions will be performed/created/deleted on all federates. The features impacted include: Queries EDA trees Custom Ops VDS objects Model definitions FATE Training objects Users may include a block in their JSON specifying which federates to operate on, which looks like the following: "federation": {"federates": ["domain_name1", "domain_name2"]}
  17. This documentation describes how to use the Lucd Python libraries for creating custom feature engineering operations for processing data for model training. Custom feature engineering operations enable a Lucd user to apply feature engineering operations that are not available in the Lucd GUI. A simple example includes adjusting a given attribute as a function of the value of several other attributes. Overall, this should enable near open-ended feature transformation needs. The Custom Operation Module The eda.int.custom_operation module is used for sending custom operations to the Lucd backend so that they are selectable in the Lucd EDA section of the GUI. The simple Python script below illustrates how one may define a custom feature engineering operation and send it to the Lucd backend using the Lucd Python libraries. from eda.int import custom_operation import lucd def transform(row): row['flower.petal_length'] = row['flower.petal_length'] + 1 return row client = lucd.LucdClient(domain="domain", username="username", password="password", login_domain="Lucd Platform URL" ) data = { "operation_name": "simple iris attribute adder", "author_name": "J. Branch", "author_email": "email@address", "operation_description": "Testing custom feature op using simple transform", "operation_purpose": "Numeric feature scaling", "operation_features": ["flower.petal_length"], "operation_function": transform } response_json, rv = custom_operation.create(data) client.close() The create function from the eda.int.custom_operation module is used to submit the actual function and metadata to Lucd. The required attributes in the dict data are used for display in the Lucd GUI (and may at some point be used for search purposes). Table 1 describes the attributes in detail. Table 1. Custom Operation Attribute Descriptions Attribute Description name String name/label of the custom feature engineering operation author_name Name of the developer who wrote the operation author_email Author’s email purpose Short description of what the operation achieves description Longer description of how the operation might achieve its purpose, as well as other notes feature List of strings identifying features/facets the operation affects transform_function The Python function that implements the logic of the custom operation Regarding the custom transformation function, it is essential that the function be defined in the same context as the dict and in which create is called. This ensures that the de-serialization of the function works properly on the Lucd backend. Custom Operation Types Applied Operations Applied operations denote those custom operations which are to be applied to data using conventional Dask and Pandas function application mechanisms. Custom functions can be applied to data (dataframes) via the following Dask functions: apply; map_partitions; applymap. Note that applied operations are applied to Pandas dataframes (which act as partitions of larger Dask dataframes). Hence, the user should consider this when designing their custom operation. As shown in the example code above, the user does not identify which Dask mechanism to use for applying a function to data in the create function. This will be selectable in the Lucd GUI when applying custom operations in the eda section. Dask Operations Dask operations denote custom operations which are to be applied directly to a Dask dataframe. This is useful in cases where custom operations/algorithms cannot be correctly applied to a *partition” of data as described above. An example would be a clustering algorithm. We expect these custom operations will use modules from Dask libraries such as Dask-ML (https://ml.dask.org/). An example custom operation is shown below. def cluster_function(df): from dask_ml.cluster import KMeans X = df[['car.days_in_service']] kmeans = KMeans(n_clusters=4) k_service_days = kmeans.fit(X) labels = k_service_days.predict(X) df['cars.service_days_cluster'] = labels return df When a Dask operation is to be applied (in the Unity client), the user should select apply_direct as the application type. View full record
  18. User Guide The Lucd Python Client provides capabilities for data scientists and AI model developers to prototype AI model solutions before uploading them to the Lucd Unity Client for extended training and performance analysis. The Lucd Python Client provides the following features: functions for accessing raw data and other assets from in Lucd for general analysis and custom visualization; functions for uploading user-defined feature transformation operations to Lucd, which can then be applied in the Lucd Unity Client to create a virtual dataset; functions for accessing ingesting data into TensorFlow and PyTorch models, which can be used for prototyping models. Installation The lucd-python-client python package should be installed using the pip command with a python wheel file. Instructions are as follows: Download or clone the lucd-python-client package (unzip if needed) from here: Lucd Python Client Project and open a command prompt and change to the package directory. At a command prompt, type python setup.py bdist_wheel The wheel file will appear in the dist directory. Switch to the dist directory and type pip install <wheel_filename> Requirements Python 3.6.5 is required for custom feature operations to work appropriately. APIs The Lucd Python Client uses python and REST APIs. Code examples using both API types are available in the examples directory of the project. Lucd Python Client API Examples Example code illustrating how to perform tasks such as authenticating to Lucd, performing queries, obtaining virtual datasets and training models resides in the examples directory of the project. Below are specific examples of how to access Lucd data using the client as well as how to create and upload a custom feature transformation operation. Accessing Data from lucd import LucdClient, log from eda.int import asset from eda.int import vds from eda.int import uds from eda.lib import lucd_uds if __name__ == "__main__": username = 'xxx' password = 'xxx' domain = 'xxx' client = LucdClient(username=username, password=password, domain=domain) log.info(f"Connected to Lucd platform.") # queries follow Elasticsearch API.. # See: https://www.elastic.co/guide/en/elasticsearch/reference/6.8/query-dsl.html query = \ { "query": { "bool": { "must": [ { "bool": { "should": [ { "match_phrase": { "source": "iris" } } ] } }, { "bool": { "should": [] } } ], "filter": [ { "bool": { "filter": [ ] } } ] } }, "size": 2000, "dataset": "iris" } results, http = uds.search(query) print(f"Search Results ({http}):\n{results}\n") hits, stats = client.search_to_dataframe(results) print(f"Search Results:\n{hits.head()}\n") print(f"Search Statistics:\n{stats}\n") all_models, http = client.rest('lucd/model/read', {"uid": username}) print(f"All Models ({http}):\n{all_models}\n") all_vds, http = vds.read({"uid": username}) print(f"All Virtual Datasets ({http}):\n{all_vds}\n") all_assets, http = asset.read({"uid": username}) print(f"All Asset Embeddings ({http}):\n{all_assets}\n") # # Lucd Library Calls to fetch assets and VDSes # # When limiting asset size, you could encounter issues with missing index entries. embeddings_index, embedding_matrix, embedding_size, word_index_mapping, word_index_mapping_padded = \ lucd_uds.get_asset("xxx", limit=100) print(embeddings_index, embedding_matrix, embedding_size, word_index_mapping, word_index_mapping_padded) # When limiting data size, you will encounter delays bring back large amounts of data # over the network, and possibly run the client out of memory. all_vds, http = vds.read({"uid": None}) print(f"All Virtual Datasets ({http}):\n{all_vds}\n") df = lucd_uds.get_dataframe("xxx", limit=100) print(f"Dataframe Data\n{df.head(20)}") client.close() Custom Feature Transformation¶ from eda.int import custom_operation import lucd def create_greater_than_mean_column(df): column_mean = df["flower.petal_length"].mean() df["flower.petal_length_Mean"] = df["flower.petal_length"] > column_mean return df if __name__ == "__main__": client = lucd.LucdClient(domain="xxx", username="xxx", password="xxx", login_domain="xxx" ) data = { "operation_name": "create_greater_than_mean_column_JBstyle", "author_name": "J. Black", "author_email": "j.black@lucd.ai", "operation_description": "Sample operation", "operation_purpose": "add a new column", "operation_features": ["flower.petal_length"], "operation_function": create_greater_than_mean_column } response_json, rv = custom_operation.create(data) client.close() Federated Endpoints To support federated machine learning, much of the high-level rest functionality operates in a federated manner. This means that unless otherwise specified, these actions will be performed/created/deleted on all federates. The features impacted include: Queries EDA trees Custom Ops VDS objects Model definitions FATE Training objects Users may include a block in their JSON specifying which federates to operate on, which looks like the following: "federation": {"federates": ["domain_name1", "domain_name2"]} View full record
  19. The Lucd platform enables the user to perform model performance analysis. Easy to use tools allows the user to view and compare the data prior to and after the data has been trained against a selected model. The platform also enables tracking of the critical governance and explainability associated with the process. View full record
  20. The Lucd platform provides necessary, easy-to-use tools for dataset query and data visualization. Query and visualization are important initial steps when preparing data for machine learning and Lucd makes this a snap. View full record
  21. This section covers the high-level tasks needed to support model training: importing and preparing data and analyzing performance. Most of the content here pertains to the full model approach, but some (e.g., reporting model status) is still helpful for PyTorch compact modeling. Importing and Preparing Data Data can be imported into a modeling context using the Lucd Unified Dataspace (UDS) API (eda.lib.lucd_uds). This library provides functions for creating datasets of various formats (TensorFlow, PyTorch, Dask dataframe) based on Lucd virtual datasets defined in the Unity client. It also provides the capability to retrieve previously trained word embeddings. The Lucd UDS functions providing data retrieval are listed below. Some are used for straightforward data importing (e.g., get_datframe) while others are used for preparing framework-specific datasets for AI models (e.g., get_tf_dataset for defining TensorFlow datasets). Refer to the API documentation for full function descriptions, and visit the Lucd Model Shop gitlab project for examples on how to use the functions for developing AI models. get_asset get_dataframe get_tf_dataset get_tf_dataset_image get_tf_dataset_text train_eval_test_split_dataframe train_eval_test_split_pytorch train_eval_test_split_tensorflow Important notes for implementing multi-class modeling TensorFlow offers different approaches to building multi-class models, two prominent ones being using pre-made Estimators (https://www.tensorflow.org/tutorials/estimator/premade#overview_of_programming_with_estimators), and using general techniques such as with Keras models and Estimators. If one-hot encoded data labels are needed (i.e., to match the number of nodes in a neural network output layer), the num_classes parameter should be used when calling relevant functions to get data (e.g., lucd_uds.get_tf_dataset). Note that most (if not all) TensorFlow pre-made Estimator models do not require explicitly one-hot encoded data labels for non-binary modeling (e.g., tensorflow.estimator.DNNClassifier) and hence, the num_classes argument can be omitted. In the case of TensorFlow Estimators, developers are encouraged to understand how to shape input for the models. The same goes for modeling with PyTorch or XGBoost. Analyzing Model Performance Post-training performance analysis tasks are supported by the Lucd Machine Learning (ML) API (eda.lib.lucd_ml). This library provides functions supporting automatic execution and reporting of critical performance analysis tasks (e.g., creating confusion matrices, ROC curves), preventing the need to repeatedly write such code. The tables and plots created from these library functions can be viewed in the Unity client after the entire model training process has completed. The Lucd ML functions for performance analysis are listed below. Refer to the API documentation for full function descriptions. get_predictions_classification_pt get_predictions_classification_tf get_predictions_regression_pt get_predictions_regression_tf lucd_precision_recall_curve lucd_roc_curve lucd_confusion_matrix update_plots Submitting Performance Analysis Results Trained models and metadata can be uploaded to the Lucd backend via the eda.int.train.update function. The following piece of example code illustrates how to use the function. model_filename = lucd_uds.zip_model_tf(trained_classifier, serving_input_receiver_fn, model_id, graph_version, log_dir) # Store model graph and performance stats back to Lucd back-end with open(model_filename, "rb") as graph_file: train.update({tid: { 'performance': { 'loss': loss, 'accuracy': accuracy, 'macro_precision': precision_macro, 'macro_recall': recall_macro, 'macro_f1': f1_macro, 'micro_precision': precision_micro, 'micro_recall': recall_micro, 'micro_f1': f1_micro, 'weighted_precision': precision_weighted, 'weighted_recall': recall_weighted, 'weighted_f1': f1_weighted, 'precision_recall_f1_per_label': results_string, }, 'graph_version': graph_version, 'graph_file': graph_file.read() }}) train.update takes a python dictionary as the argument, with the train_id, described in Table 1, as the top-level key (tid represents the table_id in the code snippet above). The secondary keys graph_version and graph_file store the graph version and trained graph file (model) respectively. The secondary key performance stores another dictionary for performance values. There is no restriction on the key-value pairs here. The developer is allowed to choose the performance values and they will be viewable in the Unity client afterward. The values shown in the code snippet above are customary for evaluating classification models. Again, see example models in The Lucd Model Shop for more insights. As for precision_recall_f1_per_label, an example of how to compute performance statistics per label can be found at https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html. precision_recall_f1_per_label must be formatted as a semicolon-separated string of tuples, with each tuple formatted as “-,,. An example is as follows, precision_recall_f1_per_label: setosa-1.0,0.5,1.0;virginica-1.0,1.0,0.6;versicolor-1.0,0.6666666666666666,0.7499999999999999. Enabling Model Explainability To enable a trained model to be used by the explainability tool in the Unity client, some parameters must be defined. For TensorFlow models, ordered_feature_names, ordered_class_names, input_name, and output_name must be defined. ordered_feature_names (not to be confused with training data input column names) is a list of ordered names of the inputs to the trained model, commonly defined in TensorFlow model definitions as tf.feature_column. For example, for a TensorFlow text classification model, the named input might be embedding_input. Please see example code in The Lucd Model Shop. ordered_class_names is a list formatted such that string class names are ordered by their integer representations (the order of outputs from your model). For instance, for a binary classification model for which the labels are 0 and 1, the order of strings must be negative and positive (or whatever string labels you choose). input_name is the name of the input layer in your TensorFlow model to which your ordered_feature_names data will be passed. output_name is the name of the output layer in your TensorFlow model (by default these can be named things like ‘dense_2’ and ‘scores’). The output_name is used to retrieve your model outputs in the proper format for explanation. PyTorch models only require that ordered_class_names be provided. Plots Lucd allows users to plot associated model training metrics in the Unity client. The plots will update in realtime during model training, providing insight into the viability, problems, and successes of training runs. Users are able to store any information they want to be plotted as part of a training run using the eda.lib.lucd_ml.update_plots function. The provided information must abide by the following assumptions: 1. The top level keys represent individual plots; 2. Each inner dictionary must specify a labels and a description key, where labels are [“plot x_label”, “plot y_label”] and the description can be any string; 3. All remaining keys in the inner dictionary will be treated as individual lines on the plot, so in the following example "accuracy" is a line on the plot. { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } The individual elements in the “l1” line above represent [x_val, y_val]. A common example as shown above includes the following: "l1": [ [epoch, accuracy], [epoch, accuracy], ... ]. Once created, the dictionary may be stored for plotting with the lucd_ml.update_plots function: def update_plots(train_or_eval: bool, tid: str, dictionary: dict) -> str: ... dictionary = { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } update_plots(True, train_id, dictionary) Regarding the update_plots function, train_or_eval allows a user to specify whether their plot is part of the training or evaluation (or validation) cycle of the model training: train=true, eval=false. A TensorFlow hook is provided in lucd_ml for automatically parsing generated events files (the same as used by TensorBoard) and passing them to update_plots as part of a TensorFlow model. It can be provided as part of a TenorFlow EvalSpec or TrainSpec object as follows (stub included for posterity): class LucdTFEstimatorHook(tf.estimator.SessionRunHook): def __init__(self, train_hook: bool, log_dir: str, tid: str, freq: int, last_epoch: int): ... train_spec = tf.estimator.TrainSpec( input_fn=lambda: lucd_uds.get_tf_dataset_image(type_dict, training_data, num_features, target_type, num_classes).repeat(count=None).shuffle(30).batch(int(30)), max_steps=training_steps, hooks=lucd_ml.LucdTFEstimatorHook(train_hook=True, log_dir=log_dir, tid=tid, freq=10, max_epoch=training_steps)]) train_hook allows a user to specify whether the hook is providing train or eval metrics to the user (train=true, eval=false). log_dir tells the hook where to find TensorFlow events files. freq is the frequency that the hook should look for metrics in the events files. last_epoch tells the hook the number of epochs being run so the hook can ignore freq for the last epoch. Two last helper functions are provided as part of the Lucd plotting framework: lucd_roc_curve and lucd_precision_recall_curve. These functions generate ROC curves and precision-recall curves respectively, and are called selectively when using the compact modeling approach (enabled by “performance_curves” button in unity). Further documentation for these functions is provided in the API. def lucd_roc_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): def lucd_precision_recall_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): ... Confusion Matrix Lucd provides an in-depth, interactive confusion matrix for classification model evaluation. Users may select a square in the Unity client to be shown actual records associated with the square selected. This may be enabled by using the following function: def lucd_confusion_matrix(test_set: list or DataLoader, predictions: list, num_classes: int, label_mapping: type(abs), tid: str, write_accumulo: bool,) -> (dict, str): The function arguments details are provided below. test_set: Users may directly pass the PyTorch DataLoader or list of delayed dask dataframes returned from the respective train_eval_test_split_pytorch/tensorflow function. predictions: This should be a list of predictions generated by your model (the list returned from lucd_ml get_predictions_classification). The list must be in the same order as the test_set data. num_classes: An integer number of classes for the confusion matrix to represent. label_mapping: A function to map integers to class labels, which is used to map predictions to a human-readable format. tid: Training id to associate confusion matrix with. write_accumulo: Boolean specifying whether to write the dictionary directly to accumulo. Must be True to enable Unity client Confusion Matrix. If False, the generated confusion matrix will only be returned & not written to the database. Further documentation for this function exists in the API documentation. Here is an example usage: def _label_mapping(): return {0: 'I. versicolor', 1: 'I. virginica', 2: 'I. setosa'} ... # Prepare vds data for modeling delayed_values_training, delayed_values_evaluation, delayed_values_testing, my_df_testing_label, num_features = \ lucd_uds.train_eval_test_split_tensorflow(virtual_dataset_id, evaluation_dataset_percent, testing_dataset_percent) ... predictions, scores = lucd_ml.get_predictions_classification_tf(_estimator, lambda: uds.get_tf_dataset( feature_dict, delayed_values_testing, num_features, target_type).batch(1), classification_mode, .5) ... lucd_ml.lucd_confusion_matrix(delayed_values_testing, predictions, 3, label_mapping(), tid, True) Submitting Model Training Status Another helpful function is eda.int.train.status, which is used for storing the status of a developer’s training pipeline. This enables a model’s status to be displayed on the Unity client. The function definition is below. def status(uid, code, message=None): """Update model status in the database. Args: uid: Int representing a model's ID. code: 0 - RETRIEVING VDS, 1 - TRAINING, 2 - ANALYZING PERFORMANCE, 3 - STORING MODEL, 4 - TRAINING COMPLETE, 5 - ERROR, 6 - QUEUED, 7 - STOPPED. message: String representing optional custom message to include. Returns: Status message. Raises: TypeError: If code is not of type int. Exception: If code is invalid. """ View full record
  22. Lucd enables users to transform large-scale data and provides an easy-to-use method for creating a "virtual dataset" based on the data desired to be trained. This transformed data can then be trained against the required models. View full record

HELP & SUPPORT

ABOUT US

Lucd is an AI software platform company that supports multiple industry verticals, allowing for its users to build enterprise-ready AI solutions with Low Code / No Code development practices. Lucd supports the entire AI lifecycle, allowing for the secure fusing of structured and unstructured data, empowering data analysts as well as business professionals to work collaboratively, resulting in reduced time to uncover new opportunities and solutions.

×
×
  • Create New...