Jump to content

Search the Community

Showing results for tags 'python client'.

  • Search By Tags

    Type tags separated by commas.
  • Search By Author

Content Type


Forums

  • Lucd AI Platform Suite
    • JedAI Client
    • Python Client
    • Modeling Framework
    • General
  • Collaborate & Discuss
    • Questions & Answers
    • Data Management
    • Machine Learning
    • AI Solutions
    • Community Feedback & Requests
  • Health & Life Sciences's Discussion
  • Financial Services's Discussion
  • Retail & Consumer Packaged Goods's Discussion
  • Media & Entertainment's Discussion
  • Energy's Discussion
  • Manufacturing's Topics

Blogs

  • Lucd Team Blog
  • UX Club (Beta version only)'s Blog

Calendars

  • Community Calendar
  • Health & Life Sciences's Events
  • Financial Services's Events
  • Retail & Consumer Packaged Goods's Events
  • Media & Entertainment's Events
  • Energy's Events
  • Manufacturing's Events

Categories

  • Datasheets and White Papers
  • Press Releases
  • Industry Focus
    • Energy
    • Financial Services
    • Health and Life Services
    • Media and Entertainment
    • Manufacturing
    • Retail

Categories

  • JedAI Client
  • Python Client
  • Modeling Framework
  • General

Find results in...

Find results that contain...


Date Created

  • Start

    End


Last Updated

  • Start

    End


Filter by number of...

Joined

  • Start

    End


Group


About Me


Interests


Industry

Found 12 results

  1. Enables the ability to authenticate to Lucd backend and access data for custom analysis and model prototyping. Enables the ability to define custom feature transformation operations, and upload to Lucd for use in the GUI. Provides access to the Asset, Concept, Custom Contract, Custom Operation, Explore, Model, UDS and VDS REST APIs. Provides multiple working examples of individual REST calls, as well as complete models. Enables Dask exploration of the data on the client side. Enables development and testing of model development using the Lucd PyTorch, Tensorflow, Keras and XGBoost classes. View full record
  2. This section covers the high-level tasks needed to support model training: importing and preparing data and analyzing performance. Most of the content here pertains to the full model approach, but some (e.g., reporting model status) is still helpful for PyTorch compact modeling. Importing and Preparing Data Data can be imported into a modeling context using the Lucd Unified Dataspace (UDS) API (eda.lib.lucd_uds). This library provides functions for creating datasets of various formats (TensorFlow, PyTorch, Dask dataframe) based on Lucd virtual datasets defined in the Unity client. It also provides the capability to retrieve previously trained word embeddings. The Lucd UDS functions providing data retrieval are listed below. Some are used for straightforward data importing (e.g., get_datframe) while others are used for preparing framework-specific datasets for AI models (e.g., get_tf_dataset for defining TensorFlow datasets). Refer to the API documentation for full function descriptions, and visit the Lucd Model Shop gitlab project for examples on how to use the functions for developing AI models. get_asset get_dataframe get_tf_dataset get_tf_dataset_image get_tf_dataset_text train_eval_test_split_dataframe train_eval_test_split_pytorch train_eval_test_split_tensorflow Important notes for implementing multi-class modeling TensorFlow offers different approaches to building multi-class models, two prominent ones being using pre-made Estimators (https://www.tensorflow.org/tutorials/estimator/premade#overview_of_programming_with_estimators), and using general techniques such as with Keras models and Estimators. If one-hot encoded data labels are needed (i.e., to match the number of nodes in a neural network output layer), the num_classes parameter should be used when calling relevant functions to get data (e.g., lucd_uds.get_tf_dataset). Note that most (if not all) TensorFlow pre-made Estimator models do not require explicitly one-hot encoded data labels for non-binary modeling (e.g., tensorflow.estimator.DNNClassifier) and hence, the num_classes argument can be omitted. In the case of TensorFlow Estimators, developers are encouraged to understand how to shape input for the models. The same goes for modeling with PyTorch or XGBoost. Analyzing Model Performance Post-training performance analysis tasks are supported by the Lucd Machine Learning (ML) API (eda.lib.lucd_ml). This library provides functions supporting automatic execution and reporting of critical performance analysis tasks (e.g., creating confusion matrices, ROC curves), preventing the need to repeatedly write such code. The tables and plots created from these library functions can be viewed in the Unity client after the entire model training process has completed. The Lucd ML functions for performance analysis are listed below. Refer to the API documentation for full function descriptions. get_predictions_classification_pt get_predictions_classification_tf get_predictions_regression_pt get_predictions_regression_tf lucd_precision_recall_curve lucd_roc_curve lucd_confusion_matrix update_plots Submitting Performance Analysis Results Trained models and metadata can be uploaded to the Lucd backend via the eda.int.train.update function. The following piece of example code illustrates how to use the function. model_filename = lucd_uds.zip_model_tf(trained_classifier, serving_input_receiver_fn, model_id, graph_version, log_dir) # Store model graph and performance stats back to Lucd back-end with open(model_filename, "rb") as graph_file: train.update({tid: { 'performance': { 'loss': loss, 'accuracy': accuracy, 'macro_precision': precision_macro, 'macro_recall': recall_macro, 'macro_f1': f1_macro, 'micro_precision': precision_micro, 'micro_recall': recall_micro, 'micro_f1': f1_micro, 'weighted_precision': precision_weighted, 'weighted_recall': recall_weighted, 'weighted_f1': f1_weighted, 'precision_recall_f1_per_label': results_string, }, 'graph_version': graph_version, 'graph_file': graph_file.read() }}) train.update takes a python dictionary as the argument, with the train_id, described in Table 1, as the top-level key (tid represents the table_id in the code snippet above). The secondary keys graph_version and graph_file store the graph version and trained graph file (model) respectively. The secondary key performance stores another dictionary for performance values. There is no restriction on the key-value pairs here. The developer is allowed to choose the performance values and they will be viewable in the Unity client afterward. The values shown in the code snippet above are customary for evaluating classification models. Again, see example models in The Lucd Model Shop for more insights. As for precision_recall_f1_per_label, an example of how to compute performance statistics per label can be found at https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html. precision_recall_f1_per_label must be formatted as a semicolon-separated string of tuples, with each tuple formatted as “-,,. An example is as follows, precision_recall_f1_per_label: setosa-1.0,0.5,1.0;virginica-1.0,1.0,0.6;versicolor-1.0,0.6666666666666666,0.7499999999999999. Enabling Model Explainability To enable a trained model to be used by the explainability tool in the Unity client, some parameters must be defined. For TensorFlow models, ordered_feature_names, ordered_class_names, input_name, and output_name must be defined. ordered_feature_names (not to be confused with training data input column names) is a list of ordered names of the inputs to the trained model, commonly defined in TensorFlow model definitions as tf.feature_column. For example, for a TensorFlow text classification model, the named input might be embedding_input. Please see example code in The Lucd Model Shop. ordered_class_names is a list formatted such that string class names are ordered by their integer representations (the order of outputs from your model). For instance, for a binary classification model for which the labels are 0 and 1, the order of strings must be negative and positive (or whatever string labels you choose). input_name is the name of the input layer in your TensorFlow model to which your ordered_feature_names data will be passed. output_name is the name of the output layer in your TensorFlow model (by default these can be named things like ‘dense_2’ and ‘scores’). The output_name is used to retrieve your model outputs in the proper format for explanation. PyTorch models only require that ordered_class_names be provided. Plots Lucd allows users to plot associated model training metrics in the Unity client. The plots will update in realtime during model training, providing insight into the viability, problems, and successes of training runs. Users are able to store any information they want to be plotted as part of a training run using the eda.lib.lucd_ml.update_plots function. The provided information must abide by the following assumptions: 1. The top level keys represent individual plots; 2. Each inner dictionary must specify a labels and a description key, where labels are [“plot x_label”, “plot y_label”] and the description can be any string; 3. All remaining keys in the inner dictionary will be treated as individual lines on the plot, so in the following example "accuracy" is a line on the plot. { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } The individual elements in the “l1” line above represent [x_val, y_val]. A common example as shown above includes the following: "l1": [ [epoch, accuracy], [epoch, accuracy], ... ]. Once created, the dictionary may be stored for plotting with the lucd_ml.update_plots function: def update_plots(train_or_eval: bool, tid: str, dictionary: dict) -> str: ... dictionary = { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } update_plots(True, train_id, dictionary) Regarding the update_plots function, train_or_eval allows a user to specify whether their plot is part of the training or evaluation (or validation) cycle of the model training: train=true, eval=false. A TensorFlow hook is provided in lucd_ml for automatically parsing generated events files (the same as used by TensorBoard) and passing them to update_plots as part of a TensorFlow model. It can be provided as part of a TenorFlow EvalSpec or TrainSpec object as follows (stub included for posterity): class LucdTFEstimatorHook(tf.estimator.SessionRunHook): def __init__(self, train_hook: bool, log_dir: str, tid: str, freq: int, last_epoch: int): ... train_spec = tf.estimator.TrainSpec( input_fn=lambda: lucd_uds.get_tf_dataset_image(type_dict, training_data, num_features, target_type, num_classes).repeat(count=None).shuffle(30).batch(int(30)), max_steps=training_steps, hooks=lucd_ml.LucdTFEstimatorHook(train_hook=True, log_dir=log_dir, tid=tid, freq=10, max_epoch=training_steps)]) train_hook allows a user to specify whether the hook is providing train or eval metrics to the user (train=true, eval=false). log_dir tells the hook where to find TensorFlow events files. freq is the frequency that the hook should look for metrics in the events files. last_epoch tells the hook the number of epochs being run so the hook can ignore freq for the last epoch. Two last helper functions are provided as part of the Lucd plotting framework: lucd_roc_curve and lucd_precision_recall_curve. These functions generate ROC curves and precision-recall curves respectively, and are called selectively when using the compact modeling approach (enabled by “performance_curves” button in unity). Further documentation for these functions is provided in the API. def lucd_roc_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): def lucd_precision_recall_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): ... Confusion Matrix Lucd provides an in-depth, interactive confusion matrix for classification model evaluation. Users may select a square in the Unity client to be shown actual records associated with the square selected. This may be enabled by using the following function: def lucd_confusion_matrix(test_set: list or DataLoader, predictions: list, num_classes: int, label_mapping: type(abs), tid: str, write_accumulo: bool,) -> (dict, str): The function arguments details are provided below. test_set: Users may directly pass the PyTorch DataLoader or list of delayed dask dataframes returned from the respective train_eval_test_split_pytorch/tensorflow function. predictions: This should be a list of predictions generated by your model (the list returned from lucd_ml get_predictions_classification). The list must be in the same order as the test_set data. num_classes: An integer number of classes for the confusion matrix to represent. label_mapping: A function to map integers to class labels, which is used to map predictions to a human-readable format. tid: Training id to associate confusion matrix with. write_accumulo: Boolean specifying whether to write the dictionary directly to accumulo. Must be True to enable Unity client Confusion Matrix. If False, the generated confusion matrix will only be returned & not written to the database. Further documentation for this function exists in the API documentation. Here is an example usage: def _label_mapping(): return {0: 'I. versicolor', 1: 'I. virginica', 2: 'I. setosa'} ... # Prepare vds data for modeling delayed_values_training, delayed_values_evaluation, delayed_values_testing, my_df_testing_label, num_features = \ lucd_uds.train_eval_test_split_tensorflow(virtual_dataset_id, evaluation_dataset_percent, testing_dataset_percent) ... predictions, scores = lucd_ml.get_predictions_classification_tf(_estimator, lambda: uds.get_tf_dataset( feature_dict, delayed_values_testing, num_features, target_type).batch(1), classification_mode, .5) ... lucd_ml.lucd_confusion_matrix(delayed_values_testing, predictions, 3, label_mapping(), tid, True) Submitting Model Training Status Another helpful function is eda.int.train.status, which is used for storing the status of a developer’s training pipeline. This enables a model’s status to be displayed on the Unity client. The function definition is below. def status(uid, code, message=None): """Update model status in the database. Args: uid: Int representing a model's ID. code: 0 - RETRIEVING VDS, 1 - TRAINING, 2 - ANALYZING PERFORMANCE, 3 - STORING MODEL, 4 - TRAINING COMPLETE, 5 - ERROR, 6 - QUEUED, 7 - STOPPED. message: String representing optional custom message to include. Returns: Status message. Raises: TypeError: If code is not of type int. Exception: If code is invalid. """
  3. In the compact modeling approach, the Lucd Modeling Framework (LMF) provides a lightweight approach to training TensorFlow and PyTorch models in Lucd. Boilerplate tasks (e.g., creating confusion matrices, ROC curves, etc.) are automatically executed by LMF, enabling a developer to write and maintain less code. Furthermore, this supports modularity so that developers can effortlessly use new data loading, training, and performance measurement logic as they become available in LMF. Compact modeling differs according the framework being used for model development, i.e., TensorFlow and PyTorch. The following sections describe the separate approaches in more detail. Examples illustrating how to use the compact modeling approach are in<>The Lucd Model Shop. TensorFlow For TensorFlow-based modeling, compact modeling requires the developer to implement only two functions:<>model<>and<>label_mapping. Model Function The<>model<>function is intended to contain code for building TensorFlow Estimator models. There are plenty of examples on the web demonstrating how to build and configure Estimator models; the following link is a good starting point,<>https://www.tensorflow.org/guide/estimator. The LMF sends essential data (e.g., training data, training steps, pre-defined word embeddings) to this function to configure an Estimator model for training. In turn, the function must return some essential elements to essentially configure the LMF for training, performance analysis, etc. Below is a formal description of the<>model<>function. def model(training_data, validation_data, num_features, training_steps, learning_rate, regularization_value, log_dir, training_param_map, embedding_matrix, embedding_size, word_index_mapping, max_document_length, pad_value, train_id): """Function used by LMF for training and analyzing TensorFlow Estimator models. Args: training_data (list): List of delayed "chunks" of Dask dataframe representing training data. validation_data (list): List of delayed "chunks" of Dask dataframe representing validation data. num_features (tuple): The shape of the features input for a model. training_steps (int): Number of steps for model training. learning_rate (float): Model's learning rate. regularization_value (float): Model's regularization value. log_dir (string): Path designating where checkpoints will be written (needed for training). training_param_map (dict): Dictionary containing miscellaneous parameters. embedding_matrix (numpy array): 2D numpy array where each row represents an embedding for an indexed word (for text classification models). embedding_size (int): Size/length of embeddings in embedding_matrix, or length of embeddings to be learned (for text classification models). word_index_mapping (dict): Dict mapping string words to their int index representations (for text classification models). max_document_length (int): Int defining the maximum number of tokens to be used for free text input into the model for training (for text classification models). pad_value (int): Int defining index value used for padding documents for training, validation, and testing (for text classification models). train_id (str): Unique identifier of the underlying training in the database. Returns: TensorFlow Estimator object (for training), TensorFlow Estimator TrainSpec object (for running training), TensorFlow Estimator EvalSpec object (for running validation), Dict mapping feature names to feature types (for loading data into the model), Type of target/label in training/validation/testing data (for loading data into the model), TensorFlow serving_input_receiver_fn suitable for use in serving (for model serving/prediction), List of feature names (same order as in training data, GUI display purposes), List of class names (same order as their numerical representation in training data, for confusion matrix and GUI display purposes), Number of classes to use in lucd_uds.get_tf_dataset_* and lucd_ml.get_predictions_* functions for TensorFlow. String "input_name" representing the name of the model input layer for use with TF signature def when generating predictions. """ Note that the inputs to the<>model<>function are defined in the Unity client, and hence should not be altered. Label_mapping Function Return values from the<>label_mapping<>function are used by the LMF to compute the confusion matrix precision and recall statistics. For proper construction of the confusion matrix, a dict mapping training data’s label values (integers) to expressive strings should be returned. PyTorch For PyTorch-based modeling, the developer is required to implement the same functions as for TensorFlow:<>model<>and<>lable_mapping. The use of the<>label_mapping<>function for PyTorch is exactly the same as for TensorFlow. Hence, only details for the<>model<>function will be described for PyTorch. Model Function As opposed to TensorFlow-based modeling, for which the<>model<>function implements a developer’s AI model,<>model<>for PyTorch is primarily used for executing model training and validation logic. This is mainly because with PyTorch, model training logic is designed to be much more under a developer’s control. As one can see in the<>example code, and similar to traditional PyTorch style, the actual AI model can be defined as a separate class inside python file. Details of the<>model<>function for PyTorch are below. def model(training_data, validation_data, num_features, training_steps, learning_rate, regularization_value, log_dir, training_param_map, embedding_matrix, embedding_size, word_index_mapping, max_document_length, pad_value, train_id): """Function used by LMF for training and analyzing TensorFlow Estimator models. Args: training_data (torch.utils.data.Dataset): PyTorch dataset representing training data. validation_data (torch.utils.data.Dataset): PyTorch dataset representing validation data. num_features (tuple): The shape of the features input for a model (no use for PyTorch identified). training_steps (int): Number of steps for model training. learning_rate (float): Model's learning rate. regularization_value (float): Model's regularization value. log_dir (string): Path designating where checkpoints will be written (needed for training). training_param_map (dict): Dictionary containing miscellaneous parameters. embedding_matrix (numpy array): 2D numpy array where each row represents an embedding for an indexed word (for text classification models). embedding_size (int): Size/length of embeddings in embedding_matrix, or length of embeddings to be learned (for text classification models). word_index_mapping (dict): Dict mapping string words to their int index representations (for text classification models). max_document_length (int): Int defining the maximum number of tokens to be used for free text input into the model for training (for text classification models). pad_value (int): Int defining index value used for padding documents for training, validation, and testing (for text classification models). train_id (str): Unique identifier of the underlying training in the database. Returns: Trained PyTorch model, List floats representing final model performance statistics values, List of class names (same order as their numerical representation in training data, for confusion matrix and GUI display purposes). """
  4. In the compact modeling approach, the Lucd Modeling Framework (LMF) provides a lightweight approach to training TensorFlow and PyTorch models in Lucd. Boilerplate tasks (e.g., creating confusion matrices, ROC curves, etc.) are automatically executed by LMF, enabling a developer to write and maintain less code. Furthermore, this supports modularity so that developers can effortlessly use new data loading, training, and performance measurement logic as they become available in LMF. Compact modeling differs according the framework being used for model development, i.e., TensorFlow and PyTorch. The following sections describe the separate approaches in more detail. Examples illustrating how to use the compact modeling approach are in<>The Lucd Model Shop. TensorFlow For TensorFlow-based modeling, compact modeling requires the developer to implement only two functions:<>model<>and<>label_mapping. Model Function The<>model<>function is intended to contain code for building TensorFlow Estimator models. There are plenty of examples on the web demonstrating how to build and configure Estimator models; the following link is a good starting point,<>https://www.tensorflow.org/guide/estimator. The LMF sends essential data (e.g., training data, training steps, pre-defined word embeddings) to this function to configure an Estimator model for training. In turn, the function must return some essential elements to essentially configure the LMF for training, performance analysis, etc. Below is a formal description of the<>model<>function. def model(training_data, validation_data, num_features, training_steps, learning_rate, regularization_value, log_dir, training_param_map, embedding_matrix, embedding_size, word_index_mapping, max_document_length, pad_value, train_id): """Function used by LMF for training and analyzing TensorFlow Estimator models. Args: training_data (list): List of delayed "chunks" of Dask dataframe representing training data. validation_data (list): List of delayed "chunks" of Dask dataframe representing validation data. num_features (tuple): The shape of the features input for a model. training_steps (int): Number of steps for model training. learning_rate (float): Model's learning rate. regularization_value (float): Model's regularization value. log_dir (string): Path designating where checkpoints will be written (needed for training). training_param_map (dict): Dictionary containing miscellaneous parameters. embedding_matrix (numpy array): 2D numpy array where each row represents an embedding for an indexed word (for text classification models). embedding_size (int): Size/length of embeddings in embedding_matrix, or length of embeddings to be learned (for text classification models). word_index_mapping (dict): Dict mapping string words to their int index representations (for text classification models). max_document_length (int): Int defining the maximum number of tokens to be used for free text input into the model for training (for text classification models). pad_value (int): Int defining index value used for padding documents for training, validation, and testing (for text classification models). train_id (str): Unique identifier of the underlying training in the database. Returns: TensorFlow Estimator object (for training), TensorFlow Estimator TrainSpec object (for running training), TensorFlow Estimator EvalSpec object (for running validation), Dict mapping feature names to feature types (for loading data into the model), Type of target/label in training/validation/testing data (for loading data into the model), TensorFlow serving_input_receiver_fn suitable for use in serving (for model serving/prediction), List of feature names (same order as in training data, GUI display purposes), List of class names (same order as their numerical representation in training data, for confusion matrix and GUI display purposes), Number of classes to use in lucd_uds.get_tf_dataset_* and lucd_ml.get_predictions_* functions for TensorFlow. String "input_name" representing the name of the model input layer for use with TF signature def when generating predictions. """ Note that the inputs to the<>model<>function are defined in the Unity client, and hence should not be altered. Label_mapping Function Return values from the<>label_mapping<>function are used by the LMF to compute the confusion matrix precision and recall statistics. For proper construction of the confusion matrix, a dict mapping training data’s label values (integers) to expressive strings should be returned. PyTorch For PyTorch-based modeling, the developer is required to implement the same functions as for TensorFlow:<>model<>and<>lable_mapping. The use of the<>label_mapping<>function for PyTorch is exactly the same as for TensorFlow. Hence, only details for the<>model<>function will be described for PyTorch. Model Function As opposed to TensorFlow-based modeling, for which the<>model<>function implements a developer’s AI model,<>model<>for PyTorch is primarily used for executing model training and validation logic. This is mainly because with PyTorch, model training logic is designed to be much more under a developer’s control. As one can see in the<>example code, and similar to traditional PyTorch style, the actual AI model can be defined as a separate class inside python file. Details of the<>model<>function for PyTorch are below. def model(training_data, validation_data, num_features, training_steps, learning_rate, regularization_value, log_dir, training_param_map, embedding_matrix, embedding_size, word_index_mapping, max_document_length, pad_value, train_id): """Function used by LMF for training and analyzing TensorFlow Estimator models. Args: training_data (torch.utils.data.Dataset): PyTorch dataset representing training data. validation_data (torch.utils.data.Dataset): PyTorch dataset representing validation data. num_features (tuple): The shape of the features input for a model (no use for PyTorch identified). training_steps (int): Number of steps for model training. learning_rate (float): Model's learning rate. regularization_value (float): Model's regularization value. log_dir (string): Path designating where checkpoints will be written (needed for training). training_param_map (dict): Dictionary containing miscellaneous parameters. embedding_matrix (numpy array): 2D numpy array where each row represents an embedding for an indexed word (for text classification models). embedding_size (int): Size/length of embeddings in embedding_matrix, or length of embeddings to be learned (for text classification models). word_index_mapping (dict): Dict mapping string words to their int index representations (for text classification models). max_document_length (int): Int defining the maximum number of tokens to be used for free text input into the model for training (for text classification models). pad_value (int): Int defining index value used for padding documents for training, validation, and testing (for text classification models). train_id (str): Unique identifier of the underlying training in the database. Returns: Trained PyTorch model, List floats representing final model performance statistics values, List of class names (same order as their numerical representation in training data, for confusion matrix and GUI display purposes). """ View full record
  5. This section covers the high-level tasks needed to support model training: importing and preparing data and analyzing performance. Most of the content here pertains to the full model approach, but some (e.g., reporting model status) is still helpful for PyTorch compact modeling. Data can be imported into a modeling context using the Lucd Unified Dataspace (UDS) API (eda.lib.lucd_uds). This library provides functions for creating datasets of various formats (TensorFlow, PyTorch, Dask dataframe) based on Lucd virtual datasets defined in the Unity client. It also provides the capability to retrieve previously trained word embeddings. The Lucd UDS functions providing data retrieval are listed below. Some are used for straightforward data importing (e.g., get_datframe) while others are used for preparing framework-specific datasets for AI models (e.g., get_tf_dataset for defining TensorFlow datasets). Refer to the API documentation for full function descriptions, and visit the Lucd Model Shop gitlab project for examples on how to use the functions for developing AI models. get_asset get_dataframe get_tf_dataset get_tf_dataset_image get_tf_dataset_text train_eval_test_split_dataframe train_eval_test_split_pytorch train_eval_test_split_tensorflow Important notes for implementing multi-class modeling TensorFlow offers different approaches to building multi-class models, two prominent ones being using pre-made Estimators (https://www.tensorflow.org/tutorials/estimator/premade#overview_of_programming_with_estimators), and using general techniques such as with Keras models and Estimators. If one-hot encoded data labels are needed (i.e., to match the number of nodes in a neural network output layer), the num_classes parameter should be used when calling relevant functions to get data (e.g., lucd_uds.get_tf_dataset). Note that most (if not all) TensorFlow pre-made Estimator models do not require explicitly one-hot encoded data labels for non-binary modeling (e.g., tensorflow.estimator.DNNClassifier) and hence, the num_classes argument can be omitted. In the case of TensorFlow Estimators, developers are encouraged to understand how to shape input for the models. The same goes for modeling with PyTorch or XGBoost. Analyzing Model Performance Post-training performance analysis tasks are supported by the Lucd Machine Learning (ML) API (eda.lib.lucd_ml). This library provides functions supporting automatic execution and reporting of critical performance analysis tasks (e.g., creating confusion matrices, ROC curves), preventing the need to repeatedly write such code. The tables and plots created from these library functions can be viewed in the Unity client after the entire model training process has completed. The Lucd ML functions for performance analysis are listed below. Refer to the API documentation for full function descriptions. get_predictions_classification_pt get_predictions_classification_tf get_predictions_regression_pt get_predictions_regression_tf lucd_precision_recall_curve lucd_roc_curve lucd_confusion_matrix update_plots Submitting Performance Analysis Results Trained models and metadata can be uploaded to the Lucd backend via the eda.int.train.update function. The following piece of example code illustrates how to use the function. model_filename = lucd_uds.zip_model_tf(trained_classifier, serving_input_receiver_fn, model_id, graph_version, log_dir) # Store model graph and performance stats back to Lucd back-end with open(model_filename, "rb") as graph_file: train.update({tid: { 'performance': { 'loss': loss, 'accuracy': accuracy, 'macro_precision': precision_macro, 'macro_recall': recall_macro, 'macro_f1': f1_macro, 'micro_precision': precision_micro, 'micro_recall': recall_micro, 'micro_f1': f1_micro, 'weighted_precision': precision_weighted, 'weighted_recall': recall_weighted, 'weighted_f1': f1_weighted, 'precision_recall_f1_per_label': results_string, }, 'graph_version': graph_version, 'graph_file': graph_file.read() }}) train.update takes a python dictionary as the argument, with the train_id, described in Table 1, as the top-level key (tid represents the table_id in the code snippet above). The secondary keys graph_version and graph_file store the graph version and trained graph file (model) respectively. The secondary key performance stores another dictionary for performance values. There is no restriction on the key-value pairs here. The developer is allowed to choose the performance values and they will be viewable in the Unity client afterward. The values shown in the code snippet above are customary for evaluating classification models. Again, see example models in The Lucd Model Shop for more insights. As for precision_recall_f1_per_label, an example of how to compute performance statistics per label can be found at: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html. precision_recall_f1_per_label must be formatted as a semicolon-separated string of tuples, with each tuple formatted as “-,,. An example is as follows, precision_recall_f1_per_label: setosa-1.0,0.5,1.0;virginica-1.0,1.0,0.6;versicolor-1.0,0.6666666666666666,0.7499999999999999. Enabling Model Explainability To enable a trained model to be used by the explainability tool in the Unity client, some parameters must be defined. For TensorFlow models, ordered_feature_names, ordered_class_names, input_name, and output_name must be defined. ordered_feature_names (not to be confused with training data input column names) is a list of ordered names of the inputs to the trained model, commonly defined in TensorFlow model definitions as tf.feature_column. For example, for a TensorFlow text classification model, the named input might be embedding_input. Please see example code in The Lucd Model Shop. ordered_class_names is a list formatted such that string class names are ordered by their integer representations (the order of outputs from your model). For instance, for a binary classification model for which the labels are 0 and 1, the order of strings must be negative and positive (or whatever string labels you choose). input_name is the name of the input layer in your TensorFlow model to which your ordered_feature_names data will be passed. output_name is the name of the output layer in your TensorFlow model (by default these can be named things like ‘dense_2’ and ‘scores’). The output_name is used to retrieve your model outputs in the proper format for explanation. PyTorch models only require that ordered_class_names be provided. Plots Lucd allows users to plot associated model training metrics in the Unity client. The plots will update in realtime during model training, providing insight into the viability, problems, and successes of training runs. Users are able to store any information they want to be plotted as part of a training run using the eda.lib.lucd_ml.update_plots function. The provided information must abide by the following assumptions: 1. The top level keys represent individual plots; 2. Each inner dictionary must specify a labels and a description key, where labels are [“plot x_label”, “plot y_label”] and the description can be any string; 3. All remaining keys in the inner dictionary will be treated as individual lines on the plot, so in the following example "accuracy" is a line on the plot. { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } The individual elements in the “l1” line above represent [x_val, y_val]. A common example as shown above includes the following: "l1": [ [epoch, accuracy], [epoch, accuracy], ... ] Once created, the dictionary may be stored for plotting with the lucd_ml.update_plots function: def update_plots(train_or_eval: bool, tid: str, dictionary: dict) -> str: ... dictionary = { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } update_plots(True, train_id, dictionary) Regarding the update_plots function, train_or_eval allows a user to specify whether their plot is part of the training or evaluation (or validation) cycle of the model training: train=true, eval=false. A TensorFlow hook is provided in lucd_ml for automatically parsing generated events files (the same as used by TensorBoard) and passing them to update_plots as part of a TensorFlow model. It can be provided as part of a TensorFlow EvalSpec or TrainSpec object as follows (stub included for posterity): class LucdTFEstimatorHook(tf.estimator.SessionRunHook): def __init__(self, train_hook: bool, log_dir: str, tid: str, freq: int, last_epoch: int): ... train_spec = tf.estimator.TrainSpec( input_fn=lambda: lucd_uds.get_tf_dataset_image(type_dict, training_data, num_features, target_type, num_classes).repeat(count=None).shuffle(30).batch(int(30)), max_steps=training_steps, hooks=lucd_ml.LucdTFEstimatorHook(train_hook=True, log_dir=log_dir, tid=tid, freq=10, max_epoch=training_steps)]) train_hook allows a user to specify whether the hook is providing train or eval metrics to the user (train=true, eval=false). log_dir tells the hook where to find TensorFlow events files. freq is the frequency that the hook should look for metrics in the events files. last_epoch tells the hook the number of epochs being run so the hook can ignore freq for the last epoch. Two last helper functions are provided as part of the Lucd plotting framework: lucd_roc_curve and lucd_precision_recall_curve. These functions generate ROC curves and precision-recall curves respectively and are called selectively when using the compact modeling approach (enabled by “performance_curves” button in unity). Further documentation for these functions is provided in the API. def lucd_roc_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): def lucd_precision_recall_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): ... Confusion Matrix Lucd provides an in-depth, interactive confusion matrix for classification model evaluation. Users may select a square in the Unity client to be shown actual records associated with the square selected. This may be enabled by using the following function: def lucd_confusion_matrix(test_set: list or DataLoader, predictions: list, num_classes: int, label_mapping: type(abs), tid: str, write_accumulo: bool,) -> (dict, str): The function arguments details are provided below. test_set: Users may directly pass the PyTorch DataLoader or list of delayed dask dataframes returned from the respective train_eval_test_split_pytorch/tensorflow function. predictions: This should be a list of predictions generated by your model (the list returned from lucd_ml get_predictions_classification). The list must be in the same order as the test_set data. num_classes: An integer number of classes for the confusion matrix to represent. label_mapping: A function to map integers to class labels, which is used to map predictions to a human-readable format. tid: Training id to associate confusion matrix with. write_accumulo: Boolean specifying whether to write the dictionary directly to accumulo. Must be True to enable Unity client Confusion Matrix. If False, the generated confusion matrix will only be returned & not written to the database. Further documentation for this function exists in the API documentation. Here is an example usage: def _label_mapping(): return {0: 'I. versicolor', 1: 'I. virginica', 2: 'I. setosa'} ... # Prepare vds data for modeling delayed_values_training, delayed_values_evaluation, delayed_values_testing, my_df_testing_label, num_features = \ lucd_uds.train_eval_test_split_tensorflow(virtual_dataset_id, evaluation_dataset_percent, testing_dataset_percent) ... predictions, scores = lucd_ml.get_predictions_classification_tf(_estimator, lambda: uds.get_tf_dataset( feature_dict, delayed_values_testing, num_features, target_type).batch(1), classification_mode, .5) ... lucd_ml.lucd_confusion_matrix(delayed_values_testing, predictions, 3, label_mapping(), tid, True) Submitting Model Training Status Another helpful function is eda.int.train.status, which is used for storing the status of a developer’s training pipeline. This enables a model’s status to be displayed on the Unity client. The function definition is below. def status(uid, code, message=None): """Update model status in the database. Args: uid: Int representing a model's ID. code: 0 - RETRIEVING VDS, 1 - TRAINING, 2 - ANALYZING PERFORMANCE, 3 - STORING MODEL, 4 - TRAINING COMPLETE, 5 - ERROR, 6 - QUEUED, 7 - STOPPED. message: String representing optional custom message to include. Returns: Status message. Raises: TypeError: If code is not of type int. Exception: If code is invalid. """
  6. This section covers the high-level tasks needed to support model training: importing and preparing data and analyzing performance. Most of the content here pertains to the full model approach, but some (e.g., reporting model status) is still helpful for PyTorch compact modeling. Data can be imported into a modeling context using the Lucd Unified Dataspace (UDS) API (eda.lib.lucd_uds). This library provides functions for creating datasets of various formats (TensorFlow, PyTorch, Dask dataframe) based on Lucd virtual datasets defined in the Unity client. It also provides the capability to retrieve previously trained word embeddings. The Lucd UDS functions providing data retrieval are listed below. Some are used for straightforward data importing (e.g., get_datframe) while others are used for preparing framework-specific datasets for AI models (e.g., get_tf_dataset for defining TensorFlow datasets). Refer to the API documentation for full function descriptions, and visit the Lucd Model Shop gitlab project for examples on how to use the functions for developing AI models. get_asset get_dataframe get_tf_dataset get_tf_dataset_image get_tf_dataset_text train_eval_test_split_dataframe train_eval_test_split_pytorch train_eval_test_split_tensorflow Important notes for implementing multi-class modeling TensorFlow offers different approaches to building multi-class models, two prominent ones being using pre-made Estimators (https://www.tensorflow.org/tutorials/estimator/premade#overview_of_programming_with_estimators), and using general techniques such as with Keras models and Estimators. If one-hot encoded data labels are needed (i.e., to match the number of nodes in a neural network output layer), the num_classes parameter should be used when calling relevant functions to get data (e.g., lucd_uds.get_tf_dataset). Note that most (if not all) TensorFlow pre-made Estimator models do not require explicitly one-hot encoded data labels for non-binary modeling (e.g., tensorflow.estimator.DNNClassifier) and hence, the num_classes argument can be omitted. In the case of TensorFlow Estimators, developers are encouraged to understand how to shape input for the models. The same goes for modeling with PyTorch or XGBoost. Analyzing Model Performance Post-training performance analysis tasks are supported by the Lucd Machine Learning (ML) API (eda.lib.lucd_ml). This library provides functions supporting automatic execution and reporting of critical performance analysis tasks (e.g., creating confusion matrices, ROC curves), preventing the need to repeatedly write such code. The tables and plots created from these library functions can be viewed in the Unity client after the entire model training process has completed. The Lucd ML functions for performance analysis are listed below. Refer to the API documentation for full function descriptions. get_predictions_classification_pt get_predictions_classification_tf get_predictions_regression_pt get_predictions_regression_tf lucd_precision_recall_curve lucd_roc_curve lucd_confusion_matrix update_plots Submitting Performance Analysis Results Trained models and metadata can be uploaded to the Lucd backend via the eda.int.train.update function. The following piece of example code illustrates how to use the function. model_filename = lucd_uds.zip_model_tf(trained_classifier, serving_input_receiver_fn, model_id, graph_version, log_dir) # Store model graph and performance stats back to Lucd back-end with open(model_filename, "rb") as graph_file: train.update({tid: { 'performance': { 'loss': loss, 'accuracy': accuracy, 'macro_precision': precision_macro, 'macro_recall': recall_macro, 'macro_f1': f1_macro, 'micro_precision': precision_micro, 'micro_recall': recall_micro, 'micro_f1': f1_micro, 'weighted_precision': precision_weighted, 'weighted_recall': recall_weighted, 'weighted_f1': f1_weighted, 'precision_recall_f1_per_label': results_string, }, 'graph_version': graph_version, 'graph_file': graph_file.read() }}) train.update takes a python dictionary as the argument, with the train_id, described in Table 1, as the top-level key (tid represents the table_id in the code snippet above). The secondary keys graph_version and graph_file store the graph version and trained graph file (model) respectively. The secondary key performance stores another dictionary for performance values. There is no restriction on the key-value pairs here. The developer is allowed to choose the performance values and they will be viewable in the Unity client afterward. The values shown in the code snippet above are customary for evaluating classification models. Again, see example models in The Lucd Model Shop for more insights. As for precision_recall_f1_per_label, an example of how to compute performance statistics per label can be found at: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html. precision_recall_f1_per_label must be formatted as a semicolon-separated string of tuples, with each tuple formatted as “-,,. An example is as follows, precision_recall_f1_per_label: setosa-1.0,0.5,1.0;virginica-1.0,1.0,0.6;versicolor-1.0,0.6666666666666666,0.7499999999999999. Enabling Model Explainability To enable a trained model to be used by the explainability tool in the Unity client, some parameters must be defined. For TensorFlow models, ordered_feature_names, ordered_class_names, input_name, and output_name must be defined. ordered_feature_names (not to be confused with training data input column names) is a list of ordered names of the inputs to the trained model, commonly defined in TensorFlow model definitions as tf.feature_column. For example, for a TensorFlow text classification model, the named input might be embedding_input. Please see example code in The Lucd Model Shop. ordered_class_names is a list formatted such that string class names are ordered by their integer representations (the order of outputs from your model). For instance, for a binary classification model for which the labels are 0 and 1, the order of strings must be negative and positive (or whatever string labels you choose). input_name is the name of the input layer in your TensorFlow model to which your ordered_feature_names data will be passed. output_name is the name of the output layer in your TensorFlow model (by default these can be named things like ‘dense_2’ and ‘scores’). The output_name is used to retrieve your model outputs in the proper format for explanation. PyTorch models only require that ordered_class_names be provided. Plots Lucd allows users to plot associated model training metrics in the Unity client. The plots will update in realtime during model training, providing insight into the viability, problems, and successes of training runs. Users are able to store any information they want to be plotted as part of a training run using the eda.lib.lucd_ml.update_plots function. The provided information must abide by the following assumptions: 1. The top level keys represent individual plots; 2. Each inner dictionary must specify a labels and a description key, where labels are [“plot x_label”, “plot y_label”] and the description can be any string; 3. All remaining keys in the inner dictionary will be treated as individual lines on the plot, so in the following example "accuracy" is a line on the plot. { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } The individual elements in the “l1” line above represent [x_val, y_val]. A common example as shown above includes the following: "l1": [ [epoch, accuracy], [epoch, accuracy], ... ] Once created, the dictionary may be stored for plotting with the lucd_ml.update_plots function: def update_plots(train_or_eval: bool, tid: str, dictionary: dict) -> str: ... dictionary = { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } update_plots(True, train_id, dictionary) Regarding the update_plots function, train_or_eval allows a user to specify whether their plot is part of the training or evaluation (or validation) cycle of the model training: train=true, eval=false. A TensorFlow hook is provided in lucd_ml for automatically parsing generated events files (the same as used by TensorBoard) and passing them to update_plots as part of a TensorFlow model. It can be provided as part of a TensorFlow EvalSpec or TrainSpec object as follows (stub included for posterity): class LucdTFEstimatorHook(tf.estimator.SessionRunHook): def __init__(self, train_hook: bool, log_dir: str, tid: str, freq: int, last_epoch: int): ... train_spec = tf.estimator.TrainSpec( input_fn=lambda: lucd_uds.get_tf_dataset_image(type_dict, training_data, num_features, target_type, num_classes).repeat(count=None).shuffle(30).batch(int(30)), max_steps=training_steps, hooks=lucd_ml.LucdTFEstimatorHook(train_hook=True, log_dir=log_dir, tid=tid, freq=10, max_epoch=training_steps)]) train_hook allows a user to specify whether the hook is providing train or eval metrics to the user (train=true, eval=false). log_dir tells the hook where to find TensorFlow events files. freq is the frequency that the hook should look for metrics in the events files. last_epoch tells the hook the number of epochs being run so the hook can ignore freq for the last epoch. Two last helper functions are provided as part of the Lucd plotting framework: lucd_roc_curve and lucd_precision_recall_curve. These functions generate ROC curves and precision-recall curves respectively and are called selectively when using the compact modeling approach (enabled by “performance_curves” button in unity). Further documentation for these functions is provided in the API. def lucd_roc_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): def lucd_precision_recall_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): ... Confusion Matrix Lucd provides an in-depth, interactive confusion matrix for classification model evaluation. Users may select a square in the Unity client to be shown actual records associated with the square selected. This may be enabled by using the following function: def lucd_confusion_matrix(test_set: list or DataLoader, predictions: list, num_classes: int, label_mapping: type(abs), tid: str, write_accumulo: bool,) -> (dict, str): The function arguments details are provided below. test_set: Users may directly pass the PyTorch DataLoader or list of delayed dask dataframes returned from the respective train_eval_test_split_pytorch/tensorflow function. predictions: This should be a list of predictions generated by your model (the list returned from lucd_ml get_predictions_classification). The list must be in the same order as the test_set data. num_classes: An integer number of classes for the confusion matrix to represent. label_mapping: A function to map integers to class labels, which is used to map predictions to a human-readable format. tid: Training id to associate confusion matrix with. write_accumulo: Boolean specifying whether to write the dictionary directly to accumulo. Must be True to enable Unity client Confusion Matrix. If False, the generated confusion matrix will only be returned & not written to the database. Further documentation for this function exists in the API documentation. Here is an example usage: def _label_mapping(): return {0: 'I. versicolor', 1: 'I. virginica', 2: 'I. setosa'} ... # Prepare vds data for modeling delayed_values_training, delayed_values_evaluation, delayed_values_testing, my_df_testing_label, num_features = \ lucd_uds.train_eval_test_split_tensorflow(virtual_dataset_id, evaluation_dataset_percent, testing_dataset_percent) ... predictions, scores = lucd_ml.get_predictions_classification_tf(_estimator, lambda: uds.get_tf_dataset( feature_dict, delayed_values_testing, num_features, target_type).batch(1), classification_mode, .5) ... lucd_ml.lucd_confusion_matrix(delayed_values_testing, predictions, 3, label_mapping(), tid, True) Submitting Model Training Status Another helpful function is eda.int.train.status, which is used for storing the status of a developer’s training pipeline. This enables a model’s status to be displayed on the Unity client. The function definition is below. def status(uid, code, message=None): """Update model status in the database. Args: uid: Int representing a model's ID. code: 0 - RETRIEVING VDS, 1 - TRAINING, 2 - ANALYZING PERFORMANCE, 3 - STORING MODEL, 4 - TRAINING COMPLETE, 5 - ERROR, 6 - QUEUED, 7 - STOPPED. message: String representing optional custom message to include. Returns: Status message. Raises: TypeError: If code is not of type int. Exception: If code is invalid. """ View full record
  7. Enables the ability to authenticate to Lucd backend and access data for custom analysis and model prototyping. Enables the ability to define custom feature transformation operations, and upload to Lucd for use in the GUI. Provides access to the Asset, Concept, Custom Contract, Custom Operation, Explore, Model, UDS and VDS REST APIs. Provides multiple working examples of individual REST calls, as well as complete models. Enables Dask exploration of the data on the client side. Enables development and testing of model development using the Lucd PyTorch, Tensorflow, Keras and XGBoost classes.
  8. This documentation describes how to use the Lucd Python libraries for creating custom feature engineering operations for processing data for model training. Custom feature engineering operations enable a Lucd user to apply feature engineering operations that are not available in the Lucd GUI. A simple example includes adjusting a given attribute as a function of the value of several other attributes. Overall, this should enable near open-ended feature transformation needs. The Custom Operation Module The eda.int.custom_operation module is used for sending custom operations to the Lucd backend so that they are selectable in the Lucd EDA section of the GUI. The simple Python script below illustrates how one may define a custom feature engineering operation and send it to the Lucd backend using the Lucd Python libraries. from eda.int import custom_operation import lucd def transform(row): row['flower.petal_length'] = row['flower.petal_length'] + 1 return row client = lucd.LucdClient(domain="domain", username="username", password="password", login_domain="Lucd Platform URL" ) data = { "operation_name": "simple iris attribute adder", "author_name": "J. Branch", "author_email": "email@address", "operation_description": "Testing custom feature op using simple transform", "operation_purpose": "Numeric feature scaling", "operation_features": ["flower.petal_length"], "operation_function": transform } response_json, rv = custom_operation.create(data) client.close() The create function from the eda.int.custom_operation module is used to submit the actual function and metadata to Lucd. The required attributes in the dict data are used for display in the Lucd GUI (and may at some point be used for search purposes). Table 1 describes the attributes in detail. Table 1. Custom Operation Attribute Descriptions Attribute Description name String name/label of the custom feature engineering operation author_name Name of the developer who wrote the operation author_email Author’s email purpose Short description of what the operation achieves description Longer description of how the operation might achieve its purpose, as well as other notes feature List of strings identifying features/facets the operation affects transform_function The Python function that implements the logic of the custom operation Regarding the custom transformation function, it is essential that the function be defined in the same context as the dict and in which create is called. This ensures that the de-serialization of the function works properly on the Lucd backend. Custom Operation Types Applied Operations Applied operations denote those custom operations which are to be applied to data using conventional Dask and Pandas function application mechanisms. Custom functions can be applied to data (dataframes) via the following Dask functions: apply; map_partitions; applymap. Note that applied operations are applied to Pandas dataframes (which act as partitions of larger Dask dataframes). Hence, the user should consider this when designing their custom operation. As shown in the example code above, the user does not identify which Dask mechanism to use for applying a function to data in the create function. This will be selectable in the Lucd GUI when applying custom operations in the eda section. Dask Operations Dask operations denote custom operations which are to be applied directly to a Dask dataframe. This is useful in cases where custom operations/algorithms cannot be correctly applied to a *partition” of data as described above. An example would be a clustering algorithm. We expect these custom operations will use modules from Dask libraries such as Dask-ML (https://ml.dask.org/). An example custom operation is shown below. def cluster_function(df): from dask_ml.cluster import KMeans X = df[['car.days_in_service']] kmeans = KMeans(n_clusters=4) k_service_days = kmeans.fit(X) labels = k_service_days.predict(X) df['cars.service_days_cluster'] = labels return df When a Dask operation is to be applied (in the Unity client), the user should select apply_direct as the application type.
  9. User Guide The Lucd Python Client provides capabilities for data scientists and AI model developers to prototype AI model solutions before uploading them to the Lucd Unity Client for extended training and performance analysis. The Lucd Python Client provides the following features: functions for accessing raw data and other assets from in Lucd for general analysis and custom visualization; functions for uploading user-defined feature transformation operations to Lucd, which can then be applied in the Lucd Unity Client to create a virtual dataset; functions for accessing ingesting data into TensorFlow and PyTorch models, which can be used for prototyping models. Installation The lucd-python-client python package should be installed using the pip command with a python wheel file. Instructions are as follows: Download or clone the lucd-python-client package (unzip if needed) from here: Lucd Python Client Project and open a command prompt and change to the package directory. At a command prompt, type python setup.py bdist_wheel The wheel file will appear in the dist directory. Switch to the dist directory and type pip install <wheel_filename> Requirements Python 3.6.5 is required for custom feature operations to work appropriately. APIs The Lucd Python Client uses python and REST APIs. Code examples using both API types are available in the examples directory of the project. Lucd Python Client API Examples Example code illustrating how to perform tasks such as authenticating to Lucd, performing queries, obtaining virtual datasets and training models resides in the examples directory of the project. Below are specific examples of how to access Lucd data using the client as well as how to create and upload a custom feature transformation operation. Accessing Data from lucd import LucdClient, log from eda.int import asset from eda.int import vds from eda.int import uds from eda.lib import lucd_uds if __name__ == "__main__": username = 'xxx' password = 'xxx' domain = 'xxx' client = LucdClient(username=username, password=password, domain=domain) log.info(f"Connected to Lucd platform.") # queries follow Elasticsearch API.. # See: https://www.elastic.co/guide/en/elasticsearch/reference/6.8/query-dsl.html query = \ { "query": { "bool": { "must": [ { "bool": { "should": [ { "match_phrase": { "source": "iris" } } ] } }, { "bool": { "should": [] } } ], "filter": [ { "bool": { "filter": [ ] } } ] } }, "size": 2000, "dataset": "iris" } results, http = uds.search(query) print(f"Search Results ({http}):\n{results}\n") hits, stats = client.search_to_dataframe(results) print(f"Search Results:\n{hits.head()}\n") print(f"Search Statistics:\n{stats}\n") all_models, http = client.rest('lucd/model/read', {"uid": username}) print(f"All Models ({http}):\n{all_models}\n") all_vds, http = vds.read({"uid": username}) print(f"All Virtual Datasets ({http}):\n{all_vds}\n") all_assets, http = asset.read({"uid": username}) print(f"All Asset Embeddings ({http}):\n{all_assets}\n") # # Lucd Library Calls to fetch assets and VDSes # # When limiting asset size, you could encounter issues with missing index entries. embeddings_index, embedding_matrix, embedding_size, word_index_mapping, word_index_mapping_padded = \ lucd_uds.get_asset("xxx", limit=100) print(embeddings_index, embedding_matrix, embedding_size, word_index_mapping, word_index_mapping_padded) # When limiting data size, you will encounter delays bring back large amounts of data # over the network, and possibly run the client out of memory. all_vds, http = vds.read({"uid": None}) print(f"All Virtual Datasets ({http}):\n{all_vds}\n") df = lucd_uds.get_dataframe("xxx", limit=100) print(f"Dataframe Data\n{df.head(20)}") client.close() Custom Feature Transformation¶ from eda.int import custom_operation import lucd def create_greater_than_mean_column(df): column_mean = df["flower.petal_length"].mean() df["flower.petal_length_Mean"] = df["flower.petal_length"] > column_mean return df if __name__ == "__main__": client = lucd.LucdClient(domain="xxx", username="xxx", password="xxx", login_domain="xxx" ) data = { "operation_name": "create_greater_than_mean_column_JBstyle", "author_name": "J. Black", "author_email": "j.black@lucd.ai", "operation_description": "Sample operation", "operation_purpose": "add a new column", "operation_features": ["flower.petal_length"], "operation_function": create_greater_than_mean_column } response_json, rv = custom_operation.create(data) client.close() Federated Endpoints To support federated machine learning, much of the high-level rest functionality operates in a federated manner. This means that unless otherwise specified, these actions will be performed/created/deleted on all federates. The features impacted include: Queries EDA trees Custom Ops VDS objects Model definitions FATE Training objects Users may include a block in their JSON specifying which federates to operate on, which looks like the following: "federation": {"federates": ["domain_name1", "domain_name2"]}
  10. This documentation describes how to use the Lucd Python libraries for creating custom feature engineering operations for processing data for model training. Custom feature engineering operations enable a Lucd user to apply feature engineering operations that are not available in the Lucd GUI. A simple example includes adjusting a given attribute as a function of the value of several other attributes. Overall, this should enable near open-ended feature transformation needs. The Custom Operation Module The eda.int.custom_operation module is used for sending custom operations to the Lucd backend so that they are selectable in the Lucd EDA section of the GUI. The simple Python script below illustrates how one may define a custom feature engineering operation and send it to the Lucd backend using the Lucd Python libraries. from eda.int import custom_operation import lucd def transform(row): row['flower.petal_length'] = row['flower.petal_length'] + 1 return row client = lucd.LucdClient(domain="domain", username="username", password="password", login_domain="Lucd Platform URL" ) data = { "operation_name": "simple iris attribute adder", "author_name": "J. Branch", "author_email": "email@address", "operation_description": "Testing custom feature op using simple transform", "operation_purpose": "Numeric feature scaling", "operation_features": ["flower.petal_length"], "operation_function": transform } response_json, rv = custom_operation.create(data) client.close() The create function from the eda.int.custom_operation module is used to submit the actual function and metadata to Lucd. The required attributes in the dict data are used for display in the Lucd GUI (and may at some point be used for search purposes). Table 1 describes the attributes in detail. Table 1. Custom Operation Attribute Descriptions Attribute Description name String name/label of the custom feature engineering operation author_name Name of the developer who wrote the operation author_email Author’s email purpose Short description of what the operation achieves description Longer description of how the operation might achieve its purpose, as well as other notes feature List of strings identifying features/facets the operation affects transform_function The Python function that implements the logic of the custom operation Regarding the custom transformation function, it is essential that the function be defined in the same context as the dict and in which create is called. This ensures that the de-serialization of the function works properly on the Lucd backend. Custom Operation Types Applied Operations Applied operations denote those custom operations which are to be applied to data using conventional Dask and Pandas function application mechanisms. Custom functions can be applied to data (dataframes) via the following Dask functions: apply; map_partitions; applymap. Note that applied operations are applied to Pandas dataframes (which act as partitions of larger Dask dataframes). Hence, the user should consider this when designing their custom operation. As shown in the example code above, the user does not identify which Dask mechanism to use for applying a function to data in the create function. This will be selectable in the Lucd GUI when applying custom operations in the eda section. Dask Operations Dask operations denote custom operations which are to be applied directly to a Dask dataframe. This is useful in cases where custom operations/algorithms cannot be correctly applied to a *partition” of data as described above. An example would be a clustering algorithm. We expect these custom operations will use modules from Dask libraries such as Dask-ML (https://ml.dask.org/). An example custom operation is shown below. def cluster_function(df): from dask_ml.cluster import KMeans X = df[['car.days_in_service']] kmeans = KMeans(n_clusters=4) k_service_days = kmeans.fit(X) labels = k_service_days.predict(X) df['cars.service_days_cluster'] = labels return df When a Dask operation is to be applied (in the Unity client), the user should select apply_direct as the application type. View full record
  11. User Guide The Lucd Python Client provides capabilities for data scientists and AI model developers to prototype AI model solutions before uploading them to the Lucd Unity Client for extended training and performance analysis. The Lucd Python Client provides the following features: functions for accessing raw data and other assets from in Lucd for general analysis and custom visualization; functions for uploading user-defined feature transformation operations to Lucd, which can then be applied in the Lucd Unity Client to create a virtual dataset; functions for accessing ingesting data into TensorFlow and PyTorch models, which can be used for prototyping models. Installation The lucd-python-client python package should be installed using the pip command with a python wheel file. Instructions are as follows: Download or clone the lucd-python-client package (unzip if needed) from here: Lucd Python Client Project and open a command prompt and change to the package directory. At a command prompt, type python setup.py bdist_wheel The wheel file will appear in the dist directory. Switch to the dist directory and type pip install <wheel_filename> Requirements Python 3.6.5 is required for custom feature operations to work appropriately. APIs The Lucd Python Client uses python and REST APIs. Code examples using both API types are available in the examples directory of the project. Lucd Python Client API Examples Example code illustrating how to perform tasks such as authenticating to Lucd, performing queries, obtaining virtual datasets and training models resides in the examples directory of the project. Below are specific examples of how to access Lucd data using the client as well as how to create and upload a custom feature transformation operation. Accessing Data from lucd import LucdClient, log from eda.int import asset from eda.int import vds from eda.int import uds from eda.lib import lucd_uds if __name__ == "__main__": username = 'xxx' password = 'xxx' domain = 'xxx' client = LucdClient(username=username, password=password, domain=domain) log.info(f"Connected to Lucd platform.") # queries follow Elasticsearch API.. # See: https://www.elastic.co/guide/en/elasticsearch/reference/6.8/query-dsl.html query = \ { "query": { "bool": { "must": [ { "bool": { "should": [ { "match_phrase": { "source": "iris" } } ] } }, { "bool": { "should": [] } } ], "filter": [ { "bool": { "filter": [ ] } } ] } }, "size": 2000, "dataset": "iris" } results, http = uds.search(query) print(f"Search Results ({http}):\n{results}\n") hits, stats = client.search_to_dataframe(results) print(f"Search Results:\n{hits.head()}\n") print(f"Search Statistics:\n{stats}\n") all_models, http = client.rest('lucd/model/read', {"uid": username}) print(f"All Models ({http}):\n{all_models}\n") all_vds, http = vds.read({"uid": username}) print(f"All Virtual Datasets ({http}):\n{all_vds}\n") all_assets, http = asset.read({"uid": username}) print(f"All Asset Embeddings ({http}):\n{all_assets}\n") # # Lucd Library Calls to fetch assets and VDSes # # When limiting asset size, you could encounter issues with missing index entries. embeddings_index, embedding_matrix, embedding_size, word_index_mapping, word_index_mapping_padded = \ lucd_uds.get_asset("xxx", limit=100) print(embeddings_index, embedding_matrix, embedding_size, word_index_mapping, word_index_mapping_padded) # When limiting data size, you will encounter delays bring back large amounts of data # over the network, and possibly run the client out of memory. all_vds, http = vds.read({"uid": None}) print(f"All Virtual Datasets ({http}):\n{all_vds}\n") df = lucd_uds.get_dataframe("xxx", limit=100) print(f"Dataframe Data\n{df.head(20)}") client.close() Custom Feature Transformation¶ from eda.int import custom_operation import lucd def create_greater_than_mean_column(df): column_mean = df["flower.petal_length"].mean() df["flower.petal_length_Mean"] = df["flower.petal_length"] > column_mean return df if __name__ == "__main__": client = lucd.LucdClient(domain="xxx", username="xxx", password="xxx", login_domain="xxx" ) data = { "operation_name": "create_greater_than_mean_column_JBstyle", "author_name": "J. Black", "author_email": "j.black@lucd.ai", "operation_description": "Sample operation", "operation_purpose": "add a new column", "operation_features": ["flower.petal_length"], "operation_function": create_greater_than_mean_column } response_json, rv = custom_operation.create(data) client.close() Federated Endpoints To support federated machine learning, much of the high-level rest functionality operates in a federated manner. This means that unless otherwise specified, these actions will be performed/created/deleted on all federates. The features impacted include: Queries EDA trees Custom Ops VDS objects Model definitions FATE Training objects Users may include a block in their JSON specifying which federates to operate on, which looks like the following: "federation": {"federates": ["domain_name1", "domain_name2"]} View full record
  12. This section covers the high-level tasks needed to support model training: importing and preparing data and analyzing performance. Most of the content here pertains to the full model approach, but some (e.g., reporting model status) is still helpful for PyTorch compact modeling. Importing and Preparing Data Data can be imported into a modeling context using the Lucd Unified Dataspace (UDS) API (eda.lib.lucd_uds). This library provides functions for creating datasets of various formats (TensorFlow, PyTorch, Dask dataframe) based on Lucd virtual datasets defined in the Unity client. It also provides the capability to retrieve previously trained word embeddings. The Lucd UDS functions providing data retrieval are listed below. Some are used for straightforward data importing (e.g., get_datframe) while others are used for preparing framework-specific datasets for AI models (e.g., get_tf_dataset for defining TensorFlow datasets). Refer to the API documentation for full function descriptions, and visit the Lucd Model Shop gitlab project for examples on how to use the functions for developing AI models. get_asset get_dataframe get_tf_dataset get_tf_dataset_image get_tf_dataset_text train_eval_test_split_dataframe train_eval_test_split_pytorch train_eval_test_split_tensorflow Important notes for implementing multi-class modeling TensorFlow offers different approaches to building multi-class models, two prominent ones being using pre-made Estimators (https://www.tensorflow.org/tutorials/estimator/premade#overview_of_programming_with_estimators), and using general techniques such as with Keras models and Estimators. If one-hot encoded data labels are needed (i.e., to match the number of nodes in a neural network output layer), the num_classes parameter should be used when calling relevant functions to get data (e.g., lucd_uds.get_tf_dataset). Note that most (if not all) TensorFlow pre-made Estimator models do not require explicitly one-hot encoded data labels for non-binary modeling (e.g., tensorflow.estimator.DNNClassifier) and hence, the num_classes argument can be omitted. In the case of TensorFlow Estimators, developers are encouraged to understand how to shape input for the models. The same goes for modeling with PyTorch or XGBoost. Analyzing Model Performance Post-training performance analysis tasks are supported by the Lucd Machine Learning (ML) API (eda.lib.lucd_ml). This library provides functions supporting automatic execution and reporting of critical performance analysis tasks (e.g., creating confusion matrices, ROC curves), preventing the need to repeatedly write such code. The tables and plots created from these library functions can be viewed in the Unity client after the entire model training process has completed. The Lucd ML functions for performance analysis are listed below. Refer to the API documentation for full function descriptions. get_predictions_classification_pt get_predictions_classification_tf get_predictions_regression_pt get_predictions_regression_tf lucd_precision_recall_curve lucd_roc_curve lucd_confusion_matrix update_plots Submitting Performance Analysis Results Trained models and metadata can be uploaded to the Lucd backend via the eda.int.train.update function. The following piece of example code illustrates how to use the function. model_filename = lucd_uds.zip_model_tf(trained_classifier, serving_input_receiver_fn, model_id, graph_version, log_dir) # Store model graph and performance stats back to Lucd back-end with open(model_filename, "rb") as graph_file: train.update({tid: { 'performance': { 'loss': loss, 'accuracy': accuracy, 'macro_precision': precision_macro, 'macro_recall': recall_macro, 'macro_f1': f1_macro, 'micro_precision': precision_micro, 'micro_recall': recall_micro, 'micro_f1': f1_micro, 'weighted_precision': precision_weighted, 'weighted_recall': recall_weighted, 'weighted_f1': f1_weighted, 'precision_recall_f1_per_label': results_string, }, 'graph_version': graph_version, 'graph_file': graph_file.read() }}) train.update takes a python dictionary as the argument, with the train_id, described in Table 1, as the top-level key (tid represents the table_id in the code snippet above). The secondary keys graph_version and graph_file store the graph version and trained graph file (model) respectively. The secondary key performance stores another dictionary for performance values. There is no restriction on the key-value pairs here. The developer is allowed to choose the performance values and they will be viewable in the Unity client afterward. The values shown in the code snippet above are customary for evaluating classification models. Again, see example models in The Lucd Model Shop for more insights. As for precision_recall_f1_per_label, an example of how to compute performance statistics per label can be found at https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html. precision_recall_f1_per_label must be formatted as a semicolon-separated string of tuples, with each tuple formatted as “-,,. An example is as follows, precision_recall_f1_per_label: setosa-1.0,0.5,1.0;virginica-1.0,1.0,0.6;versicolor-1.0,0.6666666666666666,0.7499999999999999. Enabling Model Explainability To enable a trained model to be used by the explainability tool in the Unity client, some parameters must be defined. For TensorFlow models, ordered_feature_names, ordered_class_names, input_name, and output_name must be defined. ordered_feature_names (not to be confused with training data input column names) is a list of ordered names of the inputs to the trained model, commonly defined in TensorFlow model definitions as tf.feature_column. For example, for a TensorFlow text classification model, the named input might be embedding_input. Please see example code in The Lucd Model Shop. ordered_class_names is a list formatted such that string class names are ordered by their integer representations (the order of outputs from your model). For instance, for a binary classification model for which the labels are 0 and 1, the order of strings must be negative and positive (or whatever string labels you choose). input_name is the name of the input layer in your TensorFlow model to which your ordered_feature_names data will be passed. output_name is the name of the output layer in your TensorFlow model (by default these can be named things like ‘dense_2’ and ‘scores’). The output_name is used to retrieve your model outputs in the proper format for explanation. PyTorch models only require that ordered_class_names be provided. Plots Lucd allows users to plot associated model training metrics in the Unity client. The plots will update in realtime during model training, providing insight into the viability, problems, and successes of training runs. Users are able to store any information they want to be plotted as part of a training run using the eda.lib.lucd_ml.update_plots function. The provided information must abide by the following assumptions: 1. The top level keys represent individual plots; 2. Each inner dictionary must specify a labels and a description key, where labels are [“plot x_label”, “plot y_label”] and the description can be any string; 3. All remaining keys in the inner dictionary will be treated as individual lines on the plot, so in the following example "accuracy" is a line on the plot. { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } The individual elements in the “l1” line above represent [x_val, y_val]. A common example as shown above includes the following: "l1": [ [epoch, accuracy], [epoch, accuracy], ... ]. Once created, the dictionary may be stored for plotting with the lucd_ml.update_plots function: def update_plots(train_or_eval: bool, tid: str, dictionary: dict) -> str: ... dictionary = { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } update_plots(True, train_id, dictionary) Regarding the update_plots function, train_or_eval allows a user to specify whether their plot is part of the training or evaluation (or validation) cycle of the model training: train=true, eval=false. A TensorFlow hook is provided in lucd_ml for automatically parsing generated events files (the same as used by TensorBoard) and passing them to update_plots as part of a TensorFlow model. It can be provided as part of a TenorFlow EvalSpec or TrainSpec object as follows (stub included for posterity): class LucdTFEstimatorHook(tf.estimator.SessionRunHook): def __init__(self, train_hook: bool, log_dir: str, tid: str, freq: int, last_epoch: int): ... train_spec = tf.estimator.TrainSpec( input_fn=lambda: lucd_uds.get_tf_dataset_image(type_dict, training_data, num_features, target_type, num_classes).repeat(count=None).shuffle(30).batch(int(30)), max_steps=training_steps, hooks=lucd_ml.LucdTFEstimatorHook(train_hook=True, log_dir=log_dir, tid=tid, freq=10, max_epoch=training_steps)]) train_hook allows a user to specify whether the hook is providing train or eval metrics to the user (train=true, eval=false). log_dir tells the hook where to find TensorFlow events files. freq is the frequency that the hook should look for metrics in the events files. last_epoch tells the hook the number of epochs being run so the hook can ignore freq for the last epoch. Two last helper functions are provided as part of the Lucd plotting framework: lucd_roc_curve and lucd_precision_recall_curve. These functions generate ROC curves and precision-recall curves respectively, and are called selectively when using the compact modeling approach (enabled by “performance_curves” button in unity). Further documentation for these functions is provided in the API. def lucd_roc_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): def lucd_precision_recall_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): ... Confusion Matrix Lucd provides an in-depth, interactive confusion matrix for classification model evaluation. Users may select a square in the Unity client to be shown actual records associated with the square selected. This may be enabled by using the following function: def lucd_confusion_matrix(test_set: list or DataLoader, predictions: list, num_classes: int, label_mapping: type(abs), tid: str, write_accumulo: bool,) -> (dict, str): The function arguments details are provided below. test_set: Users may directly pass the PyTorch DataLoader or list of delayed dask dataframes returned from the respective train_eval_test_split_pytorch/tensorflow function. predictions: This should be a list of predictions generated by your model (the list returned from lucd_ml get_predictions_classification). The list must be in the same order as the test_set data. num_classes: An integer number of classes for the confusion matrix to represent. label_mapping: A function to map integers to class labels, which is used to map predictions to a human-readable format. tid: Training id to associate confusion matrix with. write_accumulo: Boolean specifying whether to write the dictionary directly to accumulo. Must be True to enable Unity client Confusion Matrix. If False, the generated confusion matrix will only be returned & not written to the database. Further documentation for this function exists in the API documentation. Here is an example usage: def _label_mapping(): return {0: 'I. versicolor', 1: 'I. virginica', 2: 'I. setosa'} ... # Prepare vds data for modeling delayed_values_training, delayed_values_evaluation, delayed_values_testing, my_df_testing_label, num_features = \ lucd_uds.train_eval_test_split_tensorflow(virtual_dataset_id, evaluation_dataset_percent, testing_dataset_percent) ... predictions, scores = lucd_ml.get_predictions_classification_tf(_estimator, lambda: uds.get_tf_dataset( feature_dict, delayed_values_testing, num_features, target_type).batch(1), classification_mode, .5) ... lucd_ml.lucd_confusion_matrix(delayed_values_testing, predictions, 3, label_mapping(), tid, True) Submitting Model Training Status Another helpful function is eda.int.train.status, which is used for storing the status of a developer’s training pipeline. This enables a model’s status to be displayed on the Unity client. The function definition is below. def status(uid, code, message=None): """Update model status in the database. Args: uid: Int representing a model's ID. code: 0 - RETRIEVING VDS, 1 - TRAINING, 2 - ANALYZING PERFORMANCE, 3 - STORING MODEL, 4 - TRAINING COMPLETE, 5 - ERROR, 6 - QUEUED, 7 - STOPPED. message: String representing optional custom message to include. Returns: Status message. Raises: TypeError: If code is not of type int. Exception: If code is invalid. """ View full record

HELP & SUPPORT

ABOUT US

Lucd is an AI software platform company that supports multiple industry verticals, allowing for its users to build enterprise-ready AI solutions with Low Code / No Code development practices. Lucd supports the entire AI lifecycle, allowing for the secure fusing of structured and unstructured data, empowering data analysts as well as business professionals to work collaboratively, resulting in reduced time to uncover new opportunities and solutions.

×
×
  • Create New...