Jump to content

Search the Community

Showing results for tags 'modeling'.

  • Search By Tags

    Type tags separated by commas.
  • Search By Author

Content Type


Forums

  • Lucd AI Platform Suite
    • JedAI Client
    • Python Client
    • Modeling Framework
    • General
  • Collaborate & Discuss
    • Questions & Answers
    • Data Management
    • Machine Learning
    • AI Solutions
    • Community Feedback & Requests
  • Health & Life Sciences's Discussion
  • Financial Services's Discussion
  • Retail & Consumer Packaged Goods's Discussion
  • Media & Entertainment's Discussion
  • Energy's Discussion
  • Manufacturing's Topics

Blogs

  • Lucd Team Blog
  • UX Club (Beta version only)'s Blog

Calendars

  • Community Calendar
  • Health & Life Sciences's Events
  • Financial Services's Events
  • Retail & Consumer Packaged Goods's Events
  • Media & Entertainment's Events
  • Energy's Events
  • Manufacturing's Events

Categories

  • Datasheets and White Papers
  • Press Releases
  • Industry Focus
    • Energy
    • Financial Services
    • Health and Life Services
    • Media and Entertainment
    • Manufacturing
    • Retail

Categories

  • JedAI Client
  • Python Client
  • Modeling Framework
  • General

Find results in...

Find results that contain...


Date Created

  • Start

    End


Last Updated

  • Start

    End


Filter by number of...

Joined

  • Start

    End


Group


About Me


Interests


Industry

Found 11 results

  1. Go to the following link: Avicenna Dashboard And you will see: Login with your credentials and you will see: Click on “Dashboard”: You will see the following, click on Avicenna: You will see the Lucd Global Pandemic Prediction Model Landing Page: If you are familiar with Kibana Dashboards, you can now use that knowledge to explore. There are many web resources on interacting with Kibana dashboards. Here is an introduction Kibana Dashboards The following are an introduction to some initial views to explore. But, there are many things that can be done exploring the Avicenna dashboard Before jumping into specifics, it is important to select the date filter you want to look at. Avicenna projects pandemic spread into the future, so selecting the proper date window to view the dashboard is important. For instance, select the date pull down and change the window to the next 15 days, then select apply: You will see all the visuals in the dashboard update to be filtered to show only the date range you have selected. Now, say, with the filter set at the next 15 days, we only want the visuals to show for a particular state, say, Alabama. In the Search bar, type “State:Alabama”: You will see the visuals filtered for Alabama only. If you wanted to further filter on a metro area or county you could further filter by selecting those on the pie chart: Let’s say you just want to see what a particular date in the future is going to look like. Go back to the date field, click on it and select a 24 hour period and select update: You will see: Also, suppose you are not sure about this particular model, you can remove the model filter: You will see other models that are available in the dashboard: There are many many many more ways to filter the dashboard. You are encouraged to explore, search on kibana dashboards or post questions. View full record
  2. Model Profiling, Maintenance and Explainability Creating a model is one thing. Measuring model performances is something else altogether. Profiling model performance is necessary during model creation, which is essentially an experimental process. But once placed into production, the performance of a model can decay over time as data changes. And just as important is our ability to explain why models behave as they do -- how did my model arrive at the specific prediction it did given the data? Accommodating both the creation process and the maintenance phase, the Lucd Client gives us the necessary visibility into model metrics such as to easily profile model performance. We can know if and when we've hit thresholds of acceptability or if we need to keep refining data and training. By the same token, we can know when it's time to retire a production model with something a little more resilient. Something all of us in the industry are faced with is the need to explain how and why models made their predictions. Model Review Boards with large organizations are becoming more common. To more easily provide the documentation and assurances such bodies require, the Lucd Client Model Profiling interfaces give users significant value in that regard, making review and regulatory reporting so much easier than would otherwise be the case. The video covers how to analyze and profile your models both during and after training.
  3. Modeling Everything we have covered to this point has been pointing this this: models. The whole point of what we're doing is to harness the potential energy inherent in our data to create models that will accurately predict. Models are prediction machines. The entire thesis of ML/AI is this one salient fact: when we accurately predict, we efficiently move forward. What is ML/AI if not prediction machines? Granted, we human beings have had the ability to make fact-grounded predictions for a long time. Since the dawn of the Scientific Revolution in the early 17th century, we have learned how to gather information and from that information reason our way through to conclusions regarding cause and effect. And from that foundation, make predictions. Our ability to make predictions has grown linearly since those humble beginnings, when Galileo had the temerity to suggest the earth was not the center of the universe. But with ML/AI, our ability to predict is no longer a linear growth path -- it is entirely disruptive because, relative to previous eras and methods of prediction, it is orders of magnitude chapter today than it was just a few years ago. And with that lower cost threshold for reasonable predictions based on information, the extraordinary potential of the potential energy inherent in our data becomes clear: Prediction is at the heart of making decisions under uncertainty. All organizations are riddled with such decisions. Prediction tools increase productivity—operating machines, handling documents, communicating with stakeholders. Uncertainty constrains strategy. Better prediction creates opportunities for new business structures and strategies to improve and compete. Open Source Software Today open source software is available for anyone with a computer and Internet connection to download and use. Data itself is increasingly becoming an easy-to-acquire commodity. Notice we didn't use the word 'free' when describing these innovations. While there may be no- or low-cost options when it comes to applications and even data, there is nothing free about it. One way or another, there are costs involved, whether it is hiring and/or training skilled programmers, data engineers, machine learning specialists, or analysts, no matter what set of tools you decide to embrace, there will always be costs involved. Of course it makes sense to leverage open source software when appropriate, but if and only if it makes sense in the context of your organization. Are you prepared to take full responsibility for the software you leverage? Are you ready to hire and maintain a crew of programmers and managers, skilled in the nuances of the application in question? Or might you better served to focus on matters more core to your mission and strategy, rather than the distraction of tools that may or may not be pertinent a few years down the road? Many organizations have made the mistake of thinking that open source software, by itself, can solve their problems. Nothing could be further from the truth. We know. How? Because we leverage open source software ourselves! And we are well aware of the organizational commitment it takes to integrate, maintain, and extend such tools in the broader scheme of things for our own organization. There is no organization today that does not use open source software in some way. Whether they know it or not, somewhere in their vast array of digital servers and devices, there will be some set of open source software. The Internet would not exist were it not for open source. But that's not to say it is free. Your Data Your Model Your Innovation What is your mission? Why do you exist? The answers to those questions will inform the decisions you make and the predictions you value. Where does your competitive advantage reside in a world where technology is distributed world wide? It's not your technology unless you are an inventor of technology. Your people? Maybe. But people are not permanent assets. People are temporal. Your buildings? Your rolodex? None of those can or will provide a sustainable competitive advantage. But you do have something that will: your data. The data you own, that no one else in the world has access to without your permission is the key foundation upon which you will build an advantage over competitors that will never abate. From you data, models you own will be built. Predictions from those models that benefit you will be the result. And from there, the innovations you will inherit are actually unpredictable -- but the potential energy inherent in your data is far more than you can even imagine today. In this video of the Lucd Client tutorial series, we show you how to create and train your models in the Lucd client. NEXT CHAPTER: MODEL PROFILING
  4. The Lucd platform enables the user to perform model performance analysis. Easy to use tools allows the user to view and compare the data prior to and after the data has been trained against a selected model. The platform also enables tracking of the critical governance and explainability associated with the process.
  5. This section covers the high-level tasks needed to support model training: importing and preparing data and analyzing performance. Most of the content here pertains to the full model approach, but some (e.g., reporting model status) is still helpful for PyTorch compact modeling. Importing and Preparing Data Data can be imported into a modeling context using the Lucd Unified Dataspace (UDS) API (eda.lib.lucd_uds). This library provides functions for creating datasets of various formats (TensorFlow, PyTorch, Dask dataframe) based on Lucd virtual datasets defined in the Unity client. It also provides the capability to retrieve previously trained word embeddings. The Lucd UDS functions providing data retrieval are listed below. Some are used for straightforward data importing (e.g., get_datframe) while others are used for preparing framework-specific datasets for AI models (e.g., get_tf_dataset for defining TensorFlow datasets). Refer to the API documentation for full function descriptions, and visit the Lucd Model Shop gitlab project for examples on how to use the functions for developing AI models. get_asset get_dataframe get_tf_dataset get_tf_dataset_image get_tf_dataset_text train_eval_test_split_dataframe train_eval_test_split_pytorch train_eval_test_split_tensorflow Important notes for implementing multi-class modeling TensorFlow offers different approaches to building multi-class models, two prominent ones being using pre-made Estimators (https://www.tensorflow.org/tutorials/estimator/premade#overview_of_programming_with_estimators), and using general techniques such as with Keras models and Estimators. If one-hot encoded data labels are needed (i.e., to match the number of nodes in a neural network output layer), the num_classes parameter should be used when calling relevant functions to get data (e.g., lucd_uds.get_tf_dataset). Note that most (if not all) TensorFlow pre-made Estimator models do not require explicitly one-hot encoded data labels for non-binary modeling (e.g., tensorflow.estimator.DNNClassifier) and hence, the num_classes argument can be omitted. In the case of TensorFlow Estimators, developers are encouraged to understand how to shape input for the models. The same goes for modeling with PyTorch or XGBoost. Analyzing Model Performance Post-training performance analysis tasks are supported by the Lucd Machine Learning (ML) API (eda.lib.lucd_ml). This library provides functions supporting automatic execution and reporting of critical performance analysis tasks (e.g., creating confusion matrices, ROC curves), preventing the need to repeatedly write such code. The tables and plots created from these library functions can be viewed in the Unity client after the entire model training process has completed. The Lucd ML functions for performance analysis are listed below. Refer to the API documentation for full function descriptions. get_predictions_classification_pt get_predictions_classification_tf get_predictions_regression_pt get_predictions_regression_tf lucd_precision_recall_curve lucd_roc_curve lucd_confusion_matrix update_plots Submitting Performance Analysis Results Trained models and metadata can be uploaded to the Lucd backend via the eda.int.train.update function. The following piece of example code illustrates how to use the function. model_filename = lucd_uds.zip_model_tf(trained_classifier, serving_input_receiver_fn, model_id, graph_version, log_dir) # Store model graph and performance stats back to Lucd back-end with open(model_filename, "rb") as graph_file: train.update({tid: { 'performance': { 'loss': loss, 'accuracy': accuracy, 'macro_precision': precision_macro, 'macro_recall': recall_macro, 'macro_f1': f1_macro, 'micro_precision': precision_micro, 'micro_recall': recall_micro, 'micro_f1': f1_micro, 'weighted_precision': precision_weighted, 'weighted_recall': recall_weighted, 'weighted_f1': f1_weighted, 'precision_recall_f1_per_label': results_string, }, 'graph_version': graph_version, 'graph_file': graph_file.read() }}) train.update takes a python dictionary as the argument, with the train_id, described in Table 1, as the top-level key (tid represents the table_id in the code snippet above). The secondary keys graph_version and graph_file store the graph version and trained graph file (model) respectively. The secondary key performance stores another dictionary for performance values. There is no restriction on the key-value pairs here. The developer is allowed to choose the performance values and they will be viewable in the Unity client afterward. The values shown in the code snippet above are customary for evaluating classification models. Again, see example models in The Lucd Model Shop for more insights. As for precision_recall_f1_per_label, an example of how to compute performance statistics per label can be found at https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html. precision_recall_f1_per_label must be formatted as a semicolon-separated string of tuples, with each tuple formatted as “-,,. An example is as follows, precision_recall_f1_per_label: setosa-1.0,0.5,1.0;virginica-1.0,1.0,0.6;versicolor-1.0,0.6666666666666666,0.7499999999999999. Enabling Model Explainability To enable a trained model to be used by the explainability tool in the Unity client, some parameters must be defined. For TensorFlow models, ordered_feature_names, ordered_class_names, input_name, and output_name must be defined. ordered_feature_names (not to be confused with training data input column names) is a list of ordered names of the inputs to the trained model, commonly defined in TensorFlow model definitions as tf.feature_column. For example, for a TensorFlow text classification model, the named input might be embedding_input. Please see example code in The Lucd Model Shop. ordered_class_names is a list formatted such that string class names are ordered by their integer representations (the order of outputs from your model). For instance, for a binary classification model for which the labels are 0 and 1, the order of strings must be negative and positive (or whatever string labels you choose). input_name is the name of the input layer in your TensorFlow model to which your ordered_feature_names data will be passed. output_name is the name of the output layer in your TensorFlow model (by default these can be named things like ‘dense_2’ and ‘scores’). The output_name is used to retrieve your model outputs in the proper format for explanation. PyTorch models only require that ordered_class_names be provided. Plots Lucd allows users to plot associated model training metrics in the Unity client. The plots will update in realtime during model training, providing insight into the viability, problems, and successes of training runs. Users are able to store any information they want to be plotted as part of a training run using the eda.lib.lucd_ml.update_plots function. The provided information must abide by the following assumptions: 1. The top level keys represent individual plots; 2. Each inner dictionary must specify a labels and a description key, where labels are [“plot x_label”, “plot y_label”] and the description can be any string; 3. All remaining keys in the inner dictionary will be treated as individual lines on the plot, so in the following example "accuracy" is a line on the plot. { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } The individual elements in the “l1” line above represent [x_val, y_val]. A common example as shown above includes the following: "l1": [ [epoch, accuracy], [epoch, accuracy], ... ]. Once created, the dictionary may be stored for plotting with the lucd_ml.update_plots function: def update_plots(train_or_eval: bool, tid: str, dictionary: dict) -> str: ... dictionary = { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } update_plots(True, train_id, dictionary) Regarding the update_plots function, train_or_eval allows a user to specify whether their plot is part of the training or evaluation (or validation) cycle of the model training: train=true, eval=false. A TensorFlow hook is provided in lucd_ml for automatically parsing generated events files (the same as used by TensorBoard) and passing them to update_plots as part of a TensorFlow model. It can be provided as part of a TenorFlow EvalSpec or TrainSpec object as follows (stub included for posterity): class LucdTFEstimatorHook(tf.estimator.SessionRunHook): def __init__(self, train_hook: bool, log_dir: str, tid: str, freq: int, last_epoch: int): ... train_spec = tf.estimator.TrainSpec( input_fn=lambda: lucd_uds.get_tf_dataset_image(type_dict, training_data, num_features, target_type, num_classes).repeat(count=None).shuffle(30).batch(int(30)), max_steps=training_steps, hooks=lucd_ml.LucdTFEstimatorHook(train_hook=True, log_dir=log_dir, tid=tid, freq=10, max_epoch=training_steps)]) train_hook allows a user to specify whether the hook is providing train or eval metrics to the user (train=true, eval=false). log_dir tells the hook where to find TensorFlow events files. freq is the frequency that the hook should look for metrics in the events files. last_epoch tells the hook the number of epochs being run so the hook can ignore freq for the last epoch. Two last helper functions are provided as part of the Lucd plotting framework: lucd_roc_curve and lucd_precision_recall_curve. These functions generate ROC curves and precision-recall curves respectively, and are called selectively when using the compact modeling approach (enabled by “performance_curves” button in unity). Further documentation for these functions is provided in the API. def lucd_roc_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): def lucd_precision_recall_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): ... Confusion Matrix Lucd provides an in-depth, interactive confusion matrix for classification model evaluation. Users may select a square in the Unity client to be shown actual records associated with the square selected. This may be enabled by using the following function: def lucd_confusion_matrix(test_set: list or DataLoader, predictions: list, num_classes: int, label_mapping: type(abs), tid: str, write_accumulo: bool,) -> (dict, str): The function arguments details are provided below. test_set: Users may directly pass the PyTorch DataLoader or list of delayed dask dataframes returned from the respective train_eval_test_split_pytorch/tensorflow function. predictions: This should be a list of predictions generated by your model (the list returned from lucd_ml get_predictions_classification). The list must be in the same order as the test_set data. num_classes: An integer number of classes for the confusion matrix to represent. label_mapping: A function to map integers to class labels, which is used to map predictions to a human-readable format. tid: Training id to associate confusion matrix with. write_accumulo: Boolean specifying whether to write the dictionary directly to accumulo. Must be True to enable Unity client Confusion Matrix. If False, the generated confusion matrix will only be returned & not written to the database. Further documentation for this function exists in the API documentation. Here is an example usage: def _label_mapping(): return {0: 'I. versicolor', 1: 'I. virginica', 2: 'I. setosa'} ... # Prepare vds data for modeling delayed_values_training, delayed_values_evaluation, delayed_values_testing, my_df_testing_label, num_features = \ lucd_uds.train_eval_test_split_tensorflow(virtual_dataset_id, evaluation_dataset_percent, testing_dataset_percent) ... predictions, scores = lucd_ml.get_predictions_classification_tf(_estimator, lambda: uds.get_tf_dataset( feature_dict, delayed_values_testing, num_features, target_type).batch(1), classification_mode, .5) ... lucd_ml.lucd_confusion_matrix(delayed_values_testing, predictions, 3, label_mapping(), tid, True) Submitting Model Training Status Another helpful function is eda.int.train.status, which is used for storing the status of a developer’s training pipeline. This enables a model’s status to be displayed on the Unity client. The function definition is below. def status(uid, code, message=None): """Update model status in the database. Args: uid: Int representing a model's ID. code: 0 - RETRIEVING VDS, 1 - TRAINING, 2 - ANALYZING PERFORMANCE, 3 - STORING MODEL, 4 - TRAINING COMPLETE, 5 - ERROR, 6 - QUEUED, 7 - STOPPED. message: String representing optional custom message to include. Returns: Status message. Raises: TypeError: If code is not of type int. Exception: If code is invalid. """
  6. The Lucd Modeling Framework (LMF) enables developers to build and interface custom AI models with the Lucd JedAI platform for streamlined management, experimentation, and training using data and parameters established in the JudAI Unity client (or simply, Unity client). The framework supports python-based AI models built with TensorFlow, PyTorch, and Dask XGBoost. LMF’s python libraries support the following tasks: - accessing Lucd virtual datasets (VDSes) for model training and evaluation, - analyzing and reporting model performance metrics (e.g., with confusion matrices, ROC curves), - storing structures representing trained models and training checkpoints. Model Development Approaches LMF provides flexibility in the level of effort and control needed for preparing models for Lucd. The two approaches include the full and compact model approaches; their differences are illustrated in Figure 1. Figure 1. Conceptual illustration of full and compact model approaches. Full Model Approach In the full model approach, a developer creates some AI model and manually uses LMF python libraries to complete the model training workflow (e.g., train, validate, holdout data testing, store results). This enables complete flexibility for more advanced use cases which might include designing complex or experimental training loops, advanced performance analysis, custom model compression, etc. Full models are implemented using normal python scripts. Further details are in the Developing Full Models section of this documentation. Compact Model Approach The compact model approach enables a developer to focus most if not all effort on defining an AI model, leaving other workflow tasks like holdout data testing and storage of performance results for the LMF to do automatically behind the scenes. In the case of TensorFlow, the developer does not even need to write training logic. The major benefits of the compact model approach include (1) significantly less coding effort and (2) potential reduction of errors and/or inconsistencies in writing boilerplate performance-testing logic. These benefits are especially useful for formatting models for multi-run experiments such as k-fold cross validation and learning curves (which will be introduced in an upcoming LMF release). Further details about compact modeling are in Developing Compact Models. Notable Framework Capabilities The LMF consists of an evolving set of capabilities. The following subsections describe notable modeling capabilities supported as of release 6.3.0 RC1. TensorFlow Estimator-Based Modeling TensorFlow supports AI modeling using either low-level APIs or easier-to-use high-level Estimator APIs. The LMF is designed to support Estimator-based model development. Keras may be used to create models, especially for enabling more customization. However, such models must be converted to Estimators for LMF and the broader Lucd JedAI platform to manage them appropriately. See for following link for an introduction to TensorFlow Estimators, https://www.tensorflow.org/guide/estimator. Various Feature Types For TensorFlow modeling, all dataset feature column types are supported (see https://www.tensorflow.org/guide/feature_columns), enabling support for a broad range of numeric and categorical features. Regarding categorical features, the domain of such a feature must be known at training time. For example, if you choose to use a feature car_make as a categorical feature, you must know all the possible makes when you write your model. This requirement will be removed in a future release. Also, the conversion of non-numerical data to numerical data (e.g., for encoding label/target values) based on a scan of the entire dataset is not supported in the current release. However, to help with this, data value replacement operations are supported in the Unity client. For TensorFlow modeling, label types are assumed to be TensorFlow int32. Embedding Datasets For TensorFlow and PyTorch modeling, LMF supports the use of embedding data, e.g., word2vec for representing free text. For PyTorch, the TorchText library is supported, but n-grams are not supported in the current release. Important Note: Currently, when using text input, only the text/embedding input is allowed as a feature, enabling conventional text classification. Future releases will enable the use of multiple feature inputs alongside text data. Image Datasets For TensorFlow and PyTorch modeling, use of image data (i.e., pixel values) as model input is supported. Distributed XGBoost using Dask Distributed training of XGBoost models using the Dask parallel data analytics framework is supported. See the following link for more information, https://ml.dask.org/xgboost.html. Support for TensorFlow and PyTorch distributed training is under development. Supported Libraries The Lucd modeling framework supports the following languages and machine learning -related libraries: - Python v3.6.5 - TensorFlow (for Python) v2.1 - PyTorch v1.5.1 - TorchText - Dask - Dask-ML - Dask-XGBoost - Numpy v1.16.4 - Scikit-learn v0.19.2 - Pandas v0.25.1 Since Dask and Dask-ML libraries are included in the modeling framework, Dask-based distributed Scikit-learn modules should also be supported. However, the usage of such models has not been tested, and hence, Dask-based Scikit-learn operation may be unpredictable. Example Code While this documentation introduces all the core components and best practices for developing AI models for the Lucd JedAI platform, there is rarely a replacement for sample code. The Lucd Model Shop provides a wide range of code (prepared by Lucd engineers) to help developers get started with preparing AI models. In the future, the Lucd Model Shop will also allow for the larger Lucd developer community to share their code, further helping others with their AI goals. Python API Documentation The LMF Python API documentation can be found in the following Lucd GitLab Pages site, https://lucd.pages.lucd.ai/mock-product-development/lucd-eda-rest/. Preparing Models Using the Lucd Modeling Framework The following documentation contains further details and examples for developing AI models for Lucd. - Developing Compact Models - Developing Full Models - Working with Data and Performance Analysis - The Lucd Model Shop
  7. New Features Added the “compact modeling” framework, enabling end-users to avoid writing boilerplate code (e.g., data retrieval, performance analysis) when preparing models. This now supports PyTorch and TensorFlow. Added capability for producing customizable real-time training update graphs for TensorFlow and PyTorch modeling. Other graphs (e.g., precision-recall) have been added as well. Added interactive confusion matrix capabilities, enabling detailed view of mis-labeled testing data, etc. Added capability to evaluate output of trained models in the JedAI Unity client via the explainability tab. Works for TensorFlow and PyTorch tabular, image and text classification as well as regression models. Changes Expanded use of “model_type” parameter to include “tabular_classification,” alleviating need to perform guesswork in code to determine model type. Refactored/improved retrieval code for virtual datasets and predictions to consider differences between TensorFlow and PyTorch data. TensorFlow models must now return all four of ordered_class_names, ordered_feature_names, input_name and output_name with compact modeling. PyTorch models must return ordered_class_names with compact modeling. Code refactoring to fix input and output mappings with TensorFlow models Various other bugfixes Miscellaneous code refactoring to make PyTorch modeling less error-prone. Enablement of ElasticSearch upgrades, minor related bugfixes
  8. New Features Added the “compact modeling” framework, enabling end-users to avoid writing boilerplate code (e.g., data retrieval, performance analysis) when preparing models. This now supports PyTorch and TensorFlow. Added capability for producing customizable real-time training update graphs for TensorFlow and PyTorch modeling. Other graphs (e.g., precision-recall) have been added as well. Added interactive confusion matrix capabilities, enabling detailed view of mis-labeled testing data, etc. Added capability to evaluate output of trained models in the JedAI Unity client via the explainability tab. Works for TensorFlow and PyTorch tabular, image and text classification as well as regression models. Changes Expanded use of “model_type” parameter to include “tabular_classification,” alleviating need to perform guesswork in code to determine model type. Refactored/improved retrieval code for virtual datasets and predictions to consider differences between TensorFlow and PyTorch data. TensorFlow models must now return all four of ordered_class_names, ordered_feature_names, input_name and output_name with compact modeling. PyTorch models must return ordered_class_names with compact modeling. Code refactoring to fix input and output mappings with TensorFlow models Various other bugfixes Miscellaneous code refactoring to make PyTorch modeling less error-prone. Enablement of ElasticSearch upgrades, minor related bugfixes View full record
  9. The Lucd Modeling Framework (LMF) enables developers to build and interface custom AI models with the Lucd JedAI platform for streamlined management, experimentation, and training using data and parameters established in the JudAI Unity client (or simply, Unity client). The framework supports python-based AI models built with TensorFlow, PyTorch, and Dask XGBoost. LMF’s python libraries support the following tasks: - accessing Lucd virtual datasets (VDSes) for model training and evaluation, - analyzing and reporting model performance metrics (e.g., with confusion matrices, ROC curves), - storing structures representing trained models and training checkpoints. Model Development Approaches LMF provides flexibility in the level of effort and control needed for preparing models for Lucd. The two approaches include the full and compact model approaches; their differences are illustrated in Figure 1. Figure 1. Conceptual illustration of full and compact model approaches. Full Model Approach In the full model approach, a developer creates some AI model and manually uses LMF python libraries to complete the model training workflow (e.g., train, validate, holdout data testing, store results). This enables complete flexibility for more advanced use cases which might include designing complex or experimental training loops, advanced performance analysis, custom model compression, etc. Full models are implemented using normal python scripts. Further details are in the Developing Full Models section of this documentation. Compact Model Approach The compact model approach enables a developer to focus most if not all effort on defining an AI model, leaving other workflow tasks like holdout data testing and storage of performance results for the LMF to do automatically behind the scenes. In the case of TensorFlow, the developer does not even need to write training logic. The major benefits of the compact model approach include (1) significantly less coding effort and (2) potential reduction of errors and/or inconsistencies in writing boilerplate performance-testing logic. These benefits are especially useful for formatting models for multi-run experiments such as k-fold cross validation and learning curves (which will be introduced in an upcoming LMF release). Further details about compact modeling are in Developing Compact Models. Notable Framework Capabilities The LMF consists of an evolving set of capabilities. The following subsections describe notable modeling capabilities supported as of release 6.3.0 RC1. TensorFlow Estimator-Based Modeling TensorFlow supports AI modeling using either low-level APIs or easier-to-use high-level Estimator APIs. The LMF is designed to support Estimator-based model development. Keras may be used to create models, especially for enabling more customization. However, such models must be converted to Estimators for LMF and the broader Lucd JedAI platform to manage them appropriately. See for following link for an introduction to TensorFlow Estimators, https://www.tensorflow.org/guide/estimator. Various Feature Types For TensorFlow modeling, all dataset feature column types are supported (see https://www.tensorflow.org/guide/feature_columns), enabling support for a broad range of numeric and categorical features. Regarding categorical features, the domain of such a feature must be known at training time. For example, if you choose to use a feature car_make as a categorical feature, you must know all the possible makes when you write your model. This requirement will be removed in a future release. Also, the conversion of non-numerical data to numerical data (e.g., for encoding label/target values) based on a scan of the entire dataset is not supported in the current release. However, to help with this, data value replacement operations are supported in the Unity client. For TensorFlow modeling, label types are assumed to be TensorFlow int32. Embedding Datasets For TensorFlow and PyTorch modeling, LMF supports the use of embedding data, e.g., word2vec for representing free text. For PyTorch, the TorchText library is supported, but n-grams are not supported in the current release. Important Note: Currently, when using text input, only the text/embedding input is allowed as a feature, enabling conventional text classification. Future releases will enable the use of multiple feature inputs alongside text data. Image Datasets For TensorFlow and PyTorch modeling, use of image data (i.e., pixel values) as model input is supported. Distributed XGBoost using Dask Distributed training of XGBoost models using the Dask parallel data analytics framework is supported. See the following link for more information, https://ml.dask.org/xgboost.html. Support for TensorFlow and PyTorch distributed training is under development. Supported Libraries The Lucd modeling framework supports the following languages and machine learning -related libraries: - Python v3.6.5 - TensorFlow (for Python) v2.1 - PyTorch v1.5.1 - TorchText - Dask - Dask-ML - Dask-XGBoost - Numpy v1.16.4 - Scikit-learn v0.19.2 - Pandas v0.25.1 Since Dask and Dask-ML libraries are included in the modeling framework, Dask-based distributed Scikit-learn modules should also be supported. However, the usage of such models has not been tested, and hence, Dask-based Scikit-learn operation may be unpredictable. Example Code While this documentation introduces all the core components and best practices for developing AI models for the Lucd JedAI platform, there is rarely a replacement for sample code. The Lucd Model Shop provides a wide range of code (prepared by Lucd engineers) to help developers get started with preparing AI models. In the future, the Lucd Model Shop will also allow for the larger Lucd developer community to share their code, further helping others with their AI goals. Python API Documentation The LMF Python API documentation can be found in the following Lucd GitLab Pages site, https://lucd.pages.lucd.ai/mock-product-development/lucd-eda-rest/. Preparing Models Using the Lucd Modeling Framework The following documentation contains further details and examples for developing AI models for Lucd. - Developing Compact Models - Developing Full Models - Working with Data and Performance Analysis - The Lucd Model Shop View full record
  10. The Lucd platform enables the user to perform model performance analysis. Easy to use tools allows the user to view and compare the data prior to and after the data has been trained against a selected model. The platform also enables tracking of the critical governance and explainability associated with the process. View full record
  11. This section covers the high-level tasks needed to support model training: importing and preparing data and analyzing performance. Most of the content here pertains to the full model approach, but some (e.g., reporting model status) is still helpful for PyTorch compact modeling. Importing and Preparing Data Data can be imported into a modeling context using the Lucd Unified Dataspace (UDS) API (eda.lib.lucd_uds). This library provides functions for creating datasets of various formats (TensorFlow, PyTorch, Dask dataframe) based on Lucd virtual datasets defined in the Unity client. It also provides the capability to retrieve previously trained word embeddings. The Lucd UDS functions providing data retrieval are listed below. Some are used for straightforward data importing (e.g., get_datframe) while others are used for preparing framework-specific datasets for AI models (e.g., get_tf_dataset for defining TensorFlow datasets). Refer to the API documentation for full function descriptions, and visit the Lucd Model Shop gitlab project for examples on how to use the functions for developing AI models. get_asset get_dataframe get_tf_dataset get_tf_dataset_image get_tf_dataset_text train_eval_test_split_dataframe train_eval_test_split_pytorch train_eval_test_split_tensorflow Important notes for implementing multi-class modeling TensorFlow offers different approaches to building multi-class models, two prominent ones being using pre-made Estimators (https://www.tensorflow.org/tutorials/estimator/premade#overview_of_programming_with_estimators), and using general techniques such as with Keras models and Estimators. If one-hot encoded data labels are needed (i.e., to match the number of nodes in a neural network output layer), the num_classes parameter should be used when calling relevant functions to get data (e.g., lucd_uds.get_tf_dataset). Note that most (if not all) TensorFlow pre-made Estimator models do not require explicitly one-hot encoded data labels for non-binary modeling (e.g., tensorflow.estimator.DNNClassifier) and hence, the num_classes argument can be omitted. In the case of TensorFlow Estimators, developers are encouraged to understand how to shape input for the models. The same goes for modeling with PyTorch or XGBoost. Analyzing Model Performance Post-training performance analysis tasks are supported by the Lucd Machine Learning (ML) API (eda.lib.lucd_ml). This library provides functions supporting automatic execution and reporting of critical performance analysis tasks (e.g., creating confusion matrices, ROC curves), preventing the need to repeatedly write such code. The tables and plots created from these library functions can be viewed in the Unity client after the entire model training process has completed. The Lucd ML functions for performance analysis are listed below. Refer to the API documentation for full function descriptions. get_predictions_classification_pt get_predictions_classification_tf get_predictions_regression_pt get_predictions_regression_tf lucd_precision_recall_curve lucd_roc_curve lucd_confusion_matrix update_plots Submitting Performance Analysis Results Trained models and metadata can be uploaded to the Lucd backend via the eda.int.train.update function. The following piece of example code illustrates how to use the function. model_filename = lucd_uds.zip_model_tf(trained_classifier, serving_input_receiver_fn, model_id, graph_version, log_dir) # Store model graph and performance stats back to Lucd back-end with open(model_filename, "rb") as graph_file: train.update({tid: { 'performance': { 'loss': loss, 'accuracy': accuracy, 'macro_precision': precision_macro, 'macro_recall': recall_macro, 'macro_f1': f1_macro, 'micro_precision': precision_micro, 'micro_recall': recall_micro, 'micro_f1': f1_micro, 'weighted_precision': precision_weighted, 'weighted_recall': recall_weighted, 'weighted_f1': f1_weighted, 'precision_recall_f1_per_label': results_string, }, 'graph_version': graph_version, 'graph_file': graph_file.read() }}) train.update takes a python dictionary as the argument, with the train_id, described in Table 1, as the top-level key (tid represents the table_id in the code snippet above). The secondary keys graph_version and graph_file store the graph version and trained graph file (model) respectively. The secondary key performance stores another dictionary for performance values. There is no restriction on the key-value pairs here. The developer is allowed to choose the performance values and they will be viewable in the Unity client afterward. The values shown in the code snippet above are customary for evaluating classification models. Again, see example models in The Lucd Model Shop for more insights. As for precision_recall_f1_per_label, an example of how to compute performance statistics per label can be found at https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html. precision_recall_f1_per_label must be formatted as a semicolon-separated string of tuples, with each tuple formatted as “-,,. An example is as follows, precision_recall_f1_per_label: setosa-1.0,0.5,1.0;virginica-1.0,1.0,0.6;versicolor-1.0,0.6666666666666666,0.7499999999999999. Enabling Model Explainability To enable a trained model to be used by the explainability tool in the Unity client, some parameters must be defined. For TensorFlow models, ordered_feature_names, ordered_class_names, input_name, and output_name must be defined. ordered_feature_names (not to be confused with training data input column names) is a list of ordered names of the inputs to the trained model, commonly defined in TensorFlow model definitions as tf.feature_column. For example, for a TensorFlow text classification model, the named input might be embedding_input. Please see example code in The Lucd Model Shop. ordered_class_names is a list formatted such that string class names are ordered by their integer representations (the order of outputs from your model). For instance, for a binary classification model for which the labels are 0 and 1, the order of strings must be negative and positive (or whatever string labels you choose). input_name is the name of the input layer in your TensorFlow model to which your ordered_feature_names data will be passed. output_name is the name of the output layer in your TensorFlow model (by default these can be named things like ‘dense_2’ and ‘scores’). The output_name is used to retrieve your model outputs in the proper format for explanation. PyTorch models only require that ordered_class_names be provided. Plots Lucd allows users to plot associated model training metrics in the Unity client. The plots will update in realtime during model training, providing insight into the viability, problems, and successes of training runs. Users are able to store any information they want to be plotted as part of a training run using the eda.lib.lucd_ml.update_plots function. The provided information must abide by the following assumptions: 1. The top level keys represent individual plots; 2. Each inner dictionary must specify a labels and a description key, where labels are [“plot x_label”, “plot y_label”] and the description can be any string; 3. All remaining keys in the inner dictionary will be treated as individual lines on the plot, so in the following example "accuracy" is a line on the plot. { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } The individual elements in the “l1” line above represent [x_val, y_val]. A common example as shown above includes the following: "l1": [ [epoch, accuracy], [epoch, accuracy], ... ]. Once created, the dictionary may be stored for plotting with the lucd_ml.update_plots function: def update_plots(train_or_eval: bool, tid: str, dictionary: dict) -> str: ... dictionary = { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } update_plots(True, train_id, dictionary) Regarding the update_plots function, train_or_eval allows a user to specify whether their plot is part of the training or evaluation (or validation) cycle of the model training: train=true, eval=false. A TensorFlow hook is provided in lucd_ml for automatically parsing generated events files (the same as used by TensorBoard) and passing them to update_plots as part of a TensorFlow model. It can be provided as part of a TenorFlow EvalSpec or TrainSpec object as follows (stub included for posterity): class LucdTFEstimatorHook(tf.estimator.SessionRunHook): def __init__(self, train_hook: bool, log_dir: str, tid: str, freq: int, last_epoch: int): ... train_spec = tf.estimator.TrainSpec( input_fn=lambda: lucd_uds.get_tf_dataset_image(type_dict, training_data, num_features, target_type, num_classes).repeat(count=None).shuffle(30).batch(int(30)), max_steps=training_steps, hooks=lucd_ml.LucdTFEstimatorHook(train_hook=True, log_dir=log_dir, tid=tid, freq=10, max_epoch=training_steps)]) train_hook allows a user to specify whether the hook is providing train or eval metrics to the user (train=true, eval=false). log_dir tells the hook where to find TensorFlow events files. freq is the frequency that the hook should look for metrics in the events files. last_epoch tells the hook the number of epochs being run so the hook can ignore freq for the last epoch. Two last helper functions are provided as part of the Lucd plotting framework: lucd_roc_curve and lucd_precision_recall_curve. These functions generate ROC curves and precision-recall curves respectively, and are called selectively when using the compact modeling approach (enabled by “performance_curves” button in unity). Further documentation for these functions is provided in the API. def lucd_roc_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): def lucd_precision_recall_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): ... Confusion Matrix Lucd provides an in-depth, interactive confusion matrix for classification model evaluation. Users may select a square in the Unity client to be shown actual records associated with the square selected. This may be enabled by using the following function: def lucd_confusion_matrix(test_set: list or DataLoader, predictions: list, num_classes: int, label_mapping: type(abs), tid: str, write_accumulo: bool,) -> (dict, str): The function arguments details are provided below. test_set: Users may directly pass the PyTorch DataLoader or list of delayed dask dataframes returned from the respective train_eval_test_split_pytorch/tensorflow function. predictions: This should be a list of predictions generated by your model (the list returned from lucd_ml get_predictions_classification). The list must be in the same order as the test_set data. num_classes: An integer number of classes for the confusion matrix to represent. label_mapping: A function to map integers to class labels, which is used to map predictions to a human-readable format. tid: Training id to associate confusion matrix with. write_accumulo: Boolean specifying whether to write the dictionary directly to accumulo. Must be True to enable Unity client Confusion Matrix. If False, the generated confusion matrix will only be returned & not written to the database. Further documentation for this function exists in the API documentation. Here is an example usage: def _label_mapping(): return {0: 'I. versicolor', 1: 'I. virginica', 2: 'I. setosa'} ... # Prepare vds data for modeling delayed_values_training, delayed_values_evaluation, delayed_values_testing, my_df_testing_label, num_features = \ lucd_uds.train_eval_test_split_tensorflow(virtual_dataset_id, evaluation_dataset_percent, testing_dataset_percent) ... predictions, scores = lucd_ml.get_predictions_classification_tf(_estimator, lambda: uds.get_tf_dataset( feature_dict, delayed_values_testing, num_features, target_type).batch(1), classification_mode, .5) ... lucd_ml.lucd_confusion_matrix(delayed_values_testing, predictions, 3, label_mapping(), tid, True) Submitting Model Training Status Another helpful function is eda.int.train.status, which is used for storing the status of a developer’s training pipeline. This enables a model’s status to be displayed on the Unity client. The function definition is below. def status(uid, code, message=None): """Update model status in the database. Args: uid: Int representing a model's ID. code: 0 - RETRIEVING VDS, 1 - TRAINING, 2 - ANALYZING PERFORMANCE, 3 - STORING MODEL, 4 - TRAINING COMPLETE, 5 - ERROR, 6 - QUEUED, 7 - STOPPED. message: String representing optional custom message to include. Returns: Status message. Raises: TypeError: If code is not of type int. Exception: If code is invalid. """ View full record

HELP & SUPPORT

ABOUT US

Lucd is an AI software platform company that supports multiple industry verticals, allowing for its users to build enterprise-ready AI solutions with Low Code / No Code development practices. Lucd supports the entire AI lifecycle, allowing for the secure fusing of structured and unstructured data, empowering data analysts as well as business professionals to work collaboratively, resulting in reduced time to uncover new opportunities and solutions.

×
×
  • Create New...