Jump to content

Search the Community

Showing results for tags '6.3.0'.

  • Search By Tags

    Type tags separated by commas.
  • Search By Author

Content Type


Forums

  • Lucd AI Platform Suite
    • JedAI Client
    • Python Client
    • Modeling Framework
    • General
  • Collaborate & Discuss
    • Questions & Answers
    • Data Management
    • Machine Learning
    • AI Solutions
    • Community Feedback & Requests
  • Health & Life Sciences's Discussion
  • Financial Services's Discussion
  • Retail & Consumer Packaged Goods's Discussion
  • Media & Entertainment's Discussion
  • Energy's Discussion
  • Manufacturing's Topics

Blogs

  • Lucd Team Blog
  • UX Club (Beta version only)'s Blog

Calendars

  • Community Calendar
  • Health & Life Sciences's Events
  • Financial Services's Events
  • Retail & Consumer Packaged Goods's Events
  • Media & Entertainment's Events
  • Energy's Events
  • Manufacturing's Events

Categories

  • Datasheets and White Papers
  • Press Releases
  • Industry Focus
    • Energy
    • Financial Services
    • Health and Life Services
    • Media and Entertainment
    • Manufacturing
    • Retail

Categories

  • JedAI Client
  • Python Client
  • Modeling Framework
  • General

Find results in...

Find results that contain...


Date Created

  • Start

    End


Last Updated

  • Start

    End


Filter by number of...

Joined

  • Start

    End


Group


About Me


Interests


Industry

  1. Enables the ability to authenticate to Lucd backend and access data for custom analysis and model prototyping. Enables the ability to define custom feature transformation operations, and upload to Lucd for use in the GUI. Provides access to the Asset, Concept, Custom Contract, Custom Operation, Explore, Model, UDS and VDS REST APIs. Provides multiple working examples of individual REST calls, as well as complete models. Enables Dask exploration of the data on the client side. Enables development and testing of model development using the Lucd PyTorch, Tensorflow, Keras and XGBoost classes. View full record
  2. The Lucd platform enables the user to perform model performance analysis. Easy to use tools allows the user to view and compare the data prior to and after the data has been trained against a selected model. The platform also enables tracking of the critical governance and explainability associated with the process.
  3. The Lucd platform provides necessary, easy-to-use tools for dataset query and data visualization. Query and visualization are important initial steps when preparing data for machine learning and Lucd makes this a snap.
  4. Lucd enables users to transform large-scale data and provides an easy-to-use method for creating a "virtual dataset" based on the data desired to be trained. This transformed data can then be trained against the required models.
  5. This section covers the high-level tasks needed to support model training: importing and preparing data and analyzing performance. Most of the content here pertains to the full model approach, but some (e.g., reporting model status) is still helpful for PyTorch compact modeling. Importing and Preparing Data Data can be imported into a modeling context using the Lucd Unified Dataspace (UDS) API (eda.lib.lucd_uds). This library provides functions for creating datasets of various formats (TensorFlow, PyTorch, Dask dataframe) based on Lucd virtual datasets defined in the Unity client. It also provides the capability to retrieve previously trained word embeddings. The Lucd UDS functions providing data retrieval are listed below. Some are used for straightforward data importing (e.g., get_datframe) while others are used for preparing framework-specific datasets for AI models (e.g., get_tf_dataset for defining TensorFlow datasets). Refer to the API documentation for full function descriptions, and visit the Lucd Model Shop gitlab project for examples on how to use the functions for developing AI models. get_asset get_dataframe get_tf_dataset get_tf_dataset_image get_tf_dataset_text train_eval_test_split_dataframe train_eval_test_split_pytorch train_eval_test_split_tensorflow Important notes for implementing multi-class modeling TensorFlow offers different approaches to building multi-class models, two prominent ones being using pre-made Estimators (https://www.tensorflow.org/tutorials/estimator/premade#overview_of_programming_with_estimators), and using general techniques such as with Keras models and Estimators. If one-hot encoded data labels are needed (i.e., to match the number of nodes in a neural network output layer), the num_classes parameter should be used when calling relevant functions to get data (e.g., lucd_uds.get_tf_dataset). Note that most (if not all) TensorFlow pre-made Estimator models do not require explicitly one-hot encoded data labels for non-binary modeling (e.g., tensorflow.estimator.DNNClassifier) and hence, the num_classes argument can be omitted. In the case of TensorFlow Estimators, developers are encouraged to understand how to shape input for the models. The same goes for modeling with PyTorch or XGBoost. Analyzing Model Performance Post-training performance analysis tasks are supported by the Lucd Machine Learning (ML) API (eda.lib.lucd_ml). This library provides functions supporting automatic execution and reporting of critical performance analysis tasks (e.g., creating confusion matrices, ROC curves), preventing the need to repeatedly write such code. The tables and plots created from these library functions can be viewed in the Unity client after the entire model training process has completed. The Lucd ML functions for performance analysis are listed below. Refer to the API documentation for full function descriptions. get_predictions_classification_pt get_predictions_classification_tf get_predictions_regression_pt get_predictions_regression_tf lucd_precision_recall_curve lucd_roc_curve lucd_confusion_matrix update_plots Submitting Performance Analysis Results Trained models and metadata can be uploaded to the Lucd backend via the eda.int.train.update function. The following piece of example code illustrates how to use the function. model_filename = lucd_uds.zip_model_tf(trained_classifier, serving_input_receiver_fn, model_id, graph_version, log_dir) # Store model graph and performance stats back to Lucd back-end with open(model_filename, "rb") as graph_file: train.update({tid: { 'performance': { 'loss': loss, 'accuracy': accuracy, 'macro_precision': precision_macro, 'macro_recall': recall_macro, 'macro_f1': f1_macro, 'micro_precision': precision_micro, 'micro_recall': recall_micro, 'micro_f1': f1_micro, 'weighted_precision': precision_weighted, 'weighted_recall': recall_weighted, 'weighted_f1': f1_weighted, 'precision_recall_f1_per_label': results_string, }, 'graph_version': graph_version, 'graph_file': graph_file.read() }}) train.update takes a python dictionary as the argument, with the train_id, described in Table 1, as the top-level key (tid represents the table_id in the code snippet above). The secondary keys graph_version and graph_file store the graph version and trained graph file (model) respectively. The secondary key performance stores another dictionary for performance values. There is no restriction on the key-value pairs here. The developer is allowed to choose the performance values and they will be viewable in the Unity client afterward. The values shown in the code snippet above are customary for evaluating classification models. Again, see example models in The Lucd Model Shop for more insights. As for precision_recall_f1_per_label, an example of how to compute performance statistics per label can be found at https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html. precision_recall_f1_per_label must be formatted as a semicolon-separated string of tuples, with each tuple formatted as “-,,. An example is as follows, precision_recall_f1_per_label: setosa-1.0,0.5,1.0;virginica-1.0,1.0,0.6;versicolor-1.0,0.6666666666666666,0.7499999999999999. Enabling Model Explainability To enable a trained model to be used by the explainability tool in the Unity client, some parameters must be defined. For TensorFlow models, ordered_feature_names, ordered_class_names, input_name, and output_name must be defined. ordered_feature_names (not to be confused with training data input column names) is a list of ordered names of the inputs to the trained model, commonly defined in TensorFlow model definitions as tf.feature_column. For example, for a TensorFlow text classification model, the named input might be embedding_input. Please see example code in The Lucd Model Shop. ordered_class_names is a list formatted such that string class names are ordered by their integer representations (the order of outputs from your model). For instance, for a binary classification model for which the labels are 0 and 1, the order of strings must be negative and positive (or whatever string labels you choose). input_name is the name of the input layer in your TensorFlow model to which your ordered_feature_names data will be passed. output_name is the name of the output layer in your TensorFlow model (by default these can be named things like ‘dense_2’ and ‘scores’). The output_name is used to retrieve your model outputs in the proper format for explanation. PyTorch models only require that ordered_class_names be provided. Plots Lucd allows users to plot associated model training metrics in the Unity client. The plots will update in realtime during model training, providing insight into the viability, problems, and successes of training runs. Users are able to store any information they want to be plotted as part of a training run using the eda.lib.lucd_ml.update_plots function. The provided information must abide by the following assumptions: 1. The top level keys represent individual plots; 2. Each inner dictionary must specify a labels and a description key, where labels are [“plot x_label”, “plot y_label”] and the description can be any string; 3. All remaining keys in the inner dictionary will be treated as individual lines on the plot, so in the following example "accuracy" is a line on the plot. { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } The individual elements in the “l1” line above represent [x_val, y_val]. A common example as shown above includes the following: "l1": [ [epoch, accuracy], [epoch, accuracy], ... ]. Once created, the dictionary may be stored for plotting with the lucd_ml.update_plots function: def update_plots(train_or_eval: bool, tid: str, dictionary: dict) -> str: ... dictionary = { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } update_plots(True, train_id, dictionary) Regarding the update_plots function, train_or_eval allows a user to specify whether their plot is part of the training or evaluation (or validation) cycle of the model training: train=true, eval=false. A TensorFlow hook is provided in lucd_ml for automatically parsing generated events files (the same as used by TensorBoard) and passing them to update_plots as part of a TensorFlow model. It can be provided as part of a TenorFlow EvalSpec or TrainSpec object as follows (stub included for posterity): class LucdTFEstimatorHook(tf.estimator.SessionRunHook): def __init__(self, train_hook: bool, log_dir: str, tid: str, freq: int, last_epoch: int): ... train_spec = tf.estimator.TrainSpec( input_fn=lambda: lucd_uds.get_tf_dataset_image(type_dict, training_data, num_features, target_type, num_classes).repeat(count=None).shuffle(30).batch(int(30)), max_steps=training_steps, hooks=lucd_ml.LucdTFEstimatorHook(train_hook=True, log_dir=log_dir, tid=tid, freq=10, max_epoch=training_steps)]) train_hook allows a user to specify whether the hook is providing train or eval metrics to the user (train=true, eval=false). log_dir tells the hook where to find TensorFlow events files. freq is the frequency that the hook should look for metrics in the events files. last_epoch tells the hook the number of epochs being run so the hook can ignore freq for the last epoch. Two last helper functions are provided as part of the Lucd plotting framework: lucd_roc_curve and lucd_precision_recall_curve. These functions generate ROC curves and precision-recall curves respectively, and are called selectively when using the compact modeling approach (enabled by “performance_curves” button in unity). Further documentation for these functions is provided in the API. def lucd_roc_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): def lucd_precision_recall_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): ... Confusion Matrix Lucd provides an in-depth, interactive confusion matrix for classification model evaluation. Users may select a square in the Unity client to be shown actual records associated with the square selected. This may be enabled by using the following function: def lucd_confusion_matrix(test_set: list or DataLoader, predictions: list, num_classes: int, label_mapping: type(abs), tid: str, write_accumulo: bool,) -> (dict, str): The function arguments details are provided below. test_set: Users may directly pass the PyTorch DataLoader or list of delayed dask dataframes returned from the respective train_eval_test_split_pytorch/tensorflow function. predictions: This should be a list of predictions generated by your model (the list returned from lucd_ml get_predictions_classification). The list must be in the same order as the test_set data. num_classes: An integer number of classes for the confusion matrix to represent. label_mapping: A function to map integers to class labels, which is used to map predictions to a human-readable format. tid: Training id to associate confusion matrix with. write_accumulo: Boolean specifying whether to write the dictionary directly to accumulo. Must be True to enable Unity client Confusion Matrix. If False, the generated confusion matrix will only be returned & not written to the database. Further documentation for this function exists in the API documentation. Here is an example usage: def _label_mapping(): return {0: 'I. versicolor', 1: 'I. virginica', 2: 'I. setosa'} ... # Prepare vds data for modeling delayed_values_training, delayed_values_evaluation, delayed_values_testing, my_df_testing_label, num_features = \ lucd_uds.train_eval_test_split_tensorflow(virtual_dataset_id, evaluation_dataset_percent, testing_dataset_percent) ... predictions, scores = lucd_ml.get_predictions_classification_tf(_estimator, lambda: uds.get_tf_dataset( feature_dict, delayed_values_testing, num_features, target_type).batch(1), classification_mode, .5) ... lucd_ml.lucd_confusion_matrix(delayed_values_testing, predictions, 3, label_mapping(), tid, True) Submitting Model Training Status Another helpful function is eda.int.train.status, which is used for storing the status of a developer’s training pipeline. This enables a model’s status to be displayed on the Unity client. The function definition is below. def status(uid, code, message=None): """Update model status in the database. Args: uid: Int representing a model's ID. code: 0 - RETRIEVING VDS, 1 - TRAINING, 2 - ANALYZING PERFORMANCE, 3 - STORING MODEL, 4 - TRAINING COMPLETE, 5 - ERROR, 6 - QUEUED, 7 - STOPPED. message: String representing optional custom message to include. Returns: Status message. Raises: TypeError: If code is not of type int. Exception: If code is invalid. """
  6. The Lucd Modeling Framework (LMF) enables developers to build and interface custom AI models with the Lucd JedAI platform for streamlined management, experimentation, and training using data and parameters established in the JudAI Unity client (or simply, Unity client). The framework supports python-based AI models built with TensorFlow, PyTorch, and Dask XGBoost. LMF’s python libraries support the following tasks: - accessing Lucd virtual datasets (VDSes) for model training and evaluation, - analyzing and reporting model performance metrics (e.g., with confusion matrices, ROC curves), - storing structures representing trained models and training checkpoints. Model Development Approaches LMF provides flexibility in the level of effort and control needed for preparing models for Lucd. The two approaches include the full and compact model approaches; their differences are illustrated in Figure 1. Figure 1. Conceptual illustration of full and compact model approaches. Full Model Approach In the full model approach, a developer creates some AI model and manually uses LMF python libraries to complete the model training workflow (e.g., train, validate, holdout data testing, store results). This enables complete flexibility for more advanced use cases which might include designing complex or experimental training loops, advanced performance analysis, custom model compression, etc. Full models are implemented using normal python scripts. Further details are in the Developing Full Models section of this documentation. Compact Model Approach The compact model approach enables a developer to focus most if not all effort on defining an AI model, leaving other workflow tasks like holdout data testing and storage of performance results for the LMF to do automatically behind the scenes. In the case of TensorFlow, the developer does not even need to write training logic. The major benefits of the compact model approach include (1) significantly less coding effort and (2) potential reduction of errors and/or inconsistencies in writing boilerplate performance-testing logic. These benefits are especially useful for formatting models for multi-run experiments such as k-fold cross validation and learning curves (which will be introduced in an upcoming LMF release). Further details about compact modeling are in Developing Compact Models. Notable Framework Capabilities The LMF consists of an evolving set of capabilities. The following subsections describe notable modeling capabilities supported as of release 6.3.0 RC1. TensorFlow Estimator-Based Modeling TensorFlow supports AI modeling using either low-level APIs or easier-to-use high-level Estimator APIs. The LMF is designed to support Estimator-based model development. Keras may be used to create models, especially for enabling more customization. However, such models must be converted to Estimators for LMF and the broader Lucd JedAI platform to manage them appropriately. See for following link for an introduction to TensorFlow Estimators, https://www.tensorflow.org/guide/estimator. Various Feature Types For TensorFlow modeling, all dataset feature column types are supported (see https://www.tensorflow.org/guide/feature_columns), enabling support for a broad range of numeric and categorical features. Regarding categorical features, the domain of such a feature must be known at training time. For example, if you choose to use a feature car_make as a categorical feature, you must know all the possible makes when you write your model. This requirement will be removed in a future release. Also, the conversion of non-numerical data to numerical data (e.g., for encoding label/target values) based on a scan of the entire dataset is not supported in the current release. However, to help with this, data value replacement operations are supported in the Unity client. For TensorFlow modeling, label types are assumed to be TensorFlow int32. Embedding Datasets For TensorFlow and PyTorch modeling, LMF supports the use of embedding data, e.g., word2vec for representing free text. For PyTorch, the TorchText library is supported, but n-grams are not supported in the current release. Important Note: Currently, when using text input, only the text/embedding input is allowed as a feature, enabling conventional text classification. Future releases will enable the use of multiple feature inputs alongside text data. Image Datasets For TensorFlow and PyTorch modeling, use of image data (i.e., pixel values) as model input is supported. Distributed XGBoost using Dask Distributed training of XGBoost models using the Dask parallel data analytics framework is supported. See the following link for more information, https://ml.dask.org/xgboost.html. Support for TensorFlow and PyTorch distributed training is under development. Supported Libraries The Lucd modeling framework supports the following languages and machine learning -related libraries: - Python v3.6.5 - TensorFlow (for Python) v2.1 - PyTorch v1.5.1 - TorchText - Dask - Dask-ML - Dask-XGBoost - Numpy v1.16.4 - Scikit-learn v0.19.2 - Pandas v0.25.1 Since Dask and Dask-ML libraries are included in the modeling framework, Dask-based distributed Scikit-learn modules should also be supported. However, the usage of such models has not been tested, and hence, Dask-based Scikit-learn operation may be unpredictable. Example Code While this documentation introduces all the core components and best practices for developing AI models for the Lucd JedAI platform, there is rarely a replacement for sample code. The Lucd Model Shop provides a wide range of code (prepared by Lucd engineers) to help developers get started with preparing AI models. In the future, the Lucd Model Shop will also allow for the larger Lucd developer community to share their code, further helping others with their AI goals. Python API Documentation The LMF Python API documentation can be found in the following Lucd GitLab Pages site, https://lucd.pages.lucd.ai/mock-product-development/lucd-eda-rest/. Preparing Models Using the Lucd Modeling Framework The following documentation contains further details and examples for developing AI models for Lucd. - Developing Compact Models - Developing Full Models - Working with Data and Performance Analysis - The Lucd Model Shop
  7. The Lucd JedAI Client is downloaded locally on your device and interfaces with the Lucd Platform. The client enables users to visualize, transform, and prepare data for use in modeling frameworks (TensorFlow, PyTorch, etc.). Models can be uploaded and trained in the platform, which is touchscreen capable (not required). System Requirements The following specifications are required in order to run the client. Windows, Linux or MacOS 4 GB Memory Modern CPU Recommended Configuration Although not required, we recommend the following specifications in order to maximize the performance of the client. A GPU to support accelerated rendering 1600x900 display resolution minimum Installation Instructions The client is distributed via Lucd's Steam Store. A user is required to obtain a Steam account in order to access the client download. Usage Instructions Login Log on to the client using the credentials provided to you. Username Password Domain Cloud customers will leave the domain field blank when logging in. Private build customers will be provided a domain to use when logging in. Login Exit Application Navigation After successful authentication, the user is brought to the Home screen. The buttons along the left edge navigate to other 2D screen overlays. The buttons in the right corner manipulate camera perspective and visualization behavior. Home Data Modeling Assets Governance Epidemiology Collapse Sidebar Options Logout Reset Perspective Home The Home screen has numerous features. The primary feature of the Home screen is the Sources histogram, displaying the ingested record sources, the number of records per source, as well as date/time information relating to the ingested records. List of currently visible ingested sources Source Histogram of ingestion timeline for each visible source The actual data of the records are not displayed in the histogram. When browsing the sources histogram, the Lucd JedAI Client makes it easy to drill down on a time range of ingestion across all sources, down to the hour. Click-and-Drag date filter To narrow the range of shown data, click and drag the date filters to expand that window of time. Source Toggles Sometimes, a source may have had so much data ingested at a single time that it skews the histogram display scaling. In these cases, another useful function of the Home screen is the ability to hide that specific source from the histogram display by clicking its axis label. At this point, the chart will automatically re-scale the remaining visible data in the histogram, allowing a better, proportional chart display. Selectable bars To expand a single unit of time on the graph (e.g. see a single year of data), click on a bar to zoom in across all sources and change the axis scale to that unit of time. The Lucd JedAI Client allows scaling down to the day, so that a 24-hour period can be seen across all sources on the histogram. Active filters Data & Visualization The Data and Visualization screen is where users will query, visualize, perform Exploratory Data Analysis (EDA) functions, and transform the dataset into a Virtual Data Set (VDS) to be used with the machine learning model. The screen will initially open with a blank panel on the right and the Query option selected on the left. Query Data Query tab To execute a query, begin by navigating to the Query tab. The Lucd JedAI Client provides four ways of querying data: Sources, Facets, Keywords / Dates, and Concepts. These can be combined to get a very specific result set. Data Sources To narrow which sources are queried, navigate to Sources and select the boxes of the desired sources. By default, none are selected, and so all sources will be queried. Facets filters To filter by data facet, navigate to Facets, select the drop-down of a data model, click “Add Filter” next to the desired facet. Selected facet filters will show up below the available facets list. To remove a facet filter, click its red X button. Multiple filters on the same facet are possible. Keyword/Dates filter To search by keyword/date-time range, navigate to the Keywords/Dates tab and enter values in the desired fields. Concepts filter To search by concepts, navigate to the Concepts tab and enter a keyword into the first input field. Optionally, specify a similarity threshold in the second input field. Acceptable values range from 0 to 1. A list of concepts will display below the input field. Select one to see similar concepts in a list below the threshold input. Lucene Query Lucene queries can be run directly from the client. Execute Search Once query parameters have been specified, click search to see a basic table of the resulting dataset. Reset Parameters Click this button to reset search parameters. *Sources *(2)** are OR'd together and Facets (3) are AND'd together in their own sub blocks before being combined inside a parent ‘must' or ‘must_not' block of the query with other parameters.* Visualizations Visualize The Visualize tab provides numerous ways to view your data. Options To load a visualization, select it from the list. Table To see a table structure of all fields in the query results, use Table. This displays all fields of each record. To see more in-depth detail about a record, select it from the table. Scatterplot To see a structured plot of the query results, use Scatterplot. This can display numerical and categorical data in an interactive plot. Drag the plot to rotate it, and scroll/pinch to zoom. If you are on a non-touchscreen enabled device use the mouse scroll wheel or the keyboard shortcut for zooming. Data Points The red orbs are the data points in space. Projection The blue squares exist on all six walls of the plot and represent a projection of all data points on a given 2D plane. Each point can be selected to see the x, y, and z values of that point. Dialog Box Click on the dialog box to close it. Zoom Select the zoom button to focus on that point and see more of its facets. Numerical Range Filter Numerical fields can have a range applied to them by dragging the filter handles. Recenter To re-center the plot, select Recenter. Reset Filters To remove categorical filters, select Reset. Rotate To toggle the graph rotation, select Rotate. Axis Feature The x, y, and z fields can be changed by changing the value from their respective drop downs. Categorical Filter Categorical data can be filtered by selecting Filter and toggling the desired values. Submit Click Submit to apply changes. Parallel Coordinate Plot To see trends in features across a result set, use a Parallel Coordinate Plot. Each line from end to end represents a record. Drag to rotate, and move along the length of the plot by holding shift and dragging or by dragging with two fingers. Field Each blue plane represents a field. Field minimum Field maximum Recenter To re-center the plot, click the Recenter button. Add Field To add an additional field, click the [+] button. Field Select Each field can be changed by its drop down. Display Order A Field can be moved up or down in its display order. Remove Field Histogram To see how data is distributed across values, use a Histogram. Add Histogram To add an additional field histogram, select the [+] button. At this time, only numerical fields will be automatically added. The collection of charts can be scrolled across by dragging. Field Select To change a chart's field, select it from the drop-down. Remove To remove the chart, select the [X] button. Filter Each chart can be filtered by dragging the yellow handles. These filters will be applied across all open charts. The new maximum and minimum will be displayed below. Box Plot To see the statistical distribution across values, use a Box Plot. Add Box Plot To add an additional field box plot, select the [+] button. At this time, only numerical fields will be automatically added. The collection of charts can be scrolled across by dragging. Field Select To change a chart's field, select it from the drop-down. Remove To remove the chart, select the red [X] button. 2D Scatterplot To see the values distributed and on an XY plane, use a 2D Scatterplot. Add Scatterplot To add an additional scatterplot, select the blue [+] button. The collection of charts can be scrolled across by dragging. Field Select To change a chart's field, select it from the drop-down. Remove To remove the chart, select the red [X] button. Pearson Correlation To see how each field relates to all the other fields, use a Correlation Matrix. Only numerical fields are displayed. The matrix can be rotated by dragging. Each bar is scaled on its y axis according to how its two contributing fields relate on a scale of –1 (red) to 1 (blue). Select a bar to see more information about it. Select it again to hide the details. Exploratory Data Analysis The Exploratory Data Analysis (EDA) tab is where data can be transformed and shaped before it is used to train a model. Once a query is run, its results can be shaped and filled using Exploratory Data Analysis, or EDA. Create Tree To begin EDA on the most recent search, select the floppy disk icon. Existing Trees Saved searches for EDA will appear in the scroll view. Tabletop Once a saved search has been selected, it will show up on the tabletop. Operations EDA operations that have been added to the saved search will show up as white nodes. Menu Clicking a node will bring up a menu available options for that node. Statistics To see overview statistics on a selected node, choose it from the dropdown. Save VDS¶ When saving a Virtual Dataset, complete the creation process by entering a name, description, and selecting features to include, as well as if the data should be persisted. New Op The Lucd JedAI Client provides flexible options for data transformation without having to leave the GUI. Operation Type When adding an operation to a saved search during EDA, choose between standard operations like Fill/Filter/Replace, NLP operations, Custom defined operations, and image specific operations. Operation Selection Select the desired operation from the dropdown. Operation Parameters Parameters must be specified before saving an operation. Preparing Text Data for Model Training Lucd provides special operations for easily preparing text data for model training, saving a model developer valuable time in manually coding routines for text transformation. After creating an EDA tree based on a query of a text data source, a developer can add a new operation to the tree based on NLP operations as shown above. NLP operations (e.g., stopword removal, whitespace removal, lemmatization) can be applied in any sequence. It's important to select the correct facet as the “text attribute.” One can also elect to apply tokenization based on a document level (i.e., create one sequence of tokens for the entire facet value per record), or sentence level (i.e., create a token sequence per sentence in the facet for a record). Saving VDS with Processed Text When a developer wants to create a new virtual dataset including the transformed text data, they must choose the “processed_text” facet as the “sole” feature of the virtual dataset as shown below. Currently, Lucd does not support text model training incorporating multiple feature columns, only the “processed_text” facet must be selected. Applying Custom Operations Once custom operations have been defined and uploaded using the Lucd Python Client library, they are available in the GUI for usage in data transformation. As shown above, clicking on a custom operation will show further details, specifically the features the operation uses as well as the actual source code defining the op. As mentioned in the documentation for defining custom operations via the Lucd Python Client, one must select how to apply the operation based one of the following three Dask dataframe approaches: apply map_partitions applymap Image Workflows The Lucd framework supports image-based workflows. Binary image data contained within fields of a record will automatically be rendered in the 3d client. The images below are from the Stanford Dogs dataset. Applying Image Operations To apply image operations, select the Image Ops tab within the New Op menu in an EDA tree. It's important to select an image facet as the “Feature.” The currently provided operations are as follows: Vertical and horizontal flips Grayscale Contrast normalization Normalize (0 mean and unit variance) Resize width & height Color inversion Crop borders Gaussian blur Rotate Min-max scaling To array (converts binary data to Numpy Array) Reshape dimensions * Operations can be applied to percentages of a dataset instead of the entirety, and can also be used to augment existing data instead of operating in-place. Modeling The Lucd JedAI Client provides an intuitive and practical dashboard for data science/machine learning modeling. View Select On the Modeling screen, review available model definitions by selecting that option from the dropdown. Model Upload Button to upload new python model files (Tensorflow, Pytorch, xgboost, etc.) Refresh Select to retrieve model statuses from the backend & refresh the GUI. Existing Model Definitions Model definitions are displayed in the center. Status Indicator Lights Each model will indicate if it has models in training, training complete, or errors. Filters Display only models matching filters selected (TensorFlow, XGBoost, Classification, etc.) Group/Sort Drop down boxes for defining defition grouping & sorting. Distribution Model library and type distribution can be seen at the bottom. Model Details The currently selected model's details can be seen on the right. Train To begin training the selected model, click “START TRAINING”. Performance See all training runs for a selected model by viewing the performance analysis. Start Training Training runs require the selection of a VDS and specification of parameters/assets. Asset To set up a training run, begin by selecting an Asset to include, if any. Virtual Dataset Choose an existing VDS to train against. Parameters Set the parameters for the training run. Trained Models Trained models can also be inspected within the dashboard. View Select To review a training run, first select “Trained Models” from the dropdown. Training Runs Select a training run and view its details. Current status of the run is designated by the colored corner of the list item. Training Artifact Files Download run artifacts Model Profile View See more metrics about the training run in real time Governance Submit for governance approval Stop/Restart Pause a model & restart it. Can also be used to begin a new run after a training run has completed. Model Profile: Performance Trained model performance can be viewed more closely in real-time. View Performance To view training performance, first select “Performance” tab Update Interval Number of seconds between plot updates. Number of points displayed is limited to 1000 to keep updates consistent. Default 10 Available Plots Selector Dropdown showing available plots to display Selected Plot Description Hover to view the user input description for the plot Plot Legend Color codes individual lines in a plot for easy recognition Plot Line Toggles Turn lines on or off. Adjusts axes as well Model Profile: Confusion Matrix Trained models can generate confusion matrices for analysis. View Confusion Matrix To view a confusion matrix, first select “Confusion Matrix” tab Interactable Square Click a square to see details about actual and predicted values. Values only displayed in square if greater than 0 Percentage Toggle Display prediction percentage across row's values Model Profile: Confusion Matrix Details Confusion matrix boxes, when clicked, show more details about the values. Predicted Value Click any orange histogram bar to see sample predictions Sample Predictions A table of sample predictions from a selected histogram bar Model Profile: Predict Models with type tabular_classification and regression can predict output based on JSON input View Predictions To view a predictions, first select “Predict” tab Enter JSON Input Enter values to predict on. Must be valid JSON Click Predict Run predict operation View results in JSON Ouput is formatted JSON Model Profile: Explain - Tabular/Regression Models with type tabular_classification or regression can explain tabular data predictions View Explainability To view a explainability, first select “Explainability” tab Enter JSON Input Enter values to predict on. Must be valid JSON, as shown above % of Training Data used to build explainer Percentage of training data to build the explainer. Must be greater than 0 and less than or equal to 1 Number of Top Explanations Positive integer denoting how many class explanations to show Inputs and corresponding features Colored to show how each influences top class prediction Class Probabilities Class predictions and corresponding likelihood Explanation How each input influences a positive or negative prediction Model Profile: Explain - Images Models with type image_classification can explain image predictions View Explainability To view a explainability, first select “Explainability” tab Sample Image Select local image to explain Positive Only If True, include only regions of the image contributing to the predicted label. Hide Rest If True, make the non-explanation part of the return image gray Class Probabilities Class predictions and corresponding likelihood Colorized Explained Image Returned image with shaded regions of positive and negative influence Model Profile: Explain - Text Models with type text_classification can explain text predictions View Explainability To view a explainability, first select “Explainability” tab Input Text Text the user would like to predict and explain Predicted Probabilities Class probabilities predicted Explanation Words that contribute to positive or negative correlation Output Text Output text with class probabilities highlighted in positive (orange) or negative (blue) colors Assets The Assets page provides a singular look at all existing user “Assets” (e.g. VDS, Embeddings). View Select To see available Virtual Datasets, select it from the dropdown. Usage Counters and indicator lights displaying training run usage of an Asset. Pre-Op Heatmap Heatmap before running the selected EDA operations. Post-Op Heatmap Heatmap after running the selected EDA operations. Operations EDA operations applied to the Asset. EDA Tree View The VDS can be viewed in the context of its parent saved search by clicking 3D. Embedding _Create an embedding from the given VDS (discussed below). Embeddings The Lucd JedAI Client provides the ability to easily generate word embedding Assets for use in modeling. View Select To see available Embeddings, select it from the dropdown. Download Embeddings can be downloaded locally. PCA/TSNE View PCA/TSNE charts for for the selected embedding. Restart _Restart the embedding training here. PCA/TSNE Embeddings can be viewed using PCA/TSNE techniques for visualization. Style When viewing an embedding's PCA/TSNE, click to see terms instead of points. Region Select Toggle to select a cluster of points using a bounding box. Multiple Select Use to add multiple bounding boxes. Word Search Search for a term. All matching terms will be highlighted, as well as shown in a list to the right until there is only one matching term. Filter Narrow the number of occurrences for a term to a range using. Technique Select _Toggle between PCA and TSNE. Governance The Governance view illustrates what data, data transformations, and assets (e.g., VDS, word embeddings) were used as inputs to training a given model. The value is that a user can quickly gain insights as to what data caused a model to yield certain performance results. The following figure shows an overview of the Governance view. The main panel in the middle illustrates, for a selected model, what data and assets were used for training the model. The top half of the view shows information about the data which was used to create a virtual dataset for training the model. Submitted Models The main panel on the left-hand side displays what models are available for viewing in the Governance view. The dropdown menus at the top allow the user to select from models based on their governance approval status (i.e., “pending approval,” “approved,” or “rejected”) as well as sort the models based on various criteria. Query This represents the query that was used to generate the initial dataset, whether for the purposes of model training data or word embedding generation. Clicking the query will show query details at the bottom of the view. Transformation This represents the transformations performed on the initial dataset to establish either a virtual dataset (as in the case with training a model) or a word embedding. These are the same transformations that were applied in the exploratory data analysis section of the tool. Clicking the transformation box will show details at the bottom of the view, such as shown in the figure below. Heatmap Visualization of selected attributes (or facets) of queried or transformed data for virtual datasets or word embeddings. Dropdown selectors underneath each visual enable a user to customize the visualization (“feature 1” selects data for the y-axis and “feature 2” selects data for the x-axis). The “metric” selector chooses what statistic of the selected data to use for defining the heatmaps. In the current release, only total “counts” are available. Clicking “fetch metrics” will populate the visualization. Comparing and visualizing data heatmaps (or distributions) before and after a set of transformations is helpful for governance purposes since it can reveal, for example, if data biases exist and what transformation operations might have introduced them. Embedding Details Illustrates the name of the asset result from word embedding generation. The bottom half of the view shows details about word embedding data for models which require embeddings for training. Trained Model This represents the trained model after all previous operational flows are complete. Metadata & Performance Statistics Information like start / end time, model type, assets used, and training parameters are displayed here. Submit Report Clicking the green button enables the user to submit a governance report, either approving or rejecting the model for usage. Explainability Analysis Lucd provides the ability to visualize “explanations” of a model's output given specific inputs. Generally, explanations take the form of computed attribute weights, indicating the significance that an attribute gave to a model's decision. This supports the ability to either debug a model or scrutinize the data fed to the model. This particular feature is supported by integration of the Lime framework. The figure below illustrates the explainability panel in the governance view. This panel is displayed when the user clicks the model element (6) in the Governance view. Currently, model explainability only works for text classification models. Support for tabular data and image data will be available soon. Input Text For analyzing a text classification model, the user enters sample text into the input text box and clicks the “explain text” button underneath the box. The time required to run explanation analysis is dependent on the amount of text entered and the complexity of the model. Probability Output This is a simple bar chart showing the probabilities of a given model's outputs. In the figure, the classes are “negative” and “positive”; however, more classes may be displayed depending on the model. The class labels are obtained from the labels returned by a user's model, as explained in documentation for the Lucd modeling framework. Features Output This illustrates the weights of the most significant features determined to affect the model's output. For instance, in referring to the figure, the tag "<UNKNOWN>" is highly indicative of a piece of text (in this case, a movie review) having a “negative” sentiment. The user is encouraged to try multiple examples to understand the explainability feature. Output Text The text on the right shows the major features (words) highlighted in the text. Note that the text shown is that processed by the transformation operations for the embedding creation (which the user specified when using NLP operations before creating the embedding set). This is so that the user understands what is done to the text before inputting it to a model, which might offer extra insight into the model's decision logic. Epidemiology Lucd provides the ability to visualize epidemics and supply chain breakdowns on a map. Trained models can predict future infection rates and supply shortages down to the census tract level. Train To start training an Epidemic model, click “Train Model…” Trained Models A list of previously trained models can be found here. Train Epidemiology Model This view appears after selecting “Train Model” in the previous view. Dataset To finalize training start, a dataset to train against must be selected. Parameters Enter any custom parameters for training. Confirm 3D Map View Selecting a trained Epidemiology model will display a 3D map view. Map View Census tracts, counties, and states can all be displayed. Details Information regarding a selected region on the map. Disease Statistic Selecting the disease statistic changes the value used when polygons are extruded. Civilian Features Selecting civilian features displays bar chart value on each census tract. Search The map can be searched to snap to a specific location. Style The map style can be changed via drop-down menu. Extent Configuration for the extent of the map. Terrain Toggle switch for map terrain. Save Settings The current zoom level and location can be saved to the model object to reload later. Polygon Extrude Polygon extruding can be toggled to make the underlying map easier to read.
  8. New Features Added the “compact modeling” framework, enabling end-users to avoid writing boilerplate code (e.g., data retrieval, performance analysis) when preparing models. This now supports PyTorch and TensorFlow. Added capability for producing customizable real-time training update graphs for TensorFlow and PyTorch modeling. Other graphs (e.g., precision-recall) have been added as well. Added interactive confusion matrix capabilities, enabling detailed view of mis-labeled testing data, etc. Added capability to evaluate output of trained models in the JedAI Unity client via the explainability tab. Works for TensorFlow and PyTorch tabular, image and text classification as well as regression models. Changes Expanded use of “model_type” parameter to include “tabular_classification,” alleviating need to perform guesswork in code to determine model type. Refactored/improved retrieval code for virtual datasets and predictions to consider differences between TensorFlow and PyTorch data. TensorFlow models must now return all four of ordered_class_names, ordered_feature_names, input_name and output_name with compact modeling. PyTorch models must return ordered_class_names with compact modeling. Code refactoring to fix input and output mappings with TensorFlow models Various other bugfixes Miscellaneous code refactoring to make PyTorch modeling less error-prone. Enablement of ElasticSearch upgrades, minor related bugfixes
  9. New Features Added the “compact modeling” framework, enabling end-users to avoid writing boilerplate code (e.g., data retrieval, performance analysis) when preparing models. This now supports PyTorch and TensorFlow. Added capability for producing customizable real-time training update graphs for TensorFlow and PyTorch modeling. Other graphs (e.g., precision-recall) have been added as well. Added interactive confusion matrix capabilities, enabling detailed view of mis-labeled testing data, etc. Added capability to evaluate output of trained models in the JedAI Unity client via the explainability tab. Works for TensorFlow and PyTorch tabular, image and text classification as well as regression models. Changes Expanded use of “model_type” parameter to include “tabular_classification,” alleviating need to perform guesswork in code to determine model type. Refactored/improved retrieval code for virtual datasets and predictions to consider differences between TensorFlow and PyTorch data. TensorFlow models must now return all four of ordered_class_names, ordered_feature_names, input_name and output_name with compact modeling. PyTorch models must return ordered_class_names with compact modeling. Code refactoring to fix input and output mappings with TensorFlow models Various other bugfixes Miscellaneous code refactoring to make PyTorch modeling less error-prone. Enablement of ElasticSearch upgrades, minor related bugfixes View full record
  10. With the full modeling approach, developers implement their model training routines in a python script file and directly use python functions for defining and using PyTorch datasets, TensorFlow datasets, and Dask dataframes (for XGBoost modeling) based on Lucd virtual datasets (defined in the Unity client). Additionally, a developer must call functions for uploading trained models and metadata (e.g., model training performance metrics) to the Lucd backend. The advantage of using the full model approach is that developers are free to “carry-over” modeling and customized and/or experimental performance analysis techniques from previously written code. Full model examples are contained in The Lucd Model Shop. Full Model Format Full models are implemented using python scripts. As opposed to using a main function, the code’s entrypoint function must be called start. The arguments passed to start are described in the sections below. As a further note, in full model scripts, except blocks (for handling exceptions) MUST end with the raise statement, as opposed to another terminating statement like return. This ensures that the status of the model is accurately captured in the Unity client. TensorFlow and PyTorch Table 1 describes the python arguments (defined in the Unity client when starting mode training) which are always passed to the start function for TensorFlow and PyTorch models. Argument Description args['model'] (string) Model ID, used for storing checkpoints and models to Lucd backend args['train_id'] (string) Model “training” ID, to be used for storing trained model asset to Lucd backend args['vds'] (string) Lucd virtual dataset ID, used for retrieving training/validation/testing data for model training args['asset'] (string) Asset (word embedding) ID, used for retrieving word embeddings for text classification model training args['parameters']['steps'] (int) Number of steps for model training args['parameters']['lr'] (float) Learning rate for model training args['parameters']['regularization_value'] (float) Regularization value for model training args['parameters']['eval_percent'] (float) Percentage of the virtual dataset to use for validation args['parameters']['test_percent'] (float) Percentage of the virtual dataset to use for testing args['parameters']['classification_mode'] (string) Type of classification (binary, multiclass, tf_premade_multiclass) as selected in the GUI (only applies to classification models) args['parameters']['prediction_threshold'] (float) For binary classification models, minimum threshold for designating a positive decision args['parameters']['max_document_length'] (int) Maximum number of tokens to be used for free text input into the model for training (for text classification). args['exportdir'] (string) Directory used for storing trained model (for upload purposes) args['graphversion'] (string) Version of the graph being trained Table 1. Full model python script arguments for TensorFlow and PyTorch models. Dask XGBoost Table 2 describes the python arguments passed to the start function for Dask XGBoost models. Argument Description args['booster'] (string) XGBoost booster type args['objective'] (string) Learning task and the corresponding learning objective args['base_score'] (float) The initial prediction score of all instances, global bias args['eval_metric'] (string) Evaluation metrics for validation data, a default metric will be assigned according to ‘objective’ args['seed'] (int) Random number seed args['eta'] (int) Step size shrinkage used in update to prevents overfitting args['gamma'] (float) Minimum loss reduction required to make a further partition on a leaf node of the tree args['max_depth'] (int) Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit; beware that XGBoost aggressively consumes memory when training a deep tree args['min_child_weight'] (float) Minimum sum of instance weight(hessian) needed in a child args['max_delta_step'] (int) Maximum delta step we allow each tree’s weight estimation to be args['subsample'] (float) Subsample ratio of the training instance args['colsample_bytree'] (float) Subsample ratio of columns when constructing each tree args['colsample_bylevel'] (float) Subsample ratio of columns for each level args['colsample_bynode'] (float) Subsample ratio of columns for each split args['xgboost_lambda'] (float) L2 regularization term on weights; increasing this value will make model more conservative args['alpha'] (float) L1 regularization term on weights; increasing this value will make model more conservative args['tree_method'] (string) The tree construction algorithm used in XGBoost args['scale_pos_weight'] (float) Balancing of positive and negative weights args['refresh_leaf'] (int) This is a parameter of the refresh updater plugin; when this flag is 1, tree leaves as well as tree nodes’ stats are updated; when it is 0, only node stats are updated args['process_type'] (string) A type of boosting process to run args['num_parallel_tree'] (int) Number of parallel trees constructed during each iteration; this option is used to support boosted random forest args['sample_type'] (string) Type of sampling algorithm args['normalize_type'] (string) Type of normalization algorithm args['rate_drop'] (float) Dropout rate (a fraction of previous trees to drop during the dropout) args['one_drop'] (string) When this flag is enabled, at least one tree is always dropped during the dropout (allows Binomial-plus-one or epsilon-dropout from the original DART paper) args['skip_drop'] (float) Probability of skipping the dropout procedure during a boosting iteration args['feature_selector'] (string) Feature selection and ordering method args['top_k'] (int) The number of top features to select in greedy and thrifty feature selector; the value of 0 means using all the features args['updater'] (string) A comma-separated string defining the sequence of tree updaters to run, providing a modular way to construct and to modify the trees. This is an advanced parameter that is usually set automatically, depending on some other parameters. However, it could be also set explicitly by a user. Table 2. Full model python script arguments for Dask-XGBoost models.
  11. With the full modeling approach, developers implement their model training routines in a python script file and directly use python functions for defining and using PyTorch datasets, TensorFlow datasets, and Dask dataframes (for XGBoost modeling) based on Lucd virtual datasets (defined in the Unity client). Additionally, a developer must call functions for uploading trained models and metadata (e.g., model training performance metrics) to the Lucd backend. The advantage of using the full model approach is that developers are free to “carry-over” modeling and customized and/or experimental performance analysis techniques from previously written code. Full model examples are contained in The Lucd Model Shop. Full Model Format Full models are implemented using python scripts. As opposed to using a main function, the code’s entrypoint function must be called start. The arguments passed to start are described in the sections below. As a further note, in full model scripts, except blocks (for handling exceptions) MUST end with the raise statement, as opposed to another terminating statement like return. This ensures that the status of the model is accurately captured in the Unity client. TensorFlow and PyTorch Table 1 describes the python arguments (defined in the Unity client when starting mode training) which are always passed to the start function for TensorFlow and PyTorch models. Argument Description args['model'] (string) Model ID, used for storing checkpoints and models to Lucd backend args['train_id'] (string) Model “training” ID, to be used for storing trained model asset to Lucd backend args['vds'] (string) Lucd virtual dataset ID, used for retrieving training/validation/testing data for model training args['asset'] (string) Asset (word embedding) ID, used for retrieving word embeddings for text classification model training args['parameters']['steps'] (int) Number of steps for model training args['parameters']['lr'] (float) Learning rate for model training args['parameters']['regularization_value'] (float) Regularization value for model training args['parameters']['eval_percent'] (float) Percentage of the virtual dataset to use for validation args['parameters']['test_percent'] (float) Percentage of the virtual dataset to use for testing args['parameters']['classification_mode'] (string) Type of classification (binary, multiclass, tf_premade_multiclass) as selected in the GUI (only applies to classification models) args['parameters']['prediction_threshold'] (float) For binary classification models, minimum threshold for designating a positive decision args['parameters']['max_document_length'] (int) Maximum number of tokens to be used for free text input into the model for training (for text classification). args['exportdir'] (string) Directory used for storing trained model (for upload purposes) args['graphversion'] (string) Version of the graph being trained Table 1. Full model python script arguments for TensorFlow and PyTorch models. Dask XGBoost Table 2 describes the python arguments passed to the start function for Dask XGBoost models. Argument Description args['booster'] (string) XGBoost booster type args['objective'] (string) Learning task and the corresponding learning objective args['base_score'] (float) The initial prediction score of all instances, global bias args['eval_metric'] (string) Evaluation metrics for validation data, a default metric will be assigned according to ‘objective’ args['seed'] (int) Random number seed args['eta'] (int) Step size shrinkage used in update to prevents overfitting args['gamma'] (float) Minimum loss reduction required to make a further partition on a leaf node of the tree args['max_depth'] (int) Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit; beware that XGBoost aggressively consumes memory when training a deep tree args['min_child_weight'] (float) Minimum sum of instance weight(hessian) needed in a child args['max_delta_step'] (int) Maximum delta step we allow each tree’s weight estimation to be args['subsample'] (float) Subsample ratio of the training instance args['colsample_bytree'] (float) Subsample ratio of columns when constructing each tree args['colsample_bylevel'] (float) Subsample ratio of columns for each level args['colsample_bynode'] (float) Subsample ratio of columns for each split args['xgboost_lambda'] (float) L2 regularization term on weights; increasing this value will make model more conservative args['alpha'] (float) L1 regularization term on weights; increasing this value will make model more conservative args['tree_method'] (string) The tree construction algorithm used in XGBoost args['scale_pos_weight'] (float) Balancing of positive and negative weights args['refresh_leaf'] (int) This is a parameter of the refresh updater plugin; when this flag is 1, tree leaves as well as tree nodes’ stats are updated; when it is 0, only node stats are updated args['process_type'] (string) A type of boosting process to run args['num_parallel_tree'] (int) Number of parallel trees constructed during each iteration; this option is used to support boosted random forest args['sample_type'] (string) Type of sampling algorithm args['normalize_type'] (string) Type of normalization algorithm args['rate_drop'] (float) Dropout rate (a fraction of previous trees to drop during the dropout) args['one_drop'] (string) When this flag is enabled, at least one tree is always dropped during the dropout (allows Binomial-plus-one or epsilon-dropout from the original DART paper) args['skip_drop'] (float) Probability of skipping the dropout procedure during a boosting iteration args['feature_selector'] (string) Feature selection and ordering method args['top_k'] (int) The number of top features to select in greedy and thrifty feature selector; the value of 0 means using all the features args['updater'] (string) A comma-separated string defining the sequence of tree updaters to run, providing a modular way to construct and to modify the trees. This is an advanced parameter that is usually set automatically, depending on some other parameters. However, it could be also set explicitly by a user. Table 2. Full model python script arguments for Dask-XGBoost models. View full record
  12. In the compact modeling approach, the Lucd Modeling Framework (LMF) provides a lightweight approach to training TensorFlow and PyTorch models in Lucd. Boilerplate tasks (e.g., creating confusion matrices, ROC curves, etc.) are automatically executed by LMF, enabling a developer to write and maintain less code. Furthermore, this supports modularity so that developers can effortlessly use new data loading, training, and performance measurement logic as they become available in LMF. Compact modeling differs according the framework being used for model development, i.e., TensorFlow and PyTorch. The following sections describe the separate approaches in more detail. Examples illustrating how to use the compact modeling approach are in<>The Lucd Model Shop. TensorFlow For TensorFlow-based modeling, compact modeling requires the developer to implement only two functions:<>model<>and<>label_mapping. Model Function The<>model<>function is intended to contain code for building TensorFlow Estimator models. There are plenty of examples on the web demonstrating how to build and configure Estimator models; the following link is a good starting point,<>https://www.tensorflow.org/guide/estimator. The LMF sends essential data (e.g., training data, training steps, pre-defined word embeddings) to this function to configure an Estimator model for training. In turn, the function must return some essential elements to essentially configure the LMF for training, performance analysis, etc. Below is a formal description of the<>model<>function. def model(training_data, validation_data, num_features, training_steps, learning_rate, regularization_value, log_dir, training_param_map, embedding_matrix, embedding_size, word_index_mapping, max_document_length, pad_value, train_id): """Function used by LMF for training and analyzing TensorFlow Estimator models. Args: training_data (list): List of delayed "chunks" of Dask dataframe representing training data. validation_data (list): List of delayed "chunks" of Dask dataframe representing validation data. num_features (tuple): The shape of the features input for a model. training_steps (int): Number of steps for model training. learning_rate (float): Model's learning rate. regularization_value (float): Model's regularization value. log_dir (string): Path designating where checkpoints will be written (needed for training). training_param_map (dict): Dictionary containing miscellaneous parameters. embedding_matrix (numpy array): 2D numpy array where each row represents an embedding for an indexed word (for text classification models). embedding_size (int): Size/length of embeddings in embedding_matrix, or length of embeddings to be learned (for text classification models). word_index_mapping (dict): Dict mapping string words to their int index representations (for text classification models). max_document_length (int): Int defining the maximum number of tokens to be used for free text input into the model for training (for text classification models). pad_value (int): Int defining index value used for padding documents for training, validation, and testing (for text classification models). train_id (str): Unique identifier of the underlying training in the database. Returns: TensorFlow Estimator object (for training), TensorFlow Estimator TrainSpec object (for running training), TensorFlow Estimator EvalSpec object (for running validation), Dict mapping feature names to feature types (for loading data into the model), Type of target/label in training/validation/testing data (for loading data into the model), TensorFlow serving_input_receiver_fn suitable for use in serving (for model serving/prediction), List of feature names (same order as in training data, GUI display purposes), List of class names (same order as their numerical representation in training data, for confusion matrix and GUI display purposes), Number of classes to use in lucd_uds.get_tf_dataset_* and lucd_ml.get_predictions_* functions for TensorFlow. String "input_name" representing the name of the model input layer for use with TF signature def when generating predictions. """ Note that the inputs to the<>model<>function are defined in the Unity client, and hence should not be altered. Label_mapping Function Return values from the<>label_mapping<>function are used by the LMF to compute the confusion matrix precision and recall statistics. For proper construction of the confusion matrix, a dict mapping training data’s label values (integers) to expressive strings should be returned. PyTorch For PyTorch-based modeling, the developer is required to implement the same functions as for TensorFlow:<>model<>and<>lable_mapping. The use of the<>label_mapping<>function for PyTorch is exactly the same as for TensorFlow. Hence, only details for the<>model<>function will be described for PyTorch. Model Function As opposed to TensorFlow-based modeling, for which the<>model<>function implements a developer’s AI model,<>model<>for PyTorch is primarily used for executing model training and validation logic. This is mainly because with PyTorch, model training logic is designed to be much more under a developer’s control. As one can see in the<>example code, and similar to traditional PyTorch style, the actual AI model can be defined as a separate class inside python file. Details of the<>model<>function for PyTorch are below. def model(training_data, validation_data, num_features, training_steps, learning_rate, regularization_value, log_dir, training_param_map, embedding_matrix, embedding_size, word_index_mapping, max_document_length, pad_value, train_id): """Function used by LMF for training and analyzing TensorFlow Estimator models. Args: training_data (torch.utils.data.Dataset): PyTorch dataset representing training data. validation_data (torch.utils.data.Dataset): PyTorch dataset representing validation data. num_features (tuple): The shape of the features input for a model (no use for PyTorch identified). training_steps (int): Number of steps for model training. learning_rate (float): Model's learning rate. regularization_value (float): Model's regularization value. log_dir (string): Path designating where checkpoints will be written (needed for training). training_param_map (dict): Dictionary containing miscellaneous parameters. embedding_matrix (numpy array): 2D numpy array where each row represents an embedding for an indexed word (for text classification models). embedding_size (int): Size/length of embeddings in embedding_matrix, or length of embeddings to be learned (for text classification models). word_index_mapping (dict): Dict mapping string words to their int index representations (for text classification models). max_document_length (int): Int defining the maximum number of tokens to be used for free text input into the model for training (for text classification models). pad_value (int): Int defining index value used for padding documents for training, validation, and testing (for text classification models). train_id (str): Unique identifier of the underlying training in the database. Returns: Trained PyTorch model, List floats representing final model performance statistics values, List of class names (same order as their numerical representation in training data, for confusion matrix and GUI display purposes). """
  13. In the compact modeling approach, the Lucd Modeling Framework (LMF) provides a lightweight approach to training TensorFlow and PyTorch models in Lucd. Boilerplate tasks (e.g., creating confusion matrices, ROC curves, etc.) are automatically executed by LMF, enabling a developer to write and maintain less code. Furthermore, this supports modularity so that developers can effortlessly use new data loading, training, and performance measurement logic as they become available in LMF. Compact modeling differs according the framework being used for model development, i.e., TensorFlow and PyTorch. The following sections describe the separate approaches in more detail. Examples illustrating how to use the compact modeling approach are in<>The Lucd Model Shop. TensorFlow For TensorFlow-based modeling, compact modeling requires the developer to implement only two functions:<>model<>and<>label_mapping. Model Function The<>model<>function is intended to contain code for building TensorFlow Estimator models. There are plenty of examples on the web demonstrating how to build and configure Estimator models; the following link is a good starting point,<>https://www.tensorflow.org/guide/estimator. The LMF sends essential data (e.g., training data, training steps, pre-defined word embeddings) to this function to configure an Estimator model for training. In turn, the function must return some essential elements to essentially configure the LMF for training, performance analysis, etc. Below is a formal description of the<>model<>function. def model(training_data, validation_data, num_features, training_steps, learning_rate, regularization_value, log_dir, training_param_map, embedding_matrix, embedding_size, word_index_mapping, max_document_length, pad_value, train_id): """Function used by LMF for training and analyzing TensorFlow Estimator models. Args: training_data (list): List of delayed "chunks" of Dask dataframe representing training data. validation_data (list): List of delayed "chunks" of Dask dataframe representing validation data. num_features (tuple): The shape of the features input for a model. training_steps (int): Number of steps for model training. learning_rate (float): Model's learning rate. regularization_value (float): Model's regularization value. log_dir (string): Path designating where checkpoints will be written (needed for training). training_param_map (dict): Dictionary containing miscellaneous parameters. embedding_matrix (numpy array): 2D numpy array where each row represents an embedding for an indexed word (for text classification models). embedding_size (int): Size/length of embeddings in embedding_matrix, or length of embeddings to be learned (for text classification models). word_index_mapping (dict): Dict mapping string words to their int index representations (for text classification models). max_document_length (int): Int defining the maximum number of tokens to be used for free text input into the model for training (for text classification models). pad_value (int): Int defining index value used for padding documents for training, validation, and testing (for text classification models). train_id (str): Unique identifier of the underlying training in the database. Returns: TensorFlow Estimator object (for training), TensorFlow Estimator TrainSpec object (for running training), TensorFlow Estimator EvalSpec object (for running validation), Dict mapping feature names to feature types (for loading data into the model), Type of target/label in training/validation/testing data (for loading data into the model), TensorFlow serving_input_receiver_fn suitable for use in serving (for model serving/prediction), List of feature names (same order as in training data, GUI display purposes), List of class names (same order as their numerical representation in training data, for confusion matrix and GUI display purposes), Number of classes to use in lucd_uds.get_tf_dataset_* and lucd_ml.get_predictions_* functions for TensorFlow. String "input_name" representing the name of the model input layer for use with TF signature def when generating predictions. """ Note that the inputs to the<>model<>function are defined in the Unity client, and hence should not be altered. Label_mapping Function Return values from the<>label_mapping<>function are used by the LMF to compute the confusion matrix precision and recall statistics. For proper construction of the confusion matrix, a dict mapping training data’s label values (integers) to expressive strings should be returned. PyTorch For PyTorch-based modeling, the developer is required to implement the same functions as for TensorFlow:<>model<>and<>lable_mapping. The use of the<>label_mapping<>function for PyTorch is exactly the same as for TensorFlow. Hence, only details for the<>model<>function will be described for PyTorch. Model Function As opposed to TensorFlow-based modeling, for which the<>model<>function implements a developer’s AI model,<>model<>for PyTorch is primarily used for executing model training and validation logic. This is mainly because with PyTorch, model training logic is designed to be much more under a developer’s control. As one can see in the<>example code, and similar to traditional PyTorch style, the actual AI model can be defined as a separate class inside python file. Details of the<>model<>function for PyTorch are below. def model(training_data, validation_data, num_features, training_steps, learning_rate, regularization_value, log_dir, training_param_map, embedding_matrix, embedding_size, word_index_mapping, max_document_length, pad_value, train_id): """Function used by LMF for training and analyzing TensorFlow Estimator models. Args: training_data (torch.utils.data.Dataset): PyTorch dataset representing training data. validation_data (torch.utils.data.Dataset): PyTorch dataset representing validation data. num_features (tuple): The shape of the features input for a model (no use for PyTorch identified). training_steps (int): Number of steps for model training. learning_rate (float): Model's learning rate. regularization_value (float): Model's regularization value. log_dir (string): Path designating where checkpoints will be written (needed for training). training_param_map (dict): Dictionary containing miscellaneous parameters. embedding_matrix (numpy array): 2D numpy array where each row represents an embedding for an indexed word (for text classification models). embedding_size (int): Size/length of embeddings in embedding_matrix, or length of embeddings to be learned (for text classification models). word_index_mapping (dict): Dict mapping string words to their int index representations (for text classification models). max_document_length (int): Int defining the maximum number of tokens to be used for free text input into the model for training (for text classification models). pad_value (int): Int defining index value used for padding documents for training, validation, and testing (for text classification models). train_id (str): Unique identifier of the underlying training in the database. Returns: Trained PyTorch model, List floats representing final model performance statistics values, List of class names (same order as their numerical representation in training data, for confusion matrix and GUI display purposes). """ View full record
  14. The Lucd Modeling Framework (LMF) enables developers to build and interface custom AI models with the Lucd JedAI platform for streamlined management, experimentation, and training using data and parameters established in the JudAI Unity client (or simply, Unity client). The framework supports python-based AI models built with TensorFlow, PyTorch, and Dask XGBoost. LMF’s python libraries support the following tasks: - accessing Lucd virtual datasets (VDSes) for model training and evaluation, - analyzing and reporting model performance metrics (e.g., with confusion matrices, ROC curves), - storing structures representing trained models and training checkpoints. Model Development Approaches LMF provides flexibility in the level of effort and control needed for preparing models for Lucd. The two approaches include the full and compact model approaches; their differences are illustrated in Figure 1. Figure 1. Conceptual illustration of full and compact model approaches. Full Model Approach In the full model approach, a developer creates some AI model and manually uses LMF python libraries to complete the model training workflow (e.g., train, validate, holdout data testing, store results). This enables complete flexibility for more advanced use cases which might include designing complex or experimental training loops, advanced performance analysis, custom model compression, etc. Full models are implemented using normal python scripts. Further details are in the Developing Full Models section of this documentation. Compact Model Approach The compact model approach enables a developer to focus most if not all effort on defining an AI model, leaving other workflow tasks like holdout data testing and storage of performance results for the LMF to do automatically behind the scenes. In the case of TensorFlow, the developer does not even need to write training logic. The major benefits of the compact model approach include (1) significantly less coding effort and (2) potential reduction of errors and/or inconsistencies in writing boilerplate performance-testing logic. These benefits are especially useful for formatting models for multi-run experiments such as k-fold cross validation and learning curves (which will be introduced in an upcoming LMF release). Further details about compact modeling are in Developing Compact Models. Notable Framework Capabilities The LMF consists of an evolving set of capabilities. The following subsections describe notable modeling capabilities supported as of release 6.3.0 RC1. TensorFlow Estimator-Based Modeling TensorFlow supports AI modeling using either low-level APIs or easier-to-use high-level Estimator APIs. The LMF is designed to support Estimator-based model development. Keras may be used to create models, especially for enabling more customization. However, such models must be converted to Estimators for LMF and the broader Lucd JedAI platform to manage them appropriately. See for following link for an introduction to TensorFlow Estimators, https://www.tensorflow.org/guide/estimator. Various Feature Types For TensorFlow modeling, all dataset feature column types are supported (see https://www.tensorflow.org/guide/feature_columns), enabling support for a broad range of numeric and categorical features. Regarding categorical features, the domain of such a feature must be known at training time. For example, if you choose to use a feature car_make as a categorical feature, you must know all the possible makes when you write your model. This requirement will be removed in a future release. Also, the conversion of non-numerical data to numerical data (e.g., for encoding label/target values) based on a scan of the entire dataset is not supported in the current release. However, to help with this, data value replacement operations are supported in the Unity client. For TensorFlow modeling, label types are assumed to be TensorFlow int32. Embedding Datasets For TensorFlow and PyTorch modeling, LMF supports the use of embedding data, e.g., word2vec for representing free text. For PyTorch, the TorchText library is supported, but n-grams are not supported in the current release. Important Note: Currently, when using text input, only the text/embedding input is allowed as a feature, enabling conventional text classification. Future releases will enable the use of multiple feature inputs alongside text data. Image Datasets For TensorFlow and PyTorch modeling, use of image data (i.e., pixel values) as model input is supported. Distributed XGBoost using Dask Distributed training of XGBoost models using the Dask parallel data analytics framework is supported. See the following link for more information, https://ml.dask.org/xgboost.html. Support for TensorFlow and PyTorch distributed training is under development. Supported Libraries The Lucd modeling framework supports the following languages and machine learning -related libraries: - Python v3.6.5 - TensorFlow (for Python) v2.1 - PyTorch v1.5.1 - TorchText - Dask - Dask-ML - Dask-XGBoost - Numpy v1.16.4 - Scikit-learn v0.19.2 - Pandas v0.25.1 Since Dask and Dask-ML libraries are included in the modeling framework, Dask-based distributed Scikit-learn modules should also be supported. However, the usage of such models has not been tested, and hence, Dask-based Scikit-learn operation may be unpredictable. Example Code While this documentation introduces all the core components and best practices for developing AI models for the Lucd JedAI platform, there is rarely a replacement for sample code. The Lucd Model Shop provides a wide range of code (prepared by Lucd engineers) to help developers get started with preparing AI models. In the future, the Lucd Model Shop will also allow for the larger Lucd developer community to share their code, further helping others with their AI goals. Python API Documentation The LMF Python API documentation can be found in the following Lucd GitLab Pages site, https://lucd.pages.lucd.ai/mock-product-development/lucd-eda-rest/. Preparing Models Using the Lucd Modeling Framework The following documentation contains further details and examples for developing AI models for Lucd. - Developing Compact Models - Developing Full Models - Working with Data and Performance Analysis - The Lucd Model Shop View full record
  15. This section covers the high-level tasks needed to support model training: importing and preparing data and analyzing performance. Most of the content here pertains to the full model approach, but some (e.g., reporting model status) is still helpful for PyTorch compact modeling. Data can be imported into a modeling context using the Lucd Unified Dataspace (UDS) API (eda.lib.lucd_uds). This library provides functions for creating datasets of various formats (TensorFlow, PyTorch, Dask dataframe) based on Lucd virtual datasets defined in the Unity client. It also provides the capability to retrieve previously trained word embeddings. The Lucd UDS functions providing data retrieval are listed below. Some are used for straightforward data importing (e.g., get_datframe) while others are used for preparing framework-specific datasets for AI models (e.g., get_tf_dataset for defining TensorFlow datasets). Refer to the API documentation for full function descriptions, and visit the Lucd Model Shop gitlab project for examples on how to use the functions for developing AI models. get_asset get_dataframe get_tf_dataset get_tf_dataset_image get_tf_dataset_text train_eval_test_split_dataframe train_eval_test_split_pytorch train_eval_test_split_tensorflow Important notes for implementing multi-class modeling TensorFlow offers different approaches to building multi-class models, two prominent ones being using pre-made Estimators (https://www.tensorflow.org/tutorials/estimator/premade#overview_of_programming_with_estimators), and using general techniques such as with Keras models and Estimators. If one-hot encoded data labels are needed (i.e., to match the number of nodes in a neural network output layer), the num_classes parameter should be used when calling relevant functions to get data (e.g., lucd_uds.get_tf_dataset). Note that most (if not all) TensorFlow pre-made Estimator models do not require explicitly one-hot encoded data labels for non-binary modeling (e.g., tensorflow.estimator.DNNClassifier) and hence, the num_classes argument can be omitted. In the case of TensorFlow Estimators, developers are encouraged to understand how to shape input for the models. The same goes for modeling with PyTorch or XGBoost. Analyzing Model Performance Post-training performance analysis tasks are supported by the Lucd Machine Learning (ML) API (eda.lib.lucd_ml). This library provides functions supporting automatic execution and reporting of critical performance analysis tasks (e.g., creating confusion matrices, ROC curves), preventing the need to repeatedly write such code. The tables and plots created from these library functions can be viewed in the Unity client after the entire model training process has completed. The Lucd ML functions for performance analysis are listed below. Refer to the API documentation for full function descriptions. get_predictions_classification_pt get_predictions_classification_tf get_predictions_regression_pt get_predictions_regression_tf lucd_precision_recall_curve lucd_roc_curve lucd_confusion_matrix update_plots Submitting Performance Analysis Results Trained models and metadata can be uploaded to the Lucd backend via the eda.int.train.update function. The following piece of example code illustrates how to use the function. model_filename = lucd_uds.zip_model_tf(trained_classifier, serving_input_receiver_fn, model_id, graph_version, log_dir) # Store model graph and performance stats back to Lucd back-end with open(model_filename, "rb") as graph_file: train.update({tid: { 'performance': { 'loss': loss, 'accuracy': accuracy, 'macro_precision': precision_macro, 'macro_recall': recall_macro, 'macro_f1': f1_macro, 'micro_precision': precision_micro, 'micro_recall': recall_micro, 'micro_f1': f1_micro, 'weighted_precision': precision_weighted, 'weighted_recall': recall_weighted, 'weighted_f1': f1_weighted, 'precision_recall_f1_per_label': results_string, }, 'graph_version': graph_version, 'graph_file': graph_file.read() }}) train.update takes a python dictionary as the argument, with the train_id, described in Table 1, as the top-level key (tid represents the table_id in the code snippet above). The secondary keys graph_version and graph_file store the graph version and trained graph file (model) respectively. The secondary key performance stores another dictionary for performance values. There is no restriction on the key-value pairs here. The developer is allowed to choose the performance values and they will be viewable in the Unity client afterward. The values shown in the code snippet above are customary for evaluating classification models. Again, see example models in The Lucd Model Shop for more insights. As for precision_recall_f1_per_label, an example of how to compute performance statistics per label can be found at: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html. precision_recall_f1_per_label must be formatted as a semicolon-separated string of tuples, with each tuple formatted as “-,,. An example is as follows, precision_recall_f1_per_label: setosa-1.0,0.5,1.0;virginica-1.0,1.0,0.6;versicolor-1.0,0.6666666666666666,0.7499999999999999. Enabling Model Explainability To enable a trained model to be used by the explainability tool in the Unity client, some parameters must be defined. For TensorFlow models, ordered_feature_names, ordered_class_names, input_name, and output_name must be defined. ordered_feature_names (not to be confused with training data input column names) is a list of ordered names of the inputs to the trained model, commonly defined in TensorFlow model definitions as tf.feature_column. For example, for a TensorFlow text classification model, the named input might be embedding_input. Please see example code in The Lucd Model Shop. ordered_class_names is a list formatted such that string class names are ordered by their integer representations (the order of outputs from your model). For instance, for a binary classification model for which the labels are 0 and 1, the order of strings must be negative and positive (or whatever string labels you choose). input_name is the name of the input layer in your TensorFlow model to which your ordered_feature_names data will be passed. output_name is the name of the output layer in your TensorFlow model (by default these can be named things like ‘dense_2’ and ‘scores’). The output_name is used to retrieve your model outputs in the proper format for explanation. PyTorch models only require that ordered_class_names be provided. Plots Lucd allows users to plot associated model training metrics in the Unity client. The plots will update in realtime during model training, providing insight into the viability, problems, and successes of training runs. Users are able to store any information they want to be plotted as part of a training run using the eda.lib.lucd_ml.update_plots function. The provided information must abide by the following assumptions: 1. The top level keys represent individual plots; 2. Each inner dictionary must specify a labels and a description key, where labels are [“plot x_label”, “plot y_label”] and the description can be any string; 3. All remaining keys in the inner dictionary will be treated as individual lines on the plot, so in the following example "accuracy" is a line on the plot. { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } The individual elements in the “l1” line above represent [x_val, y_val]. A common example as shown above includes the following: "l1": [ [epoch, accuracy], [epoch, accuracy], ... ] Once created, the dictionary may be stored for plotting with the lucd_ml.update_plots function: def update_plots(train_or_eval: bool, tid: str, dictionary: dict) -> str: ... dictionary = { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } update_plots(True, train_id, dictionary) Regarding the update_plots function, train_or_eval allows a user to specify whether their plot is part of the training or evaluation (or validation) cycle of the model training: train=true, eval=false. A TensorFlow hook is provided in lucd_ml for automatically parsing generated events files (the same as used by TensorBoard) and passing them to update_plots as part of a TensorFlow model. It can be provided as part of a TensorFlow EvalSpec or TrainSpec object as follows (stub included for posterity): class LucdTFEstimatorHook(tf.estimator.SessionRunHook): def __init__(self, train_hook: bool, log_dir: str, tid: str, freq: int, last_epoch: int): ... train_spec = tf.estimator.TrainSpec( input_fn=lambda: lucd_uds.get_tf_dataset_image(type_dict, training_data, num_features, target_type, num_classes).repeat(count=None).shuffle(30).batch(int(30)), max_steps=training_steps, hooks=lucd_ml.LucdTFEstimatorHook(train_hook=True, log_dir=log_dir, tid=tid, freq=10, max_epoch=training_steps)]) train_hook allows a user to specify whether the hook is providing train or eval metrics to the user (train=true, eval=false). log_dir tells the hook where to find TensorFlow events files. freq is the frequency that the hook should look for metrics in the events files. last_epoch tells the hook the number of epochs being run so the hook can ignore freq for the last epoch. Two last helper functions are provided as part of the Lucd plotting framework: lucd_roc_curve and lucd_precision_recall_curve. These functions generate ROC curves and precision-recall curves respectively and are called selectively when using the compact modeling approach (enabled by “performance_curves” button in unity). Further documentation for these functions is provided in the API. def lucd_roc_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): def lucd_precision_recall_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): ... Confusion Matrix Lucd provides an in-depth, interactive confusion matrix for classification model evaluation. Users may select a square in the Unity client to be shown actual records associated with the square selected. This may be enabled by using the following function: def lucd_confusion_matrix(test_set: list or DataLoader, predictions: list, num_classes: int, label_mapping: type(abs), tid: str, write_accumulo: bool,) -> (dict, str): The function arguments details are provided below. test_set: Users may directly pass the PyTorch DataLoader or list of delayed dask dataframes returned from the respective train_eval_test_split_pytorch/tensorflow function. predictions: This should be a list of predictions generated by your model (the list returned from lucd_ml get_predictions_classification). The list must be in the same order as the test_set data. num_classes: An integer number of classes for the confusion matrix to represent. label_mapping: A function to map integers to class labels, which is used to map predictions to a human-readable format. tid: Training id to associate confusion matrix with. write_accumulo: Boolean specifying whether to write the dictionary directly to accumulo. Must be True to enable Unity client Confusion Matrix. If False, the generated confusion matrix will only be returned & not written to the database. Further documentation for this function exists in the API documentation. Here is an example usage: def _label_mapping(): return {0: 'I. versicolor', 1: 'I. virginica', 2: 'I. setosa'} ... # Prepare vds data for modeling delayed_values_training, delayed_values_evaluation, delayed_values_testing, my_df_testing_label, num_features = \ lucd_uds.train_eval_test_split_tensorflow(virtual_dataset_id, evaluation_dataset_percent, testing_dataset_percent) ... predictions, scores = lucd_ml.get_predictions_classification_tf(_estimator, lambda: uds.get_tf_dataset( feature_dict, delayed_values_testing, num_features, target_type).batch(1), classification_mode, .5) ... lucd_ml.lucd_confusion_matrix(delayed_values_testing, predictions, 3, label_mapping(), tid, True) Submitting Model Training Status Another helpful function is eda.int.train.status, which is used for storing the status of a developer’s training pipeline. This enables a model’s status to be displayed on the Unity client. The function definition is below. def status(uid, code, message=None): """Update model status in the database. Args: uid: Int representing a model's ID. code: 0 - RETRIEVING VDS, 1 - TRAINING, 2 - ANALYZING PERFORMANCE, 3 - STORING MODEL, 4 - TRAINING COMPLETE, 5 - ERROR, 6 - QUEUED, 7 - STOPPED. message: String representing optional custom message to include. Returns: Status message. Raises: TypeError: If code is not of type int. Exception: If code is invalid. """
  16. This section covers the high-level tasks needed to support model training: importing and preparing data and analyzing performance. Most of the content here pertains to the full model approach, but some (e.g., reporting model status) is still helpful for PyTorch compact modeling. Data can be imported into a modeling context using the Lucd Unified Dataspace (UDS) API (eda.lib.lucd_uds). This library provides functions for creating datasets of various formats (TensorFlow, PyTorch, Dask dataframe) based on Lucd virtual datasets defined in the Unity client. It also provides the capability to retrieve previously trained word embeddings. The Lucd UDS functions providing data retrieval are listed below. Some are used for straightforward data importing (e.g., get_datframe) while others are used for preparing framework-specific datasets for AI models (e.g., get_tf_dataset for defining TensorFlow datasets). Refer to the API documentation for full function descriptions, and visit the Lucd Model Shop gitlab project for examples on how to use the functions for developing AI models. get_asset get_dataframe get_tf_dataset get_tf_dataset_image get_tf_dataset_text train_eval_test_split_dataframe train_eval_test_split_pytorch train_eval_test_split_tensorflow Important notes for implementing multi-class modeling TensorFlow offers different approaches to building multi-class models, two prominent ones being using pre-made Estimators (https://www.tensorflow.org/tutorials/estimator/premade#overview_of_programming_with_estimators), and using general techniques such as with Keras models and Estimators. If one-hot encoded data labels are needed (i.e., to match the number of nodes in a neural network output layer), the num_classes parameter should be used when calling relevant functions to get data (e.g., lucd_uds.get_tf_dataset). Note that most (if not all) TensorFlow pre-made Estimator models do not require explicitly one-hot encoded data labels for non-binary modeling (e.g., tensorflow.estimator.DNNClassifier) and hence, the num_classes argument can be omitted. In the case of TensorFlow Estimators, developers are encouraged to understand how to shape input for the models. The same goes for modeling with PyTorch or XGBoost. Analyzing Model Performance Post-training performance analysis tasks are supported by the Lucd Machine Learning (ML) API (eda.lib.lucd_ml). This library provides functions supporting automatic execution and reporting of critical performance analysis tasks (e.g., creating confusion matrices, ROC curves), preventing the need to repeatedly write such code. The tables and plots created from these library functions can be viewed in the Unity client after the entire model training process has completed. The Lucd ML functions for performance analysis are listed below. Refer to the API documentation for full function descriptions. get_predictions_classification_pt get_predictions_classification_tf get_predictions_regression_pt get_predictions_regression_tf lucd_precision_recall_curve lucd_roc_curve lucd_confusion_matrix update_plots Submitting Performance Analysis Results Trained models and metadata can be uploaded to the Lucd backend via the eda.int.train.update function. The following piece of example code illustrates how to use the function. model_filename = lucd_uds.zip_model_tf(trained_classifier, serving_input_receiver_fn, model_id, graph_version, log_dir) # Store model graph and performance stats back to Lucd back-end with open(model_filename, "rb") as graph_file: train.update({tid: { 'performance': { 'loss': loss, 'accuracy': accuracy, 'macro_precision': precision_macro, 'macro_recall': recall_macro, 'macro_f1': f1_macro, 'micro_precision': precision_micro, 'micro_recall': recall_micro, 'micro_f1': f1_micro, 'weighted_precision': precision_weighted, 'weighted_recall': recall_weighted, 'weighted_f1': f1_weighted, 'precision_recall_f1_per_label': results_string, }, 'graph_version': graph_version, 'graph_file': graph_file.read() }}) train.update takes a python dictionary as the argument, with the train_id, described in Table 1, as the top-level key (tid represents the table_id in the code snippet above). The secondary keys graph_version and graph_file store the graph version and trained graph file (model) respectively. The secondary key performance stores another dictionary for performance values. There is no restriction on the key-value pairs here. The developer is allowed to choose the performance values and they will be viewable in the Unity client afterward. The values shown in the code snippet above are customary for evaluating classification models. Again, see example models in The Lucd Model Shop for more insights. As for precision_recall_f1_per_label, an example of how to compute performance statistics per label can be found at: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html. precision_recall_f1_per_label must be formatted as a semicolon-separated string of tuples, with each tuple formatted as “-,,. An example is as follows, precision_recall_f1_per_label: setosa-1.0,0.5,1.0;virginica-1.0,1.0,0.6;versicolor-1.0,0.6666666666666666,0.7499999999999999. Enabling Model Explainability To enable a trained model to be used by the explainability tool in the Unity client, some parameters must be defined. For TensorFlow models, ordered_feature_names, ordered_class_names, input_name, and output_name must be defined. ordered_feature_names (not to be confused with training data input column names) is a list of ordered names of the inputs to the trained model, commonly defined in TensorFlow model definitions as tf.feature_column. For example, for a TensorFlow text classification model, the named input might be embedding_input. Please see example code in The Lucd Model Shop. ordered_class_names is a list formatted such that string class names are ordered by their integer representations (the order of outputs from your model). For instance, for a binary classification model for which the labels are 0 and 1, the order of strings must be negative and positive (or whatever string labels you choose). input_name is the name of the input layer in your TensorFlow model to which your ordered_feature_names data will be passed. output_name is the name of the output layer in your TensorFlow model (by default these can be named things like ‘dense_2’ and ‘scores’). The output_name is used to retrieve your model outputs in the proper format for explanation. PyTorch models only require that ordered_class_names be provided. Plots Lucd allows users to plot associated model training metrics in the Unity client. The plots will update in realtime during model training, providing insight into the viability, problems, and successes of training runs. Users are able to store any information they want to be plotted as part of a training run using the eda.lib.lucd_ml.update_plots function. The provided information must abide by the following assumptions: 1. The top level keys represent individual plots; 2. Each inner dictionary must specify a labels and a description key, where labels are [“plot x_label”, “plot y_label”] and the description can be any string; 3. All remaining keys in the inner dictionary will be treated as individual lines on the plot, so in the following example "accuracy" is a line on the plot. { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } The individual elements in the “l1” line above represent [x_val, y_val]. A common example as shown above includes the following: "l1": [ [epoch, accuracy], [epoch, accuracy], ... ] Once created, the dictionary may be stored for plotting with the lucd_ml.update_plots function: def update_plots(train_or_eval: bool, tid: str, dictionary: dict) -> str: ... dictionary = { "accuracy": { "l1": [ [1, 0.10], [2, 0.15] ], "labels": ["epoch", "accuracy"], "description": "This is a simple accuracy plot example." } } update_plots(True, train_id, dictionary) Regarding the update_plots function, train_or_eval allows a user to specify whether their plot is part of the training or evaluation (or validation) cycle of the model training: train=true, eval=false. A TensorFlow hook is provided in lucd_ml for automatically parsing generated events files (the same as used by TensorBoard) and passing them to update_plots as part of a TensorFlow model. It can be provided as part of a TensorFlow EvalSpec or TrainSpec object as follows (stub included for posterity): class LucdTFEstimatorHook(tf.estimator.SessionRunHook): def __init__(self, train_hook: bool, log_dir: str, tid: str, freq: int, last_epoch: int): ... train_spec = tf.estimator.TrainSpec( input_fn=lambda: lucd_uds.get_tf_dataset_image(type_dict, training_data, num_features, target_type, num_classes).repeat(count=None).shuffle(30).batch(int(30)), max_steps=training_steps, hooks=lucd_ml.LucdTFEstimatorHook(train_hook=True, log_dir=log_dir, tid=tid, freq=10, max_epoch=training_steps)]) train_hook allows a user to specify whether the hook is providing train or eval metrics to the user (train=true, eval=false). log_dir tells the hook where to find TensorFlow events files. freq is the frequency that the hook should look for metrics in the events files. last_epoch tells the hook the number of epochs being run so the hook can ignore freq for the last epoch. Two last helper functions are provided as part of the Lucd plotting framework: lucd_roc_curve and lucd_precision_recall_curve. These functions generate ROC curves and precision-recall curves respectively and are called selectively when using the compact modeling approach (enabled by “performance_curves” button in unity). Further documentation for these functions is provided in the API. def lucd_roc_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): def lucd_precision_recall_curve(truths: list, scores: list, class_list: list, tid: str = None, write_accumulo: bool = False) -> (dict, dict, dict, dict): ... Confusion Matrix Lucd provides an in-depth, interactive confusion matrix for classification model evaluation. Users may select a square in the Unity client to be shown actual records associated with the square selected. This may be enabled by using the following function: def lucd_confusion_matrix(test_set: list or DataLoader, predictions: list, num_classes: int, label_mapping: type(abs), tid: str, write_accumulo: bool,) -> (dict, str): The function arguments details are provided below. test_set: Users may directly pass the PyTorch DataLoader or list of delayed dask dataframes returned from the respective train_eval_test_split_pytorch/tensorflow function. predictions: This should be a list of predictions generated by your model (the list returned from lucd_ml get_predictions_classification). The list must be in the same order as the test_set data. num_classes: An integer number of classes for the confusion matrix to represent. label_mapping: A function to map integers to class labels, which is used to map predictions to a human-readable format. tid: Training id to associate confusion matrix with. write_accumulo: Boolean specifying whether to write the dictionary directly to accumulo. Must be True to enable Unity client Confusion Matrix. If False, the generated confusion matrix will only be returned & not written to the database. Further documentation for this function exists in the API documentation. Here is an example usage: def _label_mapping(): return {0: 'I. versicolor', 1: 'I. virginica', 2: 'I. setosa'} ... # Prepare vds data for modeling delayed_values_training, delayed_values_evaluation, delayed_values_testing, my_df_testing_label, num_features = \ lucd_uds.train_eval_test_split_tensorflow(virtual_dataset_id, evaluation_dataset_percent, testing_dataset_percent) ... predictions, scores = lucd_ml.get_predictions_classification_tf(_estimator, lambda: uds.get_tf_dataset( feature_dict, delayed_values_testing, num_features, target_type).batch(1), classification_mode, .5) ... lucd_ml.lucd_confusion_matrix(delayed_values_testing, predictions, 3, label_mapping(), tid, True) Submitting Model Training Status Another helpful function is eda.int.train.status, which is used for storing the status of a developer’s training pipeline. This enables a model’s status to be displayed on the Unity client. The function definition is below. def status(uid, code, message=None): """Update model status in the database. Args: uid: Int representing a model's ID. code: 0 - RETRIEVING VDS, 1 - TRAINING, 2 - ANALYZING PERFORMANCE, 3 - STORING MODEL, 4 - TRAINING COMPLETE, 5 - ERROR, 6 - QUEUED, 7 - STOPPED. message: String representing optional custom message to include. Returns: Status message. Raises: TypeError: If code is not of type int. Exception: If code is invalid. """ View full record
  17. Enables the ability to authenticate to Lucd backend and access data for custom analysis and model prototyping. Enables the ability to define custom feature transformation operations, and upload to Lucd for use in the GUI. Provides access to the Asset, Concept, Custom Contract, Custom Operation, Explore, Model, UDS and VDS REST APIs. Provides multiple working examples of individual REST calls, as well as complete models. Enables Dask exploration of the data on the client side. Enables development and testing of model development using the Lucd PyTorch, Tensorflow, Keras and XGBoost classes.
  18. This documentation describes how to use the Lucd Python libraries for creating custom feature engineering operations for processing data for model training. Custom feature engineering operations enable a Lucd user to apply feature engineering operations that are not available in the Lucd GUI. A simple example includes adjusting a given attribute as a function of the value of several other attributes. Overall, this should enable near open-ended feature transformation needs. The Custom Operation Module The eda.int.custom_operation module is used for sending custom operations to the Lucd backend so that they are selectable in the Lucd EDA section of the GUI. The simple Python script below illustrates how one may define a custom feature engineering operation and send it to the Lucd backend using the Lucd Python libraries. from eda.int import custom_operation import lucd def transform(row): row['flower.petal_length'] = row['flower.petal_length'] + 1 return row client = lucd.LucdClient(domain="domain", username="username", password="password", login_domain="Lucd Platform URL" ) data = { "operation_name": "simple iris attribute adder", "author_name": "J. Branch", "author_email": "email@address", "operation_description": "Testing custom feature op using simple transform", "operation_purpose": "Numeric feature scaling", "operation_features": ["flower.petal_length"], "operation_function": transform } response_json, rv = custom_operation.create(data) client.close() The create function from the eda.int.custom_operation module is used to submit the actual function and metadata to Lucd. The required attributes in the dict data are used for display in the Lucd GUI (and may at some point be used for search purposes). Table 1 describes the attributes in detail. Table 1. Custom Operation Attribute Descriptions Attribute Description name String name/label of the custom feature engineering operation author_name Name of the developer who wrote the operation author_email Author’s email purpose Short description of what the operation achieves description Longer description of how the operation might achieve its purpose, as well as other notes feature List of strings identifying features/facets the operation affects transform_function The Python function that implements the logic of the custom operation Regarding the custom transformation function, it is essential that the function be defined in the same context as the dict and in which create is called. This ensures that the de-serialization of the function works properly on the Lucd backend. Custom Operation Types Applied Operations Applied operations denote those custom operations which are to be applied to data using conventional Dask and Pandas function application mechanisms. Custom functions can be applied to data (dataframes) via the following Dask functions: apply; map_partitions; applymap. Note that applied operations are applied to Pandas dataframes (which act as partitions of larger Dask dataframes). Hence, the user should consider this when designing their custom operation. As shown in the example code above, the user does not identify which Dask mechanism to use for applying a function to data in the create function. This will be selectable in the Lucd GUI when applying custom operations in the eda section. Dask Operations Dask operations denote custom operations which are to be applied directly to a Dask dataframe. This is useful in cases where custom operations/algorithms cannot be correctly applied to a *partition” of data as described above. An example would be a clustering algorithm. We expect these custom operations will use modules from Dask libraries such as Dask-ML (https://ml.dask.org/). An example custom operation is shown below. def cluster_function(df): from dask_ml.cluster import KMeans X = df[['car.days_in_service']] kmeans = KMeans(n_clusters=4) k_service_days = kmeans.fit(X) labels = k_service_days.predict(X) df['cars.service_days_cluster'] = labels return df When a Dask operation is to be applied (in the Unity client), the user should select apply_direct as the application type.
  19. User Guide The Lucd Python Client provides capabilities for data scientists and AI model developers to prototype AI model solutions before uploading them to the Lucd Unity Client for extended training and performance analysis. The Lucd Python Client provides the following features: functions for accessing raw data and other assets from in Lucd for general analysis and custom visualization; functions for uploading user-defined feature transformation operations to Lucd, which can then be applied in the Lucd Unity Client to create a virtual dataset; functions for accessing ingesting data into TensorFlow and PyTorch models, which can be used for prototyping models. Installation The lucd-python-client python package should be installed using the pip command with a python wheel file. Instructions are as follows: Download or clone the lucd-python-client package (unzip if needed) from here: Lucd Python Client Project and open a command prompt and change to the package directory. At a command prompt, type python setup.py bdist_wheel The wheel file will appear in the dist directory. Switch to the dist directory and type pip install <wheel_filename> Requirements Python 3.6.5 is required for custom feature operations to work appropriately. APIs The Lucd Python Client uses python and REST APIs. Code examples using both API types are available in the examples directory of the project. Lucd Python Client API Examples Example code illustrating how to perform tasks such as authenticating to Lucd, performing queries, obtaining virtual datasets and training models resides in the examples directory of the project. Below are specific examples of how to access Lucd data using the client as well as how to create and upload a custom feature transformation operation. Accessing Data from lucd import LucdClient, log from eda.int import asset from eda.int import vds from eda.int import uds from eda.lib import lucd_uds if __name__ == "__main__": username = 'xxx' password = 'xxx' domain = 'xxx' client = LucdClient(username=username, password=password, domain=domain) log.info(f"Connected to Lucd platform.") # queries follow Elasticsearch API.. # See: https://www.elastic.co/guide/en/elasticsearch/reference/6.8/query-dsl.html query = \ { "query": { "bool": { "must": [ { "bool": { "should": [ { "match_phrase": { "source": "iris" } } ] } }, { "bool": { "should": [] } } ], "filter": [ { "bool": { "filter": [ ] } } ] } }, "size": 2000, "dataset": "iris" } results, http = uds.search(query) print(f"Search Results ({http}):\n{results}\n") hits, stats = client.search_to_dataframe(results) print(f"Search Results:\n{hits.head()}\n") print(f"Search Statistics:\n{stats}\n") all_models, http = client.rest('lucd/model/read', {"uid": username}) print(f"All Models ({http}):\n{all_models}\n") all_vds, http = vds.read({"uid": username}) print(f"All Virtual Datasets ({http}):\n{all_vds}\n") all_assets, http = asset.read({"uid": username}) print(f"All Asset Embeddings ({http}):\n{all_assets}\n") # # Lucd Library Calls to fetch assets and VDSes # # When limiting asset size, you could encounter issues with missing index entries. embeddings_index, embedding_matrix, embedding_size, word_index_mapping, word_index_mapping_padded = \ lucd_uds.get_asset("xxx", limit=100) print(embeddings_index, embedding_matrix, embedding_size, word_index_mapping, word_index_mapping_padded) # When limiting data size, you will encounter delays bring back large amounts of data # over the network, and possibly run the client out of memory. all_vds, http = vds.read({"uid": None}) print(f"All Virtual Datasets ({http}):\n{all_vds}\n") df = lucd_uds.get_dataframe("xxx", limit=100) print(f"Dataframe Data\n{df.head(20)}") client.close() Custom Feature Transformation¶ from eda.int import custom_operation import lucd def create_greater_than_mean_column(df): column_mean = df["flower.petal_length"].mean() df["flower.petal_length_Mean"] = df["flower.petal_length"] > column_mean return df if __name__ == "__main__": client = lucd.LucdClient(domain="xxx", username="xxx", password="xxx", login_domain="xxx" ) data = { "operation_name": "create_greater_than_mean_column_JBstyle", "author_name": "J. Black", "author_email": "j.black@lucd.ai", "operation_description": "Sample operation", "operation_purpose": "add a new column", "operation_features": ["flower.petal_length"], "operation_function": create_greater_than_mean_column } response_json, rv = custom_operation.create(data) client.close() Federated Endpoints To support federated machine learning, much of the high-level rest functionality operates in a federated manner. This means that unless otherwise specified, these actions will be performed/created/deleted on all federates. The features impacted include: Queries EDA trees Custom Ops VDS objects Model definitions FATE Training objects Users may include a block in their JSON specifying which federates to operate on, which looks like the following: "federation": {"federates": ["domain_name1", "domain_name2"]}
  20. New Features This release contains improvements to model profile features. Additional tooltips Model Profile improvements, including: Tabular Data Predict allows multiple JSON records of input to be predicted and output Tabular/Regression Explain provides further explanation of a tabular data prediction Image Explain allows users to upload images and generate explanations of predictions Text Explain allows users to send text input and receive an explanation of predictions
  21. New Features This release contains improvements to model profile features. Additional tooltips Model Profile improvements, including: Tabular Data Predict allows multiple JSON records of input to be predicted and output Tabular/Regression Explain provides further explanation of a tabular data prediction Image Explain allows users to upload images and generate explanations of predictions Text Explain allows users to send text input and receive an explanation of predictions View full record
  22. The Lucd JedAI Client is downloaded locally on your device and interfaces with the Lucd Platform. The client enables users to visualize, transform, and prepare data for use in modeling frameworks (TensorFlow, PyTorch, etc.). Models can be uploaded and trained in the platform, which is touchscreen capable (not required). System Requirements The following specifications are required in order to run the client. Windows, Linux or MacOS 4 GB Memory Modern CPU Recommended Configuration Although not required, we recommend the following specifications in order to maximize the performance of the client. A GPU to support accelerated rendering 1600x900 display resolution minimum Installation Instructions The client is distributed via Lucd's Steam Store. A user is required to obtain a Steam account in order to access the client download. Usage Instructions Login Log on to the client using the credentials provided to you. Username Password Domain Cloud customers will leave the domain field blank when logging in. Private build customers will be provided a domain to use when logging in. Login Exit Application Navigation After successful authentication, the user is brought to the Home screen. The buttons along the left edge navigate to other 2D screen overlays. The buttons in the right corner manipulate camera perspective and visualization behavior. Home Data Modeling Assets Governance Epidemiology Collapse Sidebar Options Logout Reset Perspective Home The Home screen has numerous features. The primary feature of the Home screen is the Sources histogram, displaying the ingested record sources, the number of records per source, as well as date/time information relating to the ingested records. List of currently visible ingested sources Source Histogram of ingestion timeline for each visible source The actual data of the records are not displayed in the histogram. When browsing the sources histogram, the Lucd JedAI Client makes it easy to drill down on a time range of ingestion across all sources, down to the hour. Click-and-Drag date filter To narrow the range of shown data, click and drag the date filters to expand that window of time. Source Toggles Sometimes, a source may have had so much data ingested at a single time that it skews the histogram display scaling. In these cases, another useful function of the Home screen is the ability to hide that specific source from the histogram display by clicking its axis label. At this point, the chart will automatically re-scale the remaining visible data in the histogram, allowing a better, proportional chart display. Selectable bars To expand a single unit of time on the graph (e.g. see a single year of data), click on a bar to zoom in across all sources and change the axis scale to that unit of time. The Lucd JedAI Client allows scaling down to the day, so that a 24-hour period can be seen across all sources on the histogram. Active filters Data & Visualization The Data and Visualization screen is where users will query, visualize, perform Exploratory Data Analysis (EDA) functions, and transform the dataset into a Virtual Data Set (VDS) to be used with the machine learning model. The screen will initially open with a blank panel on the right and the Query option selected on the left. Query Data Query tab To execute a query, begin by navigating to the Query tab. The Lucd JedAI Client provides four ways of querying data: Sources, Facets, Keywords / Dates, and Concepts. These can be combined to get a very specific result set. Data Sources To narrow which sources are queried, navigate to Sources and select the boxes of the desired sources. By default, none are selected, and so all sources will be queried. Facets filters To filter by data facet, navigate to Facets, select the drop-down of a data model, click “Add Filter” next to the desired facet. Selected facet filters will show up below the available facets list. To remove a facet filter, click its red X button. Multiple filters on the same facet are possible. Keyword/Dates filter To search by keyword/date-time range, navigate to the Keywords/Dates tab and enter values in the desired fields. Concepts filter To search by concepts, navigate to the Concepts tab and enter a keyword into the first input field. Optionally, specify a similarity threshold in the second input field. Acceptable values range from 0 to 1. A list of concepts will display below the input field. Select one to see similar concepts in a list below the threshold input. Lucene Query Lucene queries can be run directly from the client. Execute Search Once query parameters have been specified, click search to see a basic table of the resulting dataset. Reset Parameters Click this button to reset search parameters. *Sources *(2)** are OR'd together and Facets (3) are AND'd together in their own sub blocks before being combined inside a parent ‘must' or ‘must_not' block of the query with other parameters.* Visualizations Visualize The Visualize tab provides numerous ways to view your data. Options To load a visualization, select it from the list. Table To see a table structure of all fields in the query results, use Table. This displays all fields of each record. To see more in-depth detail about a record, select it from the table. Scatterplot To see a structured plot of the query results, use Scatterplot. This can display numerical and categorical data in an interactive plot. Drag the plot to rotate it, and scroll/pinch to zoom. If you are on a non-touchscreen enabled device use the mouse scroll wheel or the keyboard shortcut for zooming. Data Points The red orbs are the data points in space. Projection The blue squares exist on all six walls of the plot and represent a projection of all data points on a given 2D plane. Each point can be selected to see the x, y, and z values of that point. Dialog Box Click on the dialog box to close it. Zoom Select the zoom button to focus on that point and see more of its facets. Numerical Range Filter Numerical fields can have a range applied to them by dragging the filter handles. Recenter To re-center the plot, select Recenter. Reset Filters To remove categorical filters, select Reset. Rotate To toggle the graph rotation, select Rotate. Axis Feature The x, y, and z fields can be changed by changing the value from their respective drop downs. Categorical Filter Categorical data can be filtered by selecting Filter and toggling the desired values. Submit Click Submit to apply changes. Parallel Coordinate Plot To see trends in features across a result set, use a Parallel Coordinate Plot. Each line from end to end represents a record. Drag to rotate, and move along the length of the plot by holding shift and dragging or by dragging with two fingers. Field Each blue plane represents a field. Field minimum Field maximum Recenter To re-center the plot, click the Recenter button. Add Field To add an additional field, click the [+] button. Field Select Each field can be changed by its drop down. Display Order A Field can be moved up or down in its display order. Remove Field Histogram To see how data is distributed across values, use a Histogram. Add Histogram To add an additional field histogram, select the [+] button. At this time, only numerical fields will be automatically added. The collection of charts can be scrolled across by dragging. Field Select To change a chart's field, select it from the drop-down. Remove To remove the chart, select the [X] button. Filter Each chart can be filtered by dragging the yellow handles. These filters will be applied across all open charts. The new maximum and minimum will be displayed below. Box Plot To see the statistical distribution across values, use a Box Plot. Add Box Plot To add an additional field box plot, select the [+] button. At this time, only numerical fields will be automatically added. The collection of charts can be scrolled across by dragging. Field Select To change a chart's field, select it from the drop-down. Remove To remove the chart, select the red [X] button. 2D Scatterplot To see the values distributed and on an XY plane, use a 2D Scatterplot. Add Scatterplot To add an additional scatterplot, select the blue [+] button. The collection of charts can be scrolled across by dragging. Field Select To change a chart's field, select it from the drop-down. Remove To remove the chart, select the red [X] button. Pearson Correlation To see how each field relates to all the other fields, use a Correlation Matrix. Only numerical fields are displayed. The matrix can be rotated by dragging. Each bar is scaled on its y axis according to how its two contributing fields relate on a scale of –1 (red) to 1 (blue). Select a bar to see more information about it. Select it again to hide the details. Exploratory Data Analysis The Exploratory Data Analysis (EDA) tab is where data can be transformed and shaped before it is used to train a model. Once a query is run, its results can be shaped and filled using Exploratory Data Analysis, or EDA. Create Tree To begin EDA on the most recent search, select the floppy disk icon. Existing Trees Saved searches for EDA will appear in the scroll view. Tabletop Once a saved search has been selected, it will show up on the tabletop. Operations EDA operations that have been added to the saved search will show up as white nodes. Menu Clicking a node will bring up a menu available options for that node. Statistics To see overview statistics on a selected node, choose it from the dropdown. Save VDS¶ When saving a Virtual Dataset, complete the creation process by entering a name, description, and selecting features to include, as well as if the data should be persisted. New Op The Lucd JedAI Client provides flexible options for data transformation without having to leave the GUI. Operation Type When adding an operation to a saved search during EDA, choose between standard operations like Fill/Filter/Replace, NLP operations, Custom defined operations, and image specific operations. Operation Selection Select the desired operation from the dropdown. Operation Parameters Parameters must be specified before saving an operation. Preparing Text Data for Model Training Lucd provides special operations for easily preparing text data for model training, saving a model developer valuable time in manually coding routines for text transformation. After creating an EDA tree based on a query of a text data source, a developer can add a new operation to the tree based on NLP operations as shown above. NLP operations (e.g., stopword removal, whitespace removal, lemmatization) can be applied in any sequence. It's important to select the correct facet as the “text attribute.” One can also elect to apply tokenization based on a document level (i.e., create one sequence of tokens for the entire facet value per record), or sentence level (i.e., create a token sequence per sentence in the facet for a record). Saving VDS with Processed Text When a developer wants to create a new virtual dataset including the transformed text data, they must choose the “processed_text” facet as the “sole” feature of the virtual dataset as shown below. Currently, Lucd does not support text model training incorporating multiple feature columns, only the “processed_text” facet must be selected. Applying Custom Operations Once custom operations have been defined and uploaded using the Lucd Python Client library, they are available in the GUI for usage in data transformation. As shown above, clicking on a custom operation will show further details, specifically the features the operation uses as well as the actual source code defining the op. As mentioned in the documentation for defining custom operations via the Lucd Python Client, one must select how to apply the operation based one of the following three Dask dataframe approaches: apply map_partitions applymap Image Workflows The Lucd framework supports image-based workflows. Binary image data contained within fields of a record will automatically be rendered in the 3d client. The images below are from the Stanford Dogs dataset. Applying Image Operations To apply image operations, select the Image Ops tab within the New Op menu in an EDA tree. It's important to select an image facet as the “Feature.” The currently provided operations are as follows: Vertical and horizontal flips Grayscale Contrast normalization Normalize (0 mean and unit variance) Resize width & height Color inversion Crop borders Gaussian blur Rotate Min-max scaling To array (converts binary data to Numpy Array) Reshape dimensions * Operations can be applied to percentages of a dataset instead of the entirety, and can also be used to augment existing data instead of operating in-place. Modeling The Lucd JedAI Client provides an intuitive and practical dashboard for data science/machine learning modeling. View Select On the Modeling screen, review available model definitions by selecting that option from the dropdown. Model Upload Button to upload new python model files (Tensorflow, Pytorch, xgboost, etc.) Refresh Select to retrieve model statuses from the backend & refresh the GUI. Existing Model Definitions Model definitions are displayed in the center. Status Indicator Lights Each model will indicate if it has models in training, training complete, or errors. Filters Display only models matching filters selected (TensorFlow, XGBoost, Classification, etc.) Group/Sort Drop down boxes for defining defition grouping & sorting. Distribution Model library and type distribution can be seen at the bottom. Model Details The currently selected model's details can be seen on the right. Train To begin training the selected model, click “START TRAINING”. Performance See all training runs for a selected model by viewing the performance analysis. Start Training Training runs require the selection of a VDS and specification of parameters/assets. Asset To set up a training run, begin by selecting an Asset to include, if any. Virtual Dataset Choose an existing VDS to train against. Parameters Set the parameters for the training run. Trained Models Trained models can also be inspected within the dashboard. View Select To review a training run, first select “Trained Models” from the dropdown. Training Runs Select a training run and view its details. Current status of the run is designated by the colored corner of the list item. Training Artifact Files Download run artifacts Model Profile View See more metrics about the training run in real time Governance Submit for governance approval Stop/Restart Pause a model & restart it. Can also be used to begin a new run after a training run has completed. Model Profile: Performance Trained model performance can be viewed more closely in real-time. View Performance To view training performance, first select “Performance” tab Update Interval Number of seconds between plot updates. Number of points displayed is limited to 1000 to keep updates consistent. Default 10 Available Plots Selector Dropdown showing available plots to display Selected Plot Description Hover to view the user input description for the plot Plot Legend Color codes individual lines in a plot for easy recognition Plot Line Toggles Turn lines on or off. Adjusts axes as well Model Profile: Confusion Matrix Trained models can generate confusion matrices for analysis. View Confusion Matrix To view a confusion matrix, first select “Confusion Matrix” tab Interactable Square Click a square to see details about actual and predicted values. Values only displayed in square if greater than 0 Percentage Toggle Display prediction percentage across row's values Model Profile: Confusion Matrix Details Confusion matrix boxes, when clicked, show more details about the values. Predicted Value Click any orange histogram bar to see sample predictions Sample Predictions A table of sample predictions from a selected histogram bar Model Profile: Predict Models with type tabular_classification and regression can predict output based on JSON input View Predictions To view a predictions, first select “Predict” tab Enter JSON Input Enter values to predict on. Must be valid JSON Click Predict Run predict operation View results in JSON Ouput is formatted JSON Model Profile: Explain - Tabular/Regression Models with type tabular_classification or regression can explain tabular data predictions View Explainability To view a explainability, first select “Explainability” tab Enter JSON Input Enter values to predict on. Must be valid JSON, as shown above % of Training Data used to build explainer Percentage of training data to build the explainer. Must be greater than 0 and less than or equal to 1 Number of Top Explanations Positive integer denoting how many class explanations to show Inputs and corresponding features Colored to show how each influences top class prediction Class Probabilities Class predictions and corresponding likelihood Explanation How each input influences a positive or negative prediction Model Profile: Explain - Images Models with type image_classification can explain image predictions View Explainability To view a explainability, first select “Explainability” tab Sample Image Select local image to explain Positive Only If True, include only regions of the image contributing to the predicted label. Hide Rest If True, make the non-explanation part of the return image gray Class Probabilities Class predictions and corresponding likelihood Colorized Explained Image Returned image with shaded regions of positive and negative influence Model Profile: Explain - Text Models with type text_classification can explain text predictions View Explainability To view a explainability, first select “Explainability” tab Input Text Text the user would like to predict and explain Predicted Probabilities Class probabilities predicted Explanation Words that contribute to positive or negative correlation Output Text Output text with class probabilities highlighted in positive (orange) or negative (blue) colors Assets The Assets page provides a singular look at all existing user “Assets” (e.g. VDS, Embeddings). View Select To see available Virtual Datasets, select it from the dropdown. Usage Counters and indicator lights displaying training run usage of an Asset. Pre-Op Heatmap Heatmap before running the selected EDA operations. Post-Op Heatmap Heatmap after running the selected EDA operations. Operations EDA operations applied to the Asset. EDA Tree View The VDS can be viewed in the context of its parent saved search by clicking 3D. Embedding _Create an embedding from the given VDS (discussed below). Embeddings The Lucd JedAI Client provides the ability to easily generate word embedding Assets for use in modeling. View Select To see available Embeddings, select it from the dropdown. Download Embeddings can be downloaded locally. PCA/TSNE View PCA/TSNE charts for for the selected embedding. Restart _Restart the embedding training here. PCA/TSNE Embeddings can be viewed using PCA/TSNE techniques for visualization. Style When viewing an embedding's PCA/TSNE, click to see terms instead of points. Region Select Toggle to select a cluster of points using a bounding box. Multiple Select Use to add multiple bounding boxes. Word Search Search for a term. All matching terms will be highlighted, as well as shown in a list to the right until there is only one matching term. Filter Narrow the number of occurrences for a term to a range using. Technique Select _Toggle between PCA and TSNE. Governance The Governance view illustrates what data, data transformations, and assets (e.g., VDS, word embeddings) were used as inputs to training a given model. The value is that a user can quickly gain insights as to what data caused a model to yield certain performance results. The following figure shows an overview of the Governance view. The main panel in the middle illustrates, for a selected model, what data and assets were used for training the model. The top half of the view shows information about the data which was used to create a virtual dataset for training the model. Submitted Models The main panel on the left-hand side displays what models are available for viewing in the Governance view. The dropdown menus at the top allow the user to select from models based on their governance approval status (i.e., “pending approval,” “approved,” or “rejected”) as well as sort the models based on various criteria. Query This represents the query that was used to generate the initial dataset, whether for the purposes of model training data or word embedding generation. Clicking the query will show query details at the bottom of the view. Transformation This represents the transformations performed on the initial dataset to establish either a virtual dataset (as in the case with training a model) or a word embedding. These are the same transformations that were applied in the exploratory data analysis section of the tool. Clicking the transformation box will show details at the bottom of the view, such as shown in the figure below. Heatmap Visualization of selected attributes (or facets) of queried or transformed data for virtual datasets or word embeddings. Dropdown selectors underneath each visual enable a user to customize the visualization (“feature 1” selects data for the y-axis and “feature 2” selects data for the x-axis). The “metric” selector chooses what statistic of the selected data to use for defining the heatmaps. In the current release, only total “counts” are available. Clicking “fetch metrics” will populate the visualization. Comparing and visualizing data heatmaps (or distributions) before and after a set of transformations is helpful for governance purposes since it can reveal, for example, if data biases exist and what transformation operations might have introduced them. Embedding Details Illustrates the name of the asset result from word embedding generation. The bottom half of the view shows details about word embedding data for models which require embeddings for training. Trained Model This represents the trained model after all previous operational flows are complete. Metadata & Performance Statistics Information like start / end time, model type, assets used, and training parameters are displayed here. Submit Report Clicking the green button enables the user to submit a governance report, either approving or rejecting the model for usage. Explainability Analysis Lucd provides the ability to visualize “explanations” of a model's output given specific inputs. Generally, explanations take the form of computed attribute weights, indicating the significance that an attribute gave to a model's decision. This supports the ability to either debug a model or scrutinize the data fed to the model. This particular feature is supported by integration of the Lime framework. The figure below illustrates the explainability panel in the governance view. This panel is displayed when the user clicks the model element (6) in the Governance view. Currently, model explainability only works for text classification models. Support for tabular data and image data will be available soon. Input Text For analyzing a text classification model, the user enters sample text into the input text box and clicks the “explain text” button underneath the box. The time required to run explanation analysis is dependent on the amount of text entered and the complexity of the model. Probability Output This is a simple bar chart showing the probabilities of a given model's outputs. In the figure, the classes are “negative” and “positive”; however, more classes may be displayed depending on the model. The class labels are obtained from the labels returned by a user's model, as explained in documentation for the Lucd modeling framework. Features Output This illustrates the weights of the most significant features determined to affect the model's output. For instance, in referring to the figure, the tag "<UNKNOWN>" is highly indicative of a piece of text (in this case, a movie review) having a “negative” sentiment. The user is encouraged to try multiple examples to understand the explainability feature. Output Text The text on the right shows the major features (words) highlighted in the text. Note that the text shown is that processed by the transformation operations for the embedding creation (which the user specified when using NLP operations before creating the embedding set). This is so that the user understands what is done to the text before inputting it to a model, which might offer extra insight into the model's decision logic. Epidemiology Lucd provides the ability to visualize epidemics and supply chain breakdowns on a map. Trained models can predict future infection rates and supply shortages down to the census tract level. Train To start training an Epidemic model, click “Train Model…” Trained Models A list of previously trained models can be found here. Train Epidemiology Model This view appears after selecting “Train Model” in the previous view. Dataset To finalize training start, a dataset to train against must be selected. Parameters Enter any custom parameters for training. Confirm 3D Map View Selecting a trained Epidemiology model will display a 3D map view. Map View Census tracts, counties, and states can all be displayed. Details Information regarding a selected region on the map. Disease Statistic Selecting the disease statistic changes the value used when polygons are extruded. Civilian Features Selecting civilian features displays bar chart value on each census tract. Search The map can be searched to snap to a specific location. Style The map style can be changed via drop-down menu. Extent Configuration for the extent of the map. Terrain Toggle switch for map terrain. Save Settings The current zoom level and location can be saved to the model object to reload later. Polygon Extrude Polygon extruding can be toggled to make the underlying map easier to read. View full record
  23. This documentation describes how to use the Lucd Python libraries for creating custom feature engineering operations for processing data for model training. Custom feature engineering operations enable a Lucd user to apply feature engineering operations that are not available in the Lucd GUI. A simple example includes adjusting a given attribute as a function of the value of several other attributes. Overall, this should enable near open-ended feature transformation needs. The Custom Operation Module The eda.int.custom_operation module is used for sending custom operations to the Lucd backend so that they are selectable in the Lucd EDA section of the GUI. The simple Python script below illustrates how one may define a custom feature engineering operation and send it to the Lucd backend using the Lucd Python libraries. from eda.int import custom_operation import lucd def transform(row): row['flower.petal_length'] = row['flower.petal_length'] + 1 return row client = lucd.LucdClient(domain="domain", username="username", password="password", login_domain="Lucd Platform URL" ) data = { "operation_name": "simple iris attribute adder", "author_name": "J. Branch", "author_email": "email@address", "operation_description": "Testing custom feature op using simple transform", "operation_purpose": "Numeric feature scaling", "operation_features": ["flower.petal_length"], "operation_function": transform } response_json, rv = custom_operation.create(data) client.close() The create function from the eda.int.custom_operation module is used to submit the actual function and metadata to Lucd. The required attributes in the dict data are used for display in the Lucd GUI (and may at some point be used for search purposes). Table 1 describes the attributes in detail. Table 1. Custom Operation Attribute Descriptions Attribute Description name String name/label of the custom feature engineering operation author_name Name of the developer who wrote the operation author_email Author’s email purpose Short description of what the operation achieves description Longer description of how the operation might achieve its purpose, as well as other notes feature List of strings identifying features/facets the operation affects transform_function The Python function that implements the logic of the custom operation Regarding the custom transformation function, it is essential that the function be defined in the same context as the dict and in which create is called. This ensures that the de-serialization of the function works properly on the Lucd backend. Custom Operation Types Applied Operations Applied operations denote those custom operations which are to be applied to data using conventional Dask and Pandas function application mechanisms. Custom functions can be applied to data (dataframes) via the following Dask functions: apply; map_partitions; applymap. Note that applied operations are applied to Pandas dataframes (which act as partitions of larger Dask dataframes). Hence, the user should consider this when designing their custom operation. As shown in the example code above, the user does not identify which Dask mechanism to use for applying a function to data in the create function. This will be selectable in the Lucd GUI when applying custom operations in the eda section. Dask Operations Dask operations denote custom operations which are to be applied directly to a Dask dataframe. This is useful in cases where custom operations/algorithms cannot be correctly applied to a *partition” of data as described above. An example would be a clustering algorithm. We expect these custom operations will use modules from Dask libraries such as Dask-ML (https://ml.dask.org/). An example custom operation is shown below. def cluster_function(df): from dask_ml.cluster import KMeans X = df[['car.days_in_service']] kmeans = KMeans(n_clusters=4) k_service_days = kmeans.fit(X) labels = k_service_days.predict(X) df['cars.service_days_cluster'] = labels return df When a Dask operation is to be applied (in the Unity client), the user should select apply_direct as the application type. View full record
  24. User Guide The Lucd Python Client provides capabilities for data scientists and AI model developers to prototype AI model solutions before uploading them to the Lucd Unity Client for extended training and performance analysis. The Lucd Python Client provides the following features: functions for accessing raw data and other assets from in Lucd for general analysis and custom visualization; functions for uploading user-defined feature transformation operations to Lucd, which can then be applied in the Lucd Unity Client to create a virtual dataset; functions for accessing ingesting data into TensorFlow and PyTorch models, which can be used for prototyping models. Installation The lucd-python-client python package should be installed using the pip command with a python wheel file. Instructions are as follows: Download or clone the lucd-python-client package (unzip if needed) from here: Lucd Python Client Project and open a command prompt and change to the package directory. At a command prompt, type python setup.py bdist_wheel The wheel file will appear in the dist directory. Switch to the dist directory and type pip install <wheel_filename> Requirements Python 3.6.5 is required for custom feature operations to work appropriately. APIs The Lucd Python Client uses python and REST APIs. Code examples using both API types are available in the examples directory of the project. Lucd Python Client API Examples Example code illustrating how to perform tasks such as authenticating to Lucd, performing queries, obtaining virtual datasets and training models resides in the examples directory of the project. Below are specific examples of how to access Lucd data using the client as well as how to create and upload a custom feature transformation operation. Accessing Data from lucd import LucdClient, log from eda.int import asset from eda.int import vds from eda.int import uds from eda.lib import lucd_uds if __name__ == "__main__": username = 'xxx' password = 'xxx' domain = 'xxx' client = LucdClient(username=username, password=password, domain=domain) log.info(f"Connected to Lucd platform.") # queries follow Elasticsearch API.. # See: https://www.elastic.co/guide/en/elasticsearch/reference/6.8/query-dsl.html query = \ { "query": { "bool": { "must": [ { "bool": { "should": [ { "match_phrase": { "source": "iris" } } ] } }, { "bool": { "should": [] } } ], "filter": [ { "bool": { "filter": [ ] } } ] } }, "size": 2000, "dataset": "iris" } results, http = uds.search(query) print(f"Search Results ({http}):\n{results}\n") hits, stats = client.search_to_dataframe(results) print(f"Search Results:\n{hits.head()}\n") print(f"Search Statistics:\n{stats}\n") all_models, http = client.rest('lucd/model/read', {"uid": username}) print(f"All Models ({http}):\n{all_models}\n") all_vds, http = vds.read({"uid": username}) print(f"All Virtual Datasets ({http}):\n{all_vds}\n") all_assets, http = asset.read({"uid": username}) print(f"All Asset Embeddings ({http}):\n{all_assets}\n") # # Lucd Library Calls to fetch assets and VDSes # # When limiting asset size, you could encounter issues with missing index entries. embeddings_index, embedding_matrix, embedding_size, word_index_mapping, word_index_mapping_padded = \ lucd_uds.get_asset("xxx", limit=100) print(embeddings_index, embedding_matrix, embedding_size, word_index_mapping, word_index_mapping_padded) # When limiting data size, you will encounter delays bring back large amounts of data # over the network, and possibly run the client out of memory. all_vds, http = vds.read({"uid": None}) print(f"All Virtual Datasets ({http}):\n{all_vds}\n") df = lucd_uds.get_dataframe("xxx", limit=100) print(f"Dataframe Data\n{df.head(20)}") client.close() Custom Feature Transformation¶ from eda.int import custom_operation import lucd def create_greater_than_mean_column(df): column_mean = df["flower.petal_length"].mean() df["flower.petal_length_Mean"] = df["flower.petal_length"] > column_mean return df if __name__ == "__main__": client = lucd.LucdClient(domain="xxx", username="xxx", password="xxx", login_domain="xxx" ) data = { "operation_name": "create_greater_than_mean_column_JBstyle", "author_name": "J. Black", "author_email": "j.black@lucd.ai", "operation_description": "Sample operation", "operation_purpose": "add a new column", "operation_features": ["flower.petal_length"], "operation_function": create_greater_than_mean_column } response_json, rv = custom_operation.create(data) client.close() Federated Endpoints To support federated machine learning, much of the high-level rest functionality operates in a federated manner. This means that unless otherwise specified, these actions will be performed/created/deleted on all federates. The features impacted include: Queries EDA trees Custom Ops VDS objects Model definitions FATE Training objects Users may include a block in their JSON specifying which federates to operate on, which looks like the following: "federation": {"federates": ["domain_name1", "domain_name2"]} View full record
  25. The Lucd platform enables the user to perform model performance analysis. Easy to use tools allows the user to view and compare the data prior to and after the data has been trained against a selected model. The platform also enables tracking of the critical governance and explainability associated with the process. View full record

HELP & SUPPORT

ABOUT US

Lucd is an AI software platform company that supports multiple industry verticals, allowing for its users to build enterprise-ready AI solutions with Low Code / No Code development practices. Lucd supports the entire AI lifecycle, allowing for the secure fusing of structured and unstructured data, empowering data analysts as well as business professionals to work collaboratively, resulting in reduced time to uncover new opportunities and solutions.

×
×
  • Create New...