Categories
Misc

First-Hand Experience: Deep Learning Lets Amputee Control Prosthetic Hand, Video Games

Path-breaking work that translates an amputee’s thoughts into finger motions, and even commands in video games, holds open the possibility of humans controlling just about anything digital with their minds. Using GPUs, a group of researchers trained an AI neural decoder able to run on a compact, power-efficient NVIDIA Jetson Nano system on module (SOM) Read article >

The post First-Hand Experience: Deep Learning Lets Amputee Control Prosthetic Hand, Video Games appeared first on The Official NVIDIA Blog.

Categories
Misc

An End-to-End Blueprint for Customer Churn Modeling and Prediction-Part 3

This is the third installment in a series describing an end-to-end blueprint for predicting customer churn. In previous installments, we’ve discussed some of the challenges of machine learning systems that don’t appear until you get to production: in the first installment, we introduced our use case and described an accelerated data federation pipeline; in the … Continued

This is the third installment in a series describing an end-to-end blueprint for predicting customer churn. In previous installments, we’ve discussed some of the challenges of machine learning systems that don’t appear until you get to production: in the first installment, we introduced our use case and described an accelerated data federation pipeline; in the second installment, we showed how advanced analytics fits with the rest of the machine learning lifecycle.

In this third installment, we finish presenting the analytics and federation components of our application and explain some best practices for getting the most out of Apache Spark and the RAPIDS Accelerator for Apache Spark.

Architecture review

An architecture diagram showing a federation and analytics application that takes five database tables and produces one table and a set of reports, a model training application that takes the federated table and the reports and produces a model, and a production inference application that serves the model.
Figure 1: A high-level overview of our blueprint architecture.

Recall that our blueprint application (Figure 1) includes a federation workload and a pair of analytics workloads.

  • The federation workload produces a single denormalized wide table of data about each customer drawn from aggregating data spread across five normalized tables of observations related to different aspects of customers’ accounts. 
  • The first analytic workload produces a machine-readable summary report of value distributions and domains for each feature.
  • The second analytic workload produces a series of illustrative business reports about customer outcomes. Our first installment contains additional details about the federation workload and our second installment contains additional details about the analytics workloads.

We’ve implemented these three workloads as a single Spark application with multiple phases:

  • The app federates raw data from multiple tables in HDFS (which are stored as Parquet files) into a single wide table.
  • Because the wide table is substantially smaller than the raw data, the app then reformats the wide output by coalescing to fewer partitions and casting numeric values to types that will be suitable for ML model training. The output of this phase is the source data for ML model training.
  • The app then runs the analytics workloads against the coalesced and transformed wide table, first producing the machine-readable summary report and then producing a collection of rollup and data cube reports.

Performance considerations

Parallel execution

For over 50 years, one of the most important considerations for high performance in computer systems has been increasing the applicability of parallel execution. (We choose, somewhat arbitrarily, to identify the development of Tomasulo’s algorithm in 1967, which set the stage for ubiquitous superscalar processing, as the point at which concerns about parallelism became practical and not merely theoretical.) In the daily work of analysts, data scientists, data and ML engineers, and application developers, concerns about parallelism often manifest in one of a few ways; we’ll look at those now.

When scaling out, perform work on a cluster

If you’re using a scale-out framework, perform work on a cluster instead of on a single node whenever possible. In the case of Spark, this means executing code in Spark jobs on executors rather than in serial code on the driver.  In general, using Spark’s API rather than host-language code in the driver will get you most of the way there, but you’ll want to ensure that the Spark APIs you’re using are actually executing in parallel on executors.

Operate on collections, not elements; on columns, not rows

A general best practice to exploit parallelism and improve performance is to use specialized libraries that perform operations on a collection at a time rather than an element at a time. In the case of Spark, this means using data frames and columnar operations rather than iterating over records in partitions of RDDs; in the case of the Python data ecosystem and RAPIDS.ai, it means using vectorized operations that operate on entire arrays and matrices in a single library call rather than using explicit looping in Python. Crucially, both of these approaches are also amenable to GPU acceleration.

Amortize the cost of I/O and data loading

I/O and data loading are expensive, so it makes sense to amortize their cost across as many parallel operations as possible.  We can improve performance both by directly reducing the cost of data transfers and by doing as much as possible with data once it is loaded. In Spark, this means using columnar formats, filtering relations only once upon import from stable storage, and performing as much work as possible between I/O or shuffle operations.

Better performance through abstraction

In general, raising the level of abstraction that analysts and developers employ in apps, queries, and reports allows runtimes and frameworks to find opportunities for parallel execution that developers didn’t (or couldn’t) anticipate.

Use Spark’s data frames

As an example, there are many benefits to using data frames in Spark and primarily developing against the high-level data frame API, including faster execution, semantics-preserving optimization of queries, reduced demand on storage and I/O, and dramatically improved memory footprint relative to using RDD based code. But beyond even these benefits lies a deeper advantage: because the data frame interface is high-level and because Spark allows plug-ins to alter the behavior of the query optimizer, it is possible for the RAPIDS Accelerator for Apache Spark to replace certain data frame operations with equivalent — but substantially faster — operations running on the GPU.

Transparently accelerate Spark queries

Replacing some of the functionality of Spark’s query planner with a plug-in is a particularly compelling example of the power of abstraction: an application written years before it was possible to run Spark queries on GPUs could nevertheless take advantage of GPU acceleration by running it with Spark 3.1 and the RAPIDS Accelerator.

Maintain clear abstractions

While the potential to accelerate unmodified applications with new runtimes is a major advantage of developing against high-level abstractions, in practice, maintaining clear abstractions is rarely a higher priority for development teams than shipping working projects on time. For multiple reasons, details underlying abstractions often leak into production code; while this can introduce technical debt and have myriad engineering consequences, it can also limit the applicability of advanced runtimes to optimize programs that use abstractions cleanly.

Consider operations suitable for GPU acceleration

In order to get the most out of Spark in general, it makes sense to pay down technical debt in applications that work around Spark’s data frame abstraction (e.g., by implementing parts of queries as RDD operations).  In order to make the most of advanced infrastructure, though, it often makes sense to consider details about the execution environment without breaking abstractions.  To get the best possible performance from NVIDIA GPUs and the RAPIDS Accelerator for Apache Spark, start by ensuring that your code doesn’t work around abstractions, but then consider the types and operations that are more or less amenable to GPU execution so you can ensure that as much of your applications run on the GPU as possible. We’ll see some examples of these next.

Types and operations

Not every operation can be accelerated by the GPU. When in doubt, it always makes sense to run your job with spark.rapids.sql.explain set to NOT_ON_GPU and examine the explanations logged to standard output. In this section, we’ll call out a few common pitfalls, including decimal arithmetic and operations that require configuration for support.

Beware of decimal arithmetic

Decimal computer arithmetic supports precise operations up to a given precision limit, can avoid and detect overflow, and rounds numbers as humans would while performing pencil-and-paper calculations. While decimal arithmetic is an important part of many data processing systems (especially for financial data), it presents a particular challenge for analytics systems. In order to avoid overflow, the results of decimal operations must widen to include every possible result; in cases in which the result would be wider than a system-specific limit, the system must detect overflow. In the case of Spark on CPUs, this involves delegating operations to the BigDecimal class in the Java standard library and precision is limited to 38 decimal digits, or 128 bits. The RAPIDS Accelerator for Apache Spark can currently accelerate calculations on decimal values of up to 18 digits, or 64 bits.

We’ve evaluated two configurations of the churn blueprint: one using floating-point values for currency amounts (as we described in the first installment) and one using decimal values for currency amounts (which is the configuration that the performance numbers we’re currently reporting is running against). Because of its semantics and robustness, decimal arithmetic is more costly than floating-point arithmetic, but it can be accelerated by the RAPIDS Accelerator plugin as long as all of the decimal types involved fit within 64 bits.

Configure the RAPIDS Accelerator to enable more operations

The RAPIDS Accelerator is conservative about executing operations on the GPU that might exhibit poor performance or return slightly different results than their CPU-based counterparts. As a consequence, some operations that could be accelerated may not be accelerated by default, and many real-world applications will need to enable these to see the best possible performance. We saw an example of this phenomenon in our first installment, in which we had to explicitly enable floating-point aggregate operations in our Spark configuration by setting spark.rapids.sql.variableFloatAgg.enabled to true. Similarly, when we configured the workload to use decimal arithmetic, we needed to enable decimal acceleration by setting spark.rapids.sql.decimalType.enabled to true.

The plugin documentation lists operations that can be supported or not by configuration and the reasons why certain operations are enabled or disabled by default. In addition to floating-point aggregation and decimal support, there are several classes of operations that production Spark workloads are extremely likely to benefit from enabling:

  • Cast operations, especially from string to date or numeric types or from floating-point types to decimal types.
  • String uppercase and lowercase (e.g., “SELECT UPPER(name) FROM EMPLOYEES“) are not supported for some Unicode characters in which changing the case also changes the character width in bytes, but many applications do not use such characters. You can enable these operations individually or enable them and several others by setting spark.rapids.sql.incompatibleOps.enabled to true.
  • Reading specific types from CSV files; while reading CSV files is currently enabled by default in the plugin (spark.rapids.sql.format.csv.enabled), reading invalid values of some types (numeric types, dates, and decimals in particular) will have different behavior on the GPU and the CPU and thus reading each of these will need to be enabled individually.

Accelerate data ingest from CSV files

CSV reading warrants additional attention: it is expensive and accelerating it can improve the performance of many jobs. However, because the behavior of CSV reading under the RAPIDS Accelerator may diverge from Spark’s behavior while executing on CPUs and because of the huge dynamic range of real-world CSV file quality, it is particularly important to validate the results of reading CSV files on the GPU. One quick but valuable sanity check is to ensure that reading a CSV file on the GPU returns the same number of NULL values as reading the same file on the CPU. Of course, there are many benefits to using a self-documenting structured input format like Parquet or ORC instead of CSV if possible.

Avoid unintended consequences of query optimization

The RAPIDS Accelerator transforms a physical query plan to delegate certain operators to the GPU.  By the time Spark has generated a physical plan, though, it has already performed several transformations on the logical plan, which may involve reordering operations.  As a consequence, an operation near the end of a query or data frame operation as it was stated by the developer or analyst may get moved from a leaf of the query plan towards the root.

A diagram of a database query execution.  The first step shows joining two input relations; the second step shows the output of joining these two relations; the third shows the result of filtering the join output, producing in relatively few records.
Figure 2:  A depiction of executing a data frame query that joins two data frames and then filters the results.  If the predicate is sufficiently selective, most of the output tuples will be discarded.
A diagram of a database query execution.  The first step shows filtering the first input relation; the second step shows filtering the second input relation; and the third shows joining the results of filtering the two input relations, resulting in relatively few records.
Figure 3:  A depiction of executing a data frame query that filters two input relations before joining the results.  If the predicate can be evaluated on each input relation independently, this query execution produces the same results as the query execution in Figure 2 much more efficiently.

In general, this sort of transformation can improve performance.  As an example, consider a query that joins two data frames and then filters the results:  when possible, it will often be more efficient to execute the filter before executing the join.  Doing so will reduce the cardinality of the join, eliminate comparisons that will ultimately be unnecessary, decrease memory pressure, and potentially even reduce the number of data frame partitions that need to be considered in the join.  However, this sort of optimization can have counterintuitive consequences:  aggressive query reordering may negatively impact performance on the GPU if the operation that is moved towards the root of the query plan is only supported on CPU or if it generates a value of a type that is not supported on the GPU.  When this happens, a greater percentage of the query plan may execute on the CPU than is strictly necessary.   You can often work around this problem and improve performance by dividing a query into two parts that execute separately, thus forcing CPU-only operation near the leaves of a query plan to execute only after the accelerable parts of the original query run on the GPU.

Conclusion

In this third installment, we’ve detailed some practical considerations for getting the most out of Apache Spark and the RAPIDS Accelerator for Apache Spark.  Most teams will realize the greatest benefits by focusing on using Spark’s data frame abstractions cleanly.  However, some applications may benefit from minor tweaks, in particular semantics-preserving code changes that consider the RAPIDS Accelerator’s execution model and avoid unsupported operations.  Future installments will address the rest of the data science discovery workflow and the machine learning lifecycle.

Categories
Misc

Clara Train 4.0 Upgrades to MONAI and supports FL with Homomorphic Encryption

NVIDIA recently released Clara Train 4.0, an application framework for medical imaging that includes pre-trained models, AI-Assisted Annotation, AutoML, and Federated Learning. In this 4.0 release, there are three new features to help get you started training quicker.

NVIDIA recently released Clara Train 4.0, an application framework for medical imaging that includes pre-trained models, AI-Assisted Annotation, AutoML, and Federated Learning. In this 4.0 release, there are three new features to help get you started training quicker.  

Clara Train has upgraded its underlying infrastructure from TensorFlow to MONAI. MONAI is an open-source, PyTorch-based framework that provides domain-optimized foundational capabilities for healthcare. By leveraging MONAI, users now have access to a comprehensive list of medical image–specific transformations and reference networks. Clara Train has also updated its DeepGrow model to work on 3D CT images. This updated model gives you the ability to segment an organ in 3D with only a few clicks across the organ.  

Expanding into Digital Pathology, Clara Train helps users navigate these new workloads by providing a Digital Pathology pipeline that includes data loading and training optimizations. These data loading optimizations involve using the new cuCIM library included in RAPIDS.  

Clara Train 4.0 continues to improve on its Federated Learning framework by adding homomorphic encryption tools. Homomorphic encryption allows you to compute data while the data is still encrypted. It can play an important role in healthcare in ensuring that patient data stays secure at each hospital while still benefiting from using federated learning with other institutions. 

To learn more about these new features, check out our Clara Train 4.0 features highlight video below or the latest blog posts which includes a walkthrough of how you can Bring-Your-Own-Components to Clara Train.  

Download Clara Train 4.0 from NGC, and try out the newly updated Jupyter notebooks on GitHub

Categories
Misc

What Is Explainable AI?

Banks use AI to determine whether to extend credit, and how much, to customers. Radiology departments deploy AI to help distinguish between healthy tissue and tumors. And HR teams employ it to work out which of hundreds of resumes should be sent on to recruiters. These are just a few examples of how AI is Read article >

The post What Is Explainable AI? appeared first on The Official NVIDIA Blog.

Categories
Misc

Is it possible to create a deep neural network in tensorflow that feedbacks across multiple layers?

The regular RNN layers feedback to themselves. Instead, I want to create a network where I can connect a latter layer across multiple layers to a previous one.

Something like this image: https://imgur.com/GDscR4Z

I am having trouble finding anything like this. Wondering if it is even possible.

submitted by /u/Tahoma-sans
[visit reddit] [comments]

Categories
Misc

Is there a way to incorporate other information into a segmentation model?

Hey guys! So my research professor wants me to construct a notebook that uses image data alongside genetic data in segmentation. I’ve already constructed a segmentation model that runs fine, and I have all of the genetic data in a dataframe – I was wondering how I’d go about incorporating that into the model

submitted by /u/maruchaannoodles
[visit reddit] [comments]

Categories
Misc

TFLite 1.47x slower than SavedModel

I’m benchmarking a model in a controlled environment (docker container with 1 CPU and 4GB RAM).

Running 100 inferences on SATRN model with batch size 1 takes on average 1.26 seconds/inference using the TFLite model and 0.86 seconds /inference using the SavedModel.

Is it expected? What would explain the performance difference?

submitted by /u/BarboloBR
[visit reddit] [comments]

Categories
Misc

Sparse Forests with FIL

Introduction The RAPIDS Forest Inference Library, affectionately known as FIL, dramatically accelerates inference (prediction) for tree-based models, including gradient-boosted decision tree models (like those from XGBoost and LightGBM) and random forests. (For a deeper dive into the library overall, check out the original FIL blog.) Models in the original FIL are stored as dense binary … Continued

This post was originally published on the RAPIDS AI Blog.

Introduction

The RAPIDS Forest Inference Library, affectionately known as FIL, dramatically accelerates inference (prediction) for tree-based models, including gradient-boosted decision tree models (like those from XGBoost and LightGBM) and random forests. (For a deeper dive into the library overall, check out the original FIL blog.) Models in the original FIL are stored as dense binary trees. That is, the storage of the tree assumes that all leaf nodes occur at the same depth. This leads to a simple, runtime-efficient layout for shallow trees. But for deep trees, it also requires a lot of GPU memory 2d+1-1 nodes for a tree of depth d. To support even the deepest forests, FIL supports

sparse tree storage. If a branch of a sparse tree ends earlier than the maximum depth d, no storage will be allocated for potential children of that branch. This can deliver significant memory savings. While a dense tree of depth 30 will always require over 2 billion nodes, the skinniest possible sparse tree of depth 30 would require only 61 nodes.

Using Sparse Forests with FIL

Using sparse forests in FIL is no harder than using dense forests. The type of forest created is controlled by the new storage_type parameter to ForestInference.load(). Its possible values are:

  • DENSE to create a dense forest,
  • SPARSE to create a sparse forest,
  • AUTO (default) to let FIL decide, which currently always creates a dense forest.

There is no need to change the format of the input file, input data or prediction output. The initial model could be trained by scikit-learn, cuML, XGBoost, or LightGBM. Below is an example of using FIL with sparse forests.

from cuml import ForestInference
import sklearn.datasets
# Load the classifier previously saved with xgboost model_save()
model_path = 'xgb.model'
fm = ForestInference.load(model_path, output_class=True,
                         storage_type='SPARSE')
# Generate random sample data
X_test, y_test = sklearn.datasets.make_classification()
# Generate predictions (as a gpu array)
fil_preds_gpu = fm.predict(X_test.astype('float32'))

Implementation

Figure 1 depicts how sparse forests are stored in FIL.
Figure 1: Storing sparse forests in FIL.

Figure 1 depicts how sparse forests are stored in FIL. All nodes are stored in a single large nodes array. For each tree, the index of its root in the nodes array is stored in the trees array. Each sparse node, in addition to the information stored in a dense node, stores the index of its left child. As each node always has two children, left and right nodes are stored adjacently. Therefore, the index of the right child can always be obtained by adding 1 to the index of the left child. Internally, FIL continues to support dense as well as sparse nodes, with both approaches deriving from a base forest class.

Compared to the internal changes, the changes to the Python API have been kept to a minimum. The new storage_type parameter specifies whether to create a dense or sparse forest. Additionally, a new value,'AUTO', has been made the new default for the inference algorithm parameter; it allows FIL to choose the inference algorithm itself. For sparse forests, it currently uses the'NAIVE'algorithm, which is the only one supported. For dense forests, it uses the'BATCH_TREE_REORG' algorithm.

Benchmarks

To benchmark the sparse trees, we train a random forest using scikit-learn, specifically,sklearn.ensemble.RandomForestClassifier. We then convert the resulting model into a FIL forest and benchmark the performance of inference. The data is generated using sklearn.datasets.make_classification(), and contains 2 million rows split equally between training and validation dataset, and 32 columns. For benchmarking, inference is performed on 1 million rows.

We use two sets of parameters for benchmarking.

  • With the depth limit, set to either 10 or 20; in this case, either a dense or sparse FIL forest can fit into GPU memory.
  • Without depth limit; in this case, the model trained by SKLearn contains really deep trees. In our benchmark runs, the trees usually have a depth between 30 and 50. Trying to create a dense FIL forest runs out of memory, but a sparse forest can be created smoothly.

In both cases, the size of the forest itself remains relatively small, as the number of leaf nodes in a tree is limited to 2048, and the forest consists of 100 trees. We measure the time of the CPU inference and the GPU inference. The GPU inference was performed on V100, and the CPU inference was performed on a system with 2 sockets, each with 16 cores with 2-way hyperthreading. The benchmark results are presented in Figure 2.

Results in figure 2 compare  sparse and dense FIL predictors (if the latter is available) to SKLearn CPU predictors. FIL predictors are about 34–60x faster.
Figure 2: Benchmark results for FIL (dense and sparse trees) and SKLearn.

Both sparse and dense FIL predictors (if the latter is available) are about 34–60x faster than the SKLearn CPU predictor. The sparse FIL predictor is slower compared to the dense one for shallow forests, but can be faster for deeper forests; the exact performance difference varies. For instance, in Figure 2 with max_depth=10, the dense predictor is about 1.14x faster than the sparse predictor, but with max_depth=20, it is slower, achieving only 0.75x speed of the sparse predictor. Therefore, the dense FIL predictor should be used for shallow forests.

For deep forests, however, the dense predictor runs out of memory, as its space requirements grow exponentially with the forest depth. The sparse predictor does not have this problem and provides fast inference on the GPU even for very deep trees.

Conclusion

With sparse forest support, FIL applies to a wider range of problems. Whether you’re building gradient-boosted decision trees with XGBoost or random forests with cuML or scikit-learn, FIL should be an easy drop-in option to accelerate your inference. As always, if you encounter any issues, feel free to file issues on GitHub or ask questions in our public Slack channel!

Categories
Misc

Epochs not running and GPU memory usage disappearing on cnn model.

I’m currently a student playing around with basic Deep Learning and tensorflow for a project.

I’ve installed and am running tensorflow on my RTX 3070, and use jupyter notebooks on anaconda for my code.

I’m currently playing around with an American Sign Language dataset, (one made up of 28×28 grayscale images of various letters in asl)

I’ve gotten simple models like:

from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense model = Sequential() model.add(Dense(units=512, activation='relu', input_shape=(784,))) model.add(Dense(units=512, activation='relu')) model.add(Dense(units=num_classes, activation='softmax')) 

working to great effect on my GPU, but if I try a convolutional neural network on the same dataset, like this:

from tensorflow.keras.models import Sequential from tensorflow.keras.layers import ( Dense, Conv2D, MaxPool2D, Flatten, Dropout, BatchNormalization, ) model = Sequential() model.add(Conv2D(75, (3, 3), strides=1, padding="same", activation="relu", input_shape=(28, 28, 1))) model.add(BatchNormalization()) model.add(MaxPool2D((2, 2), strides=2, padding="same")) model.add(Conv2D(50, (3, 3), strides=1, padding="same", activation="relu")) model.add(Dropout(0.2)) model.add(BatchNormalization()) model.add(MaxPool2D((2, 2), strides=2, padding="same")) model.add(Conv2D(25, (3, 3), strides=1, padding="same", activation="relu")) model.add(BatchNormalization()) model.add(MaxPool2D((2, 2), strides=2, padding="same")) model.add(Flatten()) model.add(Dense(units=512, activation="relu")) model.add(Dropout(0.3)) model.add(Dense(units=num_classes, activation="softmax")) 

and then I compile using:

model.compile(loss="categorical_crossentropy", metrics=["accuracy"]) 

and train using:

model.fit(x_train, y_train, epochs=20, verbose=1, validation_data=(x_valid, y_valid)) 

But if I run the above code, all I get is:

Epoch 1/20 

as my output, and while when I define the model, I see that the majority of my GPU memory is being used (specifically 7.6/8GB), when i try training it, all of the memory just instantly disappears, as if there never was a model.

can anyone tell me what is wrong here?

submitted by /u/the_mashrur
[visit reddit] [comments]

Categories
Misc

How Diversity Drives Innovation: Catch Up on Inclusion in AI with NVIDIA On-Demand

NVIDIA’s GPU Technology Conference is a hotbed for sharing groundbreaking innovations — making it the perfect forum for developers, students and professionals from underrepresented communities to discuss the challenges and opportunities surrounding AI. Last month’s GTC brought together virtually tens of thousands of attendees from around the world, with more than 20,000 developers from emerging Read article >

The post How Diversity Drives Innovation: Catch Up on Inclusion in AI with NVIDIA On-Demand appeared first on The Official NVIDIA Blog.