NVIDIA and Snowflake announced a new partnership bringing accelerated computing to the Data Cloud with the new Snowpark Container Services (private preview), a…
NVIDIA and Snowflake announced a new partnership bringing accelerated computing to the Data Cloud with the new Snowpark Container Services (private preview), a runtime for developers to manage and deploy containerized workloads. By integrating the capabilities of GPUs and AI into the Snowflake platform, customers can enhance ML performance and efficiently fine-tune LLMs. They achieve this by leveraging the NVIDIA AI Enterprise software suite on the secure and governed Snowflake platform. With this collaboration, customers can develop cost-effective AI-powered applications using their valuable data.
About NVIDIA AI Enterprise
As AI initiatives progress, the need for a trusted, scalable support model for enterprises becomes vital to making sure AI projects stay on track. To support building AI applications, NVIDIA AI Enterprise includes the software to streamline the end-to-end AI pipeline, from data prep, to model training, to simulation, and deploying at scale.
NVIDIA AI Enterprise is the software layer of the NVIDIA AI platform and includes:
Optimizations to run on accelerated infrastructure for performance, productivity, and cost savings.
Enterprise-grade support, security, and API stability.
AI workflows and pretrained models to speed time to production.
Certifications to deploy everywhere—cloud, data center, and edge.
Enable AI workflows directly on Snowflake Data Cloud
Data is the fuel for generative AI—and the data fueling enterprise AI use cases lives in Snowflake. With the NVIDIA AI platform now available on Snowpark Container Services, customers can put their data to work without sacrificing security, performance, or ease of use.
Using the NVIDIA AI Enterprise accelerated infrastructure and computing libraries through Snowpark Container Services, developers and data scientists can build accelerated AI workflows with ease.
With Snowpark, enterprises securely deploy and process their Python code used for AI and ML. Developers can also expand accelerated ML workloads and run sophisticated AI models such as LLMs where their data is already stored. This reduces potential security risks and latency when moving large amounts of data.
The following workflow shows how a data scientist can implement each stage, from data processing to real-time inferencing, as part of the new partnership. The technologies outlined in this workflow and example use case, including NVIDIA RAPIDS, NVIDIA Merlin, NVIDIA TensorRT, and NVIDIA Triton are all included with NVIDIA AI Enterprise and available on Snowpark Container Services.
Example use case: training a recommender model
Session-based recommender systems are becoming increasingly important in content-rich applications aiming to provide more relevant next-item predictions. Training these models starts by loading data and conducting initial SQL and DataFrame preprocessing using the Snowpark Python library.
Customers can use NVIDIA RAPIDS, Merlin, and other AI frameworks in a Jupyter Notebook running in Snowpark Container Services to augment data processing and train models.
NVTabular, part of Merlin, is an accelerated feature engineering library designed to generate key features needed in recommender systems model training. After the data is prepared using NVTabular, additional accelerated computing libraries in Merlin begin training the AI workflow.
The task of training a model on very large datasets can be time-consuming. During training, the dataset is often copied in and out of memory in chunks while compute cores process it. Training a model using a GPU provides higher throughput and faster model training because they include high-bandwidth memory and make use of an increased number of compute cores for parallel processing.
Operating on Snowpark Container Services results in a 20X speedup during the training of a predictive model with accelerated compute. This boosts the productivity of data scientists during model creation and reduces the overall TCO by doing more, in less time.
After training, the model is tested for accuracy with sample test data. The workflow then optimizes and retrains the model as needed. Finally, it publishes the newly trained model to a registry such as the Snowpark Model Registry (private preview).
After training, NVIDIA AI Enterprise provides TensorRT, optimizing for accelerated computing. At the final stage in the workflow, the model deploys and begins performing inference tasks. Running inside a Triton Inference server, it consumes data in real time and provides insights.
Request access to get started
Snowflake customers can request access to the technical preview of the Snowpark Container Services from their account team. Customers are also eligible for a free NVIDIA AI Enterprise 90-day evaluation license to make sure they have access to the full stack of NVIDIA AI software.
Editor’s note: This post is a part of our Meet the Omnivore series, which features individual creators and developers who accelerate 3D workflows and create virtual worlds using NVIDIA Omniverse, a development platform built on Universal Scene Description, aka OpenUSD. As augmented reality (AR) becomes more prominent and accessible across the globe, Kiryl Sidarchuk is Read article >
Single-cell sequencing has become one of the most prominent technologies used in biomedical research. Its ability to decipher changes in the transcriptome and…
Single-cell sequencing has become one of the most prominent technologies used in biomedical research. Its ability to decipher changes in the transcriptome and epigenome on a cell level has enabled researchers to gain valuable new insights. As a result, single-cell experiments have grown in size and complexity by a factor of over 100, with experiments involving more than 1 million cells becoming increasingly common.
However, the resulting data must be analyzed in a highly iterative process. It is crucial that fast algorithms are used for these iterative steps to enable quick turnaround times.
For more consistent single-cell analysis using Python, developers at scverse have worked toward building an entire ecosystem to help researchers perform analyses. At the core of this ecosystem lies a data structure that maintains annotations of various transformations throughout the data processing pipeline during single-cell analysis.
AnnData, a Python package for handling annotated data matrices in memory and on disk, is the core structure used in the Scanpy library, which is the main single-cell analysis suite within the scverse ecosystem. Scanpy builds on top of other libraries common to the PyData ecosystem—such as NumPy, SciPy, Numba, and Scikit-learn—for just about all of the typical analysis steps.
However, Scanpy algorithms are mostly CPU-based and slow down significantly with larger experiments. The highly iterative nature of the single-cell analysis process only compounds this problem.
GPUs for single-cell analysis
General feasibility of running downstream single-cell RNA sequencing (scRNA-seq) analysis on the GPU with RAPIDS and Scanpy has been shown in Accelerating Single-Cell Genomic Analysis with GPUs. This work resulted in the rapids-single-cell-examples GitHub repo, which contains a series of example notebooks written by the RAPIDS and NVIDIA Parabricks teams. RAPIDS is an open-source suite of libraries for GPU-accelerated data science with Python. Parabricks is a free suite of GPU-accelerated, industry-standard genomic analysis tools based on deep learning.
While these example notebooks demonstrate a few typical single-cell RNA workflows on the GPU, they were never intended for daily use nor as a GPU-accelerated replacement for libraries like Scanpy.
Drawing inspiration from prior work on rapids-single-cell-examples is an emerging library called rapids-singlecell, a GPU-accelerated tool for scRNA analysis. This tool aims to be a daily drivable single-cell analysis suite that is compatible with the scverse ecosystem. It uses RAPIDS and CuPy to provide GPU-accelerated functions that are near-drop-in replacements for the corresponding functions in Scanpy.
rapids-singlecellfollows a similar usability model as the scverse Python libraries. It is also written in Python but puts many of the performance-critical pieces on the GPU, hiding all of the complexities that are normally associated with writing CUDA applications (the language typically used to write accelerated algorithms for NVIDIA GPUs).
rapids-singlecell consists of five categories, which are described in the sections that follow. Each category accelerates a different piece of the typical single cell analysis workflow.
The AnnData, or annotated data object, is a widely used data structure for handling single-cell RNA sequencing data. As shown in Figure 1, it consists of several attributes, including a count matrix attribute, X, which represents the expression levels of genes in each cell. AnnData objects also contain annotation dataframes for cells (.obs attribute) and genes (.var attribute), which store additional information such as cell type and gene names.
In contrast, cunnData (Figure 1) is a minimized and lightweight version of the AnnData object for the GPU that replaces the scverse standard for preprocessing. Instead of storing the count matrix .X on the CPU, cunnData stores it on the GPU as CuPy sparse matrices. This makes it faster and more efficient to perform computations on the count matrix.
cunnData also includes additional features, such as the ability to store different versions of the count matrix, such as raw integer counts, in .Layers. Unlike AnnData, which stores .Layers in Host (CPU) memory, cunnData also stores .Layers on the GPU, reducing the need to copy data from the host to GPU memory and enabling accelerated computations.
cunnData supports unstructured annotations in the .uns attribute, as well as multidimensional annotations of cells and genes in the .obsm and .varm attributes, which are stored in the host memory. These annotations enable users to include additional information about their data, such as spatial coordinates or principal component analysis (PCA) embeddings.
Similarly, cunnData supports slicing like AnnData. These slices, however, are always full copies of the original data, as opposed to views. Overall, cunnData enables a faster approach to preprocessing scRNA-seq data compared to the more feature-rich CPU-bound AnnData object.
The Python snippet below demonstrates the conversion of an AnnData object (a standard data structure for handling single-cell RNA sequencing data) into a cunnData object.
import scanpy as sc
import rapids_singlecell as rsc
adata = sc.read("PATH TO DATASET")
cudata = rsc.cunnData.cunnData(adata=adata)
Preprocessing
The preprocessing functions are stored in cunnData_funcs, which provides accelerated alternatives to the Scanpy preprocessing functions. These functions work on the cunnData object and use RAPIDS cuML and CuPy to dramatically accelerate Scanpy functions based on Scikit-learn, Numpy and SciPy.
Filtering cells and genes can be accomplished with filter_cells and filter_genes functions, respectively. Quality control is handled with the calculate_qc_metrics function.
To normalize your data, cunnData_funcs provides GPU alternatives to the normalize_total, log1p, and the recently introduced normalize_pearson_residuals functions from Scanpy. Annotating highly variable genes is accelerated for all flavors supported in Scanpy (including seurat, cellranger, seurat_v3, pearson_residuals), as well as poisson_gene_selection, which is adapted from scvi-tools.
The regress_out function, used to remove unwanted sources of variation, is accelerated with the cuML linear regression estimator. It also supports multitarget regression, which was introduced in cuML in version 22.12, while staying backwards compatible with prior versions.
Principal component analysis wraps cuML PCA, Truncated SVD, and Incremental PCA to give you the same options offered by Scanpy. With the PCA version in cunnData_funcs, you can choose which layer you want to use for the analysis, an additional feature not currently supported by the scanpy PCA function.
cunndata_funcs can accelerate preprocessing by a factor of 10 to 20x (Tables 1-3). After preprocessing, the cunnData object is transformed into an AnnData object.
adata_preprocessed = cudata.to_AnnData()
Tools
scanpy_gpu provides functions that work on the AnnData object, with the goal of providing accelerated functions. To keep the syntax as close as possible between Scanpy and rapids-singlecell, metadata is also written to the .uns attribute. This attribute is useful for storing trained parameters such as the variance ratio, which is computed during the PCA computation. scanpy_gpu provides a PCA function for the AnnData object equivalent to cunnData_funcs.
Scanpy already includes support for computing UMAP and nearest neighbors on the GPU using cuML. scanpy_gpu extends Scanpy GPU support by adding more algorithms, such as accelerated graph-based clustering using Leiden and Louvain from cuGraph, as well as the Force Atlas 2 algorithm for visually laying out graph data. scanpy_gpu also uses PCA and kernel density estimation (KDE) from cuML and diffusion maps are computed using the CuPy library in a similar manner to how Scanpy uses SciPy and Numpy for scientific computing.
For batch correction, scanpy_gpu provides a GPU port of Harmony Integration, called harmony_gpu. PyMDE (minimum distortion embedding), a function that enables embedding single-cell data while jointly learning the graph and the low-dimensional representation in a probabilistic manner, has also been adapted from scvi-tools.
The near-drop-in replacement nature of rapids-singlecell relies on Scanpy for visualization. It is intuitive to use Scanpy for plotting directly within the scverse framework.
Decoupler
The decoupler tool uses a unified framework to implement several different statistical methods with a focus on biological activity. (Cellular, molecular, and physiological processes in living organisms, for example, gene sets and transcription factors activity.) decoupler_gpu re-implements and accelerates the weighted sum (run_wsum) and multivariate linear model (run_mlm) methods. The GPU port in rapids-singlecelluses the same nets/models as decoupler. Table 1 shows a performance increase of up to 37x for wsum.
Squidpy developments
rapids-singlecell is continually expanding with new accelerated functions for the scverse ecosystem. Comprehensive tests have been added to the library to ensure the correctness and reliability of the code. Squidpy enables detailed analysis and visualization of spatial molecular data. It facilitates understanding of complex cell interactions and spatial patterns, greatly contributing to the expansion of the scverse ecosystem.
Our benchmark results show that using GPU acceleration with the rapids-singlecell package and the decoupler functions can significantly improve the performance of scRNA-seq analysis.
For example, running a sample rapids-singlecell notebook with about 90 K cells end-to-end on a server node with two AMD Epyc Milan 7543, 500 GB memory, and an NVIDIA A100 80 GB GPU was completed in just 51 seconds using the rapids-singlecell package, compared to 1,106 seconds with the traditional scanpy CPU workflow.
Similarly, the decoupler functions also demonstrated significant speed improvements, with the mlm function running in just 12 seconds on the GPU compared to 83 seconds on the CPU, and the wsum method completing in just 26 seconds on the GPU compared to 16 minutes and 10 seconds on the CPU.
Overall, these results demonstrate the potential for GPU acceleration to make scRNA-seq analysis faster and more efficient. These benchmark results are summarized in Table 1.
Function
CPU
GPU
Speedup
Whole notebook(excluding decoupler functions)
1,106 s (18.5 min)
51 s
21x
Preprocessing
74 s
8 s
9x
Regress out
27 s
1.6 s
16x
PCA
35 s
0.7 s
50x
HVG (Seurat v3)
3.2 s
0.4 s
8x
Harmony
417 s
18 s
23x
Neighbors
22 s
5.1 s
4.3x
UMAP
36 s
0.4 s
90x
TSNE
133 s
2.4 s
55x
Louvain
17 s
0.6 s
28x
Leiden
14 s
0.2 s
70x
Logistic regression
58 s
3.7 s
15x
Draw graph (FA2)
256 s
0.3 s
850x
run_mlm (DoRothEA)
83 s
12 s
7x
Run_wsum (PROGENy)
970 s (16 min)
26 s
37x
Table 1. Server node benchmark for a dataset of 90,000 cells
In addition to the previous benchmark results, running a sample rapids-singlecell notebook with 500 K cells on the server node takes about 2 minutes when using rapids-singlecell. The same analysis takes about 41 minutes on the CPU.
Furthermore, using pearson_residuals for highly variable gene selection and normalization can also be accelerated using GPUs, providing additional speed improvements in scRNA-seq analysis. These benchmark results are summarized in Table 2.
rapids-singlecellis not only capable of accelerating single-cell data analysis on high-end server nodes, but also on consumer-grade hardware. Running the same notebook end-to-end with 50,000 cells on a desktop system with an AMD 5950x CPU, 64 GB memory, and an NVIDIA RTX 3090 GPU, takes around 5 minutes using rapids-singlecell. Although the system was using the RAPIDS Memory Manager (RMM) and unified memory to oversubscribe the GPUs memory, it saw a significant speedup compared to the CPU server. These benchmark results are summarized in Table 2.
Function
CPU
GPU (A100)
GPU (3090)
Speedup
Whole notebook(excluding PR functions)
2,460 s (41 min)
110 s
290 s
22x
Preprocessing
305 s
28 s
169 s
10x
HVG (Seurat v3)
48 s
1.5 s
13 s
32x
Regress out
104 s
5.1 s
16 s
20x
scale
8.4 s
1.3 s
5 s
6.4x
PCA
86 s
3.7 s
35 s
23x
Neighbors
74 s
17.1 s
18.3 s
4.3x
UMAP
281 s (4.6 min)
6.7 s
7.6 s
60x
TSNE
786 s (13 min)
10 s
12.9 s
105x
Louvain
283 s (4.5 min)
4.5 s
5.7 s
62x
Leiden
282 s (4.5 min)
0.6 s
0.9 s
470x
Logistic regression
452 s (7.5 min)
33 s
63 s
13x
Diffusion map
30 s
0.75 s
1.3 s
40x
HVG (PR)
104 s
2.1 s
15.6 s
50x
Normalize (PR)
22 s
0.3 s
1 s
73x
Table 2. Server node and consumer system benchmark for a dataset of 500,000 cells
Running the same sample notebook (Table 1) with about 90 K cells end-to-end on the desktop system takes only 48 seconds when using rapids-singlecell. In comparison, the traditional scanpy CPU workflow takes 774 seconds. The accelerated decoupler functions also demonstrate significant speed improvements on consumer-grade hardware. These benchmark results are summarized in Table 3.
Function
CPU
GPU
Speedup
Whole notebook(excluding decoupler functions)
774 s (13 min)
48 s
16x
Preprocessing
114 s
6 s
19x
Regress out
62 s
1.6 s
39x
PCA
42 s
0.7 s
60x
HVG (Seurat v3)
2.7s
0.4 s
6.7x
Harmony
175 s
21.7 s
8x
Neighbors
14.9 s
4.6 s
3.2x
UMAP
31 s
0.3 s
103x
TSNE
95 s
1.4 s
68x
Louvain
9.3 s
0.5 s
18x
Leiden
13.2 s
0.1 s
130x
Logistic regression
76 s
3.75 s
20x
Draw graph (FA2)
191 s
0.23 s
830x
run_mlm (DoRothEA)
55 s
12 s
4.5x
Run_wsum (PROGENy)
690 s (11.5 min)
28 s
26x
Table 3. Server node and consumer system benchmark for a dataset of 500,000 cells
Installation
There are multiple methods for installing rapids-singlecell. The easiest method is to use one of the provided yaml files provided within the GitHub repository. These set up the entire environment with everything needed for running the example notebooks.
conda create -f conda/rsc_rapids_23.02.yml
You can also install rapids-singlecell from PyPI into a Conda environment and install RAPIDS from Conda. The default installer does not include RAPIDS or CuPy. Scanpy is also excluded because it is technically not necessary.
pip install rapids-singlecell
Finally, you can install the entire library, including the RAPDIS dependencies, from PyPI using the new experimental PyPI packages from RAPIDS. However, this method of installation requires the user to properly set up CUDA so that it can be found by RAPIDS and CuPy.
With the rapids-singlecelllibrary, it is possible to run the complete analysis of 500 K cells in less time than it takes a CPU to compute only its UMAP embedding. Therefore, it enables a much faster iterative process in single-cell data analysis stages.
rapids-singlecell also enables the bioinformatician to analyze the data with a physician or biologist in real time, leading to better collaboration and interpretation of the data. In our experience, it is possible to analyze 200 K cells without any issues, even on a consumer-class 3090 series graphics card. Even better, RMM enables the GPU memory to be oversubscribed and spilled to the main memory, enabling scales well over 500 K cells.
With the datacenter-class NVIDIA A100 80 GB GPU, you can analyze matrices containing as many as 231-1 (approximately 2.15 billion) non-zero counts. (Note that this is the current limit imposed by the CuPy 32-bit integer-based indexing for sparse matrix calculations.) This powerful capability enables users to analyze datasets with over 1 million cells.
The upwards of 20x speedup that rapids-singlecellprovides enables researchers to focus more on analyzing and interpreting their single-cell data, rather than waiting for lengthy computational processes. In the true spirit of RAPIDS, this ultimately enhances productivity and fosters new insights into cellular biology that were not possible before.
While generative AI is a relatively new household term, drug discovery company Insilico Medicine has been using it for years to develop new therapies for debilitating diseases. The company’s early bet on deep learning is bearing fruit — a drug candidate discovered using its AI platform is now entering Phase 2 clinical trials to treat Read article >
Snowflake, the Data Cloud Company, and NVIDIA today announced at Snowflake Summit 2023 that they are partnering to provide businesses of all sizes with an accelerated path to create customized generative AI applications using their own proprietary data, all securely within the Snowflake Data Cloud.
As data generation continues to increase, linear performance scaling has become an absolute requirement for scale-out storage. Storage networks are like car…
As data generation continues to increase, linear performance scaling has become an absolute requirement for scale-out storage. Storage networks are like car roadway systems: if the road is not built for speed, the potential speed of a car does not matter. Even a Ferrari is slow on an unpaved dirt road full of obstacles.
Scale-out storage performance can be hindered by the Ethernet fabric that connects the storage nodes. NVIDIA accelerated Ethernet can remove performance bottlenecks, enabling maximum storage performance for applications in general, and AI/ML in particular.
Scale-out storage requires a strong network
Every second, 54,000 pictures are taken worldwide. By the time you read this, that number will be even higher. No matter what your business is, chances are you have massive amounts of data that must be stored and analyzed, with the amount growing every day.
The old scale-up approach of using ever-larger storage filers has been replaced with a scale-out approach to deliver storage that scales linearly in both capacity and performance.
With scale-out storage, or distributed storage, several smaller nodes are configured and connected to act as one logical unit. A single file or object can be spread across many nodes.
As more scale is needed, additional storage nodes are easily added to increase both storage capacity and performance. This applies to both a traditional enterprise storage vendor solution, or a software-defined solution, with the software and hardware sourced independently.
Distributed storage enables flexible scaling and cost efficiency, but requires a high-performance network to connect the storage nodes. Many data center switches are ill-suited to the unique traffic characteristics of storage, and in fact can cripple the performance of scale-out storage solutions.
How storage traffic is different from traditional traffic
For many use cases, network traffic is consistent and homogeneous, and traditional Ethernet suffices. However, traffic generated by storage devices may cause the issues detailed below.
Network stress
Current storage solutions benefit from faster SSDs and storage interfaces, such as NVMe and PCIe Gen 4 (soon PCIe Gen 5), that are designed to provide higher performance.
Congestion
When the storage fabric is saturated, network congestion becomes inevitable, just like roadway congestion when too much traffic is on the highway. Network congestion is particularly problematic for scale-out storage because each storage node is expected to offer fast data delivery. But when congestion occurs, many data center switches have a fairness problem, where some nodes will be slowed much more than others. A single file or object is usually spread across many nodes, so anything that slows a single node effectively slows the whole cluster.
Bursty traffic
Most storage workloads are bursty, generating intense data transfers and repeatedly requiring large amounts of bandwidth for short periods. When that happens, the network switch must use its buffer to absorb the burst until the transient burst is over, thus preventing packet loss. Otherwise, that packet loss will require data retransmissions, significantly deteriorating application performance.
Storage jumbo frames
Traditional data center network traffic uses a maximum packet size (MTU) of 1.5 KB. Scale-out storage nodes perform better when they can use 9 KB “jumbo frames,” which increase throughput while reducing CPU processing overhead. Many data center switches built with commodity switch ASICs perform poorly or unpredictably with jumbo frames.
Low latency
One of the ways storage IOPs have improved is through the orders-of-magnitude latency reduction for the read/write operations in flash-based media. However, those costly performance improvements can be lost when the network introduces high latency—especially due to excessive buffering.
Both training and inference require adequate amounts of data with high-speed access, to make sure that GPU processors are fed quickly enough to keep them fully engaged. During training, WRITE operations are performed by all nodes to improve model accuracy. This results in a burst, making it imperative for switches to handle congestion effectively. Finally, lower storage latency enables GPUs to handle compute tasks more efficiently.
Why ASICs are suboptimal for storage traffic
Most data center switches are built using commodity switch ASICs that were cost-optimized for traditional data traffic patterns and packet sizes. To keep costs low while achieving their bandwidth targets, Ethernet switch chip vendors compromised fairness by using a split buffer architecture.
Every switch has a buffer to absorb traffic bursts and prevent packet loss when congestion occurs. The common approach is to have a buffer that is shared across many ports. However, not all shared buffers are the same—there are different buffer architectures.
Commodity switches do not havea fully shared buffer—they use either an ingress-shared buffer, or an egress-shared buffer.
With an ingress-shared buffer, there is a static mapping between a group of incoming ports and a specific memory slice. These ports can use only the memory in that assigned slice and not the whole buffer, not even if the rest of the buffer is available and no one is using it.
With an egress-shared buffer, the mapping is between a group of outgoing ports and a specific buffer memory slice. Again, each group of egress ports can only use its assigned buffer slice, not the whole buffer.
With these two architectures, flows that stay within the same memory slice do not behave like flows that travel between memory slices. If many flows use ports with the same buffer, those ports will experience higher latency and lower throughput, while traffic using other slices of the buffer will enjoy higher performance.
The storage performance depends on which ports the storage traffic (and other traffic) is using and how busy are the buffer slices for those ports. This is why switches that use split buffers often experience issues related to fairness, predictability, and microburst absorption.
Why deep-buffer switches are suboptimized for storage
Deep-buffer switches usually refer to a switch that offers much more buffering (GBs rather than MBs). Deep-buffer switches are often promoted for use as routers, because they can absorb and hold large traffic bursts if there is a mismatch in network speeds or an incast situation.
But in most data center applications, including scale-out storage, the deep-buffer switches negatively impact performance for the following reasons:
Job completion time
With parallel file systems, the storage node with the slowest response dictates the time required to fetch a file. Unlike commodity switch ASICs that have a sliced on-chip buffer, deep-buffer switches have both on-chip and off-chip buffers, and they all are sliced, not fully shared, buffers.
Think of how many ways flows can go before they leave the switch. They can stay within one on-chip memory slice (fastest), travel between on-chip memory slices (slower), or travel between on-chip and off-chip memory slices (very slow).
All these flows will behave differently, and hence will cause fairness and predictability issues for storage traffic. Because these issues slow down one or more nodes, they adversely impact job completion time and slow down the whole distributed storage cluster.
Latency
The larger the switch buffer, the longer the queue each packet must go through and the greater the latency. The tested average port-to-port latency of a deep-buffer switch is more than 500 microseconds. Compared to a fully shared buffer switch from the same generation, NVIDIA Spectrum 1 latency is just 0.3 microseconds. It requires nanoseconds rather than microseconds to switch/route a packet.
Deep-buffer latency is 1,000x higher. You may be wondering, is this just happening when congestion occurs? No. Under congestion, deep-buffer latencies will be much higher; in fact, up to 20 milliseconds, or 50,000x higher. While 500 microseconds of latency might be okay for a router between data centers, within a data center it spells death to flash storage performance.
Power and cost
Deep-buffer switches need hundreds of watts of power to operate even when idling, making their ongoing operational cost higher. The initial purchase cost of deep-buffer switches is also much higher. This might be justified if performance was better, but real-world testing has proven just the opposite.
Choosing an inappropriate network switch will severely slow down your storage workloads, making your expensive fast storage act like cheaper and slower storage.
With NVIDIA Spectrum, both CapEx and OpEx can be reduced. Watts can also be used for other purposes within a rack.
NVIDIA Spectrum switches are optimized for storage
With commodity switch ASICs, flows are either staying at the same memory slice or traveling between memory slices.
With NVIDIA Spectrum switches, all flows behave the same due to the fully shared buffer. The value of this architecture is maximum burst absorption capacity as well as optimal fair and predictable performance. All traffic flows through a switch receive the same treatment and generally enjoy the same good performance, regardless of which ingress and egress ports they use.
Benchmarking the deep-buffer switch and NVIDIA Spectrum
The first case uses a common storage benchmarking FIO tool for WRITE operations from two initiators to one target while background traffic is running. This is a typical storage scenario.
The team measured the time required for the FIO job to complete (shorter is better). With the deep-buffer switch, the FIO job took 87 seconds. With the NVIDIA Spectrum switch, the job ran 40% faster and completed after just 51 seconds.
Deep-buffer switches greatly increase latency, which slows down your storage and reduces your application performance. But how high can the latency go?
For the second case, the team took the deep-buffer switch and tested how latency is impacted under different congestion use cases. The maximum buffer occupancy is only around 10% of the whole buffer size.
Two meaningful insights can be derived from the graph on the left of Figure 2. First, deep-buffer switch latency is 50,000x higher than Spectrum switches (2–19 milliseconds compared to only 300 nanoseconds for Spectrum).
Second, linear dependency is apparent between buffer occupancy and latency. In other words, testing proved that the larger the occupied buffer, the greater the latency.
With that understanding, the graph on the right of Figure 2 projects the maximum latency per deep-buffer ASIC (such as Jericho 1, Jericho 2, or Ramon). These very high latency numbers are incompatible with data center applications in general and fast storage solutions in particular.
For the third case, the team used two Windows machines and simultaneously copied a file from each to the same target storage.
With the deep-buffer switch, one Windows machine had three times the bandwidth of the other (830 MBps compared to 290 MBps). With Spectrum switch, each machine had 584 MBps (50% as expected).
Real-world testing showed that deep-buffer switches do not have a positive impact on data center applications, such as absorbing packets and preventing drops.
Deep-buffer switches may be needed for long haul or WAN connections; however, they are suboptimal for data center applications and will have negative effects, particularly when the workload is scaled beyond just two nodes, as in this use case.
These three use cases demonstrate proof points for why deep-buffer switches adversely impact AI/ML and storage workloads, while Spectrum switches provide maximized performance.
Summary
NVIDIA Spectrum Ethernet switches are built for AI/ML and storage workloads, and perform better than switches with split buffers or deep buffers. They handle congestion better, prevent packet loss, and outperform with jumbo frames (preferred for storage). NVIDIA Spectrum Ethernet switches provide overall good application performance with consistently low network latency.
Posted by Ramki Gummadi, Software Engineer, Google and Kevin Chen, Software Engineer, YouTube
Caching is a ubiquitous idea in computer science that significantly improves the performance of storage and retrieval systems by storing a subset of popular items closer to the client based on request patterns. An important algorithmic piece of cache management is the decision policy used for dynamically updating the set of items being stored, which has been extensively optimized over several decades, resulting in several efficient and robust heuristics. While applying machine learning to cache policies has shown promising results in recent years (e.g., LRB, LHD, storage applications), it remains a challenge to outperform robust heuristics in a way that can generalize reliably beyond benchmarks to production settings, while maintaining competitive compute and memory overheads.
In “HALP: Heuristic Aided Learned Preference Eviction Policy for YouTube Content Delivery Network”, presented at NSDI 2023, we introduce a scalable state-of-the-art cache eviction framework that is based on learned rewards and uses preference learning with automated feedback. The Heuristic Aided Learned Preference (HALP) framework is a meta-algorithm that uses randomization to merge a lightweight heuristic baseline eviction rule with a learned reward model. The reward model is a lightweight neural network that is continuously trained with ongoing automated feedback on preference comparisons designed to mimic the offline oracle. We discuss how HALP has improved infrastructure efficiency and user video playback latency for YouTube’s content delivery network.
Learned preferences for cache eviction decisions
The HALP framework computes cache eviction decisions based on two components: (1) a neural reward model trained with automated feedback via preference learning, and (2) a meta-algorithm that combines a learned reward model with a fast heuristic. As the cache observes incoming requests, HALP continuously trains a small neural network that predicts a scalar reward for each item by formulating this as a preference learning method via pairwise preference feedback. This aspect of HALP is similar to reinforcement learning from human feedback (RLHF) systems, but with two important distinctions:
Feedback is automated and leverages well-known results about the structure of offline optimal cache eviction policies.
The model is learned continuously using a transient buffer of training examples constructed from the automated feedback process.
The eviction decisions rely on a filtering mechanism with two steps. First, a small subset of candidates is selected using a heuristic that is efficient, but suboptimal in terms of performance. Then, a re-ranking step optimizes from within the baseline candidates via the sparing use of a neural network scoring function to “boost” the quality of the final decision.
As a production ready cache policy implementation, HALP not only makes eviction decisions, but also subsumes the end-to-end process of sampling pairwise preference queries used to efficiently construct relevant feedback and update the model to power eviction decisions.
A neural reward model
HALP uses a light-weight two-layer multilayer perceptron (MLP) as its reward model to selectively score individual items in the cache. The features are constructed and managed as a metadata-only “ghost cache” (similar to classical policies like ARC). After any given lookup request, in addition to regular cache operations, HALP conducts the book-keeping (e.g., tracking and updating feature metadata in a capacity-constrained key-value store) needed to update the dynamic internal representation. This includes: (1) externally tagged features provided by the user as input, along with a cache lookup request, and (2) internally constructed dynamic features (e.g., time since last access, average time between accesses) constructed from lookup times observed on each item.
HALP learns its reward model fully online starting from a random weight initialization. This might seem like a bad idea, especially if the decisions are made exclusively for optimizing the reward model. However, the eviction decisions rely on both the learned reward model and a suboptimal but simple and robust heuristic like LRU. This allows for optimal performance when the reward model has fully generalized, while remaining robust to a temporarily uninformative reward model that is yet to generalize, or in the process of catching up to a changing environment.
Another advantage of online training is specialization. Each cache server runs in a potentially different environment (e.g., geographic location), which influences local network conditions and what content is locally popular, among other things. Online training automatically captures this information while reducing the burden of generalization, as opposed to a single offline training solution.
Scoring samples from a randomized priority queue
It can be impractical to optimize for the quality of eviction decisions with an exclusively learned objective for two reasons.
Compute efficiency constraints: Inference with a learned network can be significantly more expensive than the computations performed in practical cache policies operating at scale. This limits not only the expressivity of the network and features, but also how often these are invoked during each eviction decision.
Robustness for generalizing out-of-distribution: HALP is deployed in a setup that involves continual learning, where a quickly changing workload might generate request patterns that might be temporarily out-of-distribution with respect to previously seen data.
To address these issues, HALP first applies an inexpensive heuristic scoring rule that corresponds to an eviction priority to identify a small candidate sample. This process is based on efficient random sampling that approximates exact priority queues. The priority function for generating candidate samples is intended to be quick to compute using existing manually-tuned algorithms, e.g., LRU. However, this is configurable to approximate other cache replacement heuristics by editing a simple cost function. Unlike prior work, where the randomization was used to tradeoff approximation for efficiency, HALP also relies on the inherent randomization in the sampled candidates across time steps for providing the necessary exploratory diversity in the sampled candidates for both training and inference.
The final evicted item is chosen from among the supplied candidates, equivalent to the best-of-n reranked sample, corresponding to maximizing the predictedpreference score according to the neural reward model. The same pool of candidates used for eviction decisions is also used to construct the pairwise preference queries for automated feedback, which helps minimize the training and inference skew between samples.
An overview of the two-stage process invoked for each eviction decision.
Online preference learning with automated feedback
The reward model is learned using online feedback, which is based on automatically assigned preference labels that indicate, wherever feasible, the ranked preference ordering for the time taken to receive future re-accesses, starting from a given snapshot in time among each queried sample of items. This is similar to the oracle optimal policy, which, at any given time, evicts an item with the farthest future access from all the items in the cache.
Generation of the automated feedback for learning the reward model.
To make this feedback process informative, HALP constructs pairwise preference queries that are most likely to be relevant for eviction decisions. In sync with the usual cache operations, HALP issues a small number of pairwise preference queries while making each eviction decision, and appends them to a set of pending comparisons. The labels for these pending comparisons can only be resolved at a random future time. To operate online, HALP also performs some additional book-keeping after each lookup request to process any pending comparisons that can be labeled incrementally after the current request. HALP indexes the pending comparison buffer with each element involved in the comparison, and recycles the memory consumed by stale comparisons (neither of which may ever get a re-access) to ensure that the memory overhead associated with feedback generation remains bounded over time.
Overview of all main components in HALP.
Results: Impact on the YouTube CDN
Through empirical analysis, we show that HALP compares favorably to state-of-the-art cache policies on public benchmark traces in terms of cache miss rates. However, while public benchmarks are a useful tool, they are rarely sufficient to capture all the usage patterns across the world over time, not to mention the diverse hardware configurations that we have already deployed.
Until recently, YouTube servers used an optimized LRU-variant for memory cache eviction. HALP increases YouTube’s memory egress/ingress — the ratio of the total bandwidth egress served by the CDN to that consumed for retrieval (ingress) due to cache misses — by roughly 12% and memory hit rate by 6%. This reduces latency for users, since memory reads are faster than disk reads, and also improves egressing capacity for disk-bounded machines by shielding the disks from traffic.
The figure below shows a visually compelling reduction in the byte miss ratio in the days following HALP’s final rollout on the YouTube CDN, which is now serving significantly more content from within the cache with lower latency to the end user, and without having to resort to more expensive retrieval that increases the operating costs.
Aggregate worldwide YouTube byte miss ratio before and after rollout (vertical dashed line).
An aggregated performance improvement could still hide important regressions. In addition to measuring overall impact, we also conduct an analysis in the paper to understand its impact on different racks using a machine level analysis, and find it to be overwhelmingly positive.
Conclusion
We introduced a scalable state-of-the-art cache eviction framework that is based on learned rewards and uses preference learning with automated feedback. Because of its design choices, HALP can be deployed in a manner similar to any other cache policy without the operational overhead of having to separately manage the labeled examples, training procedure and the model versions as additional offline pipelines common to most machine learning systems. Therefore, it incurs only a small extra overhead compared to other classical algorithms, but has the added benefit of being able to take advantage of additional features to make its eviction decisions and continuously adapt to changing access patterns.
This is the first large-scale deployment of a learned cache policy to a widely used and heavily trafficked CDN, and has significantly improved the CDN infrastructure efficiency while also delivering a better quality of experience to users.
Acknowledgements
Ramki Gummadi is now part of Google DeepMind. We would like to thank John Guilyard for help with the illustrations and Richard Schooler for feedback on this post.
Researchers at Yamagata University in Japan have harnessed AI to uncover four previously unseen geoglyphs — images on the ground, some as wide as 1,200 feet, made using the land’s elements — in Nazca, a seven-hour drive south of Lima, Peru. The geoglyphs — a humanoid, a pair of legs, a fish and a bird Read article >
Audio can include a wide range of sounds, from human speech to non-speech sounds like barking dogs and sirens. When designing accessible applications for people…
Audio can include a wide range of sounds, from human speech to non-speech sounds like barking dogs and sirens. When designing accessible applications for people with hearing difficulties, the application should be able to recognize sounds and understand speech.
Such technology would help deaf or hard-of-hearing individuals with visualizing speech, like human conversations and non-speech sounds. Combining speech and sound AI together, you can overlay the visualizations onto AR glasses, making it possible for users to see and interpret sounds that they wouldn’t be able to hear otherwise.
According to the World Health Organization, about 1.5B people (nearly 20% of the global population) live with hearing loss. This number could rise to 2.5B by 2050.
Cochl, an NVIDIA partner based in San Jose, is a deep-tech startup that uses sound AI technology to understand any type of audio. They are also a member of the NVIDIA Inception Program, which helps startups build their solutions faster by providing access to cutting-edge technology and NVIDIA experts.
The platform can recognize 37 environmental sounds, and the company went one step further by adding cutting-edge speech-to-text technology. This gives a truly complete understanding of the world of sound.
AR glasses to visualize any sound
AR glasses have the potential to greatly improve the lives of people with hearing loss as an accessible tool to visualize sounds. This technology can help enhance their communication abilities and make it easier for them to navigate and participate in the world around them.
In this scenario, automatic speech recognition (ASR) is used to enable the glasses to recognize and understand human speech. This technology can be integrated into the glasses in several ways:
Using a microphone to capture the speech of a person talking to a deaf or hard-of-hearing individual and then using ASR algorithms to interpret and transcribe the speech into text. This text can then be displayed on the glasses, enabling the deaf or hard-of-hearing person to read and understand the speech.
ASR can also be used to enable the glasses to respond to voice commands so that users can control the glasses with their voice.
They are also able to display all conversations on the screen, such as transcribing voice directions from maps while you drive and any other sounds like horns or sirens from emergency vehicles and wind noise.
The technology behind the solution
Cochl used NVIDIA Riva to power its ASR capabilities within its software stack. Riva is a GPU-accelerated, fully customizable SDK for developing speech AI applications. By using Riva, the platform has been able to expand its capabilities to understand a wide range of sounds, including non-speech sounds.
“We’ve tested lots of speech recognition services, but only Riva provided exceptionally high and stable real-time performance. So now we can make our sound AI system be closer to human auditory perception,” said Yoonchang Han, co-founder and CEO at Cochl.
“As we have observed, AR glasses are most likely to be used in open spaces with noisy environments. NVIDIA Riva has helped us transcribe speech accurately even in noisy environments and has given us a seamless experience to integrate into our Cochl.Sense platform.”
Future of assistive technology
Creating a generalized AI system that perceives sounds like humans is a huge challenge. To make AR glasses more accessible, lighter wearable technology is required.
However, at this point, they are still an ideal medium for translating sounds and speech to visual information. By integrating machine listening functionality, AR glasses can bring safer, more convenient, and more enjoyable daily life to deaf or hard-of-hearing people all around the world.
Cochl is also exploring more use cases for speech AI, such as offering closed captioning for any videos on AR glasses and visualizing multi-speaker transcriptions. To provide the best experience for individuals with hearing difficulties, they are exploring ways to analyze and visualize music to help them understand the genre and emotion of the music at a minimum.
They are excited to experiment with more NVIDIA solutions including Riva, NVIDIA NeMo, and NVIDIA TensorRT.
Get started with speech AI today
Interested in adding speech AI to your VR applications? Browse these resources to get started:
To learn about speech AI, from end-to-end pipeline basics to developing your first speech AI application, see the free ebooks.