Categories
Misc

Unlocking New Opportunities with AI Cloud Infrastructure for 5G vRAN

The cellular industry spends over $50 billion on radio access networks (RAN) annually, according to a recent GSMA report on the mobile economy. Dedicated and…

The cellular industry spends over $50 billion on radio access networks (RAN) annually, according to a recent GSMA report on the mobile economy. Dedicated and overprovisioned hardware is primarily used to provide capacity for peak demand. As a result, most RAN sites have an average utilization below 25%

This has been the industry reality for years as technology evolved from 2G to 4G. But it is set to become even more pronounced in 5G as the push for densification, combined with the use of mmWave, leads to a near doubling of the number of cell sites by 2027 to over 17 million. The implication is that RAN capital expenditures, as a share of overall network total cost of ownership (TCO), will grow to as much as 65% in 5G, compared to 45-50% in 4G. 

A new game-changing approach turns underutilization into an opportunity: leverage the same cloud and data center infrastructure used for AI to dynamically load share with 5G virtual RAN (vRAN). This approach creates a new opportunity for cloud providers and helps reduce operational costs. 

Figure 1 shows how NVIDIA is enabling the CloudRAN solution with the NVIDIA A100X converged card, Spectrum SN3750SX Switch, NVIDIA Aerial SDK, and containerized DU.

Diagram of NVIDIA CloudRAN solution.
Figure 1. The NVIDIA CloudRAN solution, offering flexible mapping and dynamic compute utilization for 5G and AI workloads

The opportunity of RAN underutilization 

By pooling baseband computing resources into a cloud-native environment, the CloudRAN solution delivers significant improvements in asset utilization, creating efficiency gains for telcos and revenue opportunities for cloud service providers (CSPs). The solution achieves this by dynamically orchestrating resources between 5G and 5G RAN off-peak workloads. 

Examples of 5G RAN off-peak workloads are AI workloads such as drive mapping, federated learning, offline video analytics, predictive maintenance, factory digital twins, and many more. 5G compute capacity is mapped to changes in traffic demands from 5G radios, while the remaining compute is used for AI workloads. For telcos, this can increase RAN operational efficiency by more than 2x for an estimated 25% or more impact on the operating margin. 

For CSPs, running 5G vRAN as a workload alongside AI workloads within their existing data center architecture is a significant opportunity. To look at a specific example, the United States wireless market includes about 420,000 cell sites. If the telcos are using Centralized-RAN (C-RAN) for 50% of their network (mostly urban areas) and running a 4:1 configuration, then they will be using 52,000 GPUs to run their network. 

In a typical data center, the hourly GPU compute rate is $2. A C-RAN configuration using dynamic orchestration to combine 5G and AI workloads and the CSP’s monetization of 52,000 GPUs will lead to a $500 million revenue opportunity. Globally, this is a multi-billion dollar opportunity. 

Figure 2 shows the off-peak AI workloads that can dynamically share the C-RAN GPU with the 5G workload during off-peak periods. 

Diagram of AI workloads that can be run during RAN off-peak
Figure 2. Off-peak workloads that can dynamically share CloudRAN GPU with 5G

To provide an example, the combination of in-vehicle AI supercomputer and GPU resources in the cloud enables offline processing for self-driving cars. The system incorporates deep learning algorithms to detect lanes, signs, and other landmarks using a combination of AI and visual simultaneous localization and mapping (VSLAM). 

The industry recognizes the general trend to pull softwarized RAN resources together in a few centralized hub locations. While this improves RAN TCO by more than 30% over distributed-RAN topology, it does not solve RAN underutilization. Why? The unused RAN compute resource goes to waste during off-peak periods. 

NVIDIA CloudRAN: Five building blocks 

To realize the CloudRAN solution, current vRANs will need to evolve in the following five key focus areas. 

First, there is a need for a software-defined fronthaul (SD-FH) with an optimal timing mechanism. Second, the RAN hardware needs to evolve from bespoke, dedicated, and (in many cases) non-cloud-native architecture, to COTS-based and cloud-native hardware. Third, the softwarized 5G RAN needs to be programmable in real time and capable of running on cloud infrastructure. Fourth, the lifecycle management (LCM) of the 5G RAN needs to be dynamic and based on open APIs. 

Finally, there is a need for an end-to-end (E2E) network and service orchestrator that can dynamically manage RAN and other off-peak workloads based on network and infrastructure utilization information. E2E service orchestration executes service intent by dynamically composing the workflow based on service models, policy, and context and using closed-loop control to automate the entire service and network.

NVIDIA CloudRAN identifies and offers an enabling solution for each of the five identified evolution needs of vRAN, as shown in Figure 3.

Diagram of the building blocks of NVIDIA CloudRAN solution
Figure 3. Components of the NVIDIA CloudRAN solution
  • SD-FH switch: NVIDIA SN3750-SX is a new 200G Ethernet switch based on the Spectrum-2 ASIC. It was purpose-built to provide the network fabric for CloudRAN-converged infrastructure where it runs AI training workloads alongside 5G networking. It has software-defined and hardware-accelerated 5G fronthaul capabilities to steer 5G traffic available DUs to RUs based on orchestrator mapping. It offers 5G time sync protocols with PTP telco profiles, SyncE, and PPS in/out. The switch also supports the NetQ validation toolset.
  • General purpose computing hardware: The NVIDIA A100X converged card combines the power of the NVIDIA A100 Tensor Core GPU with the advanced networking capabilities of the NVIDIA BlueField-2 DPU in a single, unique platform. This convergence delivers unparalleled performance for GPU-powered, I/O-intensive workloads, such as distributed AI training in the enterprise data center and 5G vRAN processing as another workload within existing data center architecture. 
  • Programmable and cloud-native 5G software: The NVIDIA Aerial SDK provides the 5G workload in the CloudRAN solution. NVIDIA Aerial is a fully cloud-native virtual 5G RAN solution running on COTS servers. It realizes RAN functions as microservices in containers over bare-metal servers, using Kubernetes and applying DevOps principles. It provides a 5G RAN solution, with inline L1 GPU acceleration for 5G NR PHY processing and supports a full stack framework for a gNB integration L2/L3 (MAC, RLC, PDCP), along with manageability and orchestration.
  • Open API lifecycle management: The O-RAN disaggregated, software-centric approach can help automate and orchestrate RAN complexity, irrespective of multi-vendor or multi-technology networks. Ultimately, the service management orchestration (SMO) will provide open and Kubernetes cluster APIs for RAN automation.
  • E2E network and service orchestrator: E2E orchestration enables dynamic applications and services by consolidating an E2E view in real time across all technology and cloud domains. A single pane of glass enables automation of all aspects of cross-domain services and manages lifecycle management, optimization, and assurance of various workloads. The E2E orchestrator will also have an interface to interact with the cloud infrastructure manager. 

Delivering the CloudRAN solution

The NVIDIA CloudRAN solution delivers a compelling value proposition with an SD-FH, general purpose data center compute, cloud-native architecture, RAN domain orchestrator, and E2E service and network orchestrator. 

NVIDIA and its ecosystem partners are building a Kubernetes-based SMO and E2E service orchestrator to support dynamic workload management. With telcos, NVIDIA is working on COTS-based and cloud-native vRAN software. With CSPs, NVIDIA is working to optimize data center hardware to support 5G workloads. 

Join us for the GTC 2022 session, Using AI Infrastructure in the Cloud for 5G vRAN to learn more about the CloudRAN solution. 

Categories
Misc

A Podcast With Teeth: How Overjet Brings AI to Dentists’ Offices

Dentists get a bad rap. Dentists also get more people out of more aggravating pain than just about anyone. Which is why the more technology dentists have, the better. Overjet, a member of the NVIDIA Inception program for startups, is moving fast to bring AI to dentists’ offices. On this episode of the NVIDIA AI Read article >

The post A Podcast With Teeth: How Overjet Brings AI to Dentists’ Offices appeared first on NVIDIA Blog.

Categories
Misc

Open-Source Healthcare AI Innovation Continues to Expand with MONAI v1.0

Developing for the medical imaging AI lifecycle is a time-consuming and resource-intensive process that typically includes data acquisition, compute, and…

Developing for the medical imaging AI lifecycle is a time-consuming and resource-intensive process that typically includes data acquisition, compute, and training time, and a team of experts who are knowledgeable in creating models suited to your specific challenge. Project MONAI, the medical open network for AI, is continuing to expand its capabilities to help make each of these hurdles easier no matter where developers are starting in their medical AI workflow. 

A growing open-source platform for better medical AI

MONAI is the domain-specific, open-source medical AI framework that drives research breakthroughs and accelerates AI into clinical impact. It unites doctors with data scientists to unlock the power of medical data for deep learning models and deployable applications in medical AI workflows. MONAI features domain-specific tools in data labeling, model training, and application deployment that enable you to develop, reproduce, and standardize on medical AI lifecycles. 

The release of MONAI v1.0 brings a number of exciting new updates and tools for developers, including:

  • Model Zoo
  • Active Learning in MONAI Label
  • Auto-3D Segmentation
  • Federated Learning

MONAI is the fastest growing open-source platform that provides deep learning infrastructure and workflows optimized for medical imaging in a native PyTorch paradigm. Freely available and optimized for supercomputing scale, MONAI is backed by 12 of the top Academic Medical Centers (AMCs) and has 50,000 downloads per month. From research to clinical products, the launch of MONAI v1.0 allows researchers and developers to build models and applications in a quick and standardized way. 

Jump-start training workflows with MONAI Model Zoo

Training and constructing your own AI models takes significant time, data, compute power, and knowledge of training algorithms. MONAI Model Zoo enables developers to quickly discover pretrained and openly available models specific to medical imaging. By using the MONAI Bundle Format, you can get started with these models in just a few commands.

MONAI Model Zoo offers a curated collection of medical imaging AI models. It is also a framework for you as a developer to create and publish your own models, resulting in an open-source collection of pretrained medical imaging models that can be used to speed up the development process. 

Driven by the community, Model Zoo makes cutting-edge medical AI tasks accessible and helps you get started quickly in your workflows with plug-and-play documentation, examples, and bundles. Among the main contributors to Model Zoo are NVIDIA, KCL, Kitware, Vanderbilt, and Charite, including more than 15 models across imaging modalities such as CT, pathology, ultrasound, and endoscopy to perform segmentation, classification, annotation tasks, and more.

A GIF showing how to access models in MONAI Model Zoo.
Figure 1. Access and download models from MONAI Model Zoo with just a few clicks

Build better datasets with active learning

The process of labeling data can be time consuming, and experts who can annotate these images may not have time to annotate every image. MONAI Label has enhanced active learning capabilities, which is a process that aims to use the least amount of data to achieve the highest possible model performance. Choosing data that will have the greatest influence on your overall model accuracy allows human annotators to focus on the annotations that will have the highest impact on the model performance.

MONAI Label provides a clinician-friendly application to expertly label data in a fraction of the time while simultaneously training a model at the push of a button. With approaches like active learning, AI-powered algorithms can intelligently select the most difficult images for clinical inputs and increase the performance of the AI model as it learns from the expert. This enables human annotators to focus on the annotations that will provide the highest gain in model performance and address areas with model uncertainty. 

Active learning serves to build better datasets in a fraction of the time it would take humans to curate. MONAI Label can now automatically review and label large datasets, flag image data that requires human input, and then query the clinician to label it before it is added back into the training data. 

Developers can see up to 75% reduction in training costs with active learning in MONAI Label with increased labeling and training efficiency while achieving better model performance. With active learning, only 25% of the actual training dataset was used to achieve the same result of 0.82 Dice Score as training on 100% of the training dataset.

Diagram showing active learning framework on MONAI Label.
Figure 2. The six steps of the active learning framework on MONAI Label

Accelerate 3D segmentation

The process of model training to achieve a state-of-the-art 3D segmentation model takes significant time, compute, and developer and researcher expertise. To help accelerate this process, MONAI now offers a low-code 3D medical image segmentation framework that speeds up model training time without human interaction.

The MONAI Auto-3D Segmentation tool is a low-code framework that allows developers and researchers of any skill level to train models that can quickly delineate regions of interest in data from 3D imaging modalities like CT and MRI. It accelerates training time for developers from one week to two days with effective models, efficient workflows, and customizability to user needs.

Features include:

  • Data analysis tool
  • Automated configuration
  • Model training in a MONAI bundle
  • Model ensemble tool
  • Workflow manager
  • Trained model weights

Federated learning on MONAI

MONAI v1.0 includes the federated learning (FL) client algorithm APIs that are exposed as an abstract base class for defining an algorithm to be run on any federated learning platform.

NVIDIA FLARE, the federated learning platform, has already built the integration piece with these new APIs. Using MONAI Bundle configurations with new federated learning APIs, any bundle can be seamlessly extended to a federated paradigm. We welcome other federated learning toolkits to integrate with MONAI FL APIs, building a common foundation for carrying out collaborative learning in medical imaging.

Get started with MONAI

To get started with MONAI v1.0, visit the MONAI website. Access Python libraries, Jupyter notebooks, and MONAI tutorials on the Project MONAI GitHub repo.

You can also request a free hands-on lab through NVIDIA LaunchPad to get started annotating and adapting medical imaging models with MONAI Label.

Categories
Misc

Democratizing and Accelerating Genome Sequencing Analysis with NVIDIA Clara Parabricks v4.0

The field of computational biology relies on bioinformatics tools that are fast, accurate, and easy to use. As next-generation sequencing (NGS) is becoming…

The field of computational biology relies on bioinformatics tools that are fast, accurate, and easy to use. As next-generation sequencing (NGS) is becoming faster and less costly, a data deluge is emerging, and there is an ever-growing need for accessible, high-throughput, industry-standard analysis.

At GTC 2022, we announced the release of NVIDIA Clara Parabricks v4.0, which brings significant improvements to how genomic researchers and bioinformaticians deploy and scale genome sequencing analysis pipelines.

  • Clara Parabricks software is now free to researchers on NGC as individual tools or as a unified container. A licensed version is available through NVIDIA AI Enterprise for customers requiring enterprise-grade support.
  • Clara Parabricks is now easily integrated into common workflow languages such as Workflow Description Language (WDL) and NextFlow, for the interweaving of GPU-accelerated and third-party tools, and scalable deployment on-premises and in the cloud. The Cromwell workflow management system from the Broad Institute is also supported. 
  • Clara Parabricks can now be deployed on the Broad Institute’s Terra SaaS platform, making it available to the 25,000+ Terra scientists. Genome analysis is reduced to just over one hour with Clara Parabricks compared to 24 hours in a CPU environment, while reducing costs by 50% for whole genome sequencing analysis.
  • Clara Parabricks continues to focus on GPU-accelerated, industry-standard, and deep-learning-based tools and has included the latest DeepVariant v1.4 germline caller. Development in the areas of sequencer-agnostic tooling and deep learning approaches are a focus of Clara Parabricks.
  • Clara Parabricks is now available through more cloud providers and partners, including Amazon Web Services, Google Cloud Platform, Terra, DNAnexus, Lifebit, Agilent Technologies, UK Biobank Research Analysis Platform (RAP), Oracle Cloud Infrastructure, Naver Cloud, Alibaba Cloud, and Baidu AI Cloud.

License-free use for research and development

Clara Parabricks v4.0 is now available entirely free of charge for research and development. This means fewer technical barriers than ever before, including the removal of the install scripts and the enterprise license server present in previous versions of the genomic analysis software. 

This also means significant simplification in deployment, with the ability to pull and run Clara Parabricks Docker containers quickly and easily, on any NVIDIA-certified systems, with maximum ease of use on-premises or in the cloud.

Commercial users that require enterprise-level technical and engineering support for their production workflows, or to work with NVIDIA experts on new features, applications, and performance optimizations, can now subscribe to NVIDIA AI Enterprise Support. This support will be available for Parabricks v4.0 with the upcoming release of NVIDIA AI Enterprise v3.0.

An NVIDIA AI Enterprise Support subscription comes with full-stack support (from container-level, through to full on-premises and cloud deployment), access to NVIDIA Parabricks experts, security notifications, enterprise training in areas such as IT or data science, and deep learning support for TensorFlow, PyTorch, NVIDIA TensorRT, and NVIDIA RAPIDS. Learn more about NVIDIA AI Enterprise Support Services and Training

A table showing Clara Parabricks license options.
Figure 1. Access all the tools within Clara Parabricks at no cost, including the pipelines and workflows

Deploying in WDL and NextFlow workflows

You can now pull Clara Parabricks directly from NGC collection containers with no licensing server, meaning that it can easily be run as part of scalable and flexible bioinformatics workflows on a variety of systems and platforms.

This includes popular bioinformatics workflow managers WDL and NextFlow that are available on the new Clara-Parabricks-Workflows GitHub repo for general use by the bioinformatics community. You can find WDL and NextFlow workflows or modules for the following:

  • BWA-MEM alignment and processing with Clara Parabricks FQ2BAM
  • A germline calling workflow running accelerated HaplotypeCaller and DeepVariant, with the option to apply the GATK best practices
  • A BAM2FQ2BAM workflow to extract reads and realign to new reference genomes (such as the T2T completed human genome)
  • A somatic workflow using accelerated Mutect2, with an optional panel of normals
  • A workflow to generate a new panel of normals for somatic variant calling from VCFs
  • A workflow to build reference indexes (required for several of the workflows and tasks listed earlier)

In addition, a workflow for calling de novo mutations in trio data developed in collaboration with researchers at the National Cancer Institute will be available later this year.

These workflows bring impressive flexibility, enabling users to interweave the GPU-accelerated tools of Clara Parabricks with third-party tooling. They can specify individual compute resources for each task, before deploying at a massive scale on local clusters (on SLURM, for example) or on cloud platforms. See the Clara-Parabricks-Workflows GitHub repo for example configurations and recommended GPU instances.

A diagram showing how to pull directly from the Clara Parabricks Docker and specify gpuType and gpuCount compute requirements.
Figure 2. Pull directly from the Clara Parabricks Docker container and specify gpuType and gpuCount compute requirements

Run on-premises or in the cloud

Clara Parabricks is well-suited to cloud deployment. It is available to run on several cloud platforms, including Amazon Web Services, Google Cloud Services, DNAnexus, Lifebit, Baidu AI Cloud, Naver Cloud, Oracle Cloud Infrastructure, Alibaba Cloud, Terra, and more.

Clara Parabricks v4.0 WDL workflows are now integrated into the Broad Institute’s Terra platform for its 25,000+ scientists to run accelerated genomic analyses. Terra’s scalable platform runs on top of Google Cloud, which hosts a fleet of NVIDIA GPUs. A FASTQ to VCF analysis on a 30x whole genome takes 24 hours in a CPU environment compared to just over one hour with Clara Parabricks in Terra. In addition, costs are reduced by over 50%, from $5 to $2 (Figure 3).

In the Terra platform, researchers can gain access to a wealth of data much more easily than in an on-premises environment. They can access the Clara Parabricks workspace at the push of a button, rather than manually managing and configuring the hardware. Get started at the Clara Parabricks page on the Terra Community Workbench.

Graph showing time and cost comparison between CPU and GPU for 30x whole genome sequencing in Terra.
Figure 3. FASTQ to VCF runs in Terra

Runtimes and compute cost (preemptible pricing) for germline analysis of a 30x whole genome (including BWA-MEM, MarkDuplicates, BQSR, and HaplotypeCaller) are greatly reduced when using Clara Parabricks and NVIDIA GPUs.

Clara Parabricks v4.0 tools and features

Clara Parabricks v4.0 is a more focused genomic analysis toolset than previous versions, with rapid alignment, gold standard processing, and high accuracy variant calling. It offers the flexibility to freely and seamlessly intertwine GPU and CPU tasks and prioritize the GPU-acceleration of the most popular and bottlenecked tools in the genomics workflow. Clara Parabricks can also integrate cutting-edge deep learning approaches in genomics.

Diagram showing the NVIDIA Clara Parabricks v4.0 toolset.
Figure 4. The NVIDIA Clara Parabricks v4.0 toolset

The individual Clara Parabricks tools are also now offered in individual containers in the Clara Parabricks collection on NGC or as a unified container that encompasses all tools in one. For the individual containers, bioinformaticians can access lean containers, and the Clara Parabricks team can push more frequent agile per-tool releases to give access to the latest versions. 

The first of these releases is for DeepVariant v1.4. This latest version of DeepVariant increases accuracy across multiple genomics sequencers. There is an additional read insert size feature for Illumina whole genome and whole exome models, which reduces errors by 4-10%, and direct phasing for more accurate variant calling in PacBio sequencing runs. This means that you can now perform the high-accuracy process of phased variant calling for PacBio data directly in DeepVariant, with pipelines such as DeepVariant-WhatsHap-DeepVariant or PEPPER-Margin-DeepVariant.

DeepVariant v1.4 is also compatible with multiple custom DeepVariant models for emerging genomics sequencing instruments. The models have been GPU-accelerated in collaboration with the NVIDIA Clara Parabricks team to provide rapid and high-accuracy variant calls across sequencing instruments. DeepVariant v1.4 is now available in the Clara Parabricks collection on NGC.

Deep learning approaches to genomics and precision medicine is a big focus for Clara Parabricks and is highlighted in the GTC 2022 NVIDIA and Broad Institute announcement on further developments on the Genome Analysis Toolkit (GATK) and large language models for DNA and RNA.

Get started with Clara Parabricks v4.0 

To start using Clara Parabricks for free, visit the Clara Parabricks collection on NGC. You can also request a free Clara Parabricks NVIDIA LaunchPad lab to get hands-on experience running accelerated industry-standard tools for germline and somatic analysis for an exome and whole genome dataset.

For more information about Clara Parabricks, including technical details on the tools available, see the Clara Parabricks documentation.

Categories
Misc

New NVIDIA DGX System Software and Infrastructure Solutions Supercharge Enterprise AI

At GTC today, NVIDIA unveiled a number of updates to its DGX portfolio to power new breakthroughs in enterprise AI development. NVIDIA DGX H100 systems are now available for order. These infrastructure building blocks support NVIDIA’s full-stack enterprise AI solutions. With 32 petaflops of performance at FP8 precision, NVIDIA DGX H100 delivers a leap in Read article >

The post New NVIDIA DGX System Software and Infrastructure Solutions Supercharge Enterprise AI appeared first on NVIDIA Blog.

Categories
Misc

No Hang Ups With Hangul: KT Trains Smart Speakers, Customer Call Centers With NVIDIA AI

South Korea’s most popular AI voice assistant, GiGA Genie, converses with 8 million people each day. The AI-powered speaker from telecom company KT can control TVs, offer real-time traffic updates and complete a slew of other home-assistance tasks based on voice commands. It has mastered its conversational skills in the highly complex Korean language thanks Read article >

The post No Hang Ups With Hangul: KT Trains Smart Speakers, Customer Call Centers With NVIDIA AI appeared first on NVIDIA Blog.

Categories
Misc

New SDKs Accelerating AI Research, Computer Vision, Data Science, and More

At GTC 2022, NVIDIA revealed major updates to its suite of NVIDIA AI software for developers. The updates accelerate computing in several areas, such as machine…

At GTC 2022, NVIDIA revealed major updates to its suite of NVIDIA AI software for developers. The updates accelerate computing in several areas, such as machine learning research with NVIDIA JAX, AI imaging and computer vision with NVIDIA CV-CUDA, and data science workloads with RAPIDS.

To learn about the latest SDK advancements from NVIDIA, watch the keynote from CEO Jensen Huang.


JAX on NVIDIA AI

Just today at GTC 2022, NVIDIA introduced JAX on NVIDIA AI, the newest addition to its GPU-accelerated deep learning frameworks. JAX is a rapidly growing library for high-performance numerical computing and machine learning research.

Highlights:

  • Efficient scaling across multi-node, multi-GPU
  • Easy workflow to train large language models on GPU with GPU-optimized T5X and GPT scripts
  • Built for all major cloud platforms

A ready-to-use JAX container will be available during Q4’2022 in early access. Apply now for early access to JAX and get notified when it is available.

Add this GTC session to your calendar:


NVIDIA CV-CUDA

NVIDIA introduced CV-CUDA, a new open source project enabling developers to build highly efficient, GPU-accelerated pre– and post-processing pipelines in cloud-scale artificial intelligence (AI) imaging and computer vision (CV) workloads.

Highlights:

  • Specialized set of 50+ highly performant CUDA kernels as standalone operators
  • Batching support with variable shape images in one batch

For more CV-CUDA updates, see the CV-CUDA early access interest page.


NVIDIA Triton

NVIDIA announced key updates to NVIDIA Triton, open-source, inference-serving software bringing fast and scalable AI to every application in production. Over 50 features were added in the last 12 months.

Notable feature additions:

  • Model orchestration using the NVIDIA Triton Management Service that automates deployment and management of multiple models on Triton Inference Server instances in Kubernetes. Apply for early access.
  • Large language model inference with multi-GPU, multi-node execution with the FasterTransformer backend.
  • Model pipelines (ensembles) with advanced logic using business logic scripting.
  • Auto-generation of minimal required model configuration for fast deployment is on by default.

Kick-start your NVIDIA Triton journey with immediate, short-term access in NVIDIA LaunchPad without setting up your own environment.

You can also download NVIDIA Triton from the NGC catalog, access code and documentation on the /triton-inference-server GitHub repo, and get enterprise-grade support.

Add these GTC sessions to your calendar:


NVIDIA RAPIDS

At GTC 2022, NVIDIA announced that RAPIDS, the data science acceleration solution chosen by 25% of Fortune 100 companies, is now further breaking down adoption and usability barriers. It is making accelerated analytics accessible to nearly every organization, whether they’re using low-level C++ libraries, Windows (WSL), or cloud-based data analytics platforms. New capabilities will be available mid-October.

Highlights:

  • Support for WSL and Arm SBSA now generally available
    • Supporting Windows brings the convenience and power of RAPIDS to nine million new Python developers who use Windows.
  • Easily launch multi-node workflows on Kubernetes and Kubeflow
    • Estimating cluster resources in advance for interactive work is often prohibitively challenging. You can now conveniently launch Dask RAPIDS clusters from within your interactive Jupyter sessions and burst beyond the resources of your container for combined ETL and ML workloads.

For more information about the latest release, download and try NVIDIA RAPIDS.

Add these GTC sessions to your calendar:


NVIDIA RAPIDS Accelerator for Apache Spark

New capabilities of the NVIDIA RAPIDS accelerator for Apache Spark 3.x were announced at GTC 2022. The new capabilities bring an unprecedented level of transparency to help you speed up your Apache Spark DataFrame and SQL operations on NVIDIA GPUs, with no code changes and without leaving the Apache Spark environment. Version 22.10 will be available mid-October.

The new capabilities of this release further the mission of accelerating your existing Apache Spark workloads, no matter where you run them.

Highlights:

  • The new workload acceleration tool analyzes Apache Spark workloads and recommends optimized GPU parameters for cost savings and performance.
  • Integration with Google Cloud DataProc.
  • Integration with Delta Lake and Apache Iceberg.

For more information about the latest release, download and try NVIDIA RAPIDS Accelerator for Apache Spark

Add these GTC sessions to your calendar:


PyTorch Geometric and DGL on NVIDIA AI

At GTC 2022, NVIDIA introduced GPU-optimized graph neural network (GNN) frameworks designed to help developers, researchers, and data scientists working on graph learning, including large heterogeneous graphs with billions of edges. With NVIDIA AI-accelerated GNN frameworks, you can achieve end-to-end performance optimization, making it the fastest solution to preprocess and build GNNs.

Highlights:

  • Ready-to-use containers for GPU-optimized PyTorch Geometric and Deep Graph Library
  • Up to 90% lower end-to-end execution time compared to CPUs for ETL, sampling, and training
  • End-to-end reference examples for GraphSage, R-GCN, and SE3-Transformer

For more information about NVIDIA AI-accelerated PyTorch Geometric and DGL and its availability, see the GNN Frameworks page.

Add these GTC sessions to your calendar:


NVIDIA cuQuantum and NVIDIA QODA

At GTC 2022, NVIDIA announced the latest version of the NVIDIA cuQuantum SDK for accelerating quantum circuit simulation. cuQuantum enables the quantum computing ecosystem to solve problems at the scale of future quantum advantage, enabling the development of algorithms and the design and validation of quantum hardware. 

NVIDIA also announced ecosystem updates for NVIDIA QODA, an open, QPU-agnostic platform for hybrid quantum-classical computing. This hybrid, quantum/classical programming model is interoperable with today’s most important scientific computing applications. We are opening up the programming of quantum computers to a massive new class of domain scientists and researchers.

cuQuantum highlights:

  • Multi-node, multi-GPU support in the DGX cuQuantum appliance
  • Support for approximate tensor network methods
  • Adoption of cuQuantum continues to gain momentum, including CSPs and industrial quantum groups

QODA private beta highlights:

  • Single-source C++ and Python implementations as well as a compiler toolchain for hybrid systems and a standard library of quantum algorithmic primitives
  • QPU-agnostic, partnering with quantum hardware companies across a broad range of qubit modalities
  • Delivering up to a 300X speedup over a leading Pythonic framework also running on an A100 GPU

Add these GTC sessions to your calendar:

Categories
Misc

New Languages, Enhanced Cybersecurity, and Medical AI Frameworks Unveiled at GTC

At GTC 2022, NVIDIA revealed major updates to its suite of NVIDIA AI frameworks for building real-time speech AI applications, designing high-performing…

At GTC 2022, NVIDIA revealed major updates to its suite of NVIDIA AI frameworks for building real-time speech AI applications, designing high-performing recommenders at scale, applying AI to cybersecurity challenges, creating AI-powered medical devices, and more.

Showcased real-world, end-to-end AI frameworks highlighted the customers and partners leading the way in their industries and domains. When organizations put their AI frameworks into production, enterprise support with NVIDIA AI Enterprise ensures the success of these AI applications.

Watch the keynote from founder and CEO Jensen Huang to explore the latest AI technology advancements from NVIDIA and learn new ways to put AI into production.


NVIDIA Riva

NVIDIA announced new updates to Riva, an accelerated SDK for enabling speech AI frameworks. Build and deploy fully customizable real-time AI pipelines with world-class automatic speech recognition (ASR) and text-to-speech (TTS): in the cloud, at the edge, on-premises, or on embedded devices.

Highlights:

  • World-class ASR in two new languages: Hindi and French.
  • 530% out-of-the-box accuracy improvement for English, Spanish, Mandarin, Russian, and German.
  • Easy voice emphasis, volume, and pause control at inference time with the SSML API.
  • Compact and simple one TTS model with multiple synthetic voices allowing voice selection at inference time.
  • Embedded Riva with seven ASR languages and two out-of-the-box female and male English TTS voices below 100 ms latency.
  • Access to NVIDIA AI experts, training, and knowledge-based resources through NVIDIA Enterprise Support with the purchase of NVIDIA AI Enterprise software.

Get started with downloading Riva, try guided Riva labs on ready-to-run infrastructure in LaunchPad, or get support for large-scale Riva deployments with NVIDIA AI Enterprise software.

Add these GTC sessions to your calendar:


NVIDIA Morpheus

At GTC, NVIDIA announced updates to Morpheus, a GPU-accelerated cybersecurity framework that enables developers to create optimized AI pipelines for filtering, processing, and classifying large volumes of real-time data.

New data visualizations included with the latest Morpheus release aid security analysts in pinpointing anomalies, with built-in explainability, enabling them to quickly determine a set of actionable next steps to remediate.

Additional new features:

  • Digital fingerprinting workflow including fine-tunable explainability and thresholding to customize for your environment.
  • A new visualization tool for digital fingerprinting provides massive data reduction for analyzed events so that security analysts can more quickly investigate and remediate.
  • Visual graph-based explainer for sensitive information detection use cases enables security analysts to identify leaked sensitive data more easily.
  • Multi-process pipeline support enables new workflows to be intelligently batched to reduce latency.

Learn more about the latest release of NVIDIA Morpheus. Get started with NVIDIA Morpheus on the nv-morpheus/Morpheus GitHub repo or NGC, or try it on NVIDIA LaunchPad.  

Add these GTC sessions to your calendar:


NVIDIA cuOpt

NVIDIA announced the worldwide open availability of its cuOpt AI logistics software framework. cuOpt offers near real-time routing optimizations, empowering businesses to make use of larger data sets with faster processing.

Companies can now employ new capabilities such as dynamic rerouting, simulations, and subsecond solver response time for last-mile delivery, supply chain management, and warehouse picking.

Major updates:

  • New containerized server API
  • Availability on major public clouds such as GCP, AWS, Azure, and more
  • Solving 700 variants of routing problems with top accuracy, speed, and scale to 10K locations per GPU
  • Ability to connect with NVIDIA Isaac Sim to discover the speed of light in your next intralogistics design
  • Leverage cuOpt’s world-record accuracy, based on benchmarks of Gehring and Homberger​’s vehicle routing (CVRPTW) and Li&Lim’s pick up and deliveries (PDPTW) benchmark

For more information, download NVIDIA cuOpt on the /nvidia/cuopt-resources GitHub repo or try it on NVIDIA LaunchPad.

Add these GTC sessions to your calendar:


NVIDIA Merlin

Today, NVIDIA announced updates to NVIDIA Merlin, an end-to-end recommender systems framework designed to accelerate data processing, training, inference, and deployment at scale.

With the latest release, data scientists and machine learning engineers can quickly build and deploy models on CPU and GPU, as well as scaling to terabyte (TB) size model training and inference.

Noted features:

  • Easily build and compare common popular retrieval and ranking models across TensorFlow, XGBoost, and Implicit.
  • Ability to build sequential and session-based recommendations by leveraging Hugging Face transformer-based architectures.
  • Discovering model-parallelism for distributed embeddings that optimize TB model training for scalability and flexibility.
  • Scaling inference to TB models with low latency and high throughput.

Ease your recommender experimentation frameworks and download it from the NVIDIA-Merlin GitHub repo today.

Add these GTC sessions to your calendar:


NVIDIA Tokkio

At GTC, NVIDIA presented how NVIDIA Avatar Cloud Engine (ACE) is being used to put together incredibly complex avatar applications, including Tokkio, a reference application for building immersive, avatar-powered customer experiences.

Try out Tokkio today on NGC.


NVIDIA Maxine

At GTC, NVIDIA announced the early-access release of Maxine audio effects microservices, as well as two new audio SDK features. NVIDIA also shared the general availability of Eye Contact, Facial Expression Estimation, and other enhanced versions of existing SDKs.

The following NVIDIA Maxine microservices are in early access:

  • Audio Super Resolution: Improves real-time audio quality by upsampling the audio input stream from 8 kHz to 16 kHz and the sampling rate from 16 kHz to 48 kHz.
  • Acoustic Echo Cancellation: Cancels real-time acoustic device echo from input-audio stream, eliminating mismatched acoustic pairs and double-talk. With AI-based technology, more effective cancellation is achieved than with traditional digital signal processing.
  • Background Noise Removal: Enables noise removal on users’ microphones as well as the audio feed of counterparts to make conversations easier to understand.
  • Room Echo Removal: Removes audio of the voice that is heard by a microphone.

Get started with NVIDIA Maxine today. 

Add this GTC session to your calendar:


MONAI 1.0 

Project MONAI, the medical open network for AI, is continuing to expand its capabilities to accelerate medical AI training no matter where developers are starting in their medical AI framework.

The release of MONAI 1.0 brings exciting new updates:

  • Model Zoo to jumpstart AI training frameworks
  • Active Learning in MONAI Label for building better datasets
  • Auto-3D Segmentation
  • Federated Learning

For more about MONAI 1.0, see Open-Source Healthcare AI Innovation Continues to Expand with MONAI 1.0 or try it on NVIDIA LaunchPad.

Add these GTC sessions to your calendar:


NVIDIA Clara Holoscan

Announced at GTC, NVIDIA Clara Holoscan SDK 0.3 now provides a lightning-fast frame rate of 240 Hz for 4K video that enables you to combine data from more sensors and build AI applications that can provide surgical guidance. With faster data transfer enabled through high-speed, Ethernet-connected sensors, you have even more tools to build accelerated AI pipelines for medical devices.

Highlights:

  • End-to-end latency of only 10 ms on a 4K 240 Hz stream
  • Streaming data at 4K 60 Hz at under 50 ms on NVIDIA RTX A6000, teams can run 15 concurrent video streams and 30 concurrent models.
  • Faster data transfer and scalable connectivity for high-bandwidth network sensors.
  • Developer tools to build accelerated AI pipelines for versatile medical device applications.

Seeking a hands-on lab? Try Clara Holoscan on LaunchPad.

Add this GTC session to your calendar:


NVIDIA Metropolis

At GTC, NVIDIA announced that Siemens, the largest industrial automation company in the world, is adopting NVIDIA Metropolis and the IGX industrial edge AI platform. Metropolis on IGX will help Siemens connect entire fleets of robots and IoT devices to work together and proactively identify safety risks on factory floors and product defects on assembly lines with higher accuracy.

 Additional benefits:

  • Metropolis enables Siemens to integrate next-level perception with new and existing industrial inspection systems.
  • With Metropolis and IGX, Siemens can connect a simulated digital twin factory to the physical twin, a huge step toward self-driving factories of the future.

Add these GTC sessions to your calendar:


NVIDIA Isaac

Today, NVIDIA introduced NVIDIA Isaac Sim on the cloud, providing three options to access photo-real, physically accurate robotics simulation using the newly released Omniverse Cloud, AWS RoboMaker, or self-managed instances. NVIDIA also announced NVIDIA Isaac Nova Orin as a modular and scalable compute and sensor platform for OEMs and ISVs working on autonomous mobile robots (AMRs).

Here’s how NVIDIA Isaac Sim and NVIDIA Isaac Nova Orin are helping to advance the robotics industry:

  • Simulate robots anywhere and on any device.
  • Scale simulations to meet the most compute-intensive simulation tasks like CI/CD and synthetic data generation.
  • Accelerate AMR development with an interoperable sensor and compute platform.

Be sure to attend the two-hour, hands-on NVIDIA Isaac Sim on AWS Robomaker workshop.

Add these GTC sessions to your calendar:


NVIDIA DRIVE Thor

Today, NVIDIA introduced NVIDIA DRIVE Thor, its next-generation centralized computer to ensure safe and secure autonomous vehicles.

NVIDIA DRIVE Thor is a superchip that achieves 2,000 teraflops of performance and can unify intelligent functions into a single architecture for greater efficiency and lower overall system cost. Smart functions include automated and assisted driving, parking, driver and occupant monitoring, digital instrument cluster, in-vehicle infotainment (IVI), and rear-seat entertainment.

New features:

  • Multi-domain computing support
  • FP8 precision
  • Inference transformer engine, improving DNN performance by 9x
  • NVLink-C2C chip interconnect technology

For more information about NVIDIA DRIVE Thor capabilities, see NVIDIA DRIVE Thor Strikes AI Performance Balance, Uniting AV and Cockpit on a Single Computer.

Add these GTC sessions to your calendar:


NVIDIA BioNeMo

Just as AI is learning to understand human languages with transformer models, it’s also learning the languages of biology and chemistry. At GTC, NVIDIA introduced BioNeMo, a new framework and service for training large-scale transformer models using bigger datasets. This results in better-performing neural networks.

Larger models can store more predictive information on amino acid sequences and properties of different proteins. You can connect these insights to biological properties, functions, and even human health conditions.

Highlights:

  • Pretrained large language models (LLMs) for chemistry and biology
  • Optimized inference at supercomputing scale, enabling you to deploy LLMs with billions or trillions of parameters
  • Turnkey cloud service for AI drug discovery pipelines

Try MegaMolBART, a generative chemistry model in the BioNeMo framework, with a guided lab on LaunchPad.

Add this GTC session to your calendar:


NVIDIA AI Enterprise

NVIDIA AI Enterprise is a powerful cloud-native software suite that streamlines the entire AI framework and unlocks the full potential of the NVIDIA AI platform.

As the operating system of the NVIDIA AI platform, NVIDIA AI Enterprise provides production-ready support for applications built with the extensive NVIDIA library of frameworks.

Whether it’s creating more engaging chatbots and AI virtual assistants with Riva, building smarter recommendations to help consumers make better purchasing decisions with NVIDIA Merlin, or developing AI-powered medical imaging and genomics with NVIDIA Clara, and more, NVIDIA AI Enterprise supports NVIDIA domain-specific frameworks that developers can use to design new business solutions.

Add these GTC sessions to your calendar:

Categories
Misc

Accelerating Ultra-Realistic Game Development with NVIDIA DLSS 3 and NVIDIA RTX Path Tracing

NVIDIA recently announced Ada Lovelace, the next generation of GPUs. Named the NVIDIA GeForce RTX 40 Series, these are the world’s most advanced graphics…

NVIDIA recently announced Ada Lovelace, the next generation of GPUs. Named the NVIDIA GeForce RTX 40 Series, these are the world’s most advanced graphics cards. Featuring third-generation Ray Tracing Cores and fourth-generation Tensor Cores, they accelerate games that take advantage of the latest neural graphics and ray tracing technology.

Since the introduction of the GeForce RTX 20 Series, NVIDIA has paved the way for groundbreaking graphics with novel research on how AI can enhance rendering and improve computer games. NVIDIA is also committed to pushing the industry toward real-time photorealism that matches blockbuster cinema. The latest suite of technologies multiply performance in games while accelerating how quickly developers can create content. 

A revolution in neural graphics

NVIDIA DLSS (Deep Learning Super Sampling) 3 introduces an all-new neural graphics technology. It multiplies performance using AI to generate frames all while delivering best-in-class image quality and responsiveness. The new AI network, called Optical Multi-Frame Generation, is powered by dedicated AI processors called Tensor Cores and the new Optical Flow Accelerator in GeForce RTX 40 Series GPUs.  

Image of a car showing that DLSS 3 is hardware-accelerated and AI-powered to create additional high quality frames.
Figure 1. DLSS 3 delivers up to 4x improvements in frame rate and up to 2x improvements in latency

Information is taken from sequential frames and an optical flow field to generate a new high-quality frame, boosting performance in both GPU– and CPU-bound scenarios. DLSS 3 combines DLSS Frame Generation with DLSS Super Resolution and NVIDIA Reflex low latency technology. It delivers up to 4x improvements in frame rate and up to 2x improvements in latency compared to native resolution rendering.

Over 200 games and applications have adopted DLSS, while DLSS 3 is off to a roaring start with integrations coming in more than 35 games and applications. To learn more about how it works and the performance gains in integrated titles, visit DLSS 3 RTX 40 Series.

Diagram showing how AI-powered Optical Multi Frame Generation takes the Optical Flow Field, motion vector data, and Super Resolution frames to create new intermediate frames.
Figure 2. AI-powered Optical Multi Frame Generation takes the Optical Flow Field, motion vector data, and Super Resolution frames to create new intermediate frames

Take full advantage of the next-generation neural graphics technology to bring next-level performance to your games. You can integrate the Streamline SDK now, to be ready for DLSS 3 when it is publicly available. 

Join us for Examining the Latest Deep Learning Models for Real-Time Neural Graphics at GTC 2022 to learn more. 

Accelerate lighting production with NVIDIA RTX Path Tracing 

The NVIDIA RTX Path Tracing SDK is a suite of technologies that combines years of best practices within real-time ray tracing and neural graphics development. It will provide new efficiencies during lighting production while offering ultra-quality rendering modes for higher-end GPUs.

Path tracing takes a physics-based approach to how light moves around a scene. Traditionally, artists and engineers raster or ray-trace individual effects such as shadows, reflections, or indirect lighting. The process is iterative and time consuming, as developers have to approximate what the ground truth should look like and wait for the rendering to complete.  

The RTX Path Tracing SDK addresses these issues by unifying direct and indirect lighting across various kinds of materials in real time, serving as a ground truth reference to ensure accurate lighting production. This reduction in iteration can save developers time and publishers money, allowing them to focus on creating more photorealistic content.

One of the key components of the SDK is Shader Execution Reordering (SER), a new scheduling system that reorders shading work on-the-fly for better execution efficiency. An NVIDIA API extension, it is useful for ray-traced workloads, as it achieves maximum parallelism and performance from path tracing shaders. 

Image showing that Shader Execution Reordering improves shader performance by up to 2x, and in-game frame rates by up to 25%.
Figure 3. Shader Execution Reordering improves shader performance by up to 2x, and in-game frame rates by up to 25%

DLSS 3, Direct Illumination (RTXDI), and Real-Time Denoisers (NRD) are other components of the NVIDIA RTX Path Tracing SDK, providing you with the flexibility to mix and match these individual components within your pipeline. Explore the new Path Tracing SDK that ensures the photorealistic lighting produced is true to life in real time. 

New graphics primitives built for the future of games

NVIDIA Micro-Mesh is a graphics primitive built from the ground up for real-time path tracing. Displaced Micro-Mesh and Opacity Micro-Map SDKs give developers the tools and sample code for the creation, compression, manipulation, and rendering of micro-meshes. From fossils to crawling creatures to nature, you can express these assets in their full richness. 

Crab image showing different micromap overlays.
Figure 4. Sample micro-mesh composed of 16K individual micro-meshes (left), expanding to two million microtriangles (right), consuming approximately one byte per microtriangle

Games include more detailed geometry than ever before, and in every scene. Developers are looking for solutions to render rasterized or path-traced assets at their full fidelity. These micro-mesh technologies provide highly efficient memory compression and performance boosts for photorealistic complex materials.

Unlike traditional graphics primitives that are inefficient with highly detailed organic surfaces, characters, or volumes, micro-meshes are the only technology built from the ground up for real-time path tracing of a 50x increase in geometry.

Adobe is committed to helping customers create new experiences. Tamy Boubekeur, Director of Adobe Research Paris, said, “We at Adobe are excited about the NVIDIA Displaced Micro-Mesh technology with native ray tracing support, which has the potential to unlock ultra-detailed, real-time, ray-traced scenes with minimal memory cost.”

In addition, Simplygon at Microsoft, the leader in 3D games content optimization, has integrated Displaced Micro-Mesh. Magnus Isaksson of Simplygon said, “We are very excited to partner with NVIDIA to enable game creators with the Simplygon SDK, to compress super-detailed objects by an order of magnitude unmatched by other solutions. With NVIDIA Displaced Micro-Mesh technology, developers can pursue crafting environments at unprecedented levels of fidelity, density, and variety–immersing players in new stunning and beautiful worlds in games.” 

To learn more, visit the Micro-Mesh page. Sign up to be notified of the release of Displaced Micro-Mesh SDK and Opacity Micro-Map SDK

Graphics development with NVIDIA Nsight Developer Tools 

The NVIDIA game development ecosystem is built to empower the most advanced and stunning work in the industry. Developers can now harness the neural graphics and photoreal revolution of the GeForce RTX 40 Series GPU. NVIDIA Nsight Developer Tools provides unprecedented access to the visual computing processes that enable NVIDIA GPUs to be utilized to their full potential. Nsight tools will launch with new updates for use on GeForce 40 RTX Series GPUs from day one of availability.

Nsight Systems offers a deep analysis of performance markers across CPU-GPU interactions to inform how a game can be fine-tuned for optimizations. With the power of the GeForce RTX 40 Series, developers will be able to identify new performance headroom and implement higher frame rate baselines accordingly.  

Nsight Graphics offers a comprehensive set of profiling and debugging tools to ensure scene rendering is performant across the entire pipeline, from acceleration structure composition down to the generation of individual pixels.

Game crashes are frustrating for players and maybe even more frustrating for developers. Nsight Aftermath on the GeForce RTX 40 Series provides precise mini-dumps that pinpoint where and why an exception occurred.

Image of the Fuyun Court location featured in an upcoming update of the NetEase game, Justice.
Figure 5. The Fuyun Court location featured in an upcoming update of the NetEase martial arts game, Justice Online

NetEase used Nsight Developer Tools to help them prepare Justice Online, their expansive MMORPG, for the GeForce RTX 40 Series. Dinggen Zhan of NetEase said, “Nsight Systems let us identify new GPU headroom to achieve higher frame performance. Nsight Graphics allowed us to optimize ray-traced lighting and shadows to make Justice even more beautiful. And Nsight Aftermath helped us ensure transitioning into the next generation of NVIDIA GPUs was without a hitch. At NetEase, we are always eager to support cutting edge graphics technology, and Nsight tools enable us to fully realize our ambition.” 

Join us for featured GTC 2022 game development sessions to learn more. Visit NVIDIA Game Development Resources for more information on integrating NVIDIA RTX and AI technologies into games.

Categories
Offsites

FindIt: Generalized Object Localization with Natural Language Queries

Natural language enables flexible descriptive queries about images. The interaction between text queries and images grounds linguistic meaning in the visual world, facilitating a better understanding of object relationships, human intentions towards objects, and interactions with the environment. The research community has studied object-level visual grounding through a range of tasks, including referring expression comprehension, text-based localization, and more broadly object detection, each of which require different skills in a model. For example, object detection seeks to find all objects from a predefined set of classes, which requires accurate localization and classification, while referring expression comprehension localizes an object from a referring text and often requires complex reasoning on prominent objects. At the intersection of the two is text-based localization, in which a simple category-based text query prompts the model to detect the objects of interest.

Due to their dissimilar task properties, referring expression comprehension, detection, and text-based localization are mostly studied through separate benchmarks with most models only dedicated to one task. As a result, existing models have not adequately synthesized information from the three tasks to achieve a more holistic visual and linguistic understanding. Referring expression comprehension models, for instance, are trained to predict one object per image, and often struggle to localize multiple objects, reject negative queries, or detect novel categories. In addition, detection models are unable to process text inputs, and text-based localization models often struggle to process complex queries that refer to one object instance, such as “Left half sandwich.” Lastly, none of the models can generalize sufficiently well beyond their training data and categories.

To address these limitations, we are presenting “FindIt: Generalized Localization with Natural Language Queries” at ECCV 2022. Here we propose a unified, general-purpose and multitask visual grounding model, called FindIt, that can flexibly answer different types of grounding and detection queries. Key to this architecture is a multi-level cross-modality fusion module that can perform complex reasoning for referring expression comprehension and simultaneously recognize small and challenging objects for text-based localization and detection. In addition, we discover that a standard object detector and detection losses are sufficient and surprisingly effective for all three tasks without the need for task-specific design and losses common in existing works. FindIt is simple, efficient, and outperforms alternative state-of-the-art models on the referring expression comprehension and text-based localization benchmarks, while being competitive on the detection benchmark.

FindIt is a unified model for referring expression comprehension (col. 1), text-based localization (col. 2), and the object detection task (col. 3). FindIt can respond accurately when tested on object types/classes not known during training, e.g. “Find the desk” (col. 4). Compared to existing baselines (MattNet and GPV), FindIt can perform these tasks well and in a single model.

Multi-level Image-Text Fusion
Different localization tasks are created with different semantic understanding objectives. For example, because the referring expression task primarily references prominent objects in the image rather than small, occluded or faraway objects, low resolution images generally suffice. In contrast, the detection task aims to detect objects with various sizes and occlusion levels in higher resolution images. Apart from these benchmarks, the general visual grounding problem is inherently multiscale, as natural queries can refer to objects of any size. This motivates the need for a multi-level image-text fusion model for efficient processing of higher resolution images over different localization tasks.

The premise of FindIt is to fuse the higher level semantic features using more expressive transformer layers, which can capture all-pair interactions between image and text. For the lower-level and higher-resolution features, we use a cheaper dot-product fusion to save computation and memory cost. We attach a detector head (e.g., Faster R-CNN) on top of the fused feature maps to predict the boxes and their classes.

FindIt accepts an image and a query text as inputs, and processes them separately in image/text backbones before applying the multi-level fusion. We feed the fused features to Faster R-CNN to predict the boxes referred to by the text. The feature fusion uses more expressive transformers at higher levels and cheaper dot-product at the lower levels.

Multitask Learning
Apart from the multi-level fusion described above, we adapt the text-based localization and detection tasks to take the same inputs as the referring expression comprehension task. For the text-based localization task, we generate a set of queries over the categories present in the image. For any present category, the text query takes the form “Find the [object],” where [object] is the category name. The objects corresponding to that category are labeled as foreground and the other objects as background. Instead of using the aforementioned prompt, we use a static prompt for the detection task, such as “Find all the objects.”. We found that the specific choice of prompts is not important for text-based localization and detection tasks.

After adaptation, all tasks in consideration share the same inputs and outputs — an image input, a text query, and a set of output bounding boxes and classes. We then combine the datasets and train on the mixture. Finally, we use the standard object detection losses for all tasks, which we found to be surprisingly simple and effective.

Evaluation
We apply FindIt to the popular RefCOCO benchmark for referring expression comprehension tasks. When only the COCO and RefCOCO dataset is available, FindIt outperforms the state-of-the-art-model on all tasks. In the settings where external datasets are allowed, FindIt sets a new state of the art by using COCO and all RefCOCO splits together (no other datasets). On the challenging Google and UMD splits, FindIt outperforms the state of the art by a 10% margin, which, taken together, demonstrate the benefits of multitask learning.

Comparison with the state of the art on the popular referring expression benchmark. FindIt is superior on both the COCO and unconstrained settings (additional training data allowed).

On the text-based localization benchmark, FindIt achieves 79.7%, higher than the GPV (73.0%), and Faster R-CNN baselines (75.2%). Please refer to the paper for more quantitative evaluation.

We further observe that FindIt generalizes better to novel categories and super-categories in the text-based localization task compared to competitive single-task baselines on the popular COCO and Objects365 datasets, shown in the figure below.

FindIt on novel and super categories. Left: FindIt outperforms the single-task baselines especially on the novel categories. Right: FindIt outperforms the single-task baselines on the unseen super categories. “Rec-Single” is the Referring expression comprehension single task model and “Loc-Single” is the text-based localization single task model.

Efficiency
We also benchmark the inference times on the referring expression comprehension task (see Table below). FindIt is efficient and comparable with existing one-stage approaches while achieving higher accuracy. For fair comparison, all running times are measured on one GTX 1080Ti GPU.

Model     Image Size     Backbone     Runtime (ms)
MattNet     1000     R101     378
FAOA     256     DarkNet53     39
MCN     416     DarkNet53     56
TransVG     640     R50     62
FindIt (Ours)     640     R50     107
FindIt (Ours)     384     R50     57

Conclusion
We present Findit, which unifies referring expression comprehension, text-based localization, and object detection tasks. We propose multi-scale cross-attention to unify the diverse localization requirements of these tasks. Without any task-specific design, FindIt surpasses the state of the art on referring expression and text-based localization, shows competitive performance on detection, and generalizes better to out-of-distribution data and novel classes. All of these are accomplished in a single, unified, and efficient model.

Acknowledgements
This work is conducted by Weicheng Kuo, Fred Bertsch, Wei Li, AJ Piergiovanni, Mohammad Saffar, and Anelia Angelova. We would like to thank Ashish Vaswani, Prajit Ramachandran, Niki Parmar, David Luan, Tsung-Yi Lin, and other colleagues at Google Research for their advice and helpful discussions. We would like to thank Tom Small for preparing the animation.