Categories
Misc

Turbocharging Generative AI Workloads with NVIDIA Spectrum-X Networking Platform

Ethernet switchesLarge Language Models (LLMs) and AI applications such as ChatGPT and DALL-E have recently seen rapid growth. Thanks to GPUs, CPUs, DPUs, high-speed storage, and…Ethernet switches

Large Language Models (LLMs) and AI applications such as ChatGPT and DALL-E have recently seen rapid growth. Thanks to GPUs, CPUs, DPUs, high-speed storage, and AI-optimized software innovations, AI is now widely accessible. You can even deploy AI in the cloud or on-premises.

Yet AI applications can be very taxing on the network, and this growth is burdening CPU and GPU servers, as well as the existing underlying network infrastructure that connects these systems together.  

Traditional Ethernet, while sufficient for handling mainstream and enterprise applications such as web and video or audio streaming, is not optimized to support the new generation of AI workloads. Traditional Ethernet is ideal for loosely coupled applications, low-bandwidth flows, and high jitter. It might be sufficient for heterogeneous traffic (such as web, video, or audio streaming; file transfers; and gaming) but is not ideal when oversubscription occurs. 

Designed from the ground up to meet the performance demands for AI applications, NVIDIA Spectrum-X networking platform is an end-to-end solution that is optimized for high-speed network performance, low latency, and scale.

NVIDIA Spectrum-X

NVIDIA Spectrum-X networking platform was developed to address traditional Ethernet network limitations. It is a network fabric designed to answer the needs of demanding AI applications, intended for tightly coupled processes. 

This NVIDIA-certified and tested end-to-end solution combines the best-in-class, AI-optimized networking hardware and software to provide a predictable, consistent, and uncompromising level of performance required by AI workloads. 

Graphic listing the benefits of NVIDIA Spectrum-X: NCCL-optimized RoCE extensions; tightly coupled switch + adapter synergy; advanced performance isolation; tuned for GPT, BERT, RetinaNet, and Spark; end-to-end provisioning for faster time to AI; NVIDIA certified and tested.
Figure 1. NVIDIA Spectrum-X networking platform combines the NVIDIA Spectrum-4 Ethernet switch with NVIDIA BlueField-3 DPU to provide optimal performance for AI workloads

NVIDIA Spectrum-X is a highly versatile technology that can be used with various AI applications. Specifically, it can significantly enhance the performance and efficiency of AI clusters in the following use cases: 

  • GPT and BERT LLMs 
  • Distributed training and parallel processing 
  • Natural language processing (NLP)
  • Computer vision  
  • High-performance simulation (NVIDIA Omniverse and NVIDIA OVX
  • High-performance data analytics (Spark) 
  • Inference applications 

The two key elements of the NVIDIA Spectrum-X platform are the NVIDIA Spectrum-4 Ethernet switch and the NVIDIA BlueField-3 DPU

NVIDIA Spectrum-4 Ethernet switch

NVIDIA Spectrum-4 Ethernet switch provides unprecedented application performance for AI clusters built on standards-based Ethernet. Realizing the full potential of NVIDIA Spectrum-4 requires an end-to-end, purpose-built network architecture. Only the NVIDIA Spectrum-X platform provides the hardware accelerators and offloads needed to power hyperscale AI. 

NVIDIA Spectrum-4 Ethernet switches are built on the 51.2 Tbps Spectrum-4 ASIC, with 4x the bandwidth of the previous generation. It is the world’s first Ethernet AI switching platform. It was designed for AI workloads, combining specialized high-performance architecture with standard Ethernet connectivity. 

NVIDIA Spectrum-4 offers:

  • RoCE extensions: RoCE with unique enhancements
    • RoCE Adaptive Routing
    • RoCE Performance Isolation
    • Simplified, Automated Adaptive Routing and RoCE Configurations
    • Synchronized Collectives
    • Other RoCE for HPC enhancements
  • Highest effective bandwidth on Ethernet at scale
  • Low latency with low jitter and short tail 
  • Deterministic performance and performance isolation
  • Full stack and end-to-end optimization
  • NVIDIA Cumulus Linux or SONiC
Image of switch open with switches and chip visible.
Figure 2. NVIDIA Spectrum-4 combines specialized high-performance architecture with standard Ethernet connectivity

Key benefits of NVIDIA Spectrum-X with NVIDIA Spectrum-4 include the following:

  • Using RoCE extension for AI and adaptive routing (AR) to achieve maximum NVIDIA Collective Communication Library (NCCL) performance.
  • Leveraging performance isolation to ensure that in a multi-tenant and multi-job environment, one job does not impact the other.
  • Ensuring that if there is a network component failure, the fabric continues to deliver the highest performance
  • Synchronizing with BlueField-3 DPU to achieve optimal NCCL and AI performance
  • Maintaining consistent and steady performance under various AI workloads, vital for achieving SLAs.

End-to-end optimal network performance

To build an effective AI compute fabric requires optimizing every part of the AI network, from DPUs to switches to networking software. Achieving the highest effective bandwidth at load and at scale demands using techniques such as RoCE adaptive routing and advanced congestion control mechanisms. Incorporating capabilities that work synchronously on NVIDIA BlueField-3 DPUs and Spectrum-4 switches is crucial to achieve the highest performance and reliability from the AI fabric. 

RoCE adaptive routing

AI workloads and applications are characterized by a small number of elephant flows responsible for the large data movement between GPUs, where the tail latency highly impacts the overall application performance. Catering to such traffic patterns with traditional network routing mechanisms can lead to inconsistent and underutilized GPU performance for AI workloads.

RoCE adaptive routing is a fine-grained load balancing technology. It dynamically reroutes RDMA data to avoid congestion and provide optimal load balancing to achieve the highest effective data bandwidth. 

It is an end-to-end capability that includes Spectrum-4 switches and BlueField-3 DPUs. The Spectrum-4 switches are responsible for selecting the least-congested port for data transmission on a per-packet basis.  As different packets of the same flow travel through different paths of the network, they may arrive out of order to their destination. The BlueField-3 transforms any out-of-order data at the RoCE transport layer, transparently delivering in-order data to the application. 

Spectrum-4 evaluates congestion based on egress queue loads, ensuring all ports are well-balanced. For every network packet, the switch selects the port with the minimal load over its egress queue. Spectrum-4 also receives status notifications from neighboring switches, which influence the routing decision. The queues evaluated are matched with the quality-of-service level. 

As a result, NVIDIA Spectrum-X enables up to 95% effective bandwidth across the hyperscale system at load, and at scale. 

Diagram with four switch icons on the top, eight switch icons in the middle, one data processing unit icon, and four data rack cabinet icons.
Figure 3. NVIDIA Spectrum-4 typical data center deployment structure

RoCE congestion control

Applications running concurrently on hyperscale cloud systems may suffer from degraded performance and reproducible run-times due to network level congestion. This can be caused by the network traffic of the application itself, or background network traffic from other applications. The primary reason for this congestion is known as many-to-one congestion, where there are multiple data senders and a single data receiver.

Such congestion cannot be solved using adaptive routing and actually requires data-flow metering per endpoint. Congestion control is an end-to-end technology, where Spectrum-4 switches provide network telemetry information representing real time congestion data.  This telemetry information is processed by the BlueField DPUs, which manage and control the data sender’s data injection rate, resulting in maximum efficiency of network sharing.  

Without congestion control, many-to-one scenarios will cause network back-pressure and congestion spreading or even packet-drop, which dramatically degrade network and application performance.

In the congestion control process, BlueField-3 DPUs execute the congestion control algorithm. They handle millions of congestion control events per second in microsecond reaction latency and apply fine-grained rate decisions. 

The Spectrum-4 switch in-band telemetry holds both queuing information for accurate congestion estimation, as well as port utilization indication for fast recovery. NVIDIA RoCE congestion control significantly improves congestion discovery and reaction time by enabling the telemetry data to bypass the congested flow queueing delay while still providing accurate and concurrent telemetry.

RoCE performance isolation 

AI hyperscale and cloud infrastructures need to support a growing number of users (tenants) and parallel applications or workflows. These users and applications inadvertently compete on the infrastructure’s shared resources, such as the network, and therefore may impact performance. 

The NVIDIA Spectrum-X platform includes mechanisms that, when combined, deliver performance isolation.  It ensures that one workload cannot impact the performance of another. These mechanisms ensure that any workload cannot create network congestion that will impact data movement of another workload. The performance isolation mechanisms include quality of service isolation, RoCE adaptive routing for data path spreading, and RoCE congestion control. 

The NVIDIA Spectrum-X platform features tight integration of software and hardware, enabling deeper understanding of AI workloads and traffic patterns. Such an infrastructure provides the capabilities to test with large workloads using a dedicated Ethernet AI cluster. By leveraging telemetry from Spectrum Ethernet switches and BlueField-3 DPUs, NVIDIA NetQ can detect network issues proactively and troubleshoot network issues faster for optimal use of network capacity. 

The NVIDIA NetQ network validation and ASIC monitoring tool set provide visibility into the network health and behavior. The NetQ flow telemetry analysis shows the paths that data flows take as they traverse the network, providing network latency and performance insights.

Increased energy efficiency

Power capping has become a common practice in data centers due to the growing demand for computing resources and the need to control energy costs. The Spectrum-4 ASIC and optical innovations enable simplified network designs that improve performance per watt, achieving better efficiency and delivering faster AI insights, without exceeding network power budgets. 

Summary

NVIDIA Spectrum-X networking platform is designed especially for demanding AI applications. With higher performance compared to traditional Ethernet, lower power consumption, lower TCO, full stack software-hardware integration, and massive scale, NVIDIA Spectrum-X is the ideal platform for running existing and future AI workloads. 

Learn more

Looking for more information? Check out these resources:

Categories
Misc

Develop a Multi-Robot Environment with NVIDIA Isaac Sim, ROS, and Nimbus

The need for a high-fidelity multi-robot simulation environment is growing rapidly as more and more autonomous robots are being deployed in real-world…

The need for a high-fidelity multi-robot simulation environment is growing rapidly as more and more autonomous robots are being deployed in real-world scenarios. In this post, I will review what we used in the past at Cogniteam for simulating multiple robots, our current progress with NVIDIA Isaac Sim, and how Nimbus can speed up the development and maintenance of a multi-robot simulation with Isaac Sim.   

Multi-robot simulation with Unreal Tournament game engine 

About 20 years ago, my friends at Cogniteam and I started our robotic development careers with the idea of a robotic framework for multi-robot task allocation and teamwork. Originally called CogniTAO, a simplified version of this system was later published as ROS decision_making

At the time, use cases for multiple robots were scarce, and 3D simulation for those robots was not possible. So I wrote a mod for the Unreal Tournament 2000-2004 game engine to enable simulation for four robots. It took our small team of four programmers about 3 years to develop a simulated environment that could reliably run for 15 minutes.

Figure 1. Simulation of four robots (left) and video from the robots (right)

This environment was able to simulate four robots with a camera, Hokuyo LiDAR, odometry, and mapping on five state-of-the-art desktops, and remotely receive video feeds from each. One of our engineers wrote a C++ TCP client that would stream the data on the local network directly from the game engine and display it in fullscreen. We had to run the code in strict order to make the robots spawn on time and in the correct place.

Multi-robot simulation with Gazebo

Fast forward 10 years to 2013 when we transitioned our work to Gazebo after it became the de facto platform for robotic simulation. It took three programmers about 2 years to simulate 10 robots on two Intel Xeon machines. They used the ROS move_base navigation stack and object detection using OpenCV Hough Circle Transform—what robotics teams used for demos before TensorFlow. Igor Makhtes, our colleague at the time, built the RQT plugin to control and show data streams from multiple robots (Figure 2). It took him 6 months to complete.

Screenshot of video feed and map view for 10 robots
 Figure 2. Video feed and map view for 10 robots with RQT plugin

These robots had to communicate with each other, but also needed to operate when a connection was unavailable. To make this possible, each had to run its own ROS master and sync through a ROS multimaster network.

Multi-robot simulation with NVIDIA Isaac Sim

A few months ago, I asked Saar Moseri, a computer science student on our algorithmic team at Cogniteam to set up a multi-robot simulative scenario using the cloud robotics ecosystem Nimbus and NVIDIA Isaac Sim. Our internal test team and I hoped to use the Nimbus agent to control our robots and view the data they generate. 

It took Saar about 2 weeks to familiarize himself with the environment and configure the system. Figure 3 shows the result of this effort, running on a standard (single) desktop machine with an NVIDIA GeForce RTX 3080 in the Cogniteam lab.

An image showing multi-robot default setup in NVIDIA Isaac Sim
 Figure 3. NVIDIA Isaac Sim multi-robot default setup

Saar used the Isaac Sim documentation available through NVIDIA NGC to install and set up the environment. Using Nimbus, he installed an agent on the simulation machine and created a gateway node to receive data from the simulation through ROS.

 Figure 4. Nimbus robot editor (left) and Nimbus configuration editor (right)

We then created the node configuration shown in Figure 5.

Screenshot of a Nimbus configuration graph with two boxes
Figure 5. Nimbus simple mission configuration with move_base navigation

The two building blocks (already containerized) are a gateway node and a node for move_base navigation. This configuration was deployed to the agent running on the simulation desktop in the Cogniteam lab. Other more complex configurations are also available (with sources) in the Nimbus hub, including nodes for GMapping, path following, and more. 

My team and I were stunned by the endless possibilities this approach enables. In the configuration described above, simulated sensory data arrives from Isaac Sim through the ROS gateway, which supports both ROS and ROS 2. View and control capabilities are enabled by Nimbus. 

Out of the box, this setup enables our team to carry out basic simulation tasks and simulate the control of a robot fleet locally in our lab, along with many more capabilities. We can now record simulated runs and sensory data from robots, remotely SSH into a simulation machine, monitor simulation data globally, and even send email and SMS notifications about simulation progress to our validation team—all from a web browser. 

Combining Isaac Sim with Nimbus results in a unified system that is similar in features to available cloud simulation offerings, but runs on a local machine and does not involve additional cloud simulation compute costs. Additionally, it opens new cutting-edge simulation flows, such as simulation with hardware in the loop. This is not possible when the simulation runs in the cloud. Figure 6 shows how the control, navigation, and mapping look in Nimbus.

Figure 6. Nimbus robot WebRTC video monitoring (left) and Nimbus map view and autonomy control (right)

To replicate the setup described, reference the Isaac Sim documentation. Then visit Nimbus to create a free account, log in, and follow the instructions to create a robot using a free license. 

After the robot agent is installed on the same desktop that Isaac Sim is running headless, you will be able to provision the simulation through remote SSH and monitor the simulation machine from the Nimbus website.

Video 1. Nimbus and NVIDIA Isaac Sim demo video

Visit the Nimbus hub to deploy the Isaac Sim configuration. Since everything is already containerized (including Isaac Sim) and control is browser-based, you do not need to install any applications. The agent on the machine will set up everything needed to execute. 

Then, on the monitor page of that agent, add monitoring for any data that is relevant to your setup. In the agent settings, you can define notifications by adding conditions on ROS streams such as:

“if GoalStatus == ABORTED”
send sms/mail to [email protected]

Cogniteam is happy to help you in the process. You can reach us at [email protected]

Summary

For the successful deployment of autonomous robots, simulation is key. Running the same scenario multiple times is crucial for testing, but multi-robot simulations differ. Developing a high-fidelity multi-robot simulated environment is complex and takes time, but it can be simplified with NVIDIA Isaac Sim and Nimbus, as described in this post.

My team and I will be attending ICRA 2023 in London, May 29 to June 2 (Booth C22), showcasing our browser interface to robots and simulations running remotely in Israel.

To learn more about Isaac Sim, check out the NVIDIA Developer Isaac ROS Forum.

Categories
Misc

Enhancing Customer Experience in Telecom with NVIDIA Customized Speech AI

The telecom sector is transforming how communication happens. Striving to provide reliable, uninterrupted service, businesses are tackling the challenge of…

The telecom sector is transforming how communication happens. Striving to provide reliable, uninterrupted service, businesses are tackling the challenge of delivering an optimal customer experience.

This optimal customer experience is something many long-time customers of large telecom service providers do not have. Take Jack, for example. His call was on hold for 10 minutes, which made him late for work. Jill, the third agent he spoke with, read the brief note provided by the previous agent but had trouble understanding it. So, she asked Jack a few questions to clarify. With no co-workers available, Jill consulted multiple policy documents to address Jack’s concerns. Several resources later, Jill located the necessary information, but sadly, Jack had already ended the call.

Long wait times, complex service requests, and a lack of personalization are some of the common issues faced by customers, leading to dissatisfaction and churn. To overcome these challenges, the telecom sector is turning to AI—specifically conversational AI, a technology that leverages speech, translation, and natural language processing (NLP) to facilitate human-like interactions.

This post explores why conversational AI systems are essential and why it is important to have a high level of transcription accuracy for optimal performance in downstream tasks. We explain the NVIDIA Riva speech recognition customization techniques Quantiphi has used to improve transcription accuracy.

Join us on June 7 for the webinar Empower Telco Contact Center Agents with Multi-Language Speech-AI-Customized Agent Assists featuring live demos from Infosys, Quantiphi, and NVIDIA.

Accuracy in conversational AI systems

In telco contact centers, highly accurate conversational AI systems are essential for several reasons. Conversational AI systems can help agents extract valuable information from call interactions and make informed decisions, leading to improved service quality and customer experience. 

One key component in a conversational AI system is automatic speech recognition (ASR), also known as speech recognition or speech-to-text. Downstream tasks in telco contact centers heavily rely on accurate transcription provided by ASR systems. These tasks encompass a wide range of applications such as: 

  • Customer insights
  • Sentiment analysis
  • Call classification
  • Call transcription

Quick and accurate responses are vital for efficient and effective customer service. That means reducing the overall latency of individual components, including ASR, is very important. By reducing the time required to complete a task, contact center agents can provide prompt solutions, leading to enhanced customer satisfaction and loyalty.

Moreover, accurate transcription that includes punctuation enhances readability. Clear and well-punctuated transcriptions help agents better understand customer queries, facilitating clear communication and problem solving. This, in turn, improves the overall efficiency and effectiveness of customer interactions.

NVIDIA Riva automatic speech recognition pipeline

Speech-to-text receives an audio stream as input, transcribes it, and produces the transcribed text as output (Figure 1). First, the audio stream goes to an audio feature extractor and preprocessor, which filter out noise and capture audio spectral features in a spectrogram or mel spectrogram. Then, an acoustic model, together with a language model, transcribes the speech into text. Punctuation is added to the transcribed text to improve readability. 

Architecture diagram showing end-to-end ASR pipeline
Figure 1. Diagram of the end-to-end automatic speech recognition pipeline

Performance evaluation metrics for ASR systems

The performance of an ASR system can be measured using three metrics:

  1. Accuracy is fundamental, as it directly affects the quality and reliability of the transcriptions. By measuring accuracy through metrics like word error rate (WER), the system can be evaluated in terms of how well it transcribes spoken words. A low WER is vital in contact centers, as it ensures that customer queries and interactions are precisely captured, enabling agents to provide accurate and appropriate responses.
  2. Latency is the time taken to generate a transcript of a segment of audio. To maintain an engaging experience, the caption should be delivered at a latency of no more than a few hundred milliseconds. A transcription system must deliver captions with minimal delay. Low latency ensures a seamless and engaging customer experience, enhancing overall efficiency and customer satisfaction.
  3. Cost to develop and run a transcription service on sufficient compute infrastructure is another important measure. Although AI-based transcription is inexpensive compared to human interpreters, cost must be weighed along with other factors.

In a contact center setting, a transcription system must excel in accuracy to provide reliable transcriptions, offer low latency for prompt customer interactions, and consider cost factors to ensure a cost-effective and feasible solution for the organization. By optimizing all three metrics, the transcription system can effectively support contact center operations and enhance delivery of customer service.

Methods to improve ASR accuracy 

As shown in Figure 2, there are several techniques that can be used to achieve the best possible transcription accuracy for a specific domain, the easiest of which is word boosting. ASR word boosting involves passing to the model a list of important, possibly out-of-vocabulary, domain-specific words as additional input. This enables the ASR module to recognize such words during inference.

Architecture diagram showing customization across the ASR pipeline; left to right: speech, feature extraction, acoustic model, decoder model, punctuation model, and text
Figure 2. Customization across the ASR pipeline

In most cases, certain nouns (such as the names of companies or services) are either not in the vocabulary, or are frequently mistranscribed by the ASR model. These nouns were added to the list of words to be boosted. This strategy enabled us to easily improve recognition of specific words at request time.

In addition, the Quantiphi team:

Customized speech-assisted conversational AI systems 

One of the most significant challenges faced by customer contact centers in the telecom industry is the long time it takes to resolve complex queries. Agents typically need to consult with multiple stakeholders and internal policy documents to respond to complex queries. 

Conversational AI systems provide relevant documentation, insights, and recommendations, thereby enabling contact center agents to expedite the resolution of customer queries. 

The Quantiphi solution architecture for customized speech-assisted conversational AI pipeline involves the following: 

  1. Speech recognition pipeline: Creates transcriptions by capturing spoken language and converting it into text
  2. Intent slot model: Identifies user intent 
  3. Semantic search pipeline: Retrieves answers for the agent query through the dialog manager 

Quantiphi built a semantic search engine and a question-answering solution (Figure 3). It retrieves the most relevant documents for a given query and generates a concise answer for telco contact center agents.

Diagram showing Quantiphi question-answering solution with components: 1. Speech Recognition: ASR system transcribes the user query to text 2. Intent Identification and Slot Classification: Identifies user intent and entities 3. Answer Extender: It helps in maintaining context and facilitating a continuous and coherent conversation. 4. Semantic Search: Search pipeline that leverages NeMo with an information retrieval system for Question Answering.
Figure 3. Quantiphi question-answering solution with semantic search engine

ASR, in conjunction with question-answering (QnA) systems, is also used in virtual agents and avatar-based chatbots. The accuracy of ASR transcripts has a significant impact on the accuracy of agent assist, virtual agents, and avatar-based chatbots, since they are input to responses generated by a retrieval augmented generation (RAG) pipeline. Even a slight discrepancy in the way the query is transcribed can cause the generative model to provide incorrect responses. 

The Quantiphi team tried off-the-shelf ASR models, which sometimes failed to correctly transcribe proper nouns. The quality of the ASR transcription is of paramount importance when it is used in conjunction with question – answering pipelines, as shown in the following example:

Query: What is 5G?

ASR transcript: What is five g.

Generator response: Five grand is the amount of money you can earn if you work in a factory for a month.

Correct response: 5G is the next generation of wireless technology. It will be faster, more reliable, and more secure than 4G LTE.

To overcome such issues, we have used word-boosting, inverse text normalization, custom vocabulary, training language models, and fine-tuning acoustic models.

Word boosting

Words (or acronyms) such as mMTC and MEC were often transcribed incorrectly. We have addressed this with the help of word boosting. Consider the following example:

Before word boosting

Multi axis edge computing, also known as MEG is a type of network architecture that provides cloud computing capabilities and an It service environment at the edge of the network.

Mtc Fis a service area that offers low bandwidth connectivity with deep coverage.

After word boosting

Multi access edge computing also known as MEC is a type of network architecture that provides cloud computing capabilities and an IT service environment at the edge of the network.

mMTC is a service area that offers low bandwidth connectivity with deep coverage.

The before and after show how responses change, even if there is a slight difference in the way an n-gram is represented. Through inverse text normalization, the ASR model transcribes words such as ‘five g’ as ‘5G’, thus improving the QnA pipeline’s performance in the process.

Adding customized vocabulary to ASR

Most use cases typically have certain domain-specific words and jargon associated with them. To include these words in the ASR output, we added them to the vocabulary file and rebuilt the ASR model. For more details, see the tutorial How to Customize Riva ASR Vocabulary and Pronunciation with Lexicon Mapping.

Training n-gram language models

The contexts present in QnA tasks typically form a good source of text corpus to train an n-gram language model. A customized language model results in ASR outputs that are more receptive to sequences of words that commonly appear in the domain. We used an NVIDIA NeMo script to train a KenLM model and integrated it with the ASR model at build time.

Fine-tuning acoustic models

To further improve ASR performance, we fine-tuned an ASR acoustic model with 10-100 hours of small chunks (5-15 seconds) of audio data, with their corresponding ground-truth text. This helped the acoustic model to pick up regional accents. We used the Riva Jupyter notebook and NeMo for this fine-tuning. We further converted this checkpoint to Riva format using the nemo2riva tool and built it using the riva-build command.

Key takeaways

Question-answering and insights extraction make up conversational solutions that empower telecom customer service agents to provide personalized and efficient support. This improves customer satisfaction and reduces agent churn. To achieve highly accurate QnA and insights extraction solutions, it is necessary to provide high-accuracy transcriptions as an input to the rest of the pipeline. 

Quantiphi achieved the highest possible accuracy by customizing speech recognition models with NVIDIA Riva ASR word boosting, inverse text normalization, custom vocabulary, training language models and fine-tuning acoustic models. This was not possible with off-the-shelf solutions. 

What does that mean for Jack and Jill? Equipped with telco-customized speech-assisted conversational AI applications, Jill can quickly scan through the AI-generated summary of Jack’s previous conversations. Just as Jack finishes asking a question, her screen is already populated with the most relevant document to resolve Jack’s query. She swiftly conveys the information to Jack. He decides to answer the survey with positive feedback and still arrives at work on time. 

Get in touch with experts at Quantiphi to embark on a comprehensive exploration of how conversational AI can profoundly augment your organization’s customer experience. If you are interested in diving deeper into the technical aspects of constructing agent assist solutions,  join us for the webinar, Empower Telco Contact Center Agents with Multi-Language Speech-AI-Customized Agent Assists.

Categories
Misc

How Language Neutralization Is Transforming Customer Service Contact Centers

According to Gartner,® “Nearly half of digital workers struggle to find the data they need to do their jobs, and close to one-third have made a wrong business…

According to Gartner,® “Nearly half of digital workers struggle to find the data they need to do their jobs, and close to one-third have made a wrong business decision due to lack of information awareness.”1 To address this challenge, more and more enterprises are deploying AI in customer service, as it helps to provide more efficient and information-based personalized services.

Technologies such as speech-to-text, text-to-speech, translation, deep learning, transformer models, and generative AI have changed how businesses interact with customers. These technologies enable:

  • Real-time analysis of customer feedback
  • Automation of customer interactions
  • Accurate and personalized AI-based recommendations to assist human agents handle customer inquiries

AI algorithms can process and analyze vast amounts of data, identify customer needs and behavior patterns, and empower the creation of engaging and satisfying customer experiences. Overall, the use of AI in customer service has significantly improved the quality and efficiency of customer interactions, benefiting both businesses and customers.

Join us on June 7 for the webinar Empower Telco Contact Center Agents with Multi-Language Speech-AI-Customized Agent Assists featuring live demos from Infosys, Quantiphi, and NVIDIA.

Language barrier challenges in contact centers

In the global economy, businesses operate across countries and serve customers with diverse linguistic and cultural backgrounds. This global language diversity presents a unique challenge for contact centers.

Effective communication is critical to providing excellent customer service, and language barriers can lead to miscommunication, misunderstandings, and frustration. This can result in dissatisfied customers and missed business opportunities.

Traditional approaches to multilingual support, such as hiring native speakers, training agents in different languages, and providing language-specific scripts are not scalable, cost effective, or efficient.

However, advances in speech AI and translation AI technology are helping contact centers overcome language barriers through language neutralization. This innovation has been crucial for contact centers catering to diverse customers.

What is language neutralization?

In the context of contact centers, language neutralization refers to the process of using transcription, translation, and speech synthesis (TTS) technologies to convert communication from a customer’s natural language to a language that an agent can understand. The agent then responds in their own language, which is again converted through transcription, translation, and speech synthesis, or a combination based on the scenario.

Language neutralization enables effective communication between parties who may not speak the same language, removing language barriers and facilitating smooth interaction. This technique involves advanced AI technologies to equip contact center agents with tools to help them understand customer queries and respond effectively.

Overcoming language barriers

Language neutralization is particularly important for contact centers that provide support services to customers from diverse linguistic and cultural backgrounds. Using language neutralization techniques, contact centers can effectively communicate with non-native speakers and provide them with the same level of service as native speakers.

Infosys Cortex language neutralization powered by NVIDIA Riva

Infosys Cortex, an AI-driven customer engagement platform, transforms contact center operations through purposeful communication and smart decision-making capabilities. With greater brain power and continuous coaching, Infosys Cortex helps employees make better and faster decisions on their journey from new hire to experienced agent.

Infosys Cortex leverages NVIDIA Riva, a cutting-edge speech and translation AI SDK, to power language neutralization capabilities. The world-class accuracy of Riva automatic speech recognition (ASR), neural machine translation (NMT), and engaging speech synthesis empower accurate and natural communication. Based on NVIDIA GPUs for model fine-tuning and processing, Riva enables a high-performance solution for contact centers.

Cortex platform features

The microservices-based architecture of Infosys Cortex includes five key modules that offer the following features (Figure 1):

  1. Cortex Core: Sense, analyze, and generate actionable insights from data, and build new customer contexts along the way.
  2. Learn: Enable agent training with simulated learning features; based on historical call pipelines, training bank creations, learn-and-practice models, and follow-up actions.
  3. Empower: Provide proactive assistance to customers and agents using intelligent nudges based on transaction details, compliance, and real-time sentiment analysis to suggest the next best action.
  4. Experience: Integrate with CTI/IVR to create contact flows for self-service, virtual assistance, and intelligent routing to enhance the customer experience.
  5. Optimize: Generate insights through analyzing customer sentiment and interaction as well as agent behavior and performance.
Graphic showing five key modules of AI-driven Infosys Cortex highlighting its cloud based open architecture, omnichannel integration, AI powered automation and data driven intelligence.
Figure 1. Infosys Cortex, an AI-driven customer engagement platform, provides cloud-based open architecture, omnichannel integration, automation, and data-driven intelligence

Benefits and advantages

Riva services have been instrumental in addressing the key challenges Infosys has faced in relying on contact centers for customer service (Figure 2). The following are some key areas Riva addresses:

  • Accuracy: Domain-specific language and product name customizations and fine-tuning different accents and pronunciation enable future-proof solutions.
  • Language barrier: Support for 12 languages—Arabic, Chinese, English (US/UK), French, German, Italian, Japanese, Korean, Portuguese, Russian, and Spanish (LATAM/Spain)—with consistent addition of support for new languages.
  • Data privacy:  On-premises deployment enables mitigation of data privacy issues, helping to ensure that sensitive data is kept secure.
  • Cost reduction: High-performance, efficient Riva models, along with flexible licensing, enable creation of cost-effective solutions as volumes increase.
  • Control: Better means of improving Riva models with phonetic boosting and transfer learning for specific domains.
Architecture diagram showing how an audio input is translated for the agent
Figure 2. Seamless language neutralization powered by NVIDIA Riva speech services transforms incoming audio into transcribed, translated, and agent-ready information

Overall, the advantages Riva models offer over managed services on the cloud include data privacy, predictable pricing, and better performance. In addition, the fine-tuning capabilities of Riva models enable further improvement of the model performance.

Language neutralization requires real-time integration with the CTI audio streams, and latency negatively impacts the experience. Riva on-premises models’ low latency is crucial, as every response must deal with transcription, translation, and synthesis flows at least once.

Key takeaways

Language neutralization is a transformative approach for contact centers, providing a scalable, cost-effective, and efficient solution for multilingual support.

The powerful language neutralization offered by Infosys Cortex and based on NVIDIA Riva speech and translation enables contact center agents to communicate effectively with customers and prevent misunderstandings and ambiguities.

Smoother customer-agent interaction leads to faster handling of issues and a reduction in wait time and backlog. Overall, the reduction in communication-based barriers results in contact centers reducing costs and increasing consistency, thus leading to greater customer satisfaction.

Developers can try Riva containers and pretrained models with a 90-day free trial through NGC. For production deployments, get unlimited usage on all clouds, enterprise-grade support, security, and API stability with the purchase of Riva, a premium edition of the NVIDIA AI Enterprise platform. Learn more.

1Gartner, Quick Answer: How Should Organizations Prepare for the Addition of Generative AI to the Microsoft Stack?, G00790185, 3/16/2023. GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.

Categories
Offsites

But what is the Central Limit Theorem?

Categories
Offsites

Why π is in the normal distribution (beyond integral tricks)

Categories
Offsites

Larger language models do in-context learning differently

There have recently been tremendous advances in language models, partly because they can perform tasks with strong performance via in-context learning (ICL), a process whereby models are prompted with a few examples of input-label pairs before performing the task on an unseen evaluation example. In general, models’ success at in-context learning is enabled by:

  • Their use of semantic prior knowledge from pre-training to predict labels while following the format of in-context examples (e.g., seeing examples of movie reviews with “positive sentiment” and “negative sentiment” as labels and performing sentiment analysis using prior knowledge).
  • Learning the input-label mappings in context from the presented examples (e.g., finding a pattern that positive reviews should be mapped to one label, and negative reviews should be mapped to a different label).

In “Larger language models do in-context learning differently”, we aim to learn about how these two factors (semantic priors and input-label mappings) interact with each other in ICL settings, especially with respect to the scale of the language model that’s used. We investigate two settings to study these two factors — ICL with flipped labels (flipped-label ICL) and ICL with semantically-unrelated labels (SUL-ICL). In flipped-label ICL, labels of in-context examples are flipped so that semantic priors and input-label mappings disagree with each other. In SUL-ICL, labels of in-context examples are replaced with words that are semantically unrelated to the task presented in-context. We found that overriding prior knowledge is an emergent ability of model scale, as is the ability to learn in-context with semantically-unrelated labels. We also found that instruction tuning strengthens the use of prior knowledge more than it increases the capacity to learn input-label mappings.

An overview of flipped-label ICL and semantically-unrelated label ICL (SUL-ICL), compared with regular ICL, for a sentiment analysis task. Flipped-label ICL uses flipped labels, forcing the model to override semantic priors in order to follow the in-context examples. SUL-ICL uses labels that are not semantically related to the task, which means that models must learn input-label mappings in order to perform the task because they can no longer rely on the semantics of natural language labels.

Experiment design

For a diverse dataset mixture, we experiment on seven natural language processing (NLP) tasks that have been widely used: sentiment analysis, subjective/objective classification, question classification, duplicated-question recognition, entailment recognition, financial sentiment analysis, and hate speech detection. We test five language model families, PaLM, Flan-PaLM, GPT-3, InstructGPT, and Codex.

Flipped labels

In this experiment, labels of in-context examples are flipped, meaning that prior knowledge and input-label mappings disagree (e.g., sentences containing positive sentiment labeled as “negative sentiment”), thereby allowing us to study whether models can override their priors. In this setting, models that are able to override prior knowledge and learn input-label mappings in-context should experience a decrease in performance (since ground-truth evaluation labels are not flipped).

The ability to override semantic priors when presented with flipped in-context example labels emerges with model scale. Smaller models cannot flip predictions to follow flipped labels (performance only decreases slightly), while larger models can do so (performance decreases to well below 50%).

We found that when no labels are flipped, larger models have better performance than smaller models (as expected). But when we flip more and more labels, the performance of small models stays relatively flat, but large models experience large performance drops to well-below random guessing (e.g., 90% → 22.5% for code-davinci-002).

These results indicate that large models can override prior knowledge from pre-training when contradicting input-label mappings are presented in-context. Small models can’t do this, making this ability an emergent phenomena of model scale.

Semantically-unrelated labels

In this experiment, we replace labels with semantically-irrelevant ones (e.g., for sentiment analysis, we use “foo/bar” instead of “negative/positive”), which means that the model can only perform ICL by learning from input-label mappings. If a model mostly relies on prior knowledge for ICL, then its performance should decrease after this change since it will no longer be able to use semantic meanings of labels to make predictions. A model that can learn input–label mappings in-context, on the other hand, would be able to learn these semantically-unrelated mappings and should not experience a major drop in performance.

Small models rely more on semantic priors than large models do, as indicated by the greater decrease in performance for small models than for large models when using semantically-unrelated labels (i.e., targets) instead of natural language labels. For each plot, models are shown in order of increasing model size (e.g., for GPT-3 models, a is smaller than b, which is smaller than c).

Indeed, we see that using semantically-unrelated labels results in a greater performance drop for small models. This suggests that smaller models primarily rely on their semantic priors for ICL rather than learning from the presented input-label mappings. Large models, on the other hand, have the ability to learn input-label mappings in-context when the semantic nature of labels is removed.

We also find that including more in-context examples (i.e., exemplars) results in a greater performance improvement for large models than it does for small models, indicating that large models are better at learning from in-context examples than small models are.

In the SUL-ICL setup, larger models benefit more from additional examples than smaller models do.

Instruction tuning

Instruction tuning is a popular technique for improving model performance, which involves tuning models on various NLP tasks that are phrased as instructions (e.g., “Question: What is the sentiment of the following sentence, ‘This movie is great.’ Answer: Positive”). Since the process uses natural language labels, however, an open question is whether it improves the ability to learn input-label mappings or whether it strengthens the ability to recognize and apply semantic prior knowledge. Both of these would lead to an improvement in performance on standard ICL tasks, so it’s unclear which of these occur.

We study this question by running the same two setups as before, only this time we focus on comparing standard language models (specifically, PaLM) with their instruction-tuned variants (Flan-PaLM).

First, we find that Flan-PaLM is better than PaLM when we use semantically-unrelated labels. This effect is very prominent in small models, as Flan-PaLM-8B outperforms PaLM-8B by 9.6% and almost catches up to PaLM-62B. This trend suggests that instruction tuning strengthens the ability to learn input-label mappings, which isn’t particularly surprising.

Instruction-tuned language models are better at learning input–label mappings than pre-training–only language models are.

More interestingly, we saw that Flan-PaLM is actually worse than PaLM at following flipped labels, meaning that the instruction tuned models were unable to override their prior knowledge (Flan-PaLM models don’t reach below random guessing with 100% flipped labels, but PaLM models without instruction tuning can reach 31% accuracy in the same setting). These results indicate that instruction tuning must increase the extent to which models rely on semantic priors when they’re available.

Instruction-tuned models are worse than pre-training–only models at learning to override semantic priors when presented with flipped labels in-context.

Combined with the previous result, we conclude that although instruction tuning improves the ability to learn input-label mappings, it strengthens the usage of semantic prior knowledge more.

Conclusion

We examined the extent to which language models learn in-context by utilizing prior knowledge learned during pre-training versus input-label mappings presented in-context.

We first showed that large language models can learn to override prior knowledge when presented with enough flipped labels, and that this ability emerges with model scale. We then found that successfully doing ICL using semantically-unrelated labels is another emergent ability of model scale. Finally, we analyzed instruction-tuned language models and saw that instruction tuning improves the capacity to learn input-label mappings but also strengthens the use of semantic prior knowledge even more.

Future work

These results underscore how the ICL behavior of language models can change depending on their scale, and that larger language models have an emergent ability to map inputs to many types of labels, a form of reasoning in which input-label mappings can potentially be learned for arbitrary symbols. Future research could help provide insights on why these phenomena occur with respect to model scale.

Acknowledgements

This work was conducted by Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, and Tengyu Ma. We would like to thank Sewon Min and our fellow collaborators at Google Research for their advice and helpful discussions.

Categories
Offsites

Using reinforcement learning for dynamic planning in open-ended conversations

As virtual assistants become ubiquitous, users increasingly interact with them to learn about new topics or obtain recommendations and expect them to deliver capabilities beyond narrow dialogues of one or two turns. Dynamic planning, namely the capability to look ahead and replan based on the flow of the conversation, is an essential ingredient for the making of engaging conversations with the deeper, open-ended interactions that users expect.

While large language models (LLMs) are now beating state-of-the-art approaches in many natural language processing benchmarks, they are typically trained to output the next best response, rather than planning ahead, which is required for multi-turn interactions. However, in the past few years, reinforcement learning (RL) has delivered incredible results addressing specific problems that involve dynamic planning, such as winning games and protein folding.

Today, we are sharing our recent advances in dynamic planning for human-to-assistant conversations, in which we enable an assistant to plan a multi-turn conversation towards a goal and adapt that plan in real-time by adopting an RL-based approach. Here we look at how to improve long interactions by applying RL to compose answers based on information extracted from reputable sources, rather than relying on content generated by a language model. We expect that future versions of this work could combine LLMs and RL in multi-turn dialogues. The deployment of RL “in the wild” in a large-scale dialogue system proved a formidable challenge due to the modeling complexity, tremendously large state and action spaces, and significant subtlety in designing reward functions.

What is dynamic planning?

Many types of conversations, from gathering information to offering recommendations, require a flexible approach and the ability to modify the original plan for the conversation based on its flow. This ability to shift gears in the middle of a conversation is known as dynamic planning, as opposed to static planning, which refers to a more fixed approach. In the conversation below, for example, the goal is to engage the user by sharing interesting facts about cool animals. To begin, the assistant steers the conversation to sharks via a sound quiz. Given the user’s lack of interest in sharks, the assistant then develops an updated plan and pivots the conversation to sea lions, lions, and then cheetahs.

The assistant dynamically modifies its original plan to talk about sharks and shares facts about other animals.

Dynamic composition

To cope with the challenge of conversational exploration, we separate the generation of assistant responses into two parts: 1) content generation, which extracts relevant information from reputable sources, and 2) flexible composition of such content into assistant responses. We refer to this two-part approach as dynamic composition. Unlike LLM methods, this approach gives the assistant the ability to fully control the source, correctness, and quality of the content that it may offer. At the same time, it can achieve flexibility via a learned dialogue manager that selects and combines the most appropriate content.

In an earlier paper, “Dynamic Composition for Conversational Domain Exploration”, we describe a novel approach which consists of: (1) a collection of content providers, which offer candidates from different sources, such as news snippets, knowledge graph facts, and questions; (2) a dialogue manager; and (3) a sentence fusion module. Each assistant response is incrementally constructed by the dialogue manager, which selects candidates proposed by the content providers. The selected sequence of utterances is then fused into a cohesive response.

Dynamic planning using RL

At the core of the assistant response composition loop is a dialogue manager trained using off-policy RL, namely an algorithm that evaluates and improves a policy that is different from the policy used by the agent (in our case, the latter is based on a supervised model). Applying RL to dialogue management presents several challenges, including a large state space (as the state represents the conversation state, which needs to account for the whole conversation history) and an effectively unbounded action space (that may include all existing words or sentences in natural language).

We address these challenges using a novel RL construction. First, we leverage powerful supervised models — specifically, recurrent neural networks (RNNs) and transformers — to provide a succinct and effective dialogue state representation. These state encoders are fed with the dialogue history, composed of a sequence of user and assistant turns, and output a representation of the dialogue state in the form of a latent vector.

Second, we use the fact that a relatively small set of reasonable candidate utterances or actions can be generated by content providers at each conversation turn, and limit the action space to these. Whereas the action space is typically fixed in RL settings, because all states share the same action space, ours is a non-standard space in which the candidate actions may differ with each state, since content providers generate different actions depending on the dialogue context. This puts us in the realm of stochastic action sets, a framework that formalizes cases where the set of actions available in each state is governed by an exogenous stochastic process, which we address using Stochastic Action Q-Learning, a variant of the Q-learning approach. Q-learning is a popular off-policy RL algorithm, which does not require a model of the environment to evaluate and improve the policy. We trained our model on a corpus of crowd-compute–rated conversations obtained using a supervised dialogue manager.

Given the current dialogue history and a new user query, content providers generate candidates from which the assistant selects one. This process runs in a loop, and at the end the selected utterances are fused into a cohesive response.

Reinforcement learning model evaluation

We compared our RL dialogue manager with a launched supervised transformer model in an experiment using Google Assistant, which conversed with users about animals. A conversation starts when a user triggers the experience by asking an animal-related query (e.g., “How does a lion sound?”). The experiment was conducted using an A/B testing protocol, in which a small percentage of Assistant users were randomly sampled to interact with our RL-based assistant while other users interacted with the standard assistant.

We found that the RL dialogue manager conducts longer, more engaging conversations. It increases conversation length by 30% while improving user engagement metrics. We see an increase of 8% in cooperative responses to the assistant’s questions — e.g., “Tell me about lions,” in response to “Which animal do you want to hear about next?” Although there is also a large increase in nominally “non-cooperative” responses (e.g., “No,” as a reply to a question proposing additional content, such as “Do you want to hear more?”), this is expected as the RL agent takes more risks by asking pivoting questions. While a user may not be interested in the conversational direction proposed by the assistant (e.g., pivoting to another animal), the user will often continue to engage in a dialogue about animals.

From the non-cooperative user response in the 3rd turn (“No.”) and the query “Make a dog sound,” in the 5th turn, the assistant recognizes that the user is mostly interested in animal sounds and modifies its plan, providing sounds and sound quizzes.

In addition, some user queries contain explicit positive (e.g., “Thank you, Google,” or “I’m happy.”) or negative (e.g., “Shut up,” or “Stop.”) feedback. While an order of magnitude fewer than other queries, they offer a direct measure of user (dis)satisfaction. The RL model increases explicit positive feedback by 32% and reduces negative feedback by 18%.

Learned dynamic planning characteristics and strategies

We observe several characteristics of the (unseen) RL plan to improve user engagement while conducting longer conversations. First, the RL-based assistant ends 20% more turns in questions, prompting the user to choose additional content. It also better harnesses content diversity, including facts, sounds, quizzes, yes/no questions, open questions, etc. On average, the RL assistant uses 26% more distinct content providers per conversation than the supervised model.

Two observed RL planning strategies are related to the existence of sub-dialogues with different characteristics. Sub-dialogues about animal sounds are poorer in content and exhibit entity pivoting at every turn (i.e., after playing the sound of a given animal, we can either suggest the sound of a different animal or quiz the user about other animal sounds). In contrast, sub-dialogues involving animal facts typically contain richer content and have greater conversation depth. We observe that RL favors the richer experience of the latter, selecting 31% more fact-related content. Lastly, when restricting analysis to fact-related dialogues, the RL assistant exhibits 60% more focus-pivoting turns, that is, conversational turns that change the focus of the dialogue.

Below, we show two example conversations, one conducted by the supervised model (left) and the second by the RL model (right), in which the first three user turns are identical. With a supervised dialogue manager, after the user declined to hear about “today’s animal”, the assistant pivots back to animal sounds to maximize the immediate user satisfaction. While the conversation conducted by the RL model begins identically, it exhibits a different planning strategy to optimize the overall user engagement, introducing more diverse content, such as fun facts.

In the left conversation, conducted by the supervised model, the assistant maximizes the immediate user satisfaction. The right conversation, conducted by the RL model, shows different planning strategies to optimize the overall user engagement.

Future research and challenges

In the past few years, LLMs trained for language understanding and generation have demonstrated impressive results across multiple tasks, including dialogue. We are now exploring the use of an RL framework to empower LLMs with the capability of dynamic planning so that they can dynamically plan ahead and delight users with a more engaging experience.

Acknowledgements

The work described is co-authored by: Moonkyung Ryu, Yinlam Chow, Orgad Keller, Ido Greenberg, Avinatan Hassidim, Michael Fink, Yossi Matias, Idan Szpektor and Gal Elidan. We would like to thank: Roee Aharoni, Moran Ambar, John Anderson, Ido Cohn, Mohammad Ghavamzadeh, Lotem Golany, Ziv Hodak, Adva Levin, Fernando Pereira, Shimi Salant, Shachar Shimoni, Ronit Slyper, Ariel Stolovich, Hagai Taitelbaum, Noam Velan, Avital Zipori and the CrowdCompute team led by Ashwin Kakarla. We thank Sophie Allweis for her feedback on this blogpost and Tom Small for the visualization.

Categories
Offsites

Responsible AI at Google Research: PAIR

PAIR (People + AI Research) first launched in 2017 with the belief that “AI can go much further — and be more useful to all of us — if we build systems with people in mind at the start of the process.” We continue to focus on making AI more understandable, interpretable, fun, and usable by more people around the world. It’s a mission that is particularly timely given the emergence of generative AI and chatbots.

Today, PAIR is part of the Responsible AI and Human-Centered Technology team within Google Research, and our work spans this larger research space: We advance foundational research on human-AI interaction (HAI) and machine learning (ML); we publish educational materials, including the PAIR Guidebook and Explorables (such as the recent Explorable looking at how and why models sometimes make incorrect predictions confidently); and we develop software tools like the Learning Interpretability Tool to help people understand and debug ML behaviors. Our inspiration this year is “changing the way people think about what THEY can do with AI.” This vision is inspired by the rapid emergence of generative AI technologies, such as large language models (LLMs) that power chatbots like Bard, and new generative media models like Google’s Imagen, Parti, and MusicLM. In this blog post, we review recent PAIR work that is changing the way we engage with AI.

Generative AI research

Generative AI is creating a lot of excitement, and PAIR is involved in a range of related research, from using language models to create generative agents to studying how artists adopted generative image models like Imagen and Parti. These latter “text-to-image” models let a person input a text-based description of an image for the model to generate (e.g., “a gingerbread house in a forest in a cartoony style”). In a forthcoming paper titled “The Prompt Artists” (to appear in Creativity and Cognition 2023), we found that users of generative image models strive not only to create beautiful images, but also to create unique, innovative styles. To help achieve these styles, some would even seek unique vocabulary to help develop their visual style. For example, they may visit architectural blogs to learn what domain-specific vocabulary they can adopt to help produce distinctive images of buildings.

We are also researching solutions to challenges faced by prompt creators who, with generative AI, are essentially programming without using a programming language. As an example, we developed new methods for extracting semantically meaningful structure from natural language prompts. We have applied these structures to prompt editors to provide features similar to those found in other programming environments, such as semantic highlighting, autosuggest, and structured data views.

The growth of generative LLMs has also opened up new techniques to solve important long-standing problems. Agile classifiers are one approach we’re taking to leverage the semantic and syntactic strengths of LLMs to solve classification problems related to safer online discourse, such as nimbly blocking newer types of toxic language as quickly as it may evolve online. The big advance here is the ability to develop high quality classifiers from very small datasets — as small as 80 examples. This suggests a positive future for online discourse and better moderation of it: instead of collecting millions of examples to attempt to create universal safety classifiers for all use cases over months or years, more agile classifiers might be created by individuals or small organizations and tailored for their specific use cases, and iterated on and adapted in the time-span of a day (e.g., to block a new kind of harassment being received or to correct unintended biases in models). As an example of their utility, these methods recently won a SemEval competition to identify and explain sexism.

We’ve also developed new state-of-the-art explainability methods to identify the role of training data on model behaviors and misbehaviours. By combining training data attribution methods with agile classifiers, we also found that we can identify mislabelled training examples. This makes it possible to reduce the noise in training data, leading to significant improvements on model accuracy.

Collectively, these methods are critical to help the scientific community improve generative models. They provide techniques for fast and effective content moderation and dialogue safety methods that help support creators whose content is the basis for generative models’ amazing outcomes. In addition, they provide direct tools to help debug model misbehavior which leads to better generation.

Visualization and education

To lower barriers in understanding ML-related work, we regularly design and publish highly visual, interactive online essays, called AI Explorables, that provide accessible, hands-on ways to learn about key ideas in ML. For example, we recently published new AI Explorables on the topics of model confidence and unintended biases. In our latest Explorable, “From Confidently Incorrect Models to Humble Ensembles,” we discuss the problem with model confidence: models can sometimes be very confident in their predictions… and yet completely incorrect. Why does this happen and what can be done about it? Our Explorable walks through these issues with interactive examples and shows how we can build models that have more appropriate confidence in their predictions by using a technique called ensembling, which works by averaging the outputs of multiple models. Another Explorable, “Searching for Unintended Biases with Saliency”, shows how spurious correlations can lead to unintended biases — and how techniques such as saliency maps can detect some biases in datasets, with the caveat that it can be difficult to see bias when it’s more subtle and sporadic in a training set.

PAIR designs and publishes AI Explorables, interactive essays on timely topics and new methods in ML research, such as “From Confidently Incorrect Models to Humble Ensembles,” which looks at how and why models offer incorrect predictions with high confidence, and how “ensembling” the outputs of many models can help avoid this.

Transparency and the Data Cards Playbook

Continuing to advance our goal of helping people to understand ML, we promote transparent documentation. In the past, PAIR and Google Cloud developed model cards. Most recently, we presented our work on Data Cards at ACM FAccT’22 and open-sourced the Data Cards Playbook, a joint effort with the Technology, AI, Society, and Culture team (TASC). The Data Cards Playbook is a toolkit of participatory activities and frameworks to help teams and organizations overcome obstacles when setting up a transparency effort. It was created using an iterative, multidisciplinary approach rooted in the experiences of over 20 teams at Google, and comes with four modules: Ask, Inspect, Answer and Audit. These modules contain a variety of resources that can help you customize Data Cards to your organization’s needs:

  • 18 Foundations: Scalable frameworks that anyone can use on any dataset type
  • 19 Transparency Patterns: Evidence-based guidance to produce high-quality Data Cards at scale
  • 33 Participatory Activities: Cross-functional workshops to navigate transparency challenges for teams
  • Interactive Lab: Generate interactive Data Cards from markdown in the browser

The Data Cards Playbook is accessible as a learning pathway for startups, universities, and other research groups.

Software Tools

Our team thrives on creating tools, toolkits, libraries, and visualizations that expand access and improve understanding of ML models. One such resource is Know Your Data, which allows researchers to test a model’s performance for various scenarios through interactive qualitative exploration of datasets that they can use to find and fix unintended dataset biases.

Recently, PAIR released a new version of the Learning Interpretability Tool (LIT) for model debugging and understanding. LIT v0.5 provides support for image and tabular data, new interpreters for tabular feature attribution, a “Dive” visualization for faceted data exploration, and performance improvements that allow LIT to scale to 100k dataset entries. You can find the release notes and code on GitHub.

PAIR’s Learning Interpretability Tool (LIT), an open-source platform for visualization and understanding of ML models.

PAIR has also contributed to MakerSuite, a tool for rapid prototyping with LLMs using prompt programming. MakerSuite builds on our earlier research on PromptMaker, which won an honorable mention at CHI 2022. MakerSuite lowers the barrier to prototyping ML applications by broadening the types of people who can author these prototypes and by shortening the time spent prototyping models from months to minutes. 

A screenshot of MakerSuite, a tool for rapidly prototyping new ML models using prompt-based programming, which grew out of PAIR’s prompt programming research.

Ongoing work

As the world of AI moves quickly ahead, PAIR is excited to continue to develop new tools, research, and educational materials to help change the way people think about what THEY can do with AI.

For example, we recently conducted an exploratory study with five designers (presented at CHI this year) that looks at how people with no ML programming experience or training can use prompt programming to quickly prototype functional user interface mock-ups. This prototyping speed can help inform designers on how to integrate ML models into products, and enables them to conduct user research sooner in the product design process.

Based on this study, PAIR’s researchers built PromptInfuser, a design tool plugin for authoring LLM-infused mock-ups. The plug-in introduces two novel LLM-interactions: input-output, which makes content interactive and dynamic, and frame-change, which directs users to different frames depending on their natural language input. The result is more tightly integrated UI and ML prototyping, all within a single interface.

Recent advances in AI represent a significant shift in how easy it is for researchers to customize and control models for their research objectives and goals.These capabilities are transforming the way we think about interacting with AI, and they create lots of new opportunities for the research community. PAIR is excited about how we can leverage these capabilities to make AI easier to use for more people.

Acknowledgements

Thanks to everyone in PAIR, to Reena Jana and to all of our collaborators.

Categories
Offsites

Sparse video tubes for joint video and image vision transformers

Video understanding is a challenging problem that requires reasoning about both spatial information (e.g., for objects in a scene, including their locations and relations) and temporal information for activities or events shown in a video. There are many video understanding applications and tasks, such as understanding the semantic content of web videos and robot perception. However, current works, such as ViViT and TimeSFormer, densely process the video and require significant compute, especially as model size plus video length and resolution increase.

In “Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning”, to be presented at CVPR 2023, we introduce a simple technique that turns a Vision Transformer (ViT) model image encoder into an efficient video backbone using sparse video tubes (learnable visual representations of samples from the video) to reduce the model’s compute needs. This approach can seamlessly process both images and videos, which allows it to leverage both image and video data sources during training. This training further enables our sparse tubes ViT model to coalesce image and video backbones together to serve a dual role as either an image or video backbone (or both), depending on the input. We demonstrate that this model is scalable, can be adapted to large pre-trained ViTs without requiring full fine-tuning, and achieves state-of-the-art results across many video classification benchmarks.

Using sparse video tubes to sample a video, combined with a standard ViT encoder, leads to an efficient visual representation that can be seamlessly shared with image inputs.

Building a joint image-video backbone

Our sparse tube ViT uses a standard ViT backbone, consisting of a stack of Transformer layers, that processes video information. Previous methods, such as ViViT, densely tokenize the video and then apply factorized attention, i.e., the attention weights for each token are computed separately for the temporal and spatial dimensions. In the standard ViT architecture, self-attention is computed over the whole token sequence. When using videos as input, token sequences become quite long, which can make this computation slow. Instead, in the method we propose, the video is sparsely sampled using video tubes, which are 3D learnable visual representations of various shapes and sizes (described in more detail below) from the video. These tubes are used to sparsely sample the video using a large temporal stride, i.e., when a tube kernel is only applied to a few locations in the video, rather than every pixel.

By sparsely sampling the video tubes, we can use the same global self-attention module, rather than factorized attention like ViViT. We experimentally show that the addition of factorized attention layers can harm the performance due to the uninitialized weights. This single stack of transformer layers in the ViT backbone also enables better sharing of the weights and improves performance. Sparse video tube sampling is done by using a large spatial and temporal stride that selects tokens on a fixed grid. The large stride reduces the number of tokens in the full network, while still capturing both spatial and temporal information and enabling the efficient processing of all tokens.

Sparse video tubes

Video tubes are 3D grid-based cuboids that can have different shapes or categories and capture different information with strides and starting locations that can overlap. In the model, we use three distinct tube shapes that capture: (1) only spatial information (resulting in a set of 2D image patches), (2) long temporal information (over a small spatial area), and (3) both spatial and temporal information equally. Tubes that capture only spatial information can be applied to both image and video inputs. Tubes that capture long temporal information or both temporal and spatial information equally are only applied to video inputs. Depending on the input video size, the three tube shapes are applied to the model multiple times to generate tokens.

A fixed position embedding, which captures the global location of each tube (including any strides, offsets, etc.) relative to all the other tubes, is applied to the video tubes. Different from the previous learned position embeddings, this fixed one better enables sparse, overlapping sampling. Capturing the global location of the tube helps the model know where each came from, which is especially helpful when tubes overlap or are sampled from distant video locations. Next, the tube features are concatenated together to form a set of N tokens. These tokens are processed by a standard ViT encoder. Finally, we apply an attention pooling to compress all the tokens into a single representation and input to a fully connected (FC) layer to make the classification (e.g., playing soccer, swimming, etc.).

Our video ViT model works by sampling sparse video tubes from the video (shown at the bottom) to enable either or both image or video inputs to be seamlessly processed. These tubes have different shapes and capture different video features. Tube 1 (yellow) only captures spatial information, resulting in a set of 2D patches that can be applied to image inputs. Tube 2 (red) captures temporal information and some spatial information and tube 3 (green) equally captures both temporal and spatial information (i.e., the spatial size of the tube x and y are the same as the number of frames t). Tubes 2 and 3 can only be applied to video inputs. The position embedding is added to all the tube features.

Scaling video ViTs

The process of building video backbones is computationally intensive, but our sparse tube ViT model enables computationally efficient scaling of video models, leveraging previously trained image backbones. Since image backbones can be adapted to a video backbone, large image backbones can be turned into large video backbones. More specifically, one can transfer the learned video feature representations from a small tube ViT to a large pre-trained image ViT and train the resulting model with video data for only a few steps, as opposed to a full training from scratch.

Our approach enables scaling a sparse tube ViT in a more efficient way. Specifically, the video features from a small video ViT (top network) can be transferred to a large, pre-trained image ViT (bottom network), and further fine-tuned. This requires fewer training steps to achieve strong performance with the large model. This is beneficial as large video models might be prohibitively expensive to train from scratch.

Results

We evaluate our sparse tube ViT approach using Kinetics-400 (shown below), Kinetics-600 and Kinetics-700 datasets and compare its performance to a long list of prior methods. We find that our approach outperforms all prior methods. Importantly, it outperforms all state-of-the-art methods trained jointly on image+video datasets.

Performance compared to several prior works on the popular Kinetics-400 video dataset. Our sparse tube ViT outperforms state-of-the-art methods.

Furthermore, we test our sparse tube ViT model on the Something-Something V2 dataset, which is commonly used to evaluate more dynamic activities, and also report that it outperforms all prior state-of-the-art approaches.

Performance on the Something-Something V2 video dataset.

Visualizing some learned kernels

It is interesting to understand what kind of rudimentary features are being learned by the proposed model. We visualize them below, showing both the 2D patches, which are shared for both images and videos, and video tubes. These visualizations show the 2D or 3D information being captured by the projection layer. For example, in the 2D patches, various common features, like edges and colors, are detected, while the 3D tubes capture basic shapes and how they may change over time.

Visualizations of patches and tubes learned the sparse tube ViT model. Top row are the 2D patches and the remaining two rows are snapshots from the learned video tubes. The tubes show each patch for the 8 or 4 frames to which they are applied.

Conclusions

We have presented a new sparse tube ViT, which can turn a ViT encoder into an efficient video model, and can seamlessly work with both image and video inputs. We also showed that large video encoders can be bootstrapped from small video encoders and image-only ViTs. Our approach outperforms prior methods across several popular video understanding benchmarks. We believe that this simple representation can facilitate much more efficient learning with input videos, seamlessly incorporate either image or video inputs and effectively eliminate the bifurcation of image and video models for future multimodal understanding.

Acknowledgements

This work is conducted by AJ Piergiovanni, Weicheng Kuo and Anelia Angelova, who are now at Google DeepMind. We thank Abhijit Ogale, Luowei Zhou, Claire Cui and our colleagues in Google Research for their helpful discussions, comments, and support.