Month: April 2022

Misc

Problem with training models using fit

Post author By
Post date April 8, 2022
No Comments on Problem with training models using fit

Hi guys,

I’m currently developing and image classificator and for that i started watching the tutorial on tensorflow.org .

I’m trying to see how to train a CNN, but as i use model.fit i receive an OOM. I’ve searched on internet but no ones of the answer that i’ve found helped me.

For give you a bit of context i’m using a mobile gtx 1050 and i suspect that the problem is that it hasn’t enough memory, so i ask you some advice on how to succesfully train my models.

submitted by /u/redska_
[visit reddit] [comments]

Offsites

Large-Scale Matrix Factorization on TPUs

Post author By
Post date April 8, 2022
No Comments on Large-Scale Matrix Factorization on TPUs

Posted by Harsh Mehta, Software Engineer, Google Research

Matrix factorization is one of the oldest, yet still widely used, techniques for learning how to recommend items such as songs or movies from user ratings. In its basic form, it approximates a large, sparse (i.e., mostly empty) matrix of user-item interactions with a product of two smaller, denser matrices representing learned item and user features. These dense matrices, in turn, can be used to recommend items to a user with which they haven’t interacted before.

Despite its algorithmic simplicity, matrix factorization can still achieve competitive performance in recommender benchmarks. Alternating least squares (ALS), and especially its implicit variation, is a fundamental algorithm to learn the parameters of matrix factorization. ALS is known for its high efficiency because it scales linearly in the number of rows, columns and non-zeros. Hence, this algorithm is very well suited for large-scale challenges. But, for very large real-world matrix factorization datasets, a single machine implementation would not suffice, and so, it would require a large distributed system. Most of the distributed implementations of matrix factorization that employ ALS leverage off-the-shelf CPU devices, and rightfully so, due to the inherently sparse nature of the problem (the input matrix is mostly empty).

On the other hand, recent success of deep learning, which has exhibited growing computational capacity, has spurred a new wave of research and progress on hardware accelerators such as Tensor Processing Units (TPUs). TPUs afford domain specific hardware speedups, especially for use cases like deep learning, which involves a large number of dense matrix multiplications. In particular, they allow significant speedups for traditional data-parallel workloads, such as training models with Stochastic Gradient Descent (SGD) in SPMD (single program multiple data) fashion. The SPMD approach has gained popularity in computations like training neural networks with gradient descent algorithms, and can be used for both data-parallel and model-parallel computations, where we distribute parameters of the model across available devices. Nevertheless, while TPUs have been enormously attractive for methods based on SGD, it is not immediately clear if a high performance implementation of ALS, which requires a large number of distributed sparse matrix multiplies, can be developed for a large-scale cluster of TPU devices.

In “ALX: Large Scale Matrix Factorization on TPUs”, we explore a distributed ALS design that makes efficient use of the TPU architecture and can scale well to matrix factorization problems of the order of billions of rows and columns by scaling the number of available TPU cores. The approach we propose leverages a combination of model and data parallelism, where each TPU core both stores a portion of the embedding table and trains over a unique slice of data, grouped in mini-batches. In order to spur future research on large-scale matrix factorization methods and to illustrate the scalability properties of our own implementation, we also built and released a real world web link prediction dataset called WebGraph.

The figure shows the flow of data and computation through the ALX framework on TPU devices. Similar to SGD-based training procedures, each TPU core performs identical computation for its own batch of data in SPMD fashion, which allows for synchronous computation in parallel on multiple TPU cores. Each TPU starts with gathering all the relevant item embeddings in the Sharded Gather stage. These materialized embeddings are used to solve for user embeddings which are scattered to the relevant shard of the embedding table in the Sharded Scatter stage.

Dense Batching for Improved Efficiency
We designed ALX specifically for TPUs, exploiting unique properties of TPU architecture while overcoming a few interesting limitations. For instance, each TPU core has limited memory and restricts all tensors to have a static shape, but each example in a mini-batch can have a wildly varying number of items (i.e., inputs can be long and sparse). To resolve this, we break exceedingly long examples into multiple smaller examples of the same shape, a process called dense batching. More details about dense batching can be found in our paper.

Illustrating example of how sparse batches are densified to increase efficiency on TPUs.

Uniform Sharding of Embedding Tables
With the batching problem solved, we next want to factorize a sparse matrix into two dense embedding matrices (e.g., user and item embeddings) such that the resulting dot product of embeddings approximate the original sparse matrix — this helps us infer predictions for all the positions from the original matrix, including those that were empty, which can be used to recommend items with which users haven’t interacted. Both the resulting embedding tables (W and H in the figure below) can potentially be too large to fit in a single TPU core, thus requiring a distributed training setup for most large-scale use cases.

Most previous attempts of distributed matrix factorization use a parameter server architecture where the model parameters are stored on highly available servers, and the training data is processed in parallel by workers that are solely responsible for the learning task. In our case, since each TPU core has identical compute and memory, it’s wasteful to only use either memory for storing model parameters or compute for training. Thus, we designed our system such that each core is used to do both.

Illustrative example of factorizing a sparse matrix Y into two dense embedding matrices W and H.

In ALX, we uniformly divide both embedding tables, thus fully exploiting both the size of distributed memory available and the dedicated low-latency interconnects between TPUs. This is highly efficient for very large embedding tables and results in good performance for distributed gather and scatter operations.

Uniform sharding of both embedding tables (W and H) across TPU cores (in blue).

WebGraph
Since potential applications may involve very large data sets, scalability is potentially an important opportunity for advancement in matrix factorization. To that end, we also release a large real-world web link prediction dataset called WebGraph. This dataset can be easily modeled as a matrix factorization problem where rows and columns are source and destination links, respectively, and the task is to predict destination links from each source link. We use WebGraph to illustrate the scaling properties of ALX.

The WebGraph dataset was generated from a single crawl performed by CommonCrawl in 2021 where we strip everything and keep only the link->outlinks data. Since the performance of a factorization method depends on the properties of the underlying graph, we created six versions of WebGraph, each varying in the sparsity pattern and locale, to study how well ALS performs on each.

To study locale-specific graphs, we filter based on two top level domains: ‘de’ and ‘in’, each producing a graph with an order of magnitude fewer nodes.
These graphs can still have arbitrary sparsity patterns and dangling links. Thus we further filter the nodes in each graph to have a minimum of either 10 or 50 inlinks and outlinks.

For easy access, we have made these available as a Tensorflow Dataset package. For reference, the biggest version, WebGraph-sparse, has more than 365M nodes and 30B edges. We create and publish both training and testing splits for evaluation purposes.

Results
We carefully tune the system and quality parameters of ALX. Based on our observations related to precision and choice of linear solvers. We observed that by carefully selecting the precision for storage of the embedding tables (bfloat16) and for the input to the linear solvers (float32), we were able to halve the memory required for the embeddings while still avoiding problems arising from lower precision values during the solve stage. For our linear solvers, we selected conjugate gradients, which we found to be the fastest across the board on TPUs. We use embeddings of dimension 128 and train the model for 16 epochs. In our experience, hyperparameter tuning over both norm penalty (λ) and unobserved weight (α) has been indispensable for good recall metrics as shown in the table below.

Results obtained by running ALX on all versions of WebGraph dataset. Recall values of 1.0 denote perfect recall.

Scaling Analysis
Since the input data are processed in parallel across TPU cores, increasing the number of cores decreases training time, ideally in a linear fashion. But at the same time, a larger number of cores requires more network communication (due to the sharded embedding tables). Thanks to high-speed interconnects, this overhead can be negligible for a small number of cores, but as the number of cores increases, the overhead eventually slows down the ideal linear scaling.

In order to confirm our hypothesis, we analyze scaling properties of the four biggest WebGraph variants in terms of training time as we increase the number of available TPU cores. As shown below, even empirically, we do observe the predicted linear decrease in training time up to a sweet spot, after which the network overhead slows the decline.

Scaling analysis of running time as the number of TPU cores are increased. Each figure plots the time taken to train for one epoch in seconds.

Conclusion
For easy access and reproducibility, the ALX code is open-sourced and can be easily run on Google Cloud. In fact, we illustrate that a sparse matrix like WebGraph-dense of size 135M x 135M (with 22B edges) can be factorized in a colab connected to 8 TPU cores in less than a day. We have designed the ALX framework with scalability in mind. With 256 TPU cores, one epoch of the largest WebGraph variant, WebGraph-sparse (365M x 365M sparse matrix) takes around 20 minutes to finish (5.5 hours for the whole training run). The final model has around 100B parameters. We hope that the ALX and WebGraph will be useful to both researchers and practitioners working in these fields. The code for ALX can be found here on github!

Acknowledgements
The core team includes Steffen Rendle, Walid Krichene and Li Zhang. We thank many Google colleagues for helping at various stages of this project. In particular, we are grateful to the JAX team for numerous discussions, especially James Bradbury and Skye Wanderman-Milne; Blake Hechtman for help with XLA and Rasmus Larsen for useful discussions about performance of linear solvers on TPUs. Finally, we’re also grateful to Nicolas Mayoraz and John Anderson for providing useful feedback.

Misc

Accelerating Climate Change Mitigation with Machine Learning: The Case of Carbon Storage

Post author By
Post date April 8, 2022
No Comments on Accelerating Climate Change Mitigation with Machine Learning: The Case of Carbon Storage

We present a new generation of neural operators, named U-FNO, that empowers a novel technology for solving multiphase flow problems with superior accuracy, speed, and data efficiency.

Climate change mitigation is about reducing greenhouse gas (GHG) emissions. The worldwide goal is to reach net zero, which means balancing the amount of GHG emissions produced and the amount removed from the atmosphere.

On the one hand, this implies reducing emissions by using low-carbon technologies and energy efficiency. On the other hand, it implies deploying negative emission technologies such as carbon storage, which is the subject of this post.

Carbon capture and storage (CCS) refers to a group of technologies that contribute to directly reducing emissions at their source in key power sectors such as coal and gas power plants and industrial plants. For emissions that cannot be reduced directly either because they are technically difficult or prohibitively expensive to eliminate, CCS underpins an important net-negative technological approach for removing carbon from the atmosphere.

If not being used on site, CO₂ can be compressed and transported by pipeline, ship, rail, or truck. It can be used in a range of applications or injected into deep geological formations (including depleted oil and gas reservoirs or saline formations) that trap the CO₂ for permanent storage. This unique dual ability of CCS makes it an essential solution among the energy transition technologies that mitigate climate change.

In addition to the role CCS plays in the energy transition, it is a solution for challenging emissions in heavy industries and addresses deep emissions reductions from hard-to-abate sectors like steel, fertilizer, and cement production. It can also support a cost-effective pathway for low-carbon blue hydrogen production.

In numbers, CCS facilities currently in operation can capture and permanently store around 40 Mt of CO₂ every year. According to the International Energy Agency (IEA), to achieve a climate outcome consistent with the Paris Agreement, 1150 MtCO₂ must be stored before 2030. So, there is a factor of 30 to achieve in storage capacity by 2030 to reduce the emissions of power and industrial sectors.

Global macro-trends such as the rise of environment social governance (ESG) criteria are stimulating the implementation of the broadest portfolio of technologies, including CCS to achieve net-zero emissions at the lowest possible risk and cost. Investment incentives are therefore building unprecedented momentum behind CCS with plans for more than 100 new facilities already announced in 2021.

CO₂ injection problem

Carbon must be stored somewhere. It is most often stored underground in a process called geological sequestration. A geological formation is only selected as a storage site under certain conditions to make sure there’s no significant risk of leakage and no significant environmental or health risk.

Picture shows 6 options for geological storage: depleted oil and gas reservoirs, use of CO2 in enhanced oil recovery, deep unused saline water-saturated reservoir rocks, deep unmineable coal seams, use of CO2 in enhanced coal bed methane recovery, and other options like basalts, oil shales, and cavities. — *Figure 1. Geological storage options for CO₂.* Source: https://co2crc.com.au/

This involves injecting carbon dioxide into underground rock formations. It is stored as a supercritical fluid, meaning that it has properties between those of a gas and a liquid.

When CO₂ is injected at depth into a reservoir, it remains in this supercritical condition as long as it stays in excess of 31.1° C and at a pressure in excess of 73.86 bar. This is true whether the reservoir is a saline formation or depleted oil and gas fields.

CO₂ must be sealed under a capillary barrier so that carbon remains stored for hundreds of years or even indefinitely in a safe way. Otherwise, if CO₂ leaks out in large quantities, it could potentially contaminate a nearby aquifer. If it leaks to ground surface, it can cause safety hazards to nearby humans or animals.

The overall performance of such storage can be predicted numerically by solving a multiphase flow problem. However, this requires solving highly nonlinear PDEs due to multiscale heterogeneities and complex thermodynamics.

The numerical simulation methodology to achieve this usually consists in several steps:

Collecting data and information about the subsurface geology and properties.
Building a geological model of the storage formation and its surroundings.
Building a dynamic model of the reservoir, which is used to simulate CO₂ injection and CO₂ evolution inside the reservoir. These dynamic simulations are used to evaluate and optimize key performance indicators related to reservoir conditions.

Traditional simulators can accurately simulate this complex problem but are expensive at sufficiently refined grid resolution. Machine learning models trained with numerical simulation data can provide a faster alternative to traditional simulators.

In this post, we highlight the results using the newly developed U-FNO machine learning model and show its superiority for CO₂-water multiphase flow problems required for understanding and scaling CCS applications.

Simulation setup

We consider modeling gas saturation and pressure over 30 years in deep saline formation at a constant rate ranging from 0.2 to 2 Mt/year. The x-axis and y-axis are the reservoir thickness and reservoir radius in meters, respectively.

The setup is a realistic reservoir located at least 800 m below ground surface (Figure 2). The setup enables reservoir simulation at various realistic depths, temperature, formation thickness, injection pattern, rock properties, and formation geology.

The numerical simulator Schlumberger ECLIPSE (e300) is used to develop the multiphase flow dataset for CO₂geological storage. Super-critical CO₂ can be injected through a vertical injection, well with various perforation interval designs into a radially symmetrical system x (r, z).

The schematic shows the different steps in typical carbon dioxide removal cycles, leading to the modeling stage for storing CO2 in saline formations. — *Figure 2. Steps in typical carbon dioxide removal cycles, leading to the modeling stage*

A novel Fourier neural operator

In a recent paper published in Advances in Water Resources, four types of machine learning model architectures are studied:

The Fourier neural operator (FNO) proposed in Fourier Neural Operator for Parametric Partial Differential Equations
A newly proposed U-FNO
A conv-FNO that uses conv3d instead of U-Net
The state-of-the-art benchmark CNN used in U-FNO–An enhanced Fourier neural operator-based deep-learning model for multiphase flow

The goal of a neural operator is to learn an infinite-dimensional-space mapping from a finite collection of input-output observations. In contrast to the original Fourier layer in FNO, the U-FNO architecture proposed here appends a U-Net path in each U-Fourier layer. The U-Net processes local convolution to enrich the representation power of the U-FNO in higher frequencies information.

The newly proposed U-FNO model architecture uses both Fourier and U-Fourier layers (Figure 3).

Diagram of architecture layers with breakouts for representative Fourier and U-Fourier layers. — *Figure 3. U-FNO model architecture*

In the Fourier and U-Fourier layers (a):

$a(x)$ is the input.
$P$ and $Q$ are fully connected neural networks.
$z(x)$ is the output.

Inside the Fourier layer (b):

$F$ denotes the Fourier transform.
$R$ is the parameterization in Fourier space.
$latex F⁻¹$ is the inverse Fourier transform.
$W$ is a linear bias term.
$sigma$ is the activation function.

Inside the U-FNO layer (c):

$U$ denotes a two-step U-Net.
The other notations have identical meaning as in the Fourier layer.
The number of Fourier and U-Fourier layers, $L$ and $M$ , are hyperparameters, optimized for the specific problem.

Comparisons with the original FNO architecture and a state-of-the-art CNN benchmark shows that the newly proposed U-FNO architecture provides the best performance for both gas saturation and pressure buildup predictions.

The results of CO₂ storage predictions using NVIDIA GPUs show the following:

U-FNO predictions are accurate, with only 1.6% plume error on gas saturation and 0.68% relative error on pressure buildup.
U-FNO has superior performance on both training and testing sets compared to CNNs and the original FNO.
Gas saturation and pressure buildup prediction using U-FNO is 46% and 24% more accurate than state-of-the-art CNNs.
U-FNO requires only 33% of the training data to achieve the equivalent accuracy as CNNs.
Running a 30-year case on GPUs using U-FNO takes 0.01 s compared to 600 s using traditional finite-difference methods (FDM).
U-FNO is 6 x 10⁴x faster than the “ground truth” conventional FDM solver; FNO is 10⁵ x faster.

The training and testing times are both evaluated on an NVIDIA A100-SXM GPU and compared to Schlumberger ECLIPSE simulations on an Intel Xeon Processor E5-2670 CPU.

For the CO₂-water multiphase flow application described here, the goal was to optimize for the accuracy of gas saturation and pressure fields, for which the U-FNO provides the highest performance. The trained U-FNO model can therefore serve as an alternative to traditional numerical simulators in probabilistic assessment, inversion, and CCS site selection.

Web application

The trained U-FNO models are hosted on an openly accessible web application, CCSNet: a deep learning modeling suite for CO2 storage. The web application provides real-time predictions and lowers the technical barriers for governments, companies, and researchers to obtain reliable simulation results for CO₂ storage projects.

Scaling FNO to 3D problem sizes using NVIDIA Tensor Core GPUs

Due to the high dimensionality of input data in CO₂ storage problems, machine learning application has been limited to two-dimensional or small to medium-scale three-dimensional sized problems.

To overcome this limitation, a recent study proposed a model-parallel version of FNOs based on domain-decomposition of both the input data and network weights. For more information, see Towards Large-Scale Learned Solvers for Parametric PDEs with Model-Parallel Fourier Neural Operators.

Performing domain decomposition methods with minimum communication amount for parallelization of multidimensional fast Fourier transforms (FFT)s has been extensively looked at in literature, for many applications.

It is well-known that you can efficiently compute a multidimensional FFT with a sequence of lower-dimensional FFTs. The main idea is to use an iterative repartition pattern. The full mathematical derivation of the required components of distributed FNOs and its implementation is provided in the previously mentioned paper.

The following figure shows this concept, showing the distributed FFT using pencil decompositions acting on an input initially distributed over a 2×2 partition. Repartition operators are used to ensure that each worker has the full data it needs for calculating the sequential FFT in each dimension.

Diagram of performing FFTs using an iterative repartition pattern — *Figure 4. Distributed FFT using pencil decompositions acting on an input initially distributed over a 2×2 partition*

The authors demonstrated that this implementation offers a different set of features when solving the 3D time-varying two-phase flow equations. In this case, the model-parallel FNO can predict time-varying PDE solutions of over 3.2 billion variables using up to 768 GPUs (128 nodes) on Summit.

Looking ahead towards next steps, we can train larger 3D models following the Grady approach based on domain-decomposition, and drastically increase our capability on data size. With this technique, it is possible for us to scale up FNO type models to solve 3D basin/reservoir CO₂ storage problems.

Summary

Simulations are required to optimize the CO₂ injection location and verify that CO₂ does not leak from the storage site. We have shown that U-FNO, an enhanced deep Fourier neural operator, is 2x more accurate, 3x more data efficient than state-of-the-art CNN, and four orders of magnitude faster than a numerical simulator.

Using NVIDIA GPUs, the trained U-FNO models generate gas saturation and pressure buildup predictions 6 × 10⁴x faster than a traditional numerical solver. Distributed operator learning and the ability to scale FNOs with domain decomposition to large problem sizes opens up new possibilities to scale our studies to realistically sized data.

To avoid the worst outcomes from climate change, Working Group III published the 2022 mitigation pathways in its contribution to the sixth assessment report (AR6) of the Intergovernmental Panel on Climate Change (IPCC). Experts highlighted the need for and potential of carbon capture and storage (CCS) to limit global warming to 1.5° C or 2° C.

Admittedly, there are current limitations to this set of technologies, especially related to economic and sociocultural barriers. But its deployment to counterbalance hard-to-abate residual emissions is considered unavoidable if net zero CO₂ or greenhouse gas emissions are to be achieved.

We believe that building powerful AI tools for climate action and resilience can curb emissions at scale. In this work, we deployed novel artificial intelligence techniques to accelerate the CO₂ flow in porous media, which play an important role for CCS applications and the path forward towards mitigating climate change.

Offsites

VDTTS: Visually-Driven Text-To-Speech

Post author By
Post date April 7, 2022
No Comments on VDTTS: Visually-Driven Text-To-Speech

Posted by Tal Remez, Software Engineer, Google Research and Micheal Hassid, Software Engineer Intern, Google Research

Recent years have seen a tremendous increase in the creation and serving of video content to users across the world in a variety of languages and over numerous platforms. The process of creating high quality content can include several stages from video capturing and captioning to video and audio editing. In some cases dialogue is re-recorded (referred to as dialog replacement, post-sync or dubbing) in a studio in order to achieve high quality and replace original audio that might have been recorded in noisy conditions. However, the dialog replacement process can be difficult and tedious because the newly recorded audio needs to be well synced with the video, requiring several edits to match the exact timing of mouth movements.

In “More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech”, we present a proof-of-concept visually-driven text-to-speech model, called VDTTS, that automates the dialog replacement process. Given a text and the original video frames of the speaker, VDTTS is trained to generate the corresponding speech. As opposed to standard visual speech recognition models, which focus on the mouth region, we detect and crop full faces using MediaPipe to avoid potentially excluding information pertinent to the speaker’s delivery. This gives the VDTTS model enough information to generate speech that matches the video while also recovering aspects of prosody, such as timing and emotion. Despite not being explicitly trained to generate speech that is synchronized to the input video, the learned model still does so.

Given a text and video frames of a speaker, VDTTS generates speech with prosody that matches the video signal.

VDTTS Model
The VDTTS model resembles Tacotron at its core and has four main components: (1) text and video encoders that process the inputs; (2) a multi-source attention mechanism that connects encoders to a decoder; (3) a spectrogram decoder that incorporates the speaker embedding (similarly to VoiceFilter), and produces mel-spectrograms (which are a form of compressed representation in the frequency domain); and (4) a frozen, pretrained neural vocoder that produces waveforms from the mel-spectrograms.

The overall architecture of VDTTS. Text and video encoders process the inputs and then a multisource attention mechanism connects these to a decoder that produces mel-spectrograms. A vocoder then produces waveforms from the mel-spectrograms to generate speech as an output.

We train VDTTS using video and text pairs from LSVSR in which the text corresponds to the exact words spoken by a person in a video. Throughout our testing, we have determined that VDTTS cannot generate arbitrary text, thus making it less prevalent for misuse (e.g., the generation of fake content).

Quality
To showcase the unique strength of VDTTS in this post, we have selected two inference examples from the VoxCeleb2 test dataset and compare the performance of VDTTS to a standard text-to-speech (TTS) model. In both examples, the video frames provide prosody and word timing clues, visual information that is not available to the TTS model.

In the first example, the speaker talks at a particular pace that can be seen as periodic gaps in the ground-truth mel-spectrogram (shown below). VDTTS preserves this characteristic and generates audio that is much closer to the ground-truth than the audio generated by standard TTS without access to the video.

Similarly, in the second example, the speaker takes long pauses between some of the words. These pauses are captured by VDTTS and are reflected in the video below, whereas the TTS does not capture this aspect of the speaker’s rhythm.

We also plot fundamental frequency (F0) charts to compare the pitch generated by each model to the ground-truth pitch. In both examples, the F0 curve of VDTTS fits the ground-truth much better than the TTS curve, both in the alignment of speech and silence, and also in how the pitch changes over time. See more original videos and VDTTS generated videos.

We present two examples, (a) and (b), from the VoxCeleb2 test set. From top to bottom: input face images, ground-truth (GT) mel-spectrogram, mel-spectrogram output of VDTTS, mel-spectrogram output of a standard TTS model, and two plots showing the normalized F0 (normalized by mean non-zero pitch, i.e., mean is only over voiced periods) of VDTTS and TTS compared to the ground-truth signal.

Video Samples

Original	VDTTS	VDTTS video-only	TTS

Original displays the original video clip. VDTTS, displays the audio predicted using both the video frames and the text as input. VDTTS video-only displays audio predictions using video frames only. TTS displays audio predictions using text only. Top transcript: “of space for people to make their own judgments and to come to their own”. Bottom transcript: “absolutely love dancing I have no dance experience whatsoever but as that”.

Model Performance
We’ve measured the VDTTS model’s performance using the VoxCeleb2 dataset and compared it to TTS and the TTS with length hint (a TTS that receives the scene length) models. We demonstrate that VDTTS outperforms both models by large margins in most of the aspects we measured: higher sync-to-video quality (measured by SyncNet Distance) and better speech quality as measured by mel cepstral distance (MCD), and lower Gross Pitch Error (GPE), which measures the percentage of frames where pitch differed by more than 20% on frames for which voice was present on both the predicted and reference audio.

SyncNet distance comparison between VDTTS, TTS and the TTS with Length hint (a lower metric is better).

Mel cepstral distance comparison between VDTTS, TTS and the TTS with Length hint (a lower metric is better).

Gross Pitch Error comparison between VDTTS, TTS and the TTS with Length hint (a lower metric is better).

Discussion and Future Work
One thing to note is that, intriguingly, VDTTS can produce video synchronized speech without any explicit losses or constraints to promote this, suggesting complexities such as synchronization losses or explicit modeling are unnecessary.

While this is a proof-of-concept demonstration, we believe that in the future, VDTTS can be upgraded to be used in scenarios where the input text differs from the original video signal. This kind of a model would be a valuable tool for tasks such as translation dubbing.

Acknowledgements
We would like to thank the co-authors of this research: Michelle Tadmor Ramanovich, Ye Jia, Brendan Shillingford, and Miaosen Wang. We are also grateful to the valued contributions, discussions, and feedback from Nadav Bar, Jay Tenenbaum, Zach Gleicher, Paul McCartney, Marco Tagliasacchi, and Yoni Tzafir.

Misc

Developers Drive DPU Evolution in the NVIDIA DOCA Hackathon

Post author By
Post date April 7, 2022
No Comments on Developers Drive DPU Evolution in the NVIDIA DOCA Hackathon

The recent DOCA Hackathon in Europe revealed streamlined innovation in video processing, storage solutions, and switching protocols using the BlueField DPU and DOCA SDK.

The third in a series of global NVIDIA DOCA Hackathons took place on March 21, during NVIDIA 2022 GTC. Competing in the event were 10 teams from a variety of universities, enterprises, and technology partners from across Europe and the Middle East. As part of GTC, NVIDIA CEO, Jensen Huang, gave a powerful keynote highlighting efforts in AI to supercharge industries including DPU and switching.

The recent Hackathon in Europe focused on BlueField DPU innovations that leverage the DOCA software framework to streamline the development process. Participants continue to find new ways to utilize the DPU for offloading, accelerating, and isolating a broad range of services. With DOCA, NVIDIA brings together APIs, drivers, libraries, sample code, documentation, services, and prepackaged containers for developers to speed up application development and deployment.

“The NVIDIA Hackathon series enables participants to take a giant leap forward in their DPU application development. With direct access to DOCA training, mentors, preconfigured setups, documentation, and access to a working environment. These Hackathons help to accelerate application development that would have otherwise taken months for many organizations and play a significant role in establishing a strong DOCA developer community,” said Dror Goldenberg, SVP of Software Architecture at NVIDIA. “We continue to be very impressed with the creativity and ingenuity of all of our hackathon contestants and this competition was no exception!”

First Place

Team Thales from Theresis
Utilization of DPU for Storage Security

All 5 members of Team Thales celebrating their win — *Figure 1. Members of the first-place team celebrate their win.*

The Team Thales solution successfully used the BlueField DPU to create cyberdefense for files transmitted over the network. They used a combination of networking and storage security rules that delivered an overall performance improvement. The goal was to build upon the DPI acceleration to enable Yara rules, which are used for inspection of files downloaded from the network to identify malware and potential threats. To implement this, Team Thales used a Yara Parser to transform public Yara rules into DPI rules in a Suricata community-based format supported by the DOCA DPI lib. This solution leveraged DOCA DPI functionality to scan the files on the fly as the packets flow through the device.

Second Place

Team RARE/FreeRTR from GÉANT
Router for Academia Research and Education Hardware accelerated MPLS

FreeRTR Router is a Swiss army knife meant to be used as a primary router but can also be used as a specific appliance. The team evaluated accelerated DPDK and DOCA FLOW and enabled the routing functionality with multiple routing protocols on the DPU leveraging the large and programmable flow tables. As part of their planned innovation, the team also evaluated added services by linking DOCA libraries to provide additional functionality including Firewall, RegEx scanning, MACsec encryption, and DPI engine at line rate. This allowed Team Rare to create one control plane to rule all dataplanes.

Third Place

Team DOCA Seville from the University of Seville
Video processing at the edge

Team DOCA Seville used the BlueField DPU to filter out voided frames of CCTV streams, reducing the load on a CPU/GPU and improving physical security by providing detection of firearms. The intention is to prevent mass shootings by detecting the weapon. Removing images with no people significantly improves the overall system performance. Team DOCA Seville leveraged the DPU to offload the image processing and used DOCA gRPC infrastructure to stream filtered data for further analysis.

Congratulations to the winners and thanks to all of the teams that participated, making this round of NVIDIA DOCA Hackathon such a wonderful success!

NVIDIA is building a broad community of DOCA developers to create innovative applications and services on top of BlueField DPUs to secure and accelerate modern, efficient data centers. To learn more about joining the community, visit the DOCA developer web page or register to download DOCA today.

Up next is the NVIDIA DPU Hackathon in China. Check out our corporate calendar to stay informed of future events, and take part in our journey to reshape the data center of tomorrow.

More resources

Misc

Pictures from my iPhone caused bad training

Post author By
Post date April 7, 2022
No Comments on Pictures from my iPhone caused bad training

I went around my area and took a bunch of pictures of litter. And it caused my model to go to shit

I can’t think why??

I took decent photos, converted to jpg, and label everything appropriately. The photos are larger in dimensions and memory wise so it caused training to be slower, which is understandable 🙃

The images I got from the internet work great 🤔

submitted by /u/stephanr21
[visit reddit] [comments]

Misc

Try This Out: GFN Thursday Delivers Instant-Play Game Demos on GeForce NOW

Post author By
Post date April 7, 2022
No Comments on Try This Out: GFN Thursday Delivers Instant-Play Game Demos on GeForce NOW

GeForce NOW is about bringing new experiences to gamers. This GFN Thursday introduces game demos to GeForce NOW. Members can now try out some of the hit games streaming on the service before purchasing the full PC version — including some finalists from the 2021 Epic MegaJam. Plus, look for six games ready to stream Read article >

The post Try This Out: GFN Thursday Delivers Instant-Play Game Demos on GeForce NOW appeared first on NVIDIA Blog.

Misc

TF2 Optimize an Input

So I trained a basic model, and I’d like to use that model to optimize an input.

So if my input is [A, B, C, D, E] I’d like to specify to optimize A, B, and C while keeping the model, as well as D and E constant.

Using 2.7.0 with Keras in Python

I can find some examples with depracated code, but I am having trouble finding resources on how to set this up in TF2.

submitted by /u/hobopwnzor
[visit reddit] [comments]

Misc

Getting the Best Performance on MLPerf Inference 2.0

Post author By
Post date April 6, 2022
No Comments on Getting the Best Performance on MLPerf Inference 2.0

NVIDIA delivered leading results for MLPerf Inference 2.0, including 5x more performance for NVIDIA Jetson AGX Orin, an SoC platform built for edge devices and robotics.

Models like Megatron 530B are expanding the range of problems AI can address. However, as models continue to grow complexity, they pose a twofold challenge for AI compute platforms:

These models must be trained in a reasonable amount of time.
They must be able to do inference work in real time.

What’s needed is a versatile AI platform that can deliver the needed performance on a wide variety of models for both training and inference.

To evaluate that performance, MLPerf is the only industry-standard AI benchmark that tests data center and edge platforms across a half-dozen applications, measuring throughput, latency, and energy efficiency.

In MLPerf Inference 2.0, NVIDIA delivered leading results across all workloads and scenarios with both data center GPUs and the newest entrant, the NVIDIA Jetson AGX Orin SoC platform built for edge devices and robotics.

Beyond the hardware, it takes great software and optimization work to get the most out of these platforms. The results of MLPerf Inference 2.0 demonstrate how to get the kind of performance needed to tackle today’s increasingly large and complex AI models.

Here’s a look at the performance seen on MLPerf Inference 2.0, as well as some of those optimizations and how they got built.

Do the numbers

Figure 1 shows the latest entrant, NVIDIA Jetson AGX Orin.

Chart shows that for edge performance and perf per watt, Jetson AGX Orin delivers up to 5x more than Jetson AGX Xavier and 2.3x more energy efficiency. — *Figure 1. NVIDIA Jetson AGX Orin performance improvement*

MLPerf v2.0 Inference Edge Closed and Edge Closed Power; Performance/Watt from MLPerf results for respective submissions for Data Center and Edge, Offline Throughput, and Power. NVIDIA Xavier AGX Xavier: 1.1-110 and 1.1-111 | Jetson AGX Orin: 2.0-140 and 2.0-141. MLPerf name and logo are trademarks. Source: http://www.mlcommons.org/en.

Figure 1 shows that Jetson AGX Orin delivers up to 5x more performance compared to previous generation. It brings on average about 3.4x more performance across the full breadth of usages tested. In addition, Jetson AGX Orin delivers up to 2.3x more energy efficiency.

Jetson Orin AGX is an SoC that brings up to 275 TOPS of AI compute for multiple concurrent AI inference pipelines, plus high-speed interface support for multiple sensors. The NVIDIA Jetson AGX Orin Developer Kit enables you to create advanced robotics and edge AI applications for manufacturing, logistics, retail, service, agriculture, smart city, healthcare, and life sciences.

In the Data Center category, NVIDIA continues to deliver across-the-board AI inference performance leadership across all usages.

Chart shows that NVIDIA leads Offline and Server results for across the board Image Classification, Object Detection, Medical Imaging (offline only), Speech Recognition, NLP, and Recommendation. — *Figure 2. NVIDIA A100 per-accelerator performance*

MLPerf v2.0 Inference Closed; Per-accelerator performance derived from the best MLPerf results for respective submissions using reported accelerator count in Data Center Offline and Server. Qualcomm AI 100: 2.0-130, Intel Xeon 8380 from MLPerf v.1.1 submission: 1.1-023 and 1.1-024, Intel Xeon 8380H 1.1-026, NVIDIA A30: 2.0-090, NVIDIA A100 (Arm): 2.0-077, NVIDIA A100 (x86): 2.0-094. MLPerf name and logo are trademarks. Source: http://www.mlcommons.org/en.

NVIDIA A100 delivers the best per-accelerator performance across all tests in both the Offline and Server scenarios.

We submitted the A100 in the following configurations:

A100 SXM paired with x86 CPUs (AMD Epyc 7742)
A100 PCIe paired with x86 CPUs (AMD Epyc 7742)
A100 SXM paired with an Arm CPU (NVIDIA Ampere Architecture Altra Q80-30)

Microsoft Azure also submitted using its A100 instance, which we also show in this data.

All configurations deliver about the same inference performance, which is a testament to the readiness of our Arm software stack, as well as the overall performance of A100 on-premises and in the cloud.

A100 also delivers up to 105x more performance than a CPU-only submission (RNN-T, Server scenario). A30 also delivered leadership-level performance on all but one workload. Like the A100, it ran every Data Center category test.

Key optimizations

Delivering great inference performance requires a full stack approach, where great hardware is combined with optimized and versatile software. NVIDIA TensorRT and NVIDIA Triton Inference Server both play pivotal roles in delivering great inference performance across the diverse set of workloads.

Jetson AGX Orin optimizations

The NVIDIA Orin new NVIDIA Ampere Architecture iGPU is supported by NVIDIA TensorRT 8.4. It is the most important component of the SoC for MLPerf performance. The extensive TensorRT library of optimized GPU kernels was extended to support the new architecture. The TensorRT builder picks up these kernels automatically.

Furthermore, the plug-ins used in MLPerf networks have all been ported to NVIDIA Orin and added to TensorRT 8.4, including the res2 plug-in (resnet50) and qkv to context plug-in (BERT). Unlike systems with discrete GPU accelerators, inputs are not copied from host memory to device memory, because the SoC DRAM is shared by the CPU and iGPU.

In addition to the iGPU, NVIDIA uses two deep learning accelerators (DLAs) for maximum system performance on CV networks (resnet50, ssd-mobilenet, ssd-resnet34) in the offline scenario.

NVIDIA Orin incorporates a new generation of DLA HW. To take advantage of these hardware improvements, the DLA compiler added the following NVIDIA Orin features, which are available automatically when you upgrade to a future release of TensorRT, without modifying any application source code.

SRAM chaining: Keeps intermediate tensors in local SRAM to avoid reads from/writes to DRAM, which reduces latency as well as platform DRAM usage. It also reduces interference with GPU inference.
Convolution + pooling fusion: INT8 convolution + bias + scale + ReLU can be fused with a subsequent pooling node.
Convolution + element-wise fusion: INT8 convolution + element-wise sum can be fused with a subsequent ReLU node.

The batch size of the two DLA accelerators were finely tuned to obtain the right balance of GPU+DLA aggregated performance. The tuning balanced the need to minimize scheduling conflicts for DLA engine’s GPU fallback kernels while reducing overall potential starvation out of the SoC’s shared DRAM bandwidth.

3D-UNet medical imaging

While most of the workloads were largely unchanged from MLPerf Inference v1.1, the 3D-UNet medical imaging workload was enhanced with the KITS19 data set. This new data set of kidney tumor images has much larger images of varying sizes and requires a lot more processing per sample.

The KiTS19 data set introduces new challenges in achieving performant, energy-efficient inference. More specifically:

Input tensors used in KiTS19 vary in shape from 128x192x320 to 320x448x448; the biggest input tensor is 8.17x larger than the smallest input tensor.
Tensors larger than 2 GB are needed during inference.
There is a sliding window on the specific region of interest (ROI) shape (128x128x128), with a large overlap factor (50%).

To address this, we developed a sliding window method to process these images:

Slice each input tensor into ROI shape, abiding by the overlap factor.
Use a loop to process all the sliding window slices for a given input tensor.
Weight and normalize the inference result of each sliding window.
Get the final segmentation output by the ArgMax of the aggregated results of sliding window inferences.

KITS19 inference task basic flow diagram, from KITS input to slice & feed, through weight output with Gaussian kernel, Sum & Normalize, and using the ArgMax op, to result in KITS output. — *Figure 3. 3D-UNet performing KiTS19 kidney tumor segmentation inference task, using a sliding window approach*

In Figure 3, each input tensor is sliced to the ROI-shape (128x128x128) with overlap-factor (50%) and fed into the pretrained network. Each sliding window output is then optimally weighted for capturing features with a Gaussian kernel of normalized sigma = 0.125.

The inference results are aggregated according to original input tensor shape and normalized, for those weighting factors. An ArgMax operation then carves the segmentation information, marking the background, normal kidney cells, and tumors.

This implementation compares the segmentation against the ground truth and calculates a DICE score to determine benchmarking accuracy. You can also check the outcome visually (Figure 4).

Three sets of images, from cross-section to normal kidney cells and tumors. — Figure 4. (left) 3D CT image, which is an input to the inference task. (middle) Ground truth segmentation result showing normal kidney cells (gray) and tumors (highlighted white). (right) Inference result from 3D-UNet KiTS19.

We’ve supported INT8 precision in our data center GPUs for over 5 years, and this precision brings significant speedups on many models with near-zero loss in accuracy compared to FP16 and FP32 precision levels.

For 3D-UNet, we used INT8 by calibrating the images from the calibration set using the TensorRT IInt8MinMaxCalibrator. This implementation achieves 100% accuracy in the FP32 reference model, thus enabling both high and low accuracy modes of the benchmark.

Diagram shows the implementation plus timing and custom kernel vs. host use. — *Figure 5. NVIDIA 3D-UNet KiTS19 implementation used in MLPerf-Inference v2.0 submission*

In Figure 5, green boxes are executed on the device (GPU) and yellow boxes are executed on the host (CPU). Some operations required for sliding window inference are optimized as fused operations.

Exploiting the GPUDirect RDMA and storage, host-to-device, or device-to-host data movement can be minimized or eliminated. The latency of each work is measured from the DGX-A100 system for one input sample whose size is close to the average input size. The latency for the slicing kernel and ArgMax kernel varies proportionally to the input image size.

Here are some of the specific optimizations implemented:

The Gaussian kernel patches used for weighting are now precalculated and stored on disk and are loaded into GPU memory before the timed portion of the benchmark starts.
Weighting and normalizing are optimized as a fused operation, using 27 precomputed patches required for the sliding window of 50% overlap on the 3D input tensor.
Custom CUDA kernels handling the slicing, weighting, and ArgMax are written so that all these operations are done in the GPU, eliminating the need of H2D/D2H data transfer.
Input tensor in INT8 LINEAR memory layout enables a minimal amount of data in H2D transfer as KiTS19 input set is a single channel.

TensorRT requires INT8 input in the NC/32DHW32 format. We use a custom CUDA kernel that performs slicing to zero-pad and reformat the INT8 LINEAR input tensor slice to INT8 NC/32DHW32 format in a contiguous memory region in GPU global memory.

Zero-padding and reformatting the tensor within the GPU is much faster than otherwise expensive H2D transfer of 32x more data. This optimization improves overall performance significantly and frees up the precious system resource.

The TensorRT engine is built for running inference on each of the sliding window slices. Because 3D-UNet is dense, we found that increasing batch size proportionally increases the runtime of the engine.

NVIDIA Triton optimizations

The NVIDIA submission continues to show the versatility of Triton Inference Server. This round, the Triton Inference Server also supports running NVIDIA Triton on AWS Inferentia. NVIDIA Triton uses the Python backend to run Inferentia-optimized PyTorch and TensorFlow models.

Using NVIDIA Triton and torch-neuron, the NVIDIA submissions achieved 85% to 100% of the native inference performance on Inferentia.

Triton Inference Server delivers the same performance as bare-metal submissions on A100, A30, A2, and 7% more performance compared to a recent CPU submission. — *Figure 6. Triton Inference Server performance*

MLPerf v1.1 Inference Closed; Per-accelerator performance derived from the best MLPerf results for respective submissions using reported accelerator count in Data Center Offline. Geomean of all submitted workloads shown. CPU comparison based on Intel submission data from MLPerf Inference 1.1 to compare configurations with same CPU, submissions 1.0-16, 1.0-17, 1.0-19. NVIDIA Triton on CPU: 2.0-100 and 2.0-101. A2: 2.0-060 and 2.0-061. A30: 2.0-091 and 2.0-092. A100: 2.0-094 and 2.0-096. MLPerf name and logo are trademarks. Source: http://www.mlcommons.org/en.

NVIDIA Triton now supports the AWS Inferentia inference processor and delivers nearly equal performance to running on the AWS Neuron SDK alone.

It takes a platform

NVIDIA inference leadership comes from building the most performant AI accelerators, both for training and inference. But great hardware is just the beginning.

NVIDIA TensorRT and Triton Inference Server software play pivotal roles in delivering our great inference performance across this diverse set of workloads. They are available at NGC, the NVIDIA hub, along with other GPU-optimized software for deep learning, machine learning, and HPC.

The NGC containerized software makes it much easier to get accelerated platforms up and running so you can focus on building real applications, and speeding time-to-value. NGC is freely available through the marketplace of your preferred cloud provider.

For more information, see Inference Technical Overview. This paper covers the trends in AI inference, challenges around deploying AI applications, and details of inference development tools and application frameworks.

Misc

Neural Networks with Tensorflow – free course from udemy for limited enrolls

Post author By
Post date April 6, 2022
No Comments on Neural Networks with Tensorflow – free course from udemy for limited enrolls

Neural Networks with Tensorflow - free course from udemy for limited enrolls

submitted by /u/Ordinary_Craft
[visit reddit] [comments]

CO2 injection problem

Simulation setup

A novel Fourier neural operator

Web application

Scaling FNO to 3D problem sizes using NVIDIA Tensor Core GPUs

Summary

First Place

Team Thales from TheresisUtilization of DPU for Storage Security

Second Place

Team RARE/FreeRTR from GÉANTRouter for Academia Research and Education Hardware accelerated MPLS

Third Place

Team DOCA Seville from the University of SevilleVideo processing at the edge

More resources

Do the numbers

Key optimizations

Jetson AGX Orin optimizations

3D-UNet medical imaging

NVIDIA Triton optimizations

It takes a platform

CO₂ injection problem

Team Thales from Theresis
Utilization of DPU for Storage Security

Team RARE/FreeRTR from GÉANT
Router for Academia Research and Education Hardware accelerated MPLS

Team DOCA Seville from the University of Seville
Video processing at the edge