Yang Yang

I work as Sr. Staff Deep Learning Engineer at Qualcomm AI Research in San Diego. My research interests include deep generative models, neural data compression (image/video/speech compression), and machine learning for combinatorial optimization (e.g., graph-level compiler optimization). I am passionate about designing deep learning solutions to challenging problems and deploying them to edge devices.

I received my Ph.D. from the Ohio State University in 2015 on wireless networking. Before joining AI Research, I worked on wireless physical layer design (reference signal, channel estimation, tracking loop), channel coding design, and standardization for 5G NR (patent).

news

Mar 18, 2022	Completed Chapter 2 Normalizing Flows of our deep generative models book
Mar 5, 2022	Post Quantization for Neural Networks is up!
Aug 1, 2021	My research focus is transitioned from neural data compression to MLCO.
Jun 19, 2021	Checkout my team’s demo: Real-time on-device neural video decoding (CVPR 2021); More
May 7, 2021	Co-organized ICLR 2021 Neural Compression Workshop
Apr 3, 2021	Post on how to enforce Lipschitz constant in neural networks is up!

selected publications

ICLR

Transformer-based Transform Coding

Zhu, Yinhao*, Yang, Yang*, and Cohen, Taco

In International Conference on Learning Representations (ICLR) 2022

TLDR Abs HTML

Neural data compression based on nonlinear transform coding has made great progress over the last few years, mainly due to improvements in prior models, quantization methods and nonlinear transforms. A general trend in many recent works pushing the limit of rate-distortion performance is to use ever more expensive prior models that can lead to prohibitively slow decoding. Instead, we focus on more expressive transforms that result in a better rate-distortion-computation trade-off. Specifically, we show that nonlinear transforms built on Swin-transformers can achieve better compression efficiency than transforms built on convolutional neural networks (ConvNets), while requiring fewer parameters and shorter decoding time. Paired with a compute-efficient Channel-wise Auto-Regressive Model prior, our SwinT-ChARM model outperforms VTM-12.1 by 3.68 % in BD-rate on Kodak with comparable decoding speed. In P-frame video compression setting, we are able to outperform the popular ConvNet-based scale-space-flow model by 12.35 % in BD-rate on UVG. We provide model scaling studies to verify the computational efficiency of the proposed solutions and conduct several analyses to reveal the source of coding gain of transformers over ConvNets, including better spatial decorrelation, flexible effective receptive field, and more localized response of latent pixels during progressive decoding.

A swin-transformer based auto-encoder structure is proposed that achieves SOTA in rate-distortion-complexity trade-off of image compression. It also outperforms SSF in video compression. AFAIK SwinT-ChARM is the first neural image codec that outperforms VTM in rate-distortion while with comparable decoding time on GPU.
ICIP

Progressive Neural Image Compression With Nested Quantization And Latent Ordering

Lu, Yadong*, Zhu, Yinhao*, Yang, Yang*, Said, Amir, and Cohen, Taco S

In IEEE International Conference on Image Processing (ICIP) 2021

TLDR Abs HTML

We present PLONQ, a progressive neural image compression scheme which pushes the boundary of variable bitrate compression by allowing quality scalable coding with a single bitstream. In contrast to existing learned variable bitrate solutions which produce separate bitstreams for each quality, it enables easier rate-control and requires less storage. Leveraging the latent scaling based variable bitrate solution, we introduce nested quantization, a method that defines multiple quantization levels with nested quantization grids, and progressively refines all latents from the coarsest to the finest quantization level. To achieve finer progressiveness in between any two quantization levels, latent elements are incrementally refined with an importance ordering defined in the rate-distortion sense. To the best of our knowledge, PLONQ is the first learning-based progressive image coding scheme and it outperforms SPIHT, a well-known wavelet-based progressive image codec.

A prefix-decodable bitstream (lower bitrate stream is embedded as a prefix of higher bitrate stream) is obtained with nested quantization and per-element sorting by prior stddev, based on the hyperprior model.
JSP

Transform Network Architectures for Deep Learning Based End-to-End Image/Video Coding in Subsampled Color Spaces

Egilmez, Hilmi, Singh, Ankitesh Kumar, Coban, Muhammed, Karczewicz, Marta, Zhu, Yinhao, Yang, Yang, Said, Amir, and Cohen, Taco

In IEEE Open Journal of Signal Processing 2021

TLDR Abs PDF

Most of the existing deep learning based end-to-end image/video coding (DLEC) architectures are designed for non-subsampled RGB color format. However, in order to achieve a superior coding performance, many state-of-the-art block-based compression standards such as High Efficiency Video Coding (HEVC/H.265) and Versatile Video Coding (VVC/H.266) are designed primarily for YUV 4:2:0 format, where U and V components are subsampled by considering the human visual system. This paper investigates various DLEC designs to support YUV 4:2:0 format by comparing their performance against the main profiles of HEVC and VVC standards under a common evaluation framework. Moreover, a new transform network architecture is proposed to improve the efficiency of coding YUV 4:2:0 data. The experimental results on YUV 4:2:0 datasets show that the proposed architecture significantly outperforms naive extensions of existing architectures designed for RGB format and achieves about 10% average BD-rate improvement over the intra-frame coding in HEVC.

PReLU can replace GDN in the hyperprior model to compress YUV (and RGB!) images without loss of coding gain.
ICASSP

Feedback Recurrent Autoencoder

Yang, Yang, Sautière, Guillaume, Ryu, J. Jon, and Cohen, Taco S

In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020

TLDR Abs HTML

In this work, we propose a new recurrent autoencoder architecture, termed Feedback Recurrent AutoEncoder (FRAE), for online compression of sequential data with temporal dependency. The recurrent structure of FRAE is designed to efficiently extract the redundancy along the time dimension and allows a compact discrete representation of the data to be learned. We demonstrate its effectiveness in speech spectrogram compression. Specifically, we show that the FRAE, paired with a powerful neural vocoder, can produce high-quality speech waveforms at a low, fixed bitrate. We further show that by adding a learned prior for the latent space and using an entropy coder, we can achieve an even lower variable bitrate.

Extend VAE to sequential setting for efficient representation learning of sequential data. In the application of speech compression, when paired with a neural vocoder, the proposed FRAE signifantly outperforms Opus, an open source speech codec.
ACCV

Feedback Recurrent Autoencoder for Video Compression

Golinski, Adam*, Pourreza, Reza*, Yang, Yang*, Sautière, Guillaume, and Cohen, Taco S

In Asian Conference on Computer Vision (ACCV) 2020

TLDR Abs HTML

Recent advances in deep generative modeling have enabled efficient modeling of high dimensional data distributions and opened up a new horizon for solving data compression problems. Specifically, autoencoder based learned image or video compression solutions are emerging as strong competitors to traditional approaches. In this work, We propose a new network architecture, based on common and well studied components, for learned video compression operating in low latency mode. Our method yields competitive MS-SSIM/rate performance on the high-resolution UVG dataset, among both learned video compression approaches and classical video compression methods (H.265 and H.264) in the rate range of interest for streaming applications. Additionally, we provide an analysis of existing approaches through the lens of their underlying probabilistic graphical models.Finally, we point out issues with temporal consistency and color shift observed in empirical evaluation, and suggest directions forward to alleviate those.

One of the early works that propose neural network based video codecs. A recurrent layer is introduced in the decoder to capture long-range redundancy of video content, the state of which is fed back to encoder.
CVPR

Guided Variational Autoencoder for Disentanglement Learning

Ding, Zheng*, Xu, Yifan*, Xu, Weijian, Parmar, Gaurav, Yang, Yang, Welling, Max, and Tu, Zhuowen

In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020

TLDR Abs HTML

We propose an algorithm, guided variational autoencoder (Guided-VAE), that is able to learn a controllable generative model by performing latent representation disentanglement learning. The learning objective is achieved by providing signal to the latent encoding/embedding in VAE without changing its main backbone architecture, hence retaining the desirable properties of the VAE. We design an unsupervised and a supervised strategy in Guided-VAE and observe enhanced modeling and controlling capability over the vanilla VAE. In the unsupervised strategy, we guide the VAE learning by introducing a lightweight decoder that learns latent geometric transformation and principal components; in the supervised strategy, we use an adversarial excitation and inhibition mechanism to encourage the disentanglement of the latent variables. Guided-VAE enjoys its transparency and simplicity for the general representation learning task, as well as disentanglement learning. On a number of experiments for representation learning, improved synthesis/sampling, better disentanglement for classification, and reduced classification errors in meta learning have been observed.

An auxiliary decoder is used to improve the disentanglement of latent from a VAE.
ICASSP

Automatic Grammar Augmentation for Robust Voice Command Recognition

Yang, Yang, Lalitha, Anusha, Lee, Jinwon, and Lott, Chris

In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019

TLDR Abs HTML

This paper proposes a novel pipeline for automatic grammar augmentation that provides a significant improvement in the voice command recognition accuracy for systems with small footprint acoustic model (AM). The improvement is achieved by augmenting the user-defined voice command set, also called grammar set, with alternate grammar expressions. For a given grammar set, a set of potential grammar expressions (candidate set) for augmentation is constructed from an AM-specific statistical pronunciation dictionary that captures the consistent patterns and errors in the decoding of AM induced by variations in pronunciation, pitch, tempo, accent, ambiguous spellings, and noise conditions. Using this candidate set, greedy optimization based and cross-entropy-method (CEM) based algorithms are considered to search for an augmented grammar set with improved recognition accuracy utilizing a command-specific dataset. Our experiments show that the proposed pipeline along with algorithms considered in this paper significantly reduce the mis-detection and mis-classification rate without increasing the false-alarm rate. Experiments also demonstrate the consistent superior performance of CEM method over greedy-based algorithms.

Improve voice command recognition of a light-weight acoustic model by augmenting the target command to capture variations of inputs.
CVPR-W

Phase Selective Convolution

Lin, Jamie Menjay, Noorzad, Parham, Yang, Yang, Kwak, Nojun, and Porikli, Fatih

In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops on Embedded Vision 2021

TLDR Abs HTML

This paper introduces Phase Selective Convolution (PSC), an enhanced convolution for more deliberate utilization of activations in convolutional networks. Unlike conventional use of convolutions with activation functions, PSC preserves the full space of activations while supporting desirable model nonlinearity. Similar to several other network operations, e.g., the ReLU operation, at the time of their introduction, PSC may not execute as efficiently on platforms without hardware specialization support. As a first step in addressing the need for optimization, we propose a hardware acceleration scheme to enable the intended efficiency for PSC execution. Moreover, we propose a PSC deployment strategy, with which PSC is applied only to selected layers of the networks, to avoid excessive increase in the total model size. To evaluate the results, we apply PSC as a drop-in replacement for selected convolution layers in several networks without affecting their macro network architectures. In particular, PSC-enhanced ResNets achieve higher accuracies by 1.0-2.0% and 0.7-1.0% on CIFAR-100 and ImageNet, respectively, in Pareto efficiency. PSC-enhanced MobileNets (V2 and V3 Large) and MobileNetV3 (Small) achieve 0.9-1.0% and 1.8% accuracy gains, respectively, on ImageNet at little (0.2-0.7%) total model size increase.

A simple and effective extension of Conv+ReLU that can achieve better pareto efficiency in image classification when compared with ResNet and MobileNets