Introduction

In sparse dictionary learning, sparse coding and dictionary update are solved in an alternating manner (Aharon et al., 2006). In sparse coding stage, the following problem is solved given the dictionary DRd×n and signals yjRd:

(1)minxj yjDxj22s.t. x0K,

where K is the sparsity. Sometimes, there’s an additional constraint xj0 if non-negative sparse coding is required. Since L0 constraint is intractable to optimize exactly, either approximate greedy algorithm like (non-negative) orthogonal matching pursuit (Cai & Wang, 2011; Yaghoobi et al., 2015; Nguyen et al., 2019), or relaxation of L0 to L1 sparsity as (non-negative) basis pursuit (Chen & Donoho, 1994; Gregor & LeCun, 2010; Zhang et al., 2018; Tolooshams & Ba, 2022) are regarded as idiomatic solutions.

Proposed method

(Louizos et al., 2018) suggests a novel approach to handle the intractability of L0 constraint. Instead of tackling the L0 constraint directly, the authors address the expectation of the L0 norms by introducing Bernoulli random variables. In the parlance of the sparse coding problem (1),

(2)minxj,πj Eq(zjπj)[yjD(xjzj)22]s.t. 1πjK,

where xj has been reparameterized as xjzj, and for each i, zjiBernoulli(πji), xjiR, the symbol denotes elementwise product. Note that Equation (2.1) can be trivially extend to non-negative sparse coding case by reparameterization xj:=exp(xj)zj or xj:=softplus(xj)zj, where softplus()=log(1+exp()). (Louizos et al., 2018) further introduces a smoother on the discrete random variable zj to allow for reparameterization trick (Kingma & Welling, 2014; Rezende et al., 2014), and the expectation in Equation (2) can be estimated by Monte Carlo sampling.

To solve the constrained minimization in Equation (2), it’s natural to proceed using Lagrangian multiplier and optimize under bound constraint only:

(3)minxj,πjmaxλj0 Eq(zjπj)[yjD(xjzj)22]+λj(1πjK).

On the one hand, one may optimize xj,πj,λj jointly via gradient descent. However, it’s worthy noting that one must perform gradient ascent on λj, which can be achieved by negating its gradient before the descent step. On the other hand, dual gradient ascent can be adopted. Here, given fixed λj, the objective (3) is minimized till a critical point; then given fixed xj and πj, λj is updated with one-step gradient ascent; finally, iterate.

In practice, potentially a great number of signals are required to be sparse coded given the dictionary:

(4)minx,πmaxλ0 j=1m{Eq(zjπj)[yjD(xjzj)22]+λj(1πjK)}.

It’s not uncommon that all the variables to optimize, especially {xj,πj}j=1m, are unable to fit into memory, thus failing to run gradient descent. Notice that for each j, the optimal solution (xj,πj) are related to (yj,λj); that is, xj=x(yj,λj), πj=π(yj,λj). Therefore, I propose to perform amortized inference: to use a neural network f parameterized by ϕ that takes as input (yj,λj) to predict xj and πj. I found the use of ReLU activation in such network promotes training the most. The objective (4) now becomes:

(5)minϕmaxλ0 j=1m{Eq(zjϕ)[yjD(fx(yj,λj;ϕ)zj)22]+λj(1fπ(yj,λj;ϕ)K)}.

With dictionary learning, the dictionary need to be learned. Using the objective (5), I found it preferable to optimize using the procedure below:

  1. Given λ, reinitialize ϕ, and jointly learn ϕ and D until stationary point.
  2. Given ϕ and D, perform one-step gradient ascent on λ.
  3. Iterate.

I found the reinitialization step on the amortized network critically important. Without it, the network tends to predict all-zero and eventually learns nothing. However, the dictionary needs to be initialized only at the very beginning.

Experiments

For dictionary learning without non-negativity constraint on sparse coding, I compared against (Rubinstein et al., 2008) in image denoising. My proposed fully differentiable solution converges slower and denoises poorer than K-SVD supported by batch OMP.

For dictionary learning with non-negative constraint on sparse coding, I compare against (Nguyen et al., 2019) in exploration of atoms of discourse, which is known to admit a non-negative sparse coding form (Arora et al., 2018). While being faster, my proposed method still performs worse than non-negative OMP, in that the learned dictionary atoms are mostly not atoms of discourse.

Hence, this is the main reason why I record my attempt here in a post rather than write a paper. Perhaps, the proposed method is promising, but it’s not well-prepared yet.