Arxiv trawling, May 19

Tensor recovery and completion — a novel approach to recovering low-rank tensors that does not involve matrix unfolding, and has reasonable sample complexities (better than square deal)

Partial sharing of latent factors in multiview data — explicitly take advantage of the fact that not only should information be shared across modes, but also each mode should have its own unique information

Analysis of a nonconvex low-rank matrix estimator — not sure what the regularizer is, but looks readable, analysis perhaps uses the restricted isometry assumption

Exponential families on Manifolds — fun light reading. natural adaption of exponential families to (Lie?) manifolds; not much experimental results

Exponential families in high dimensions — learning convergence rate under sparsity assumptions. useful for anything I do? looks like fun/informative reading, though

Graph partitioning for parallel machine learning — uses submodularity. how compares to the sampling scheme from Cyclades?

Column selection from matrices with missing data — looks like a long read, but worthwhile, and well written. READ

Minkowski symmetrization of star-shaped sets — looks readable; applicable to using Householder rotations to do range space estimation?

Distributed preconditioning for sparse linear systems — A Saad paper. Is this useful for us anywhere? Nice to read to soak in some knowledge on preconditioning

Another preconditioner for sparse systems — Another Saad paper. Nice to read to get some preconditioner knowledge

Using bootstrap to test equality of covariance matrices — relevant to the science problems we’re working on (e.g. diagnostics)? might be worth reading just for general statistical sophistication; they apply the method to a biology problem

A second-order active set algorithm for l1-regularized optimization — with short, sweet, convergence analysis. looks like a fun, informative optimization read

Boosting in linear regression is related to subgradient optimization — how is this related to AdaBoost and Forward Stagewise Regression are First-Order Convex Optimization Methods by the same authors?

And an oldie:
Duality between subgradient and conditional gradient methods — exactly as stated. Looks like a nice piece of work in terms of convergence analyses and ideas to purpose, as well as general pedagogical value. A Bach paper.

Random Kitchen Sinks

I’ve been away from this blog for years now(!); this is a post that’s been sitting in the pipeline. More to come.

In their 2007 NIPS paper, “Random Features for Large-Scale Kernel Machines“, Rahimi and Recht propose using random feature maps to facilitate dealing with nonlinear kernel methods over large training sets. The problem is that if \(n\), the number of training examples, is large, then the optimization problems involving dealing with an \(n \times n\) kernel matrix. Also, the resulting classifier (or regression equation, or what have you) requires the computation of \(k(x, x_i)\) for all the training points \(x_i\).

One way to mitigate these concerns is to use sampling methods like the Nystrom extension. The Nystrom extension and its ilk work by sampling then interpolating to approximate the kernel. Rahimi and Recht’s idea is to use the fact that PSD kernels can be written in terms of feature maps, \(k(x,y) = \langle \phi(x), \phi(y) \rangle\). This implies that one could approximate the kernel \(k(x,y) \approx z(x)^T z(y)\) if you chose \(z(\cdot)\) appropriately. Note that for general kernels, the feature map \(\phi\) may be infinite-dimensional, so if \(z\) could be chosen to be a finite dimensional mapping, we’d arguably have a simplication of the kernel representation. More practically interesting, although we’d still be dealing with an \(n \times n\) large kernel matrix, we’d be using a linear kernel, for which much more efficient optimization methods exist. Furthermore, applying the trained classifier (or regression equation, or what have you) would require only the computation of \(w^T x\), where \(w\) is a vector in the span of \(z(x_i)\). Thus, finding an appropriate low-dimensional approximate feature map \(z\) would increase the efficiency with which kernel methods are both trained and applied.

Rahimi and Recht’s key idea was to take \(z(x)\) to be a sum of unbiased estimators of \(\phi(x)\). Then concentration of measure ensures that uniformly over a compact space \(\Omega\), we have \(x,y\in\Omega\) implies
\[
k(x,y) = \langle \phi(x), \phi(y) \rangle \approx z(x)^T z(y).
\]
In particular, they proposed sampling from the Fourier transform of shift invariant kernels to form such estimators. In more detail, if
\[
k(x,y) = k(x-y) = \int_\mathrm{R} \exp(i \omega^T (x-y)) \rho(\omega) d\omega,
\]
then one can approximate \(k\) with a finite sum:
\[
k(x,y) \approx \sum_j \exp(i \omega_j^T x) \exp(i \omega_j^T x)
\]
when \(\omega_j\) are sampled according to \(\rho.\)

This is a neat idea, but one is left with the question: now we know how to approximate the nonlinear kernel matrix with a linear kernel matrix, but what can we say about how this approximation affects our downstream application? Recht and Rahimi’s second paper, “Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning“, answers this question for Lipschitz multiplicative loss functions of the form
\[
\ell(c, c^\prime) = \ell(cc^\prime), \quad\text{with}\quad \|\ell(x) – \ell(y)\| \leq L \|x – y\|
\]
and hypothesis space of the form
\[
\left\{ f(x) = \int_{\Omega} \alpha(\omega) \phi(x; \omega) d\omega \,\big|\, |\alpha(\omega)| \leq Cp(\omega) \right\},
\]
where the atoms \(\phi(x; \omega)\) are uniformly bounded w.r.t both arguments, and \(p\) is a distribution over \(\Omega.\)

The idea is that since the contribution of each atom is bounded (by \(C \sup_{x,\omega} |\phi(x; \omega)|\)) the approximation error in replacing the function in the true hypothesis space that minimizes the loss by the function in the finite-dimensional approximation space
\[
\left\{ f(x) = \sum_{k=1}^K \alpha_k \phi(x; \omega_k) \,\big|\, |\alpha_k| \leq \frac{C}{K} \right\}
\]
that minimizes the loss is small with high probability with respect to the \(\omega_k\), which are selected i.i.d according to \(p\). Concisely, the Galerkin method works when the basis elements are selected randomly from an appropriate distribution.

“Some estimates of norms of random matrices” and “Tensor sparsification via a bound on the spectral norm of random tensors”

Every now and again, I get it in my head to revisit Latala’s paper “Some estimates of norms of random matrices” where he gives a sharp bound on the expected spectral norm of a mean zero random matrix with independent entries.

His proof is damn near incomprehensible (certainly so for me), results in a bound with an unspecified universal constant, and doesn’t give moment bounds. But it’s clear that all of these problems can be gotten around. It looks like he’s using an entropy-concentration tradeoff, as Rudelson and Vershynin call it, but his proof is so convoluted that it’s hard to pin down exactly what’s going on.

But whenever I get it in my head to attempt to clarify and extend his proof, I remember that Nguyen, Drineas, and Tran have already done so. They’ve actually extended it to bound the expected norm of mean zero random tensors with independent entries. Unfortunately they don’t seem to give Latala his full amount of credit, since they claim they’re using a new approach (viz, the entropy-concentration tradeoff) due to Rudelson and Vershynin, when it seems to me that they’re revising Latala’s proof to make it more readable and making obvious extensions. This isn’t to say that the NDT paper isn’t worthwhile: I think it’s a great example of power of the Gaussian symmetrization techniques and a very approachable demonstration of the entropy-concentration tradeoff. Just I think Latala should be given more credit for observing this entropy-concentration tradeoff.

This case serves as a supporting point in my argument that it’s not enough to just publish results. Your exposition should be clear about the intuition and techniques that are used, not simply a chain of technical arguments. If Latala had spent more time refining his exposition, the NDT result would be clearly seen as a descendant of his work.

Initio

This site’s purpose is to track and encourage my mathematical reading while giving me a central location to record my observations and comments on whatever I’m reading. I hope that others will be motivated to share their own thoughts and comments, or suggest relevant readings.

For now the blog’s name is “Readings in modern applied math.” Modern as opposed to classical: I’m NOT interested in classical applied math (PDEs and related numerical methods). I’m interested in the more modern ‘informatic aspects’ of applied math and the associated theory. What I mean by that should become clear as I post (actually, here’s a precis of the research I’ve done so far, and some things I’m interested on working on in the future), but essentially for now think of the type of work being done by the likes of:

(applied people) Christos Boutisidis, Petros Drineas, Michael Mahoney, Benjamin Recht, Joel Tropp, Emmanuel Candes, Daniel Hsu, Mark Tygert, Nam Nguyen

(more theoretical) Daniel Spielman, Mark Rudelson, Roman Vershynin, Pascal Massart, Michel Ledoux, Michel Talagrand, Sara van de Geer

So, if you’re interested, subscribe to the blog to get updates, and comment to let me know of any work or researchers you think I may find interesting.