Some SMART Slides

I just returned from University of Wisconsin, Madison, where I spent the last few days talking with faculty, students, and postdocs about their research interests and the SMART algorithm. During that trip, I presented SMART to the SILO seminar, which has a varied audience of mathematicians, engineers, and computer scientists. I’ve uploaded the set of slides here, and probably its most notable feature is its length: just 30 slides, which I was able to present in roughly 30-35 minutes. In technical talks, paying attention for more than 20 minutes at a time can be difficult, so lately I’ve resolved to limit all my talks to 30 minutes, even if I’m allotted 50. The length of my talk went over well with everyone whom I spoke to after the talk.

Some good questions came up during the talk, mostly about the trigger graph, which is a minibatching interface built into SMART, and how its spectral properties affect SMART’s convergence rates. Only a coarse answer to this problem appears in the SMART paper, in particular, in the second half of Section 5.5. But clearly more minibatching results in more speed, as long as the per iteration cost does not rise too high.

SMART: The Stochastic Monotone Aggregated Root-Finding Algorithm

Last night, I uploaded a new paper on the Stochastic Monotone Aggregated Root-Finding (SMART) algorithm. The algorithm excites me for a few reasons:

  1. SMART extends SAGA, SVRG, Finito, and SDCA, all of which solve problems like

    \displaystyle \text{minimize}_{x \in \mathbb{R}^d}\; \frac{1}{N} \sum_{i=1}^N f_i(x),

    to allow asynchronous parallel implementations, arbitrary block-coordinate updates, mini batching, and importance sampling.

  2. SMART replaces function gradients, {\nabla f_i}, with black-boxes, {S_i}, called operators, and arrives at the root-finding problem:

    \displaystyle \text{Find }x^\ast \in \mathbb{R}^d\text{ such that }\frac{1}{N} \sum_{i=1}^N S_i(x^\ast) = 0.

    For SMART to converge, these operators need only satisfy a weak property, which we call the coherence condition.

  3. Because SMART works with operators, it generates some new algorithms for large-scale optimization problems like

    \displaystyle \text{minimize}_{x \in \mathbb{R}^d}\; \frac{1}{M}\sum_{j=1}^M g_j(A_j x) + \frac{1}{n} \sum_{i=1}^N f_i(x),

    where the functions {g_j} are proximable and the maps {A_j} are linear—these problems are hot in machine learning right now.

In the coming weeks, I’ll devote some blog posts to implementations of SMART on problems like logistic regression, support vector machines, collaborative filtering, feasibility problems, and more. In the meantime, check out the paper; comments are welcome.

This material is based upon work supported by the National Science Foundation under Award No. 1502405. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.