Some SMART Slides

I just returned from University of Wisconsin, Madison, where I spent the last few days talking with faculty, students, and postdocs about their research interests and the SMART algorithm. During that trip, I presented SMART to the SILO seminar, which has a varied audience of mathematicians, engineers, and computer scientists. I’ve uploaded the set of slides here, and probably its most notable feature is its length: just 30 slides, which I was able to present in roughly 30-35 minutes. In technical talks, paying attention for more than 20 minutes at a time can be difficult, so lately I’ve resolved to limit all my talks to 30 minutes, even if I’m allotted 50. The length of my talk went over well with everyone whom I spoke to after the talk.

Some good questions came up during the talk, mostly about the trigger graph, which is a minibatching interface built into SMART, and how its spectral properties affect SMART’s convergence rates. Only a coarse answer to this problem appears in the SMART paper, in particular, in the second half of Section 5.5. But clearly more minibatching results in more speed, as long as the per iteration cost does not rise too high.

Day 3 at IMA Workshop on Optimization and Parsimonious Modeling

It’s not the last day of the workshop, but it’s my last day here in snowy Minneapolis. I’ve had a fun time, and my trip is ending on a good note because of today’s good talks.

Michael Overton: Crouzeix’s Conjecture

Michael told us about Crouzeix’s conjecture which is an inequality, conjectured to be true by Michel Crouziex, relating the spectral norm of a polynomially transformed matrix to the maximum of that same polynomial over the field of values of that matrix, which is the set of complex numbers v^TAv ranging over all unit complex vectors v (here, I use T to denote conjugate transpose). This conjecture has little bearing on parsimonious modeling, but it’s an example where, to a surprising degree of success, nonsmooth analysis is used to prove properties of a function, called Crouzeix’s ratio, which suggest the verity of Crouzeix’s Conjecture and where nonsmooth BFGS seems to work quite well.

Lek-Heng Lim: Grothendieck Constant and Structured Matrix Computations

Lek-Heng wants to determine the minimal number of operations needed to, for example, multiply two matrices together, and after he finds the minimal number of operations, he’d like to give us an algorithm that achieves that number. We still lack an understanding of general matrix-matrix multiplication, but for structured matrices, for example, circulant matrices or Toeplitz matrices, Lek-Heng has the answer. The intriguing processes by which he arrives at these answers takes two steps: (1) represent the matrix-matrix multiplication rule as a bilinear map, which itself can be viewed as a tensor, and (2) compute the rank of that tensor, which allows you to both measure the complexity of the multiplication and provides an algorithm to compute the multiplication. You see, computing the rank of a tensor means simplifying it into a linear combination of simpler tensors, which suggests an algorithm. And the fewer the tensors used in this decomposition, the fewer the operations needed to compute that tensor.

Laura Balzano: Low-rank Matrix Completion under Monotonic Transformation

Laura presented an issue with matrix completion, which, briefly stated, is the task of partly observing a matrix, and then, filling in the unobserved entries of the matrix. We don’t want just any completion of the observed matrix entries, we want a low-rank one because…Occam’s Razor. But there is a problem with this approach: when we apply matrix completion in practice, our solution must be a matrix of integers, for example, each entry may correspond to a ranking (1 to 4 stars) that some user assigned some product in Netflix’s movie catalogue; but, Laura says, matrices of integers are not low-rank! So she changes the model. Laura’s new model assumes that a Lipschitz transformation of the matrix, not the original matrix, is low rank. Unfortunately, we do not often know the Lipschitz mapping in advance, so Laura proposes the scariest optimization problem of the workshop, one that requires optimizing over a set of Lipschitz monotonic functions. She provides a few heuristics, which don’t actually solve the problem under consideration, but provide interesting experimental results nonetheless.

Michael Friedlander: Level-set Methods for Convex Optimization

Michael showed us how to solve constrained optimization problems, for example, one in which we minimize a function f subject to another function g being less than sigma, by approximately solving a series of flipped optimization problems in which we minimize g subject to f being less than tau; the series of optimization problems result from varying tau, and ultimately, we’re searching for the tau for which the minimum value of g subject to f being smaller than tau is sigma, i.e., we want to tau to be the optimal value of the original optimization problem—that’s a root finding problem, hence the title of the talk. The novelty of this new approach, which has been building up in several of Michael’s and his coauthor’s older papers, is to make the root-finding algorithm practical, so they propose inexact newton and secant algorithms for finding the root.

Rachel Ward: Globally Optimal and Robust k-means Clustering via Convex Optimization

In her talk, Rachel revisited the k-means clustering problem. She works from the assumption that the minimizers of the k-means clustering problem are good clusterings. And from this assumption, she proposes a generative model of clustering, which neatly captures the types of clustering that those k-means minimizers should cluster appropriately, namely translates of isotropic distributions with sufficiently spread out cluster centers. Then, generative model in hand, she proposes a convex relaxation of the nonconvex k-means objective function, that has some nice theoretical properties, in particular, solutions to convex problem can provide certificates of optimality for the nonconvex k-means clustering problem. Rachel emphasizes that more work is left to be done, though, namely because her relaxation is a huge semi-definite optimization problem which will take a lot of time to solve.

René Vidal: Global Optimality in Matrix and Tensor Factorization, Deep Learning, and Beyond

René explained to us how deep learning and matrix factorization are a lot alike. In matrix factorization, we want to factor a matrix X approximately into product of matrices UV, but in (admittedly shallow) deep learning, we want to factor a matrix X into psi(U) V, where psi is some nonlinear function. And with this analogy, René proposed several deep learning models in which all local minimizers are global minimizers, by associating a convex problem to the notoriously nonconvex deep learning problem such that both problems have the same local minimizers. The key to the association is the analogy with matrix factorization; indeed, a similar correspondences is already known there. I think we optimizers have yet to assess the implications of René’s talk, who admits he doesn’t have the background to completely tackle the problem, but for sure, everyone sat on the edge of seats throughout the talk, frequently interrupting René with deep interest.

Day 2 at IMA Workshop on Optimization and Parsimonious Modeling

Another day, another series of talks, and even a poster session, where I watched many people awkwardly stop from afar, read my poster title, and amble off into the distance, never to be seen again.

Jim Renegar: Applying Subgradient Methods — and Accelerated Gradient Methods — to Efficiently Solve General Convex, Conic Optimization Problems

Fellow Cornellian Jim presented a very creative paper, on an elementary idea for converting any conic optimization problem into one with a simple constraint and a 1-Lipschitz objective. The key idea, according to Jim, is the radial projection map, which is used transfer the solution from the harder optimization problem, to the simpler one. Once you have the simpler problem that Jim defines, you’re free to apply any algorithm, but as a first step, Jim applies some supgradient methods, optimal or otherwise, and uses a clever trick to remove the dependence of the convergence rate on the optimal objective value. As we all know, but ask me if you don’t, any convex optimization problem can be transformed into a cone problem, and Jim uses this conversion to develop new first order methods for general convex programs. Jim’s talk was good because for every answer he proposed, one hundred questions were raised, which means work on this project is far from complete and in the future, we’ll likely see his paper cited many times.

Didier Henrion: Optimal Experimental Design with Polynomial Optimization

This talk sits far from my area of expertise, and I don’t have a good memory of the main points, so I’ll let the slides speak for themselves.

Hongchao Zhang: Optimal Gradient Methods for Nonlinear Optimization

Hongchao is a nonlinear programming expert, and the peeps from his community are serious about testing their algorithms. They have huge libraries of problems, such as CUTEr, and they aggressively throw these problems at their even more aggressive algorithms. So, in general, Hongchao does not know if the problems he receives are convex, or locally convex, or completely nonconvex. This motivates him to introduce an accelerated gradient method for continuously differentiable functions with Lipschitz gradients—one that has a fast convergence rate regardless of convexity, and recovers the existing bounds in the case of convexity. One notable difference between his method and, for example, Nesterov’s accelerated gradient method, is that his iterative steps are not in the direction of a gradient; instead they are in the direction of any descent direction, which allows him more flexibly in combining his accelerated methods with nonlinear conjugate gradient and quasi-Newton methods.

Caroline Uhler: Your Dreams May Come True with MTP2…

My dreams have come true
because I’ve understood you
that MTP2
provides a natural structural constraint on inverse covariance matrices that can be used in place of ell_1 in maximum likelihood estimates of sparse graphical models. Forgetting all the theory, an MTP2 is a type of probability distribution, and a rather rare one at that: for example, in three dimensional space, only 5% of Gaussians are of type MTP2, and that percentage gets lower in higher dimensions. However, some datasets naturally have sample distributions that are MTP2, and Caroline showed us a few in her presentation. For optimizers, the most interesting portion of the talk is alluded to in the first sentence of this paragraph: for maximum likelihood estimation of sparse gaussian graphical models, sparse inverse covariance matrix estimation, in which you minimize the negative log determinant of the covariance + an ell_1 norm on the entries of the covariance matrix subject to agreement on the observed entries, can be improved upon by enforcing that the covariance matrix comes from an MTP2 gaussian; by enforcing MTP2, you naturally enforce sparsity, the problem becomes a simpler convex program, and you only need two measurements to ensure that the maximum likelihood estimate exists. You couldn’t ask for more good properties.

Mihailo Jovanovic: Stochastic Dynamical Modeling: Structured Matrix Completion of Partially Available Statistics

Unfortunately, I had a meeting during this talk, and I missed it.

Day 1 at IMA Workshop on Optimization and Parsimonious Modeling

This week, I’m participating in the IMA workshop on Optimization and Parsimonious Modeling in Minneapolis. This is my first time at IMA and in Minneapolis, and I’ve been pretty impressed by the university and the surrounding area, which is laid out like a vast brick fortress of restaurants—many of which are good or at least familiar to me. The IMA, in particular, seems to be really good at hosting workshops; all events are perfectly organized, and you get the feeling that the workshop organizers are living up to IMA standards, not the other way around.

Bernd Sturmfels: Eigenvectors of Tensors

Today we learned from Bernd how to define the rank, eigenvectors, and singular values of tensors, those multi dimensional arrays, which, in the 3x3x3 case, Bernd delightfully calls Rubik’s cubes. He mostly restricted himself to symmetric tensors, which are like symmetric matrices, in that all the dimensions are equal and the entries of the tensor are invariant under permutation of the indices. I found the definition of eigenvalues to be interesting because they’re connected to the stationary points of certain polynomials when minimized over a sphere. But to Bernd, the point of the talk was to convince all of us that the word variety is not scary.

Aleksandr Aravkin: Conjugate Interior Point Method for Large-Scale Problems

Alex told us about interior point methods for minimizing piecewise linear-quadratic (PLQ) functions—a class of functions that behaves well under several operations, for example, the sum of two PLQs is a PLQ, conjugates of PLQs are PLQs, and smoothed PLQs (e.g., Moreau envelopes of PLQs) are PLQs. Alex and many other collaborators implemented a general interior point solver for PLQs, called ipSolve that exploits properties, for example, sparsity, of the matrices and vectors involved. Because they exploit this structure, Alex can solve fairly large problems (e.g., with 4 million unknowns) on his laptop. Not content with current methods for choosing model parameters, which many times comes down to trial and error, Alex and his collaborators add a final cherry on top, as he says, which fuses their ipSolver with a hyperparameter selection tool.

Discussion 1

Rene Vidal’s flight was cancelled due to weather, so in place of his presentation, which was moved to Wednesday, all 60 or so participants at IMA introduced themselves, including their name and their interests, and if they were on the job market, they were given some extra time to advertise their work. My favorite introduction came from a professor (I didn’t catch his name) who said “I’m from Korea. Obviously I mean South Korea,” at which the room erupted in laughter, but on second thought, it actually made me reflect on how awful things might be in the Northern half of Korea.

Wotao Yin: Asynchronous Parallel Computing in Signal Processing and Machine Learning

Wotao Yin told us about how, in the 1990s, cpu clock speed determined the price and quality of computers—and in fact, that that such speed used to appear in advertisements. He contrasted this fact with several current computer advertisements, in particular for the Microsoft surface, which do not contain clock speed or any hardware specs whatsoever. He told us that Moore’s law has led Intel to manufacture, almost exclusively, multicore processors because single core processing power has effectively stagnated. We can only make faster algorithms, he says, if we learn to use parallel computing effectively. But, he says, parallel computing is most effective when there is low overhead in the software and the hardware layers.

One source of overhead is synchronization; I like to visual synchronization with a game called average time relay: imagine two teams of five people running on a ten lane track; lanes ones through five hold team one, and lanes six through ten hold team two. The team that wins is the one in which the average of their individual times is smallest. Team one is superstitious and has decided to place ten barriers on the track, and at each of these barriers, the whole team must stop and wait for the slowest runner; they call this system synchronization. In contrast, all members of team two just run straight for the finish line. Who do you think will win?

Wotao says that if the team members are parallel subproblems of a larger parallelizable optimization algorithm, he’d put his money on the team without the barriers. And this is what he and others do in their paper on the ARock algorithm. It appears to work quite well in practice, and it motivated me to introduce SMART.

Sebastian Bubeck: Revisiting Nesterov’s Acceleration

Nesterov’s acceleration of the gradient descent algorithm is mysterious to a lot of people, particularly because there appears to be little geometric intuition behind it. So Sebastian made a new accelerated gradient method based entirely on a geometric principle which involves intersecting closed balls. This new algorithm nearly achieves the rate of convergence of Nesterov’s method (due to an exact line search, there is a log^2(1/eps) instead of a log(1/eps)). I’ll leave it to Sebastian to explain the proof to you.

Discussion 2

Most of the discussion surrounded Bernd’s talk on eigenvectors of tensors. The most interesting question came from Bubeck who asked whether it is generally hard to find eigenvectors of tensors in which each entry of the tensor is gaussian. The answer appears to be unknown.

SMART: The Stochastic Monotone Aggregated Root-Finding Algorithm

Last night, I uploaded a new paper on the Stochastic Monotone Aggregated Root-Finding (SMART) algorithm. The algorithm excites me for a few reasons:

  1. SMART extends SAGA, SVRG, Finito, and SDCA, all of which solve problems like

    \displaystyle \text{minimize}_{x \in \mathbb{R}^d}\; \frac{1}{N} \sum_{i=1}^N f_i(x),

    to allow asynchronous parallel implementations, arbitrary block-coordinate updates, mini batching, and importance sampling.

  2. SMART replaces function gradients, {\nabla f_i}, with black-boxes, {S_i}, called operators, and arrives at the root-finding problem:

    \displaystyle \text{Find }x^\ast \in \mathbb{R}^d\text{ such that }\frac{1}{N} \sum_{i=1}^N S_i(x^\ast) = 0.

    For SMART to converge, these operators need only satisfy a weak property, which we call the coherence condition.

  3. Because SMART works with operators, it generates some new algorithms for large-scale optimization problems like

    \displaystyle \text{minimize}_{x \in \mathbb{R}^d}\; \frac{1}{M}\sum_{j=1}^M g_j(A_j x) + \frac{1}{n} \sum_{i=1}^N f_i(x),

    where the functions {g_j} are proximable and the maps {A_j} are linear—these problems are hot in machine learning right now.

In the coming weeks, I’ll devote some blog posts to implementations of SMART on problems like logistic regression, support vector machines, collaborative filtering, feasibility problems, and more. In the meantime, check out the paper; comments are welcome.

This material is based upon work supported by the National Science Foundation under Award No. 1502405. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

There is no infinite dimensional Lebesgue measure

Imagine that there was a countably additive and translation invariant measure, {m}, on a Hilbert Space {\mathcal{H}} such that every ball {B(v, \varepsilon)} has positive measure, finite measure. Because {\mathcal{H}} is infinite dimensional, there exists {\{e_{i} | i = 1, \cdots, \infty\}} such that {\langle e_{i}, e_{j}\rangle = \delta_{ij}}. Note that the points {e_{i}} and {e_{j}} are far apart: {\|e_{i} - e_{j}\|^2 = \|e_{i}\|^2 + \|e_{j}\|^2 = 2}, if {i \neq j}. Now, the ball, {B(0, 2)}, contains the countable collection of disjoint balls: {\{B(e_{i}, \frac{1}{2}) | i = 1,\cdots, \infty\}} (you can see where this is going). Thus, because all balls, {B(e_{i}, \frac{1}{2})} have the same measure (by translation invariance), and {m} is countable additive, it follows that {\infty = \sum_{i=1}^\infty m(B(e_{i}, \frac{1}{2})) < B(0, 2) < \infty}. Thus, we’ve reached a contradiction.

In some cases, it’s desirable to have an analogue of a measure on a Hilbert space {\mathcal{H}}. The most common application is to make statements such as “Property X is true for a.e. function.” One way to do make these statements is through the concept of Prevalence. Another is the definition of a Gaussian like measure called the Abstract Wiener Space.

Note that I wrote this article before realizing that Wikipedia has a similar article.

My First Two Years

During the past one and a half years I’ve spent as a graduate student, I’ve developed a few principles. I have decided to post them here, and update them as I learn more.

1. Don’t be afraid to branch out and learn subjects in other disciplines. Broadening your interests is easy in graduate school, and you’re paid to do it. Further, you may discover that you’ve gone to grad school for the wrong subject. You may be studying number theory, topology, etc., because you’re already good at it, and you don’t know enough about other subjects. It’s easy to get lost in the achievement factor in math, and just because you can solve problems with ease, doesn’t mean you’re truly motivated by the subject’s goals. The best way to discover your true motivations is to think about the guiding problems in the field and decide if they’re interesting to you in their most isolated form (e.g. do you “care” about distinguishing smooth structures on manifolds?). If the subjects no longer seem interesting, then move on to something you really enjoy.

2. Talk to everybody: professors, graduate students, undergrads, etc. The best way to learn is from other people, but don’t depend on them. Everybody has a tool they’re fond of — learn it. Find out what professors are interested in and ask for a research project or a reading course in that topic. Professors generally want to speak with you after you pass your quals. It’s part of their job to graduate students. So don’t be discouraged by professors that turn you down, five no’s and a yes is still a yes.

3. Don’t be afraid to say you don’t understand. Once you realize you don’t understand, you’re in the best position to learn. Further, many students may give the impression that they truly understand a subject, but will be forced to reevaluate if you ask enough questions.

4. Don’t be afraid of the computer. There is a massive amount of data in the world, and you’ll be at a supreme advantage if you know how to compute with it.

5. Go to seminars and colloquiums in your field and other nearby fields. Learning what other people are working on is likely to give you research ideas. Sign up to give talks in front of these groups, even if you don’t know a thing about the topic. Put a lot of effort into explaining concepts clearly and accurately.

6. Think long term. I know that many PhD students are only thinking about the next five years of graduate school. Assuming that everything will “work out” will put you at a severe disadvantage. Before every major choice you make in grad school, ask yourself why you want a PhD, and how will this choice benefit your goal. No matter what field you plan to go into, ask yourself the following question: who are the experts, and am I good enough to take their job? There isn’t a lot of room at the top, so you’re likely going to have to push someone out of the way to make room for yourself.

7. Schedule enough down time so that you can be fresh every day. Make sure you attend social functions, go to the gym, or play a sport, etc.

8. Learn to manage your time effectively. How many projects are you working on? If you get side tracked on a particular problem, move on and come back to it. Try to skim a book before you do the problems. Then, go back and read carefully when you need a deep understanding.

9. Learn how to measure success. Set benchmarks for yourself. Did you achieve more today than you do on average? Did you master a difficult concept? Try to achieve more every day, but don’t be discouraged if you can’t. You can’t learn everything in finite time. Some things are long term projects and even reading a few pages can give you something to be proud of.

10. Apply for fellowships. Teaching eats up a lot of time, but it does give you valuable experience. If you can’t get an RA, having your own funding can make your research a lot more productive. Having your own funding also increases the chances of a professor working with you.