This reading course will explore modern challenges at the interface of continuous optimization and machine learning. Our inquiry will be guided by two motivating questions:
- Can we use classical ideas in optimization to better understand (and improve) algorithms for challenging problems in machine learning?
- How can modern insights in machine learning guide the design of new and improved methods for optimization?
Topics may include low rank optimization, generalization in deep learning, regularization (implicit and explicit) for deep learning, connections between control theory and modern reinforcement learning, and optimization for trustworthy machine learning (including fair, causal, or interpretable models). Topics may change according to student interest.
The format will consist of student-guided discussion on papers. The course will culminate in a final research project which will constitute a majority of the grade. Students will deliver short presentations on their projects in class in addition to written reports.
- When: 11:40 - 12:55 TTh
- Where: Hollister Hall 372 or via Zoom
- Quiz (same link for all quizzes)
- CMT for paper reviews and peer reviews
- Slack (use your cornell.edu address) for general questions and comments.
Office hours will be chosen via an in-class poll. You can also talk with Prof. Udell after class or contact her by email at udell@cornell.edu to set up an appointment.
Students are required to attend class, with at most two absences. (Up to two classes may also be attended remotely via Zoom by prior arrangement with the instructor. Students at Cornell Tech may attend all classes remotely.)
Course grades have four components (weights in parentheses for 3 credit option):
- Reviews (.3): Students will write reviews of some (about half) of the papers we read for class and upload them to CMT.
- Quiz (.1): Most classes will begin with a two-question quiz. The questions will be easy to answer if you read the paper. Grading on this portion of the course will be nonlinear (eg, 0 points if too few quiz questions are answered correctly).
- Present a paper (.3): Students will each lead the discussion twice (or so), possibly in teams depending on course enrollment. The student leading the discussion will prepare a 30 minute presentation using slides, two true/false or multiple choice questions on the paper, and a class activity or questions for discussion. The presentations will be graded by peer review.
- Research project (.3): The final research project should be on a topic with connections to both continuous optimization and machine learning. Projects can be in teams of up to two students except by special advance permission from the instructor. Students will prepare an initial project proposal, midterm report, and final report on the project, and will make an in-class presentation. Research projects can (and should!) advance your PhD research.
Students taking the class for 1 credit will be required to write fewer reviews, present only one paper, and will not be required to complete the research project. The quiz requirement will be the same.
This google doc serves as our schedule. Sign up for presentation slots on the google doc by adding your names and a link to the paper you'll present. (Make sure not to choose a paper someone else has already picked!) We may spend more or less time on a topic depending on student interest.
The course readings will be selected from the following, or other papers suggested by students in the class.
Our tour begins with low rank optimization. Many of the ideas developed (rigorously) for low rank optimization carry over (heuristically) to deep learning. Here we see two ways to formulate the problem: as a biconvex problem, or as a convex problem with a low rank constraint (or, sometimes, side information). Here, we know how to provably find the solution to the problem (using convex methods); we can sometimes prove that the nonconvex method finds the same solution, and can always evaluate the quality of any putative solution relative to the true optimal solution.
- Barvinok 1995, Pataki 1998 show that rank of exact solutions to SDP with m linear constraints grow as O(sqrt(m))
- Barvinok 2002 p. 140: rank of epsilon approximation to a PSD matrix satisfying m linear constraints grows as log(m / epsilon)
- So, Ye, and Zhang 2008 Similar to above low rank approximation idea, more algorithmic. Still requires solving the SDP first.
- Hazan 2008 suggests using Franke Wolfe for semidefinite programming
- Garber and Hazan 2011 integrates the above idea into a method to produce approximate solutions of SDP
- Jaggi 2013 Solving matrix completion with Frank Wolfe ensures rank is bounded by number of iterations
- Yurtsever et al. 2017 The first storage-optimal convex method for matrix completion
- Burer and Monteiro 2003 Original paper proposing factored formulation for solving SDPs
- Journ'ee et al. 2010 How to escape a saddle point with gradient information
- Boumal et al. 2014
- Boumal, Voroninski, and Bandeira 2016 The non-convex Burer-Monteiro approach works on smooth semidefinite programs
- Bhojanapalli et al. 2018 Smoothed analysis of BM factorization
- finding local minima efficiently: Lee et al. 2016 shows GD only converges to minimizers (not saddle points), and Lee et al. 2017 extends the result to other first order methods
- Gunasekar et al. 2017 Gradient descent on factored formulation of matrix completion implicitly regularized the nuclear norm
- Bhojanapalli, Neyshabur, and Srebro 2016 Ge, Lee, and Ma 2016 The factored formulation of matrix completion from incoherent measurements has no spurious local minima
- Waldspurger and Waters 2018 ...but only if the rank is high enough
- Ma et al. 2017 Gradient descent (implicitly) imposes regularization for a wide variety of statistical models
What properties of deep nets make them easy or hard to optimize?
- Dauphin et al. 2014 Is the quality of a critical point proportional to the number of directions of positive curvature?
- Zhang et al. 2016 (Not very) deep nets can exactly fit noise. So why do they generalize well for real data?
- Hao Li and 2017 Visualizing the loss landscape of neural nets. Certain architectures are dramatically more "convex" than others!
- D Alberti et al. 2018 How does the depth and width of a deep neural net (and the associated difficulties with optimization) affect the bias variance tradeoff?
- Behnam Neyshabur et al. 2019 Maybe deeps nets aren't as overparametrized as they seem
- Ian Goodfellow, Oriol Vinyals, and Andrew Saxe 2015 The loss surface looks convex on the path from the initial to final SGD iterate
Is optimization enough? Is SGD necessary to provide implicit regularization?
- Hardt, Recht, and Singer 2015 Stability of SGD means that good solutions found quickly should generalize well
- Keskar et al. 2016 Large batch training and sharp minima
- Hoffer, Hubara, and Soudry 2017 A proposal to decrease the generalization gap when using large batch sizes, proposing different explanation than Keskar et al. 2016
- Rahimi and Recht 2009 + Rahimi Test of Time Talk at NeurIPS 2017
- Kawaguchi, Kaelbling, and Bengio 2017 Generalization in deep learning
- Neyshabur et al. 2017 Do (particular) optimization algorithms implicitly regularize the solution found?
- Gunasekar et al. 2018 Studies the same question as above, with new theoretical results
- Nacson et al. 2018 Studies the same question as above in a simpler case
Can we improve deep learning using more sophisticated ideas from optimization?
- Gradient science blog posts 1 2 3 expound the thesis that GANs are better fit by robust optimization
- Sinha, Namkoong, and Duchi 2017 Distributionally robust optimization foils adversarial examples
- Gradient science blog posts on batch normalization attempt to explain its success as a change of geometry
- Neyshabur, Salakhutdinov, and Srebro 2015 Why not optimize in a geometry invariant to rescalings of the parameters that don't affect the output?
- Wilson et al. 2017 Adaptive optimization methods (which learn the geometry from the iterates) can be worse than useless
- Can we use second order methods for training deep nets? Proposals include Raghu Bollapragada, Richard H. Byrd, and Jorge Nocedal 2016 and Zhewei Yao et al. 2018
Learn better with fewer parameters
- Chen et al. 2018 Treat number of layers as a continuous parameter (!)
- Jeremy Bernstein et al. 2018 Train a NN with compressed (one-bit) gradients
- Majid Janzamin, Hanie Sedghi, and Anima Anandkumar 2015 Train a two-layer NN provably using tensor methods
- Kossaifi et al. 2017 Tensor contraction layers to reduce parameters in NN
Other interesting ideas connecting optimization and deep learning
- Lin et al. 2018 Mode collapse in GANs implies network does not generalize; this paper proposes a fix
- Armin Askari et al. 2018 Reformulate DNNs as a biconvex problem and use ADMM
- Bora et al. 2017 Compressed sensing using generative models
- Chang et al. 2017 Solving all inverse problems
- Mardani et al. 2017 Application to MRI imaging
- Mardani et al. 2018 Use a DNN to learn (a proximal map for) the best regularizer and denoise via PGD
- Lucas et al. 2018 IEEE review on deep learning for inverse problems in imaging
- Krull, Buchholz, and Jug 2018 Learn to denoise without clean target images
- Recht 2018 Reinforcement Learning: connections with continuous control.
- Also consider the blog posts on this topic, which are even nicer to read
- See also Bertsekas 2018 book on Reinforcement Learning and Optimal Control
Trustworthy machine learning is an umbrella term that includes
- Privacy-preserving Statistics and Machine Learning
- Fairness
- Interpretability
- Robust Statistics and Machine Learning
- Causality
- Adversarial Examples
Some nice recent paperson the topic that are fair game for this class:
- Hardt et al. 2016 Some (optimization-capable) definitions of fairness
- Hashimoto et al. 2018 Fairness (without group labels!) via robust optimization
- Goh et al. 2016 Optimization with Dataset Constraints
- Blum et al. 2018 Can we preserve fairness when combining classifiers?
- Liu et al. 2018 Fairness criteria, when naively applied, can result in unfair behavior over multiple rounds of decisions
talks include
- quiz
- discussion questions
before the talk
- submit a PR to the class git repo with your slides
bring
- power
- dongle (for HDMI or VGA)
setup
- projector
- camera
- zoom
- share screen