{"title": "Better Mini-Batch Algorithms via Accelerated Gradient Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 1647, "page_last": 1655, "abstract": "Mini-batch algorithms have recently received significant attention as a way to speed-up stochastic convex optimization problems. In this paper, we study how such algorithms can be improved using accelerated gradient methods. We provide a novel analysis, which shows how standard gradient methods may sometimes be insufficient to obtain a significant speed-up. We propose a novel accelerated gradient algorithm, which deals with this deficiency, and enjoys a uniformly superior guarantee. We conclude our paper with experiments on real-world datasets, which validates our algorithm and substantiates our theoretical insights.", "full_text": "Better Mini-Batch Algorithms\n\nvia Accelerated Gradient Methods\n\nAndrew Cotter\n\nToyota Technological Institute at Chicago\n\ncotter@ttic.edu\n\nOhad Shamir\n\nMicrosoft Research, NE\nohadsh@microsoft.com\n\nNathan Srebro\n\nKarthik Sridharan\n\nToyota Technological Institute at Chicago\n\nToyota Technological Institute at Chicago\n\nnati@ttic.edu\n\nkarthik@ttic.edu\n\nAbstract\n\nMini-batch algorithms have been proposed as a way to speed-up stochastic\nconvex optimization problems. We study how such algorithms can be im-\nproved using accelerated gradient methods. We provide a novel analysis,\nwhich shows how standard gradient methods may sometimes be insu\ufb03cient\nto obtain a signi\ufb01cant speed-up and propose a novel accelerated gradient\nalgorithm, which deals with this de\ufb01ciency, enjoys a uniformly superior\nguarantee and works well in practice.\n\n1\n\nIntroduction\n\nWe consider a stochastic convex optimization problem of the form minw\u2208W L(w), where\nL(w) = Ez [\uffff(w, z)], based on an empirical sample of instances z1, . . . , zm. We assume that\nW is a convex subset of some Hilbert space (which in this paper, we will take to be Euclidean\nspace), and \uffff is non-negative, convex and smooth in its \ufb01rst argument (i.e. has a Lipschitz-\nconstinuous gradient). The classical learning application is when z = (x, y) and \uffff(w, (x, y))\nis a prediction loss. In recent years, there has been much interest in developing e\ufb03cient\n\ufb01rst-order stochastic optimization methods for these problems, such as mirror descent [2, 6]\nand dual averaging [9, 16]. These methods are characterized by incremental updates based\non subgradients \u2202\uffff(w, zi) of individual instances, and enjoy the advantages of being highly\nscalable and simple to implement.\n\nAn important limitation of these methods is that they are inherently sequential, and so\nproblematic to parallelize. A popular way to speed-up these algorithms, especially in a\nparallel setting, is via mini-batching, where the incremental update is performed on an\naverage of the subgradients with respect to several instances at a time, rather than a single\ninstance (i.e., 1\nj=1 \u2202\uffff(w, zi+j)). The gradient computations for each mini-batch can\nbe parallelized, allowing these methods to perform faster in a distributed framework (see\nfor instance [11]). Recently, [10] has shown that a mini-batching distributed framework is\ncapable of attaining asymptotically optimal speed-up in general (see also [1]).\n\nb\uffffb\n\nA parallel development has been the popularization of accelerated gradient descent methods\n[7, 8, 15, 5]. In a deterministic optimization setting and for general smooth convex functions,\nthese methods enjoy a rate of O(1/n2) (where n is the number of iterations) as opposed\nto O(1/n) using standard methods. However, in a stochastic setting (which is the relevant\none for learning problems), the rate of both approaches have an O(1/\u221an) dominant term\nin general, so the bene\ufb01t of using accelerated methods for learning problems is not obvious.\n\n1\n\n\fAlgorithm 1 Stochastic Gradient Descent with Mini-Batching (SGD)\n\nParameters: Step size \u03b7, mini-batch size b.\nInput: Sample z1, . . . , zm\nw1 = 0\nfor i = 1 to n = m/b do\n\nt=b(i\u22121)+1 \uffff(wi, zt)\n\nLet \uffffi(wi) = 1\nw\uffffi+1 := wi \u2212 \u03b7\u2207\uffffi(wi))\nwi+1 := PW (w\uffffi+1)\ni=1 wi\n\nend for\nReturn \u00afw = 1\n\nb\uffffbi\nn\uffffn\n\nAlgorithm 2 Accelerated Gradient Method (AG)\nParameters: Step sizes (\u03b3i,\u03b2 i), mini-batch size b\nInput: Sample z1, . . . , zm\nw = 0\nfor i = 1 to n = m/b do\n\nt=b(i\u22121)+1 \uffff(w, zt)\n\ni\n\nb\uffffbi\n\nLet \uffffi(wi) := 1\n:= \u03b2\u22121\nwmd\nw\uffffi+1 := wmd\nwi+1 := PW (w\uffffi+1)\nwag\ni+1 \u2190 \u03b2\u22121\n\ni wi + (1 \u2212 \u03b2\u22121\ni \u2212 \u03b3i\u2207\uffffi(wmd\ni wi+1 + (1 \u2212 \u03b2\u22121\n\ni\n\nend for\nReturn wag\nn\n\n)wag\ni\n)\n\ni\n\n)wag\ni\n\ni\n\nIn this paper, we study the application of accelerated methods for mini-batch algorithms, and\nprovide theoretical results, a novel algorithm, and empirical experiments. The main resulting\nmessage is that by using an appropriate accelerated method, we obtain signi\ufb01cantly better\nstochastic optimization algorithms in terms of convergence speed. Moreover, in certain\nregimes acceleration is actually necessary in order to allow a signi\ufb01cant speedups. The\npotential bene\ufb01t of acceleration to mini-batching has been brie\ufb02y noted in [4], but here we\nstudy this issue in much more depth. In particular, we make the following contributions:\n\n\u2022 We develop novel convergence bounds for the standard gradient method, which\nre\ufb01nes the result of [10, 4] by being dependent on L(w\uffff), the expected loss of\nthe best predictor in our class. For example, we show that in the regime where the\ndesired suboptimality is comparable or larger than L(w\uffff), including in the separable\ncase L(w\uffff) = 0, mini-batching does not lead to signi\ufb01cant speed-ups with standard\ngradient methods.\n\nis optimized for a mini-batch framework and implicitly adaptive to L(w\uffff).\n\n\u2022 We develop a novel variant of the stochastic accelerated gradient method [5], which\n\u2022 We provide an analysis of our accelerated algorithm, re\ufb01ning the analysis of [5] by\nbeing dependent on L(w\uffff), and show how it always allows for signi\ufb01cant speed-\nups via mini-batching, in contrast to standard gradient methods. Moreover, its\nperformance is uniformly superior, at least in terms of theoretical upper bounds.\n\u2022 We provide an empirical study, validating our theoretical observations and the e\ufb03-\n\ncacy of our new method.\n\n2 Preliminaries\n\nAs discussed in the introduction, we focus on a stochastic convex optimization problem,\nwhere we wish to minimize L(w) = Ez [\uffff(w, z)] over some convex domain W, using an i.i.d.\nsample z1, . . . , zm. Throughout this paper we assume that the instantaneous loss \uffff(\u00b7, z) is\nconvex, non-negative and H-smooth for each z \u2208Z . Also in this paper, we take W to be\nthe set W = {w : \uffffw\uffff \u2264 D}, although our results can be generalized.\n\n2\n\n\fWe discuss two stochastic optimization approaches to deal with this problem: stochastic\ngradient descent (SGD), and accelerated gradient methods (AG). In a mini-batch setting,\nboth approaches iteratively average sub-gradients with respect to several instances, and use\nthis average to update the predictor. However, the update is done in di\ufb00erent ways.\n\nThe stochastic gradient descent algorithm (which in more general settings is known as\nmirror descent, e.g. [6]) is summarized as Algorithm 1. In the pseudocode, PW refers to the\nprojection on to the ball W, which amounts to rescaling w to have norm at most D.\nThe accelerated gradient method (e.g., [5]) is summarized as Algorithm 2.\n\nIn terms of existing results, for the SGD algorithm we have [4, Section 5.1]\n\nwhereas for an accelerated gradient algorithm, we have [5]\n\nE [L( \u00afw)] \u2212 L(w\uffff) \u2264O \uffff\uffff 1\nn )] \u2212 L(w\uffff) \u2264O \uffff\uffff 1\n\nm\n\nm\n\nE [L(wag\n\nb\n\nm\uffff ,\n\n+\n\nb2\n\nm2\uffff .\n\n+\n\nThus, as long as b = o(\u221am), both methods allow us to use a large mini-batch size b\nwithout signi\ufb01cantly degrading the performance of either method. This allows the number of\niterations n = m/b to be smaller, potentially resulting in faster convergence speed. However,\nthese bounds do not show that accelerated methods have a signi\ufb01cant advantage over the\nSGD algorithm, at least when b = o(\u221am), since both have the same \ufb01rst-order term 1/\u221am.\nTo understand the di\ufb00erences between these two methods better, we will need a more re\ufb01ned\nanalysis, to which we now turn.\n\n3 Convergence Guarantees\n\nThe following theorems provide a re\ufb01ned convergence guarantee for the SGD algorithm and\nthe AG algorithm, which improves on the analysis of [10, 4, 5] by being explicitly dependent\non L(w\uffff), the expected loss of the best predictor w\uffff in W. Due to lack of space, the proofs\nare only sketched. The full proofs are deferred to the supplementary material.\n=\nTheorem 1. For\n\nStochastic Gradient Descent\n\nalgorithm with\n\nthe\n\n\u03b7\n\nmin\uffff 1\n\nL(w\uffff)Hn\n\n2H , \uffff bD2\n1+\uffff HD2\n\nL(w\uffff)bn\uffff, assuming L(0) \u2264 HD2, we get that\nE [L( \u00afw)] \u2212 L(w\uffff) \u2264\uffff HD2 L(w\uffff)\n\n2bn\n\n+\n\nn\n\n2HD2\n\n+\n\n9HD2\n\nbn\n\nTheorem 2. For the Accelerated Descent algorithm with \u03b2i = i+1\n\n2 , \u03b3i = \u03b3ip where\n\n\u03b3 = min\uffff 1\n\n4H , \uffff\n\nbD2\n\n412HL(w\uffff)(n\u22121)2p+1 , \uffff\n\n2p+1\uffff\n1044H(n\u22121)2p\uffff p+1\n\nb\n\n2p+1\uffff\n4HD2+\u221a4HD2L(w\uffff)\uffff p\n\nD2\n\n(1)\n\n(2)\n\nand\n\nas long as n \u2265 904, we have that\n\n,\n\n2 log(n \u2212 1)\n\np = min\uffffmax\uffff log(b)\nn )] \u2212 L(w\uffff) \u2264 358\uffff HD2L(w\uffff)\n\nb(n \u2212 1)\n\nE [L(wag\n\n2 (log(b(n \u2212 1)) \u2212 log log(n))\uffff , 1\uffff ,\n\nlog log(n)\n\n+\n\n1545HD2\n\n\u221ab(n \u2212 1)\n\n+\n\n1428HD2\u221alog n\n\nb(n \u2212 1)\n\n+\n\n4HD2\n(n \u2212 1)2\n\n3\n\n\fWe emphasize that Theorem 2 gives more than a theoretical bound: it actually speci\ufb01es a\nnovel accelerated gradient strategy, where the step size \u03b3i scales polynomially in i, in a way\ndependent on the minibatch size b and L(w\uffff). While L(w\uffff) may not be known in advance,\nit does have the practical implication that choosing \u03b3i \u221d ip for some p < 1, as opposed to\njust choosing \u03b3i \u221d i as in [5]), might yield superior results.\nThe key observation used for analyzing the dependence on L(w\uffff) is that for any non-negative\nH-smooth convex function f : W \uffff\u2192 R, we have [13]:\n\uffff\u2207f (w)\uffff \u2264\uffff4Hf (w)\n\nThis self-bounding property tells us that the norm of the gradient is small at a point if the\nloss is itself small at that point. This self-bounding property has been used in [14] in the\nonline setting and in [13] in the stochastic setting to get better (faster) rates of convergence\nfor non-negative smooth losses. The implication of this observation are that for any w \u2208 W,\n\n\uffff\u2207L(w)\uffff \u2264\uffff4HL(w) and \u2200z \u2208Z ,\uffff\uffff(w, z)\uffff \u2264\uffff4H\uffff(w, z).\n\nProof sketch for Theorem 1. The proof for the stochastic gradient descent bound is\nmainly based on the proof techniques in [5] and its extension to the mini-batch case in [10].\nFollowing the line of analysis in [5], one can show that\n\n(3)\n\nE\uffff 1\n\nn\n\nn\uffffi=1\n\nL(wi)\uffff \u2212 L(w\uffff) \u2264 \u03b7\n\nn\u22121\n\nn\u22121\uffffi=1\n\nE\uffff\uffff\u2207L(wi) \u2212 \u2207\uffffi(wi)\uffff2\uffff + D2\n\n2\u03b7(n\u22121)\n\nIn the case of [5], E [\uffff\u2207L(wi) \u2212 \u2207\uffffi(wi)\uffff] is bounded by the variance, and that leads to the\n\ufb01nal bound provided in [5] (by setting \u03b7 appropriately). As noticed in [10], in the minibatch\nb\uffffbi\nsetting we have \u2207\uffffi(wi) = 1\nL(wi)\uffff \u2212 L(w\uffff) \u2264\nE\uffff 1\n\nt=b(i\u22121)+1 \uffff(wi, zt) and so one can further show that\n\nE\uffff\u2207L(wi) \u2212 \u2207\uffff(wi, zt)\uffff2 + D2\n2\u03b7(n\u22121)\n\nb2(n\u22121)\n\n(4)\n\nn\n\n\u03b7\n\nn\u22121\uffffi=1\n\nib\ufffft=\n\n(i\u22121)b+1\n\nn\uffffi=1\n\nIn [10], each of \uffff\u2207L(wi) \u2212 \u2207\uffff(wi, zt)\uffff is bounded by \u03c30 and so setting \u03b7, the mini-batch\nbound provided there is obtained. In our analysis we further use the self-bounding property\nto (4) and get that\n\nE\uffff 1\n\nn\n\nn\uffffi=1\n\nL(wi)\uffff \u2212 L(w\uffff) \u2264 16H\u03b7\n\nb(n\u22121)\n\nn\u22121\uffffi=1\n\nE [L(wi)] + D2\n\n2\u03b7(n\u22121)\n\nrearranging and setting \u03b7 appropriately gives the \ufb01nal bound.\n\nProof sketch for Theorem 2. The proof of the accelerated method starts in a similar\nway as in [5]. For the \u03b3i\u2019a and \u03b2i\u2019s mentioned in the theorem, following similar lines of\nanalysis as in [5] we get the preliminary bound\n\nE [L(wag\n\nn )] \u2212 L(w\uffff) \u2264\n\n2\u03b3\n\n(n \u2212 1)p+1\n\nn\u22121\uffffi=1\n\ni2p E\uffff\uffff\uffff\u2207L(wmd\n\ni\n\n) \u2212 \u2207\uffffi(wmd\n\ni\n\n)\uffff\uffff2\uffff +\n\nD2\n\n\u03b3(n \u2212 1)p+1\n\nIn [5] the step size \u03b3i = \u03b3(i + 1)/2 and \u03b2i = (i + 1)/2 which e\ufb00ectively amounts to\np = 1 and further similar to the stochastic gradient descent analysis. Furthermore, each\n\nto the \ufb01nal bound provided in [5] by setting \u03b3 appropriately. On the other hand, we \ufb01rst\nnotice that due to the mini-batch setting, just like in the proof of stochastic gradient descent,\n\n)\uffff\uffff2\uffff is assumed to be bounded by some constant, and thus leads\n\nE\uffff\uffff\uffff\u2207L(wmd\n\n) \u2212 \u2207\uffffi(wmd\n\ni\n\ni\n\nE [L(wag\n\nn )] \u2212 L(w\uffff) \u2264\n\n2\u03b3\n\nb2(n\u22121)p+1\n\ni2p\n\nn\u22121\uffffi=1\n\nib\ufffft=\n\nb(i\u22121)+1\n\nE\uffff\uffff\uffff\u2207L(wmd\n\ni\n\n) \u2212 \u2207\uffff(wmd\n\ni\n\n, zt)\uffff\uffff2\uffff +\n\nD2\n\n\u03b3(n\u22121)p+1\n\n4\n\n\fUsing smoothness, the self bounding property some manipulations, we can further get the\nbound\n\nE [L(wag\n\nn )] \u2212 L(w\uffff) \u2264 64H\u03b3\nb(n\u22121)1\u2212p\n+\n\n(E [L(wag\n\nn\u22121\uffffi=1\n\u03b3(n\u22121)p+1 + 32HD2\nb(n\u22121)\n\nD2\n\ni )] \u2212 L(w\uffff)) + 64H\u03b3L (w\uffff)(n\u22121)p\n\nb\n\nthat\n\nabove\n\nNotice\n\ni=1 (E [L(wag\n\n\uffffn\u22121\n\nrecursively bounds E [L(wag\n\nthe\nof\ni )] \u2212 L(w\uffff)). While unrolling the recursion all the way down to 2\ndoes not help, we notice that for any w \u2208W , L(w)\u2212 L(w\uffff) \u2264 12HD2 + 3L(w\uffff). Hence we\nunroll the recursion to M steps and use this inequality for the remaining sum. Optimizing\nover number of steps up to which we unroll and also optimizing over the choice of \u03b3, we get\nthe bound,\n\nn )] \u2212 L(w\uffff)\n\nin terms\n\nE [L(wag\n\nn )] \u2212 L(w\uffff) \u2264\uffff 1648HD2L(w\uffff)\n\n+ 348(6HD2+2L(w\uffff))\n\n(b(n \u2212 1))\n\np\n\np+1 + 32HD2\nb(n\u22121)\n\nb(n\u22121)\n(n\u22121)p+1 + 36HD2\nb(n\u22121)\n\n+ 4HD2\n\nb(n\u22121)\nlog(n)\n(b(n\u22121))\n\np\n\n2p+1\n\nUsing the p as given in the theorem statement, and few simple manipulations, gives the \ufb01nal\nbound.\n\n4 Optimizing with Mini-Batches\n\nTo compare our two theorems and understand their implications, it will be convenient to\ntreat H and D as constants, and focus on the more interesting parameters of sample size\nm, minibatch size b, and optimal expected loss L(w\uffff). Also, we will ignore the logarithmic\nfactor in Theorem 2, since we will mostly be interested in signi\ufb01cant (i.e. polynomial)\ndi\ufb00erences between the two algorithms, and it is quite possible that this logarithmic factor\nis merely an artifact of our analysis. Using m = nb, we get that the bound for the SGD\nalgorithm is\n\nand the bound for the accelerated gradient method we propose is\n\nE [L( \u00afw)] \u2212 L(w\uffff) \u2264 \u02dcO\uffff\uffff L(w\uffff)\nn )]\u2212 L(w\uffff) \u2264 \u02dcO\uffff\uffff L(w\uffff)\n\n1\n\u221abn\n\nbn\n\nbn\n\n+\n\n+\n\n1\n\nm\n\nn\uffff = \u02dcO\uffff\uffff L(w\uffff)\nn2\uffff = \u02dcO\uffff\uffff L(w\uffff)\n\nm\n\n1\n\nb\n\nm\uffff ,\n\n+\n\n(5)\n\n\u221ab\nm\n\n+\n\nb2\n\nm2\uffff . (6)\n\n+\n\n+\n\nE [L(wag\n\nTo understand the implication these bounds, we follow the approach described in [3, 12] to\nanalyze large-scale learning algorithms. First, we \ufb01x a desired suboptimality parameter \uffff,\nwhich measures how close to L(w\uffff) we want to get. Then, we assume that both algorithms\nare ran till the suboptimality of their outputs is at most \uffff. Our goal would be to understand\nthe runtime each algorithm needs, till attaining suboptimality \uffff, as a function of L(w\uffff),\uffff, b .\n\nTo measure this runtime, we need to discern two settings here: a parallel setting, where we\nassume that the mini-batch gradient computations are performed in parallel, and a serial\nsetting, where the gradient computations are performed one after the other. In a parallel\nsetting, we can take the number of iterations n as a rough measure of the runtime (note\nthat in both algorithms, the runtime of a single iteration is comparable). In a serial setting,\nthe relevant parameter is m, the number of data accesses.\n\nTo analyze the dependence on m and n, we upper bound (5) and (6) by \uffff, and invert them\nto get the bounds on m and n. Ignoring logarithmic factors, for the SGD algorithm we get\n\n1\n\n\uffff\uffff L(w\uffff)\n\n\uffff\n\nn \u2264\n\n1\nb\n\n\u00b7\n\n+ 1\uffff\n\nm \u2264\n\n5\n\n1\n\n\uffff\uffff L(w\uffff)\n\n\uffff\n\n+ b\uffff ,\n\n(7)\n\n\f1\n\n\uffff\uffff L(w\uffff)\n\n\uffff\n\nn \u2264\n\n1\nb\n\n\u00b7\n\n+\n\n1\n\u221ab\n\n+ \u221a\uffff\uffff\n\n1\n\n\uffff\uffff L(w\uffff)\n\n\uffff\n\n+ \u221ab + b\u221a\uffff\uffff .\n\nm \u2264\n\n(8)\n\nand for the AG algorithm we get\n\nFirst, let us compare the performance of these two algorithms in the parallel setting, where\nthe relevant parameter to measure runtime is n. Analyzing which of the terms in each\nbound dominates, we get that for the SGD algorithm, there are 2 regimes, while for the AG\nalgorithm, there are 2-3 regimes depending on the relationship between L(w\uffff) and \uffff. The\nfollowing two tables summarize the situation (again, ignoring constants):\n\nSGD Algorithm\nRegime\n\nn\n\nb \u2264\uffffL(w\uffff)m L(w\uffff)\nb \u2265\uffffL(w\uffff)m\n\n\uffff2b\n1\n\uffff\n\n\uffff \u2264 L(w\uffff)2\n\n\uffff \u2265 L(w\uffff)2\n\nAG Algorithm\nRegime\n\nb \u2264 L(w\uffff)1/4m3/4\nb \u2265 L(w\uffff)1/4m3/4\n\nb \u2264 L(w\uffff)m\n\nL(w\uffff)m \u2264 b \u2264 m2/3\n\nb \u2265 m2/3\n\nn\n\nL(w\uffff)\n\n\uffff2b\n1\u221a\uffff\nL(w\uffff)\n\n\uffff2b\n1\n\uffff\u221ab\n1\u221a\uffff\n\nFrom the tables, we see that for both methods, there is an initial linear speedup as a function\nof the minibatch size b. However, in the AG algorithm, this linear speedup regime holds for\nmuch larger minibatch sizes1. Even beyond the linear speedup regime, the AG algorithm\n\nstill maintains a \u221ab speedup, for the reasonable case where \uffff \u2265 L(w\uffff)2. Finally, in all\n\nregimes, the runtime bound of the AG algorithm is equal or signi\ufb01cantly smaller than that\nof the SGD algorithm.\n\nWe now turn to discuss the serial setting, where the runtime is measured in terms of m.\nInspecting (7) and (8), we see that a larger size of b actually requires m to increase for both\nalgorithms. This is to be expected, since mini-batching does not lead to large gains in a\nserial setting. However, using mini-batching in a serial setting might still be bene\ufb01cial for\nimplementation reasons, resulting in constant-factor improvements in runtime (e.g. saving\noverhead and loop control, and via pipelining, concurrent memory accesses etc.). In that\ncase, we can at least ask what is the largest mini-batch size that won\u2019t degrade the runtime\nguarantee by more than a constant. Using our bounds, the mini-batch size b for the SGD\nalgorithm can scale as much as L/\uffff, vs. a larger value of L/\uffff3/2 for the AG algorithm.\n\nFinally, an interesting point is that the AG algorithm is sometimes actually necessary to\nobtain signi\ufb01cant speed-ups via a mini-batch framework (according to our bounds). Based\non the table above, this happens when the desired suboptimality \uffff is not much bigger then\nL(w\uffff), i.e. \uffff =\u2126( L(w\uffff)). This includes the \u201cseparable\u201d case, L(w\uffff) = 0, and in general\na regime where the \u201cestimation error\u201d \uffff and \u201capproximation error\u201d L(w\uffff) are roughly the\nsame\u2014an arguably very relevant one in machine learning. For the SGD algorithm, the\n\ncritical mini-batch value\uffffL(w\uffff)m can be shown to equal L(w\uffff)/\uffff, which is O(1) in our\ncase. So with SGD we get no non-constant parallel speedup. However, with AG, we still\nenjoy a speedup of at least \u0398(\u221ab), all the way up to mini-batch size b = m2/3.\n\n5 Experiments\n\nWe implemented both the SGD algorithm (Algorithm 1) and the AG algorithm (Algorithm\n2, using step-sizes of the form \u03b3i = \u03b3ip as suggested by Theorem 2) on two publicly-available\nbinary classi\ufb01cation problems, astro-physics and CCAT. We used the smoothed hinge loss\n\uffff(w; x, y), de\ufb01ned as 0.5\u2212yw\uffffx if yw\uffffx \u2264 0; 0 if yw\uffffx > 1, and 0.5(1\u2212yw\uffffx)2 otherwise.\nWhile both datasets are relatively easy to classify, we also wished to understand the algo-\nrithms\u2019 performance in the \u201cseparable\u201d case L(w\uffff) = 0, to see if the theory in Section 4\n\n1Since it is easily veri\ufb01ed that \uffffL(w\uffff)m is generally smaller than both L(w\uffff)1/4m3/4 and\n\nL(w\uffff)m\n\n6\n\n\fastro-physics\n\nCCAT\n\n\u25cf\n\n0\n2\n0\n0\n.\n0\n\n8\n1\n0\n0\n.\n0\n\n6\n1\n0\n0\n.\n0\n\n4\n1\n0\n0\n.\n0\n\ns\ns\no\nL\n\nt\ns\ne\nT\n\n\u25cf\n\n\u25cf\n\nb=4\nb=16\nb=64\n\n2\n2\n0\n0\n.\n0\n\n0\n2\n0\n0\n.\n0\n\n8\n1\n0\n0\n.\n0\n\n6\n1\n0\n0\n.\n0\n\n0.0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1.0\n\n0.0\n\np\n\nb=16\nb=64\nb=256\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1.0\n\np\n\nFigure 1: Left: Test smoothed hinge loss, as a function of p, after training using the AG algorithm\non 6361 examples from astro-physics, for various batch sizes. Right: the same, for 18578 examples\nfrom CCAT. In both datasets, margin violations were removed before training so that L(w\uffff) = 0.\nThe circled points are the theoretically-derived values p = ln b/(2 ln(n \u2212 1)) (see Theorem 2).\n\nholds in practice. To this end, we created an additional version of each dataset, where\nL(w\uffff) = 0, by training a classi\ufb01er on the entire dataset and removing margin violations.\n\nIn all of our experiments, we used up to half of the data for training, and one-quarter\neach for validation and testing. The validation set was used to determine the step sizes\n\u03b7 and \u03b3i. We justify this by noting that our goal is to compare the performance of the\nSGD and AG algorithms, independently of the di\ufb03culties in choosing their stepsizes. In\nthe implementation, we neglected the projection step, as we found it does not signi\ufb01cantly\na\ufb00ect performance when the stepsizes are properly selected.\n\nIn our \ufb01rst set of experiments, we attempted to determine the relationship between the\nperformance of the AG algorithm and the p parameter, which determines the rate of in-\ncrease of the step sizes \u03b3i. Our experiments are summarized in Figure 5. Perhaps the most\nimportant conclusion to draw from these plots is that neither the \u201ctraditional\u201d choice p = 1,\nnor the constant-step-size choice p = 0, give the best performance in all circumstances. In-\nstead, there is a complicated data-dependent relationship between p, and the \ufb01nal classi\ufb01er\u2019s\nperformance. Furthermore, there appears to be a weak trend towards higher p performing\nbetter for larger minibatch sizes b, which corresponds neatly with our theoretical predictions.\n\nIn our next experiment, we directly compared the performance of the SGD and AG. To do\nso, we varied the minibatch size b while holding the total amount of training data (m = nb)\n\ufb01xed. When L(w\uffff) > 0 (top row of Figure 5), the total sample size m is high and the\nsuboptimality \uffff is low (red and black plots), we see that for small minibatch size, both\nmethods do not degrade as we increase b, corresponding to a linear parallel speedup. In\nfact, SGD is actually overall better, but as b increases, its performance degrades more\nquickly, eventually performing worse than AG. That is, even in the least favorable scenario\nfor AG (high L(w\uffff) and small \uffff, see the tables in Sec. 4), it does give bene\ufb01ts with large\nenough minibatch sizes. Further, we see that once the suboptimality \uffff is roughly equal to\nL\u2217, AG signi\ufb01cantly outperforms SGD, even with small minibatches, agreeing with theory.\nTurning to the case L(w\uffff) = 0 (bottom two rows of Figure 5), which is theoretically more\nfavorable to AG, we see it is indeed mostly better, in terms of retaining linear parallel\nspeedups for larger minibatch sizes, even for large data set sizes corresponding to small\nsuboptimality values, and might even be advantageous with small minibatch sizes.\n\n6 Summary\n\nIn this paper, we presented novel contributions to the theory of \ufb01rst order stochastic convex\noptimization (Theorems 1 and 2, generalizing results of [4] and [5] to be sensitive to L (w\uffff)),\n\n7\n\n\f5\n7\n0\n\n.\n\n0\n\n0\n7\n0\n\n.\n\n0\n\n5\n6\n0\n\n.\n\n0\n\n0\n6\n0\n\n.\n\n0\n\n5\n5\n0\n\n.\n\n0\n\n0\n5\n0\n\n.\n\n0\n\n2\n1\n0\n\n.\n\n0\n\n8\n0\n0\n\n.\n\n0\n\n4\n0\n0\n0\n\n.\n\n0\n0\n0\n0\n\n.\n\n8\n0\n0\n0\n\n.\n\n6\n0\n0\n0\n\n.\n\n4\n0\n0\n0\n\n.\n\n2\n0\n0\n0\n\n.\n\n0\n0\n0\n0\n\n.\n\ns\ns\no\nL\n\nt\ns\ne\nT\n\ns\ns\no\nL\n\nt\ns\ne\nT\n\nn\no\ni\nt\na\nc\n\ufb01\n\ni\ns\ns\na\nl\nc\ns\ni\n\nM\n\nt\ns\ne\nT\n\nastro-physics\n\nCCAT\n\nT=402207\nT=25137\nT=1571\n\nT=31884\nT=7796\nT=1949\n\n4\n1\n\n.\n\n0\n\n3\n1\n\n.\n\n0\n\n2\n1\n\n.\n\n0\n\n1\n1\n\n.\n\n0\n\n0\n1\n\n.\n\n0\n\n9\n0\n\n.\n\n0\n\n8\n0\n\n.\n\n0\n\n1\n\n5\n\n10\n\n50 100\n\n500\n\n5000\n\n1\n\n10\n\n100\n\n1000\n\n10000\n\n1\n\n5\n\n10\n\n50 100\n\n500\n\nT=402207\nT=25137\nT=1571\n\n1\n\n10\n\n100\n\n1000\n\n10000\n\nT=402207\nT=25137\nT=1571\n\nT=31884\nT=7796\nT=1949\n\nT=31884\nT=7796\nT=1949\n\n0\n3\n0\n\n.\n\n0\n\n0\n2\n0\n\n.\n\n0\n\n0\n1\n0\n0\n\n.\n\n0\n0\n0\n0\n\n.\n\n0\n2\n0\n0\n\n.\n\n5\n1\n0\n0\n\n.\n\n0\n1\n0\n0\n\n.\n\n5\n0\n0\n0\n\n.\n\n0\n0\n0\n0\n\n.\n\n1\n\n5\n\n10\n\n50 100\n\n500\n\n1\n\n10\n\nb\n\n1000\n\n10000\n\n100\n\nb\n\nFigure 2: Test loss on astro-physics and CCAT as a function of mini-batch size b (in log-scale),\nwhere the total amount of training data m = nb is held \ufb01xed. Solid lines and dashed lines are for\nSGD and AG respectively (for AG, we used p = ln b/(2 ln(n\u2212 1)) as in Theorem 2). The upper row\nshows the smoothed hinge loss on the test set, using the original (uncensored) data. The bottom\nrows show the smoothed hinge loss and misclassi\ufb01cation rate on the test set, using the modi\ufb01ed\ndata where L(w\uffff) = 0. All curves are averaged over three runs.\n\ndeveloped a novel step size strategy for the accelerated method that we used in order to\nobtain our results and we saw works well in practice, and provided a more re\ufb01ned analysis\nof the e\ufb00ects of minibatching which paints a di\ufb00erent picture then previous analyses [4, 1]\nand highlights the bene\ufb01t of accelerated methods.\n\nA remaining open practical and theoretical question is whether the bound of Theorem 2\nis tight. Following [5], the bound is tight for b = 1 and b \u2192 \u221e, i.e. the \ufb01rst and third\nterms are tight, but it is not clear whether the 1/(\u221abn) dependence is indeed necessary.\nIt would be interesting to understand whether with a more re\ufb01ned analysis, or perhaps\ndi\ufb00erent step-sizes, we can avoid this term, whether an altogether di\ufb00erent algorithm is\nneeded, or whether this term does represent the optimal behavior for any method based on\nb-aggregated stochastic gradient estimates.\n\n8\n\n\fReferences\n[1] A. Agarwal and J. Duchi. Distributed delayed stochastic optimization. Technical report,\n\narXiv, 2011.\n\n[2] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods\n\nfor convex optimization. Operations Research Letters, 31(3):167 \u2013 175, 2003.\n\n[3] L. Bottou and O. Bousquet. The tradeo\ufb00s of large scale learning. In NIPS, 2007.\n\n[4] O. Dekel, R. Gilad Bachrach, O. Shamir, and L. Xiao. Optimal distributed online\n\nprediction using mini-batches. Technical report, arXiv, 2010.\n\n[5] G. Lan. An optimal method for stochastic convex optimization. Technical report,\n\nGeorgia Institute of Technology, 2009.\n\n[6] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation\napproach to stochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609,\n2009.\n\n[7] Y. Nesterov. A method for unconstrained convex minimization problem with the rate\n\nof convergence o(1/k2). Doklady AN SSSR, 269:543\u2013547, 1983.\n\n[8] Y. Nesterov.\n\nSmooth minimization of non-smooth functions. Math. Program.,\n\n103(1):127\u2013152, 2005.\n\n[9] Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical\n\nProgramming, 120(1):221\u2013259, August 2009.\n\n[10] O. Shamir O. Dekel, R. Gilad-Bachrach and L. Xiao. Optimal distributed online pre-\n\ndiction. In ICML, 2011.\n\n[11] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: primal estimated\n\nsub-gradient solver for SVM. Math. Program., 127(1):3\u201330, 2011.\n\n[12] S. Shalev-Shwartz and N. Srebro. SVM optimization: inverse dependence on training\n\nset size. In ICML, 2008.\n\n[13] N. Srebro, K. Sridharan, and A. Tewari. Smoothness, low noise and fast rates. In NIPS,\n\n2010.\n\n[14] S.Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications. PhD thesis,\n\nHebrew University of Jerusalem, 2007.\n\n[15] P. Tseng. On accelerated proximal gradient methods for convex-concave optimization.\n\nSubmitted to SIAM Journal on Optimization, 2008.\n\n[16] L. Xiao. Dual averaging methods for regularized stochastic learning and online opti-\n\nmization. Journal of Machine Learning Research, 11:2543\u20132596, 2010.\n\n9\n\n\f", "award": [], "sourceid": 942, "authors": [{"given_name": "Andrew", "family_name": "Cotter", "institution": null}, {"given_name": "Ohad", "family_name": "Shamir", "institution": null}, {"given_name": "Nati", "family_name": "Srebro", "institution": null}, {"given_name": "Karthik", "family_name": "Sridharan", "institution": null}]}