Important paper from Google on large batch optimization. They do impressively careful experiments measuring # iterations needed to achieve target validation error at various batch sizes. The main "surprise" is the lack of surprises. [thread]

https://t.co/7QIx5CFdfJ

The paper is a good example of lots of elements of good experimental design. They validate their metric by showing lots of variants give consistent results. They tune hyperparamters separately for each condition, check that optimum isn't at the endpoints, and measure sensitivity.
They have separate experiments where the hold fixed # iterations and # epochs, which (as they explain) measure very different things. They avoid confounds, such as batch norm's artificial dependence between batch size and regularization strength.
When the experiments are done carefully enough, the results are remarkably consistent between different datasets and architectures. Qualitatively, MNIST behaves just like ImageNet.
Importantly, they don't find any evidence for a "sharp/flat optima" effect whereby better optimization leads to worse final results. They have a good discussion of experimental artifacts/confounds in past papers where such effects were reported.
The time-to-target-validation is explained purely by optimization considerations. There's a regime where variance dominates, and you get linear speedups w/ batch size. Then there's a regime where curvature dominates and larger batches don't help. As theory would predict.
Incidentally, this paper must have been absurdly expensive, even by Google's standards. Doing careful empirical work on optimizers requires many, many runs of the algorithm. (I think surprising phenomena on ImageNet are often due to the difficulty of running proper experiments.)

More from Machine learning

Happy 2⃣0⃣2⃣1⃣ to all.🎇

For any Learning machines out there, here are a list of my fav online investing resources. Feel free to add yours.

Let's dive in.
⬇️⬇️⬇️

Investing Services

✔️ @themotleyfool - @TMFStockAdvisor & @TMFRuleBreakers services

✔️ @7investing

✔️ @investing_city
https://t.co/9aUK1Tclw4

✔️ @MorningstarInc Premium

✔️ @SeekingAlpha Marketplaces (Check your area of interest, Free trials, Quality, track record...)

General Finance/Investing

✔️ @morganhousel
https://t.co/f1joTRaG55

✔️ @dollarsanddata
https://t.co/Mj1owkzRc8

✔️ @awealthofcs
https://t.co/y81KHfh8cn

✔️ @iancassel
https://t.co/KEMTBHa8Qk

✔️ @InvestorAmnesia
https://t.co/zFL3H2dk6s

✔️

Tech focused

✔️ @stratechery
https://t.co/VsNwRStY9C

✔️ @bgurley
https://t.co/NKXGtaB6HQ

✔️ @CBinsights
https://t.co/H77hNp2X5R

✔️ @benedictevans
https://t.co/nyOlasCY1o

✔️

Tech Deep dives

✔️ @StackInvesting
https://t.co/WQ1yBYzT2m

✔️ @hhhypergrowth
https://t.co/kcLKITRLz1

✔️ @Beth_Kindig
https://t.co/CjhLRdP7Rh

✔️ @SeifelCapital
https://t.co/CXXG5PY0xX

✔️ @borrowed_ideas

You May Also Like

1/12

RT-PCR corona (test) scam

Symptomatic people are tested for one and only one respiratory virus. This means that other acute respiratory infections are reclassified as


2/12

It is tested exquisitely with a hypersensitive non-specific RT-PCR test / Ct >35 (>30 is nonsense, >35 is madness), without considering Ct and clinical context. This means that more acute respiratory infections are reclassified as


3/12

The Drosten RT-PCR test is fabricated in a way that each country and laboratory perform it differently at too high Ct and that the high rate of false positives increases massively due to cross-reaction with other (corona) viruses in the "flu


4/12

Even asymptomatic, previously called healthy, people are tested (en masse) in this way, although there is no epidemiologically relevant asymptomatic transmission. This means that even healthy people are declared as COVID


5/12

Deaths within 28 days after a positive RT-PCR test from whatever cause are designated as deaths WITH COVID. This means that other causes of death are reclassified as