This post is pretty bizarre, but it manages to hit on so many false beliefs that I've seen hurt junior data scientists that it deserves some explicit
More from Data science
I have always emphasized on the importance of mathematics in machine learning.
Here is a compilation of resources (books, videos & papers) to get you going.
(Note: It's not an exhaustive list but I have carefully curated it based on my experience and observations)
📘 Mathematics for Machine Learning
by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong
https://t.co/zSpp67kJSg
Note: this is probably the place you want to start. Start slowly and work on some examples. Pay close attention to the notation and get comfortable with it.
📘 Pattern Recognition and Machine Learning
by Christopher Bishop
Note: Prior to the book above, this is the book that I used to recommend to get familiar with math-related concepts used in machine learning. A very solid book in my view and it's heavily referenced in academia.
📘 The Elements of Statistical Learning
by Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie
Mote: machine learning deals with data and in turn uncertainty which is what statistics teach. Get comfortable with topics like estimators, statistical significance,...
📘 Probability Theory: The Logic of Science
by E. T. Jaynes
Note: In machine learning, we are interested in building probabilistic models and thus you will come across concepts from probability theory like conditional probability and different probability distributions.
Here is a compilation of resources (books, videos & papers) to get you going.
(Note: It's not an exhaustive list but I have carefully curated it based on my experience and observations)
📘 Mathematics for Machine Learning
by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong
https://t.co/zSpp67kJSg
Note: this is probably the place you want to start. Start slowly and work on some examples. Pay close attention to the notation and get comfortable with it.
📘 Pattern Recognition and Machine Learning
by Christopher Bishop
Note: Prior to the book above, this is the book that I used to recommend to get familiar with math-related concepts used in machine learning. A very solid book in my view and it's heavily referenced in academia.
📘 The Elements of Statistical Learning
by Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie
Mote: machine learning deals with data and in turn uncertainty which is what statistics teach. Get comfortable with topics like estimators, statistical significance,...
📘 Probability Theory: The Logic of Science
by E. T. Jaynes
Note: In machine learning, we are interested in building probabilistic models and thus you will come across concepts from probability theory like conditional probability and different probability distributions.
1/ A ∞-wide NN of *any architecture* is a Gaussian process (GP) at init. The NN in fact evolves linearly in function space under SGD, so is a GP at *any time* during training. https://t.co/v1b6kndqCk With Tensor Programs, we can calculate this time-evolving GP w/o trainin any NN
2/ In this gif, narrow relu networks have high probability of initializing near the 0 function (because of relu) and getting stuck. This causes the function distribution to become multi-modal over time. However, for wide relu networks this is not an issue.
3/ This time-evolving GP depends on two kernels: the kernel describing the GP at init, and the kernel describing the linear evolution of this GP. The former is the NNGP kernel, and the latter is the Neural Tangent Kernel (NTK).
4/ Once we have these two kernels, we can derive the GP mean and covariance at any time t via straightforward linear algebra.
5/ So it remains to calculate the NNGP kernel and NT kernel for any given architecture. The first is described in https://t.co/cFWfNC5ALC and in this thread
2/ In this gif, narrow relu networks have high probability of initializing near the 0 function (because of relu) and getting stuck. This causes the function distribution to become multi-modal over time. However, for wide relu networks this is not an issue.
3/ This time-evolving GP depends on two kernels: the kernel describing the GP at init, and the kernel describing the linear evolution of this GP. The former is the NNGP kernel, and the latter is the Neural Tangent Kernel (NTK).
4/ Once we have these two kernels, we can derive the GP mean and covariance at any time t via straightforward linear algebra.
5/ So it remains to calculate the NNGP kernel and NT kernel for any given architecture. The first is described in https://t.co/cFWfNC5ALC and in this thread
You May Also Like
The entire discussion around Facebook’s disclosures of what happened in 2016 is very frustrating. No exec stopped any investigations, but there were a lot of heated discussions about what to publish and when.
In the spring and summer of 2016, as reported by the Times, activity we traced to GRU was reported to the FBI. This was the standard model of interaction companies used for nation-state attacks against likely US targeted.
In the Spring of 2017, after a deep dive into the Fake News phenomena, the security team wanted to publish an update that covered what we had learned. At this point, we didn’t have any advertising content or the big IRA cluster, but we did know about the GRU model.
This report when through dozens of edits as different equities were represented. I did not have any meetings with Sheryl on the paper, but I can’t speak to whether she was in the loop with my higher-ups.
In the end, the difficult question of attribution was settled by us pointing to the DNI report instead of saying Russia or GRU directly. In my pre-briefs with members of Congress, I made it clear that we believed this action was GRU.
The story doesn\u2019t say you were told not to... it says you did so without approval and they tried to obfuscate what you found. Is that true?
— Sarah Frier (@sarahfrier) November 15, 2018
In the spring and summer of 2016, as reported by the Times, activity we traced to GRU was reported to the FBI. This was the standard model of interaction companies used for nation-state attacks against likely US targeted.
In the Spring of 2017, after a deep dive into the Fake News phenomena, the security team wanted to publish an update that covered what we had learned. At this point, we didn’t have any advertising content or the big IRA cluster, but we did know about the GRU model.
This report when through dozens of edits as different equities were represented. I did not have any meetings with Sheryl on the paper, but I can’t speak to whether she was in the loop with my higher-ups.
In the end, the difficult question of attribution was settled by us pointing to the DNI report instead of saying Russia or GRU directly. In my pre-briefs with members of Congress, I made it clear that we believed this action was GRU.