Data science | Buzz Chronicles

This post is pretty bizarre, but it manages to hit on so many false beliefs that I've seen hurt junior data scientists that it deserves some explicit

(1) The notion that R is well-suited to "building web applications" seems totally out of left field. I don't feel like most R loyalists think this is a good idea, but it's worth calling out that no normal company will be glad you wrote your entire web app in R.

(2) It is true that Python had some issues historically with the 2-to-3 transition, but it's not such a big deal these days. On the flip side, I have found interesting R code that doesn't run in modern R interpreters because of changes in core operations (e.g. assignment syntax).

(3) "Most of the time we only need a latest, working interpreter with the latest packages to run the code" -- this is where things get real and reveal some things that hurt data scientists. If this sentence is true, it's likely because you don't share code with coworkers.

(3) Really is a broader issue in data science: people only think of what they need to do their work if no one else existed and code was never maintained. Junior data scientists almost always operate on projects they start from scratch and don't have to maintain for long.

Data Professor
@thedataprof

Cheat sheet that summarizes #DataScience in 10 pages
(Links in the comments below 👇)

2/ Link to the cheatsheet by Maverick

Jeremy Howard
@jeremyphoward

An amazing new project from @bearpelican was just released: https://t.co/DBov6sZTVS . A beautiful design; you can auto-generate a melody from chords, chords from a melody, and more.

It's technically brilliant, combining BERT, seq2seq, and Transformer XL
https://t.co/jF3mO5aXiu

It's also a wonderful example of leveraging and customizing the fastai framework in a deep & thoughtful way.

Here's the full set of blog posts diving in to this

Emil Wallner
@EmilWallner

Tips for AI writers:

1. Spend 30% of your effort on skimming all student ML papers (e.g. Stanford NLP CS224n) the past 3 years and prototype your favorites

The idea is everything. Pick an area you are interested in and ideally something that has a visual aspect to it

Most of my 'on the top of my mind' ideas were bad in retrospect. Skimming 100s of student papers will give you an overview of what's interesting.

Student papers are overlooked, easy to understand, and have good compute constraints.

2. Spend 30% on your effort on coding

Create an edge to the project. Apply it to something new and use FastAI or Keras to improve the accuracy with 5-30%.

3. Spend 30% writing an in-depth article

Have a north star article in terms of structure and quality. Find something that stretches you to your utmost capability. I used @copingbear’s Style transfer article:

4. Spend 10% marketing your project

Invest a week in studying the strategies to rank on sites like HN and Reddit, then use them. If you have an interesting result and a great article, you've done the hard work.

Simon DeDeo
@SimonDeDeo

On Bayesianism, the Many Worlds Interpretation, and personal identity.

Some thoughts worked out in a letter to a friend, which is the kind of thing you do when off Twitter for a glorious week. (🧵)

“Chance is ignorance”—the Bayesian story; all probabilities represent states of mind, not states of the world. One *could* put (some) chances “in the world”, but let’s take Occam’s Razor seriously...

That the probability of a fair coin coming up heads is 50% simply means that marginalizing (tracing, as the physicists say) over the hidden facts leaves you, nearly, maximally ignorant of the outcome.

Quantum uncertainty (access below!) poses an apparent challenge to this story. There seems to be nothing to be ignorant about when it comes to (say) electron spin—there is nothing “inside” the

The electron is a simple object, in other words. So where does the uncertainty come from? One could follow David Wallace’s wonderful interpretation in terms of chaotic dynamics and decoherence, but let’s see if we can take another route...

Sophie Hill
@sophie_e_hill

This is a more wonky thread about how I made this visualization in #Rstats using the awesome visNetwork

First step is to create the underlying network data. We need one file of "nodes" - i.e. the people and organizations. And one file of "edges" - i.e. the connections between them.

I created these by hand, based on excellent investigate journalism:

Now we can pull these together to create a network visualization!

You'll notice that I included a column for "type" in the nodes file. This allows me to use different icons for people vs firms vs political organizations.

All the icons are taken from @fontawesome. I *think* the visNetwork 📦 currently only works with fontawesome version 4.7, which is a bit limited – e.g. I decided to use a book icon to represent the fringe Evangelical Christian sect "Exclusive Brethren"! 😂

I very much enjoyed getting to use the "incognito" icon to represent all the unknown donors that have funded Tory MP Owen Paterson's overseas jaunts!

Ryan J. Gallagher
@ryanjgallag

Tired of word clouds? Want to do better sentiment analysis? Not sure how to look at the words underneath your measures?

Our long overdue paper on generalized word shift graphs is finally here!
https://t.co/lIBXvbMJWX
https://t.co/vSL1REYT8V

So what are they?

1/n

If we have two texts, there are many ways we can compare them. Weighted averages are a particularly useful measure because they're flexible and interpretable

Proportions, Shannon entropy, the KLD, the JSD, and dictionary methods can all be written as weighted averages

2/n

But weighted avgs are also slippery. When we try to compress complex phenomena like happiness, surprise, divergence, or diversity into a single number, it can be unclear what we're measuring

If the measure goes up, what does that mean? Why did it do that? Can we trust it?

3/n

Very often, that's the end of the line and we're left with an uneasy feeling in the pit of our stomach that our weighted avg is actually picking up a data artifact or some other unintended peculiarity

Word shift graphs help us address those concerns

4/n

First, word shifts look under the hood of weighted averages to see what's going on

All weighted averages are a sum of contributions from individual words. We can pull out those words, and rank which ones contribute the most to the difference between two texts

5/n

Goku Mohandas
@GokuMohandas

🔥 Putting ML in Production! We're going to publicly develop @madewithml's first ML service. Here is the broad curriculum:

- 📦 Product
- 🔢 Data
- 🤖 Modeling
- 📝 Scripting
- 🛠 API
- 🚀 Production

More details (lessons, task, etc.) here: https://t.co/xmMm9XGK9j

Thread 👇

Questions that this thread will answer:

- What is it?
- Who is this course for?
- What is the format?
- What makes this course unique?
- Why constrain to open source tools?
- What are my qualifications?
- Why is this free?
- What are the

What is it?

Putting ML in Production: a guide and code-driven case study on MLOps. We will be developing and deploying Made With ML's first ML service, from Product → ML → Production, with open source tools.

This ML service will act as a foundation for all future ML features and subsequent iterations. The first feature is tagifai - multilabel classification of tags for a project. We'll discuss the need and utility of this feature in the first lesson.

Who is this course for?

- ML developers looking to become end-to-end ML developers.
- Software engineers looking to learn how to responsibly deploy and monitor ML systems.
- Product managers who want to have a comprehensive understanding of the different stages of ML dev.

Maria Khalusova
@mariaKhalusova

To my JVM friends looking to explore Machine Learning techniques - you don’t necessarily have to learn Python to do that. There are libraries you can use from the comfort of your JVM environment. 🧵👇

https://t.co/EwwOzgfDca : Deep Learning framework in Java that supports the whole cycle: from data loading and preprocessing to building and tuning a variety deep learning networks.

https://t.co/J4qMzPAZ6u Framework for defining machine learning models, including feature generation and transformations, as directed acyclic graphs (DAGs).

https://t.co/9IgKkSxPCq a machine learning library in Java that provides multi-class classification, regression, clustering, anomaly detection and multi-label classification.

https://t.co/EAqn2YngIE : TensorFlow Java API (experimental)

elvis
@omarsar0

I have always emphasized on the importance of mathematics in machine learning.

Here is a compilation of resources (books, videos & papers) to get you going.

(Note: It's not an exhaustive list but I have carefully curated it based on my experience and observations)

📘 Mathematics for Machine Learning

by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong

https://t.co/zSpp67kJSg

Note: this is probably the place you want to start. Start slowly and work on some examples. Pay close attention to the notation and get comfortable with it.

📘 Pattern Recognition and Machine Learning

by Christopher Bishop

Note: Prior to the book above, this is the book that I used to recommend to get familiar with math-related concepts used in machine learning. A very solid book in my view and it's heavily referenced in academia.

📘 The Elements of Statistical Learning

by Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie

Mote: machine learning deals with data and in turn uncertainty which is what statistics teach. Get comfortable with topics like estimators, statistical significance,...

📘 Probability Theory: The Logic of Science

by E. T. Jaynes

Note: In machine learning, we are interested in building probabilistic models and thus you will come across concepts from probability theory like conditional probability and different probability distributions.

$Sunrit Jana \U0001f680$

Sunrit Jana 🚀
@JanaSunrise

Pandas is an amazing data analysis and manipulation library for python, Really popular when working with ML, Data science or more.

It has a robust data structure, Dataframe for manipulation and analyzing data.

Here's some tips to help you work better with pandas. Let's go! ↓

If you're not aware about what a Dataframe is, It's an optimized data structure for loading data, analysing it, manipulating data in it, and Mostly gathering insights.

It uses Cython backend which transpiles into C for optimized code.

Here's how a dataframe looks like,

Before we start, You need to ensure, you have pandas installed. If you don't, Do that before moving ahead!

Here are the tips, Let's go!

1/ Convert PD series to Dataframe

We all have struggled, when we deal with pandas series. It's always easier to work with Dataframes, rather than series. Here is how you can convert series to dataframe easily.

2/ How to create dummy Dataframe for testing

We always need dataframes for testing and analysing normally, if we do not have data ready. Here is how you can use Pandas API to generate different types of data.

Pat Schloss
@PatSchloss

Wellll... A few weeks back I started working on a tutorial for our lab's Code Club on how to make shitty graphs. It was too dispiriting and I balked. A twitter workshop with figures and code:

When are you doing pie charts?
— #BlackLivesMatter (@surt_lab) October 13, 2020

Here's the code to generate the data frame. You can get the "raw" data from https://t.co/jcTE5t0uBT

Obligatory stacked bar chart that hides any sense of variation in the data

Obligatory stacked bar chart that shows all the things and yet shows absolutely nothing at the same time

STACKED Donut plot. Who doesn't want a donut? Who wouldn't want a stack of them!?! This took forever to render and looked worse than it should because coord_polar doesn't do scales="free_x".

Greg Yang
@TheGregYang

1/ A ∞-wide NN of *any architecture* is a Gaussian process (GP) at init. The NN in fact evolves linearly in function space under SGD, so is a GP at *any time* during training. https://t.co/v1b6kndqCk With Tensor Programs, we can calculate this time-evolving GP w/o trainin any NN

2/ In this gif, narrow relu networks have high probability of initializing near the 0 function (because of relu) and getting stuck. This causes the function distribution to become multi-modal over time. However, for wide relu networks this is not an issue.

3/ This time-evolving GP depends on two kernels: the kernel describing the GP at init, and the kernel describing the linear evolution of this GP. The former is the NNGP kernel, and the latter is the Neural Tangent Kernel (NTK).

4/ Once we have these two kernels, we can derive the GP mean and covariance at any time t via straightforward linear algebra.

5/ So it remains to calculate the NNGP kernel and NT kernel for any given architecture. The first is described in https://t.co/cFWfNC5ALC and in this thread

Research Engineering a...
@turinghut23

✨✨ BIG NEWS: We are hiring!! ✨✨
Amazing Research Software Engineer / Research Data Scientist positions within the @turinghut23 group at the @turinginst, at Standard (permanent) and Junior levels 🤩

👇 Here below a thread on who we are and what we

We are a highly diverse and interdisciplinary group of around 30 research software engineers and data scientists 😎💻 👉 https://t.co/KcSVMb89yx #RSEng

We value expertise across many domains - members of our group have backgrounds in psychology, mathematics, digital humanities, biology, astrophysics and many other areas 🧬📖🧪📈🗺️⚕️🪐
https://t.co/zjoQDGxKHq
/ @DavidBeavan @LivingwMachines

In our everyday job we turn cutting edge research into professionally usable software tools. Check out @evelgab's #LambdaDays 👩‍💻 presentation for some examples:

We create software packages to analyse data in a readable, reliable and reproducible fashion and contribute to the #opensource community, as @drsarahlgibson highlights in her contributions to @mybinderteam and @turingway: https://t.co/pRqXtFpYXq #ResearchSoftwareHour

Categories Data science