If you're interested in DB internals, stop what you're doing and watch @CMUDB Quarantine Talk from Nico+Cesar about @SQLServer's Cascades query optimizer: https://t.co/FCdsbHHEaD

Many talks this semester were good. This one is the best. My thread provides key takeaways

[6:50] Microsoft hired Goetz Graffe in the 1990s to help them rewrite original Sybase optimizer into Cascades. This framework is now used across all MSFT DB products (@SQLServer, @cosmosdb, @Azure_Synapse).
[14:43] The optimizer checks whether it has stats it will need before cost-based search. If not, it blocks planning until the DBMS generates them. This is different than other approaches we saw this semester where DBMS says "we'll do it live!" with whatever stats are available.
[21:05] Their Cascades' search starts small/simple and then they make decision on the fly whether to expand search based on the expected query runtime and performance benefit from more search.
[26:33] They explicitly have a property for Halloween Problem. Operators specify whether they protect from it and then optimizer ensures property is satisfied. This is mindblowing. I have never thought about using the optimizer for this but it makes sense. https://t.co/hjjoGCwyvl
[33:16] This is the menu of all the stats that they maintain for tables. Again, the latest research shows @SQLServer has the most accurate stats: https://t.co/d1btkxmsYf
[39:05] @SQLServer uses a general-purpose cardinality estimation framework. This allows them to programmatically select the best data structure to use per expression type. They rank choices by "quality of estimation". This needs further research.
[44:16] Question from @Lin_Ma_: Are you using ML for cardinality estimation?
Answer: @cosmosdb is using it. @SQLServer is more conservative and using a minor form of it.
[53:14] They use heuristics to pre-seed Cascades' memoization table with plans that they think will be good. This allows the search to start from a local optimum instead of a random location in search space.
[54:48] Optimizer uses logical timeouts (# of plans considered) instead of physical timeout (wallclock). This ensures that DBMS always produces plans with same quality under high load. Hand-tuned timeouts for different optimization stages.
[1:00:45] They also use pre-seeding to support DBA provided query plan hints! This is another genius idea that seems obvious once somebody shows it to you.
[1:03:32] This example shows the limitations of Cascades' tree-based plan search. For some optimizations, the DBMS must also consider hypergraphs. See Neumann SIGMOD'08: https://t.co/s825mXPMqK

More from Internet

We’ve spent the last ten months building #CitizenBrowser, a project that aims to peek inside the Black Box of social media algorithms, by building a nationwide panel to share data with us. Today, we are publishing our first story from the project. /1

.@corintxt crunched the numbers and found that after Facebook flipped the switch for political ads, partisan content elbowed out reputable news outlets in our panelists’ news feeds.
https://t.co/Z0kibSBeQZ /2

You can learn more in our methodology, where we describe how we did this and what steps we took to ensure that we preserved the panelists' privacy. https://t.co/UYbTXAjy5i /3

Personally, this project is the culmination of years of experiments trying to figure out how to collect data from social media platforms in a way that can lead to meaningful reporting. I’ve described a couple of highlights below 👇 /4

My first attempt was in 2016 at Propublica, when I was working with @JuliaAngwin . We were interested in seeing if there was a difference in the Ad interests FB disclosed to users in their settings and the interests they showed to marketers. /5

You May Also Like

Recently, the @CNIL issued a decision regarding the GDPR compliance of an unknown French adtech company named "Vectaury". It may seem like small fry, but the decision has potential wide-ranging impacts for Google, the IAB framework, and today's adtech. It's thread time! 👇

It's all in French, but if you're up for it you can read:
• Their blog post (lacks the most interesting details):
https://t.co/PHkDcOT1hy
• Their high-level legal decision: https://t.co/hwpiEvjodt
• The full notification: https://t.co/QQB7rfynha

I've read it so you needn't!

Vectaury was collecting geolocation data in order to create profiles (eg. people who often go to this or that type of shop) so as to power ad targeting. They operate through embedded SDKs and ad bidding, making them invisible to users.

The @CNIL notes that profiling based off of geolocation presents particular risks since it reveals people's movements and habits. As risky, the processing requires consent — this will be the heart of their assessment.

Interesting point: they justify the decision in part because of how many people COULD be targeted in this way (rather than how many have — though they note that too). Because it's on a phone, and many have phones, it is considered large-scale processing no matter what.