If you're interested in DB internals, stop what you're doing and watch @CMUDB Quarantine Talk from Nico+Cesar about @SQLServer's Cascades query optimizer: https://t.co/FCdsbHHEaD

Many talks this semester were good. This one is the best. My thread provides key takeaways

[6:50] Microsoft hired Goetz Graffe in the 1990s to help them rewrite original Sybase optimizer into Cascades. This framework is now used across all MSFT DB products (@SQLServer, @cosmosdb, @Azure_Synapse).
[14:43] The optimizer checks whether it has stats it will need before cost-based search. If not, it blocks planning until the DBMS generates them. This is different than other approaches we saw this semester where DBMS says "we'll do it live!" with whatever stats are available.
[21:05] Their Cascades' search starts small/simple and then they make decision on the fly whether to expand search based on the expected query runtime and performance benefit from more search.
[26:33] They explicitly have a property for Halloween Problem. Operators specify whether they protect from it and then optimizer ensures property is satisfied. This is mindblowing. I have never thought about using the optimizer for this but it makes sense. https://t.co/hjjoGCwyvl
[33:16] This is the menu of all the stats that they maintain for tables. Again, the latest research shows @SQLServer has the most accurate stats: https://t.co/d1btkxmsYf
[39:05] @SQLServer uses a general-purpose cardinality estimation framework. This allows them to programmatically select the best data structure to use per expression type. They rank choices by "quality of estimation". This needs further research.
[44:16] Question from @Lin_Ma_: Are you using ML for cardinality estimation?
Answer: @cosmosdb is using it. @SQLServer is more conservative and using a minor form of it.
[53:14] They use heuristics to pre-seed Cascades' memoization table with plans that they think will be good. This allows the search to start from a local optimum instead of a random location in search space.
[54:48] Optimizer uses logical timeouts (# of plans considered) instead of physical timeout (wallclock). This ensures that DBMS always produces plans with same quality under high load. Hand-tuned timeouts for different optimization stages.
[1:00:45] They also use pre-seeding to support DBA provided query plan hints! This is another genius idea that seems obvious once somebody shows it to you.
[1:03:32] This example shows the limitations of Cascades' tree-based plan search. For some optimizations, the DBMS must also consider hypergraphs. See Neumann SIGMOD'08: https://t.co/s825mXPMqK

More from Internet

We’ve spent the last ten months building #CitizenBrowser, a project that aims to peek inside the Black Box of social media algorithms, by building a nationwide panel to share data with us. Today, we are publishing our first story from the project. /1

.@corintxt crunched the numbers and found that after Facebook flipped the switch for political ads, partisan content elbowed out reputable news outlets in our panelists’ news feeds.
https://t.co/Z0kibSBeQZ /2

You can learn more in our methodology, where we describe how we did this and what steps we took to ensure that we preserved the panelists' privacy. https://t.co/UYbTXAjy5i /3

Personally, this project is the culmination of years of experiments trying to figure out how to collect data from social media platforms in a way that can lead to meaningful reporting. I’ve described a couple of highlights below 👇 /4

My first attempt was in 2016 at Propublica, when I was working with @JuliaAngwin . We were interested in seeing if there was a difference in the Ad interests FB disclosed to users in their settings and the interests they showed to marketers. /5
Well, this should be a depressing read -- notably because the UK and the US are both terrible when it comes to data protection, but the UK appears to be getting a pass. So much for 'adequacy'.


A few initial thoughts on the Draft Decision on UK Adequacy: https://t.co/ncAqc93UFm

The decision goes into great detail about the state of the UK surveillance system, and notably, "bulk acquisition" of data, and I think I get their argument. /1

For one, while the UK allows similar "bulk powers," it differs from the US regime both in terms of proportionality, oversight, and even notice. Some of this came about after the Privacy International case in 2019 (Privacy International) v Investigatory
Powers Tribunal [2019]) /2

Whereas, other bits were already baked in by virtue of the fact that the Human Rights Act is a thing (This concept doesn't exist in the US; rather we hand-wave about the Constitution and Bill of Rights, and then selectively apply it) /3

For example, UK bulk surveillance (I'm keeping this broad, but the draft policy breaksk it down), substantially limits collection to three agencies: MI5, MI6, and GHCQ). By contrast, it's a bit of a free-for-all in the US, where varying policies /4

You May Also Like