Last up in Privacy Tech for #enigma2021, @xchatty speaking about "IMPLEMENTING DIFFERENTIAL PRIVACY FOR THE 2020

Differential privacy was invented in 2006. Seems like a long time but it's not a long time since a fundamental scientific invention. It took longer than that between the invention of public key cryptography and even the first version of SSL.
But even in 2020, we still can't meet user expectations.
* Data users expect consistent data releases
* Some people call synthetic data "fake data" like
"fake news"
* It's not clear what "quality assurance" and "data exploration" means in a DP framework
We just did the 2020 US census
* required to collect it by the constitution
* but required to maintain privacy by law
But that's hard! What if there were 10 people on the block and all the same sex and age? If you posted something like that, then you would know what everyone's sex and age was on the block.
Previously used a method called "swapping" with secret parameters
* differential privacy is open and we can talk about privacy loss/accuracy tradeoff
* swapping assumed limitations of the attackers (e.g. limited computational power)
Needed to design the algorithms to get the accuracy we need it and tune the privacy loss based on that.

Change in the meaning of "privacy" as relative -- it requires a lot of explanation and overcoming organizational barriers.
By 2017 thought they had a good understanding of how differential privacy would fit -- just use the new algorithm where the old one was used, to create the "micodata detail file".
* different groups at the Census thought that meant different things
* before, states were processed as they came in. Differential privacy requires everything be computed on at once
* required a lot more computing power
* differential privacy system has to be developed with real data; can't use simulated data to do this because the algorithms in the literature weren't designed for dats anything like as complex as the real data (multiracial people, different kinds of households, etc)
* to understand the privacy/accuracy trade-off requires a lot of runs, representing a *lot* of computer time
Census bureau was 100% behind the move
* initial implementation was by Dan Kiefer, who took a sabbatical
* expanded team to with Simson and others
* 2018 end to end test
* original development was on an on-prem Linux cluster
* then got to move to AWS Elastic compute... but the monitoring wasn't good enough and had to create their own dashboard to track execution
* it wasn't a small amount of compute
* republished the 2010 census data using the differentially private algorithm and then had a conference to talk about it
* ... it wasn't well-received by the data users who thought there was too much error
For example: if we add a random value to a child's age, we might get a negative value, which probably won't happen to a child's age.

If you avoid that, you might add bias to the data. How to avoid that? Let some data users get access to the measurement files [I don't follow]
In summary, this is retrofitting the longest-running statistical program in the country with differential privacy. Data users have had some concerns, but believe it will all come out.
Code is up on github and papers are up online. (@xchatty have some links?)

[end of talk]

More from Lea Kissner

More from Tech

Recently, the @CNIL issued a decision regarding the GDPR compliance of an unknown French adtech company named "Vectaury". It may seem like small fry, but the decision has potential wide-ranging impacts for Google, the IAB framework, and today's adtech. It's thread time! 👇

It's all in French, but if you're up for it you can read:
• Their blog post (lacks the most interesting details):
• Their high-level legal decision:
• The full notification:

I've read it so you needn't!

Vectaury was collecting geolocation data in order to create profiles (eg. people who often go to this or that type of shop) so as to power ad targeting. They operate through embedded SDKs and ad bidding, making them invisible to users.

The @CNIL notes that profiling based off of geolocation presents particular risks since it reveals people's movements and habits. As risky, the processing requires consent — this will be the heart of their assessment.

Interesting point: they justify the decision in part because of how many people COULD be targeted in this way (rather than how many have — though they note that too). Because it's on a phone, and many have phones, it is considered large-scale processing no matter what.

You May Also Like

Assalam Alaiki dear Sister in Islam. I hope this meets you well. Hope you are keeping safe in this pandemic. May Allah preserve you and your beloved family. I would like to address the misconception and misinterpretation in your thread. Please peruse the THREAD below.

1. First off, a disclaimer. Should you feel hurt by my words in the course of the thread, then forgive me. It’s from me and not from Islam. And I probably have to improve on my delivery. And I may not quote you verbatim, but the intended meaning would be there. Thank You!

2. Standing on Imam Shafii’s quote: “And I never debated anyone but that I did not mind whether Allah clarified the truth on my tongue or his tongue” or “I never once debated anyone hoping to win the debate; rather I always wished that the truth would come from his side.”

3. Okay, into the meat (my love for meat is showing. Lol) of the thread. Even though you didn’t mention the verse that permitted polygamy, everyone knows the verse you were talking about (Q4:3).

4. Your reasons for the revelation of the verse are strange. The first time I came across such. I had to quickly consult the books on the exegeses or tafsir of the Quran written by renowned specialists!