Interesting dataset. Data like this can be used to identify strong links between...

kasey_junk · on Oct 21, 2018

I'm not sure I could disagree with you more. At least in the US there is a very strong expectation that communications between governmental employees is non-private except in very special circumstances. You'll note Matt says that the Police and Human Services departments have not responded, I'd guess thats not an accident because police records and personnel/medical records are largely exempt from FOIA requests.

Further, the idea that sleuths (amateur seems pejorative in this case) are working with city governments to battle corruption does not hold up to Matt's (or other journalists) experience. By and large the governments only provide the data because they are required to, and we have made sure they are required to by representative legislative action.

Had the IT dept. in Seattle not made an obvious mistake this would likely have not been a story at all and the data would have been an interesting data set for informed democratic functions.

danso · on Oct 21, 2018

> Further, the idea that sleuths (amateur seems pejorative in this case) are working with city governments to battle corruption does not hold up to Matt's (or other journalists) experience.

Yes, exactly this. History has shown that government employees and officials, being human, are reluctant to dig into, nevermind snitch about matters that may impugn their own colleagues. That a random citizen, or even a journalist, could convince law enforcement to go to a judge based on trends found via analysis of anonymized hashes is unrealistic. Nevermind that it effectively denies the benefits of transparency to anyone who isn't trained in data science.

jacquesm · on Oct 21, 2018

> between governmental employees

The request contains the names of private individuals through BCC and CC headers requested and exactly when they communicated with which government officials.

kasey_junk · on Oct 21, 2018

Yes. When I say “between governmental employees” I mean between them & anyone.

This is a normal expectation to the point where there are lots of rules about what public servants can even use private email addresses for. This was a big deal of course in the 2016 campaign.

This is likely a cultural difference, I view the data that Matt requested as mine because that governmental agent is acting in my name.

anigbrowl · on Oct 21, 2018

Fine by me.

anigbrowl · on Oct 21, 2018

release only anonymized data to protect the identities of the people working for the city

No thanks, that's an invitation to further corruption. Let me put that in context; recently a group of people were pushing a local authority to adjust its policy on the public release of arrestee's photos (aka mugshots). The group felt it was being selectively used for political purposes which the police and city council denied.

FOIA requests revealed that the authorities were unhappy with their public image as it related to law enforcement and decided to use mugshots as part of a social media campaign to change public perception. Can't have a social media campaign without any content, so people were arrested on nonsense charges that were subsequently dropped, so as to generate mugshots that could be publicized.

Thanks to the FOIA requests members of the public were able to confront the city council with specific names and dates of the people who drafted, approved, and implemented this policy, which information will also be central to future litigation on the topic.

Jakob · on Oct 21, 2018

The linked Kaggle dataset https://www.kaggle.com/foiachap/seattle-email-metadata/ shows that the final returned data are Excel tables with content which looks like this:

  Sender or Created by: "Herring, Kaya" <Kaya.Herring@seattle.gov>
  Recipients in To line: Ortiz, Piper; Jones, Raphael
  Recipients in Cc line: Valdez, Khloe
  Recipients in Bcc line: 
  Sent: 3/23/17 18:08

(I changed the names to random ones)

jacquesm · on Oct 21, 2018

I suspected as much. So the names are out there. Pretty sloppy.

coderintherye · on Oct 21, 2018

It's not sloppy, it's what the law allows for. You seem to have a misunderstanding in stating "the metadata should have only contained anonymized entries for the email addresses of the counterparties." That's simply incorrect in terms of what is allowed under law and for the request, though there is some variation state-to-state as to their FOIA.

darksaints · on Oct 21, 2018

But why would we need to protect identities of public servants? I disagree with the entire presumption that the information requested is private. It falls squarely within the principles of the laws on the books and is consistent with the ideals of open government that led to those laws being instituted in the first place.

More importantly, information that can be used to blackmail a public servant is IMO information that should be kept public. Blackmail is only useful if that information is kept private between a blackmailer and victim. Put it all on WikiLeaks and suddenly blackmail holds no weight because the blackmailer lost his leverage. If the information is of public interest and is severe enough that it is blackmail worthy, the entire public deserves to know it.

rocqua · on Oct 21, 2018

hash@hash alone isn't enough. Keyed hashes, with a secret key might work.

The issue with hash@hash is that it is still possible to see whether a given person sent an email. Moreover, there are probably similar issues as with hashed_known_hosts as described in [1]. In short, the space of possible emails might be small enough to just brute-force search for all e-mails.

[1] https://news.ycombinator.com/item?id=18082033

jacquesm · on Oct 21, 2018

> The issue with hash@hash is that it is still possible to see whether a given person sent an email.

In that case you already have their email, and you know what hash and salt were used. It's game over at that point afaic, nothing will stop you from reversing all of the email addresses.

Even just seeing the graph laid out would allow you to infer who some of the players are. In general, to release such information on the assumption that it will be impossible to reverse it is irresponsible, and I would have loved for the city to recognize this and to get a judge to sign off on the release.

rocqua · on Oct 21, 2018

> In that case you already have their email, and you know what hash and salt were used. It's game over at that point afaic, nothing will stop you from reversing all of the email addresses.

Agreed, hence the need for more than a plain hash. Note that technically, a 'salt' is unique per user and generally doesn't need to be kept secret. It really only applies to storing passwords.

What I suggested is more like a pepper [1], but in this use-case, you could use the same pepper for every address. Alternatively, you could just generate UUIDs for each address and publish those, but that requires a lookup in the UUID table for every e-mail. (Just like salted hashes would require a lookup to the salt for every e-mail).

[1] https://en.wikipedia.org/wiki/Pepper_(cryptography)

jacquesm · on Oct 21, 2018

I don't think you could use the same 'pepper' for every address. After all, if you know at least one address in the database (for instance, your own) and what time you sent the email (which you do) then you could use that to recover the pepper that was used for the hashing. So I really do believe the salt should be unique per address used.

rocqua · on Oct 21, 2018

As per the wikipedia article:

" Where the salt only has to be long enough to be unique, a pepper has to be secure to remain secret (at least 112 bits is recommended by NIST), otherwise an attacker only needs one known entry to crack the pepper. "

If you use e.g. a 128 bit pepper, anyone trying to brute-force that based on a known email-hash combination would need to brute force 128 bits.