Hacker News new | past | comments | ask | show | jobs | submit login

Interesting dataset. Data like this can be used to identify strong links between contractors and government officials.

One problem is that the metadata should have only contained anonymized entries for the email addresses of the counterparties of the Seattle.gov addresses, the article leaves this unclear.

Another potential problem is that if a case of corruption or nepotism is identified that has not been passed to the authorities for review that the author suddenly finds himself in the possession of data that can be used to blackmail some fairly powerful people, in fact there might be fish at a higher than city level government in the trawl because there have to be links between Seattle officials and state officials.

Yet another problem is that the addresses most likely contain the names of private individuals (including employees) as well, and I am not quite sure what to think of that but feel that the city has no business releasing that in cleartext.

A better way for amateur sleuths and the city government to work together to battle corruption would be to release only anonymized data to protect the identities of the people working for the city, for instance by releasing only hashes of the email addresses, for instance a hash@hash format where the hash for all Seatle domains is released to the requester. All the relevant analysis could still be done, and if something interesting was found it could be released to law enforcement who in turn should have then used a judge to order de-anonymization of those entries they are interested in.




I'm not sure I could disagree with you more. At least in the US there is a very strong expectation that communications between governmental employees is non-private except in very special circumstances. You'll note Matt says that the Police and Human Services departments have not responded, I'd guess thats not an accident because police records and personnel/medical records are largely exempt from FOIA requests.

Further, the idea that sleuths (amateur seems pejorative in this case) are working with city governments to battle corruption does not hold up to Matt's (or other journalists) experience. By and large the governments only provide the data because they are required to, and we have made sure they are required to by representative legislative action.

Had the IT dept. in Seattle not made an obvious mistake this would likely have not been a story at all and the data would have been an interesting data set for informed democratic functions.


> Further, the idea that sleuths (amateur seems pejorative in this case) are working with city governments to battle corruption does not hold up to Matt's (or other journalists) experience.

Yes, exactly this. History has shown that government employees and officials, being human, are reluctant to dig into, nevermind snitch about matters that may impugn their own colleagues. That a random citizen, or even a journalist, could convince law enforcement to go to a judge based on trends found via analysis of anonymized hashes is unrealistic. Nevermind that it effectively denies the benefits of transparency to anyone who isn't trained in data science.


> between governmental employees

The request contains the names of private individuals through BCC and CC headers requested and exactly when they communicated with which government officials.


Yes. When I say “between governmental employees” I mean between them & anyone.

This is a normal expectation to the point where there are lots of rules about what public servants can even use private email addresses for. This was a big deal of course in the 2016 campaign.

This is likely a cultural difference, I view the data that Matt requested as mine because that governmental agent is acting in my name.


Fine by me.


release only anonymized data to protect the identities of the people working for the city

No thanks, that's an invitation to further corruption. Let me put that in context; recently a group of people were pushing a local authority to adjust its policy on the public release of arrestee's photos (aka mugshots). The group felt it was being selectively used for political purposes which the police and city council denied.

FOIA requests revealed that the authorities were unhappy with their public image as it related to law enforcement and decided to use mugshots as part of a social media campaign to change public perception. Can't have a social media campaign without any content, so people were arrested on nonsense charges that were subsequently dropped, so as to generate mugshots that could be publicized.

Thanks to the FOIA requests members of the public were able to confront the city council with specific names and dates of the people who drafted, approved, and implemented this policy, which information will also be central to future litigation on the topic.


The linked Kaggle dataset https://www.kaggle.com/foiachap/seattle-email-metadata/ shows that the final returned data are Excel tables with content which looks like this:

  Sender or Created by: "Herring, Kaya" <Kaya.Herring@seattle.gov>
  Recipients in To line: Ortiz, Piper; Jones, Raphael
  Recipients in Cc line: Valdez, Khloe
  Recipients in Bcc line: 
  Sent: 3/23/17 18:08
(I changed the names to random ones)


I suspected as much. So the names are out there. Pretty sloppy.


It's not sloppy, it's what the law allows for. You seem to have a misunderstanding in stating "the metadata should have only contained anonymized entries for the email addresses of the counterparties." That's simply incorrect in terms of what is allowed under law and for the request, though there is some variation state-to-state as to their FOIA.


But why would we need to protect identities of public servants? I disagree with the entire presumption that the information requested is private. It falls squarely within the principles of the laws on the books and is consistent with the ideals of open government that led to those laws being instituted in the first place.

More importantly, information that can be used to blackmail a public servant is IMO information that should be kept public. Blackmail is only useful if that information is kept private between a blackmailer and victim. Put it all on WikiLeaks and suddenly blackmail holds no weight because the blackmailer lost his leverage. If the information is of public interest and is severe enough that it is blackmail worthy, the entire public deserves to know it.


hash@hash alone isn't enough. Keyed hashes, with a secret key might work.

The issue with hash@hash is that it is still possible to see whether a given person sent an email. Moreover, there are probably similar issues as with hashed_known_hosts as described in [1]. In short, the space of possible emails might be small enough to just brute-force search for all e-mails.

[1] https://news.ycombinator.com/item?id=18082033


> The issue with hash@hash is that it is still possible to see whether a given person sent an email.

In that case you already have their email, and you know what hash and salt were used. It's game over at that point afaic, nothing will stop you from reversing all of the email addresses.

Even just seeing the graph laid out would allow you to infer who some of the players are. In general, to release such information on the assumption that it will be impossible to reverse it is irresponsible, and I would have loved for the city to recognize this and to get a judge to sign off on the release.


> In that case you already have their email, and you know what hash and salt were used. It's game over at that point afaic, nothing will stop you from reversing all of the email addresses.

Agreed, hence the need for more than a plain hash. Note that technically, a 'salt' is unique per user and generally doesn't need to be kept secret. It really only applies to storing passwords.

What I suggested is more like a pepper [1], but in this use-case, you could use the same pepper for every address. Alternatively, you could just generate UUIDs for each address and publish those, but that requires a lookup in the UUID table for every e-mail. (Just like salted hashes would require a lookup to the salt for every e-mail).

[1] https://en.wikipedia.org/wiki/Pepper_(cryptography)


I don't think you could use the same 'pepper' for every address. After all, if you know at least one address in the database (for instance, your own) and what time you sent the email (which you do) then you could use that to recover the pepper that was used for the hashing. So I really do believe the salt should be unique per address used.


As per the wikipedia article:

" Where the salt only has to be long enough to be unique, a pepper has to be secure to remain secret (at least 112 bits is recommended by NIST), otherwise an attacker only needs one known entry to crack the pepper. "

If you use e.g. a 128 bit pepper, anyone trying to brute-force that based on a known email-hash combination would need to brute force 128 bits.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: