Hacker News new | past | comments | ask | show | jobs | submit login
Kafka, GDPR and Event Sourcing (danlebrero.com)
184 points by delebe on April 12, 2018 | hide | past | favorite | 116 comments



> A suggestion from Michiel Rook’s blog is that maybe is enough to remove the data from the projections/read models, and there is no need to touch the data in the event store.

No. This is very easy. The right to be forgotten means: if I’m done with your web service and I want to have my account deleted, you have to delete everything you have on me in a reasonable timeframe and which is not required by you for other laws (such as keeping receipts of my purchases for 6-10 years).

If someone can replay that log step by step (and I think this is the idea of Event Sourcing), and my data shows up in there for no particular reason, it’s illegal.

There is no need for you to archive data, if I don’t want you to create that archive or if I want you to delete that archive.

Also, if I haven’t used your web service after X years, you should send a friendly reminder that my account and all data will be erased soon. If I don’t reactivate, you should delete every information that helps you or anyone to identify me as a person.

Edit: arguing with the purpose of “archiving” is exactly this fucked up weaseling around which companies do to circumvent the law. Take responsibility for your users data and delete everything after a while or on request.


Yes. Forget about just not rendering the data in projections. Article 5.1e states: “Personal data shall be kept in a form which permits identification of data subjects for no longer than is necessary for the purposes for which the personal data are processed [...]”

Article 5 contains concrete rules and also sets out the basic principles.

In short, you must be able to remove all information that may enable you to “single out” an individual (a physical individual by eg name or an online identity such as a particular web visitor by eg a cookie id).


So, if you're building your infrastructure in a way that you need to access a separate datastore to discover which individual is linked to which event in a stream. Then, leaving the stream of event as it is, but removing all referred keys in the joined datastore would be "GDPR-compliant"?


It’s easy, but poses massive problems in the public sector or finance. Say you’re terrible at paying your dept, but manage to do so 10 years late.

Now your transaction is complete and you have the right to be forgotten, but I wouldn’t want to ever lend you money again. How do I keep that record?


You take Art 6 (f) of the GDPR. You write down that your processing is necessary for the purposes of the legitimate interests pursued by you. E.g. I don't want to lend money to a you ever again.

But to identify you and not mistake you for someone I need to keep information like Name, Passportnumber and a reason why I put you on the list in first place.

Now you can keep these information.


But the important point: you can only use that information for that purpose. No "oh well, now that we can keep it, let's use it for marketing and such".


Good point!

Looking at Recital 50 there may be possibility to use information for other than the original purpose if the new purpose is compatible.

But another interesting thought: Can Marketing hold as a legitimate interest?

Lets look at Recital 47: The processing of personal data for direct marketing purposes may be regarded as carried out for a legitimate interest.

Interseting finding you think. Reading this again the word may is striking. Ok this is not a Carte Blance. And since there is Recital 70 which states:

"Where personal data are processed for the purposes of direct marketing, the data subject should have the right to object to such processing, including profiling to the extent that it is related to such direct marketing, whether with regard to initial or further processing, at any time and free of charge. That right should be explicitly brought to the attention of the data subject and presented clearly and separately from any other information."

So where does this leave us: Yeah we can use Data for Marketing and it well may be a legitimate interest of a company. But we will have to inform the data subject that it can object the usage.


Yeah, not really. Recital 47 says "may provide a legal basis for processing, provided that the interests or the fundamental rights and freedoms of the data subject are not overriding".

If I ask to be forgotten, my interest is that my data is no longer used, and your legitimate interest in holding on to it to avoid lending to me in the future does not override that for any other purpose.


That is right, and I guess it would count as objection to marketing.


Well, what about stuff like this: http://www.bbc.com/news/technology-43752344


Aren't there provisions for things like back-ups and other stuff that you can reasonably expect not to turn up later? I would think data in the event store but not in projections could be considered as such, to some extent.


I should disclaim this with: I've only spoken to people who have looked into this. I can't guarantee it's correct, but it's the situation as I know it currently. It's also assuming cold backups that are not processed in any way that accesses the should-be-removed data

You don't need to remove data from backups in realtime but you do need a system in place to re-delete data after a backup restore that previously had a removal request made against it.

From what I've heard so far the best method is some form of GUID that means nothing when there's no data attached to it but can be used to identify user records that need removing if they re-appear during a backup restore

Naturally, you'll want to make the please-delete-these-GUIDs table more redundant than the rest of what you've backed up


There’s no distinction between storage and backup. Most large backup companies are developing fixes for removing data in backups.


Makes me wonder about block chain data that can't be deleted


That's quite simple - as a controller holding my data, you're responsible for safeguarding the data and fulfilling the requirements, so if putting that data on a public blockchain makes it impossible to do your duty, then you're not allowed to put that data on a public blockchain.

If I put my data there, that's my problem, but you're not allowed to do that.


> If I put my data there, that's my problem, but you're not allowed to do that.

is it really? that's a part I did not really understand about the regulation and can't find anything about voluntarily relinquishing control of personal data.

i.e: I leave my personal name and email address on a public forum. Google later add that post to it's index, or another random company like Web Archive or archive.is scrapes it.

What are my rights? What are their obligations? is the forum owner liable for my action if I didn't explicitly agree for my data to be shared by him to all unforeseeable future scrapers?

> as a controller holding my data, you're responsible for safeguarding the data

this confuses me even more. say I'm running an analytic system. I track users trough a cookie. tracking cookies are, for some reason, personal information. if an user delete the cookie from their browser and then ask to fulfill my obligations about erasing his data from my system, how do I identify him? who's liable then?


> I leave my personal name and email address on a public forum. Google later add that post to it's index, or another random company like Web Archive or archive.is scrapes it.

I wish there was an official guide on how to run a forum properly, because all forums suffer from the same problems.

What I figured out so far:

You as a forum administrator must delete that personal data on request. This could mean digging through all posts of a user and delete all PII, although you don’t have to delete all posts if they contain no PII.

You as a forum administrator must get your privacy policy right and possibly make it harder for third parties to index PII. This depends on your intentions and whether or not people know that your forum is public.

Say, your forum is about car parts. You could set a subforum that asks members to introduce themselves (where PII are most likely) to noindex or hide from the public.

This way, you’ve put in reasonable effort to protect your users and your obligation is done. Indexing by third parties is now out of your control.

But say you run a medical forum where people post health data (considered super sensitive) and are expected to post a lot of PII, you might have to set the entire board to ”members only“.

Although I’m not sure about any of this and to some extend, most forums provide value by being visible to visitors and indexes by Google. Quite sad.


IMHO this is not talked about much because that's not a new GDPR issue; GDPR introduces a bunch of new things about e.g. consent of processing data, but the "right to be forgotten" and requests to delete my PII that someone else posted to your forum is pretty much unchanged, it was a thing in EU legislation for quite some time already (might be a full decade?) so all the "here's what to do now" articles don't touch this.


digged a little into this.

art 17 relevant excerpt seems these:

> the controller, taking account of available technology and the cost of implementation, shall take reasonable steps, including technical measures, to inform controllers which are processing the personal data that the data subject has requested the erasure

so it seems a "disallow" in robots would solve the issue with crawlers and public data, but that works at page levels, not content level. pushing a crawl request after information erasure may not be enough, since we need to inform the controllers on why the data has changed, not just that it has changed.


The last part of your question is easy to answer. Before you start processing personal information, you need to ask the affected persons for consent. So now you know how to identify them.

If you want to get away without consent, you need to design your system in such a way that it doesn't hold personal information. E.g., maybe randomize the raw tracking data as first processing step, and only keep the randomized version.


What if your data was put their before the law was passed?


Will you be doing any processing on it after the law is passed? If you do, you certainly fall under the law. If you're not doing any processing and you don't have a legal obligation to hold the data, why are you?

(Also you probably need to demonstrate that you have consent to hold that data in compliance with the law. But I'm not entirely sure on how that applies to pre-existing holdings.)


If you put PII on the blockchain then God have mercy on your soul when the regulators find out...


Can a malicious actor put PII (or copyrighted/otherwise illegal works) onto a blockchain to "kill" it?


There's apparently all sorts of illegal stuff embedded in bitcoin metadata


An uncensorable, global, distributed data store opens up all kinds of possibilities for weaponizing data. If you're intending to blackmail* someone by threat of doxxing, threatening to put their information on the blockchain is much worse than simply putting it on pastebin.

People have also attempted to use blockchain data for coercion. For example some folks were worried about the centralization of Bitcoin mining in China, so they posted information about Tiananmen Square on the blockchain, hoping the Chinese government would simply ban Bitcoin.

*blockmailing them, if you will.


The general approach for using a blockchain with personally identifiable information seems to be storing only that something happened, not exactly what. Effectively, this means storing only a resource-local identifier, some metadata (e.g., 'UPDATED RECORD'), and a reference to the database were the actual data is stored on the blockchain.


Your right to be forgotten is balanced against a business’s right to make money, and society’s right to remember. I imagine that a court would consider a whole range of factors in weighing that balance.

How much real harm was done to you? How expensive would it be for us to delete the data? What steps have we taken to protect your privacy?

I highly doubt that the courts will see it as black and white as you suggest.


That is nonsense. The right to be forgotten is EU law, enacted by the EU parliament, enforceable by EU court. Article 17 of the GDPR.

There is no "right to make money" or "right to remember" in this sense at all. You can argue that there is a public interest in allowing people to make money, and a public interest in remembering, but the place where these interests are weighed is in parliament. Not in courts.


> There is no ... "right to remember"

To quote Article 17.3.d:

> for archiving purposes in the public interest, scientific or historical research purposes or statistical purposes in accordance with Article 89(1) in so far as the right referred to in paragraph 1 is likely to render impossible or seriously impair the achievement of the objectives of that processing

In fact the whole of Article 17 contains exceptions to the right to be forgotten.

Meanwhile Article 16 of the EU Charter of Fundamental Rights (enacted 2009) is the "right to do business", and some of it conflicts with GDPR.

All legislation is interpreted by the courts. It's very rarely as simple as "the law says X, therefore you are guilty. Case closed." Instead, lawyers argue about conflicting legislation, and previous cases. A judge weighs all those arguments, and gives a decision, which then affects other future cases.

I am not a lawyer, but I have spent quite a bit of time working with lawyers to comply with GDPR. What I have said here interprets their words.


>A judge weighs all those arguments, and gives a decision, which then affects other future cases.

Careful. This is not common law jurisdiction. Case law doesn't apply here, and the ECJ rulings themselves have the primary interpretative function, not the arguments of lawyers.


Thank you, I'm out of my depth in explaining the difference.


Yes, so there is no "right to remember" and the right to do business is not the "right to make money" doing business. So you're at the very least misquoting your lawyers.

17(3), and in more detail 89(1) closely circumscribes the scope of exceptions. Specifically in the context of the post you replied to: There is obviously no public interest exception that would allow you to keep the data around because a specific technical implementation becomes more convenient/cheaper if you do.


I suspect you're going to be very disappointed by the outcome of the GDPR.


Would it be sufficient to scrub the event log of PII so you don’t have to break as much of the object graph in the read model?


Okay, this approach is really cool, but... just hear me out: what if... you don't store personally identifiable data in the event store? What if you only store references/ids that point to services which can resolve those ids to data, and that data needs not to be immutable. In the event of erasure request you shred/anonymize that data only, without touching kafka. I mean it's pretty obvious to me, to the point i feel like i'm missing some huge point and grossly misunderstanding the whole problem


I think that's exactly what any real system would do. This sort of widespread event encryption and forgetting is crazy. It is a very bad idea to delete events from the immutable event log. There is no telling how downstream consumers will react but it probably won't be good. It's much simpler to only hold references to a user (UserID) and then if you need to forget that a user you simply burn whatever is behind that UserID.

This is actually how the GDPR is designed: it's not about user data it's about personal data, that is data that can be associated or related back to an "identifiable person." The problem is not that you need to delete all the user's data across all your systems, the problem is that you need to break any associations that would allow you to identify a natural person who has asked to be "forgotten." The funny thing is that while some people make lots of noises about GDPR being some huge "burden," the reality is that any architect worth her salt should've been designing systems like this from the very start rather than letting personal data be replicated en mass from one system to another. It is a basic normalization of data. All the GDPR is doing, like most regulations, is requiring businesses follow best practices and not cut corners that might harm end users. The great thing about this in the long run is that this fixes a huge problem with the internet. Today users are reluctant to sign up to a service precisely because they don't want to surrender their data because once it's gone it's gone forever. I suspect users will be more open to trying new services if they can be assured that it's possible to "un-sign up", that is be "forgotten" by a service.


GDPR is troublesome not for requiring something particular which is hard to implement, but because it requires to do something poorly defined, and promising heavy fines for not following the path one can't clearly see. This minefield situation, in fact, makes SMEs much more vulnerable compared to big corporations, and I suppose it's not what most people here would like to see.


> if you need to forget that a user you simply burn whatever is behind that UserID [...] you need to break any associations that would allow you to identify a natural person

The problem is that as you collect more "impersonal" data, the probability that your collective data can still be used to identify someone approaches 100%.

"We don't know their name, but Deleted User 4510 was friends with every single member of the John Doe family except for John Doe himself..."


Surely the implication is that when you 'burn whatever is behind that UserID' you would delete the records of who DeletedUser4510 was friends with (and who was friends with DeletedUser4510).

If all you do is rename someone to DeletedUser4510, you pretty obviously haven't deleted all the data you hold on them.


Right, and what if those connections are burned into an immutable data source through the connection event in your event sourcing?

That is what this article is about, and methods to deal with that.


> you would delete the records of who DeletedUser4510 was friends with

You erase the friend-linkages...

... Then you find that Jane Doe made a post in response to a blanked DeletedUser4510 post, and responses can only be made by friends, so therefore Jane was at some point friends with DeletedUser4510.

So you put in tombstones for all posts and all post-to-post causation links.

... Then you find that the entire Doe family is tagged in a photo by Jebediah Doe, except for one guy, and the comments are every family member and somebody named DeletedUser4510...

Anyway, my point is that it's really easy to get caught in such a fog of relational data that gaps merely change a certainty of identification down to an extremely high probability, and an event-sourced system -- by design -- makes it extra difficult to remove data or to even plan for its removal.


Does the GDPR actually identify what information is considered personal? I haven't followed it at all.

For instance, if someone posted on a message board, is it enough to rename their user to anonymous. Or do you have to go back and delete their user, leaving orphaned records? Or do you have to delete all of their postings, which could leave discussion history in disarray.

What about something like a phone service? Erasing a lines recent history is easy enough, but going back years to delete records from archival systems that weren't designed to handle it could be problematic. For instance, in very large data tables, deletes can be very slow. Call records are often stored in compressed flat files. Which would mean searching through tens of thousands flat files for lines to delete. And some of that data would have been processed through a system like splunk or logstash that isn't particularly friendly to deletes and would require a massive re-indexing operation to flush the necessary records. And some of those systems probably have tiered storage that includes offline, slow to recover archives built with cost assessments that did not account for frequent data removal. (Think about how much it would cost to download 500TB from glacier, decompress it, decrypt it, put it on an active system, remove a few records, re-encrypt, re-compress and re-upload it 4 or 5 times a month). And think of how that cost compares with "we generally need to reference maybe 1gb of data every three or four months".


The law is not defined in a "technical" way. It's definitely not about renaming users or deleting their postings. GDPR is declarative: can you identify someone through the records in your possession, whether database or not, and connect that to any protected information? (their political views, sexual preferences, name and location, economic status, health issues…).

The law is quite broad and fuzzy, which is what makes technical people uneasy:

‘personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person;

It is entirely possible that in your message board example, you've deleted the user, you've deleted their postings, and are still in violation. For one, the person requesting deletion doesn't even have to be your user. Someone else may have posted such information about them on your board.

People misconstrue GDPR to be only about databases and unlinking users' records from tables (mainly because it's the easiest thing to do). But it really is about all and any personal and sensitive information of natural people, full stop.


> For one, the person requesting deletion doesn't even have to be your user.

If what you wrote is true, then organizations are liable for data that someone may, possibly even just theoretically, be able to use to identify someone else.

Given that the law is not 'technical', maybe it'll be interpreted much more leniently than a straightforward reading would lead one to expect.


Yes, it's almost certain it will be interpreted more leniently. At least to start with.

Elizabeth Denham, UK's information commissioner in charge of data protection enforcement, had this to say:

"Having larger fines is useful but I think fundamentally what I'm saying is it's scaremongering to suggest that we're going to be making early examples of organisations that breach the law or that fining a top whack is going to become the norm. Our office will be more lenient on companies that have shown awareness of the GDPR and tried to implement it, when compared to those that haven't made any effort."

In reality, nobody knows how GDPR will pan out exactly (including the authorities).


In that case, individuals are data controllers and not data processors. Everyone posting on an online forum controls the data they post. Adding 'delete post' button fulfils the requirement for a data processor.


> It's much simpler to only hold references to a user (UserID) and then if you need to forget that a user you simply burn whatever is behind that UserID.

But if someone was a user of your service and your services included say photo or video hosting, and you delete their name, their phone number and such, but all of the data you don’t delete still reference a common user ID, then that all of your data can still be related back to an identifiable person.

If instead of holding reference to the user id on the data themselves you have separate table for data ownership that is one step better.

I.e you have all data stored separately and each piece of data has a guid, and you have the user profile separate with a guid and then you have separate tables tying data guid to user guid and when you delete a profile you delete both the user profile and the tables that tie data to that user profile guid.

However I think even that is not enough.

Yes it’s hard to implement full deletion, but you are the one that chose to accept data in the first place. If you can’t implement a system where the data can be deleted you shouldn’t be accepting that data in the first place IMO.

Because even if you delete the references to the data, the individual pieces of data can still be tied back together with data analysis. For example by looking at meta data in pictures, or looking for artifacts that come from lens scratches.

Also, even if you delete references of ownership of an image, probably you aren’t deleting records of comments that other people made on the photo, because if you went to that extent you probably could have just deleted the data proper, so then with access to the image and knowing what other people commented it will likely be possible to determine who originally uploaded that photo.


Not to derail, but rather share: I have worked on a healthcare solution which did just that, every medical record had a guid, and by itself it didn't carry a direct relation to a patient; it was done via an additional table. This was done to ensure data security, ability to use the medical records for scientific purposes (records themselves were anonymous) and to manage access to these records (i.e. patient was the owner of his/her records by default, but could choose to delegate read and modification permissions to e.g. their spouse or physician, or emergency personnel). Come to think of it, it was a pretty awesome system, even though it ran on asp.net (it was some 7 years ago i think). This design was necessary because data privacy regulations for medical records have always been very strict in the EU; come to think of it, people worried about GDPR should take a look at the way medical systems handle data.


One important thing not to forget about: make sure your event log doesn't leak metadata. You may have deleted PII like email or username, but if certain user activity logged under some user id provides sufficient information to identify the user (e.g. a friend tags user on his group photo - this event contains just an id, but allows to link specific person on still available image to that id), what will you do?


That is an interesting point - a photo by itself is personally identifiable information[1] - so you should blur/distort/remove that particular face - or, to stay on the safe side - all faces, just like google maps does to be GDPR compliant, i guess. Jesus, this regulation is such a can of worms...

[1] https://www.quora.com/Is-a-photo-of-a-person-considered-PII-...


If you upload personal identifiable information about me surely I should have the right to get it removed? You already have no right to upload the data in the first place without my explicit permission, so the problem in your scenario is not GDPR compliance unless I gave you permission and if I did I have the right to remove the permission at any point in time already if I so chose. I don't see how GDPR changes anything in that scenario.


The user who posts the photo is a data controller, and it is her obligation under GDPR to remove the photo when requested. All data processors in the chain (the photo app, cache, host, etc.) have to provide the means for the data controller to remove the image. That's my understanding for user-generated content which may or may not contain personal information.


No, the users aren't data controllers, the GDPR does not apply at all to any acts of natural persons in the course of purely personal activity, and GDPR does not apply any obligation to that user (other laws, however, may do that). It does, however, apply to the organization storing that photo - since the default case is that they're forbidden to do that, and as article 6 point 1 (https://gdpr-info.eu/art-6-gdpr/) states, processing is lawful only if one of the listed conditions apply.

It does seem that interpretations handling user-generated content will be tricky, and likely would require additional legislation to clarify that, which is a common scenario for such laws.


https://gdpr-info.eu/art-4-gdpr/

7. ‘controller’ means the natural or legal person, public authority, agency or other body which, alone or jointly with others, determines the purposes and means of the processing of personal data; where the purposes and means of such processing are determined by Union or Member State law, the controller or the specific criteria for its nomination may be provided for by Union or Member State law;

8. ‘processor’ means a natural or legal person, public authority, agency or other body which processes personal data on behalf of the controller;

You are right though that GDPR doesn't apply to non-business individuals, but that doesn't make the image host the data controller. Maybe the user and the host are data controllers 'jointly'? Dropbox is quite clear that at least in their Business and Education accounts, the account owners are data controllers for all their data.


This logic would effectively kill all photo-sharing services. I guess, noone wants to blur people's faces in their Instagram posts,so a regulation following the old Islamic hadith not to depict humans is really terrible perspective.


The whole point of event sourcing is the ability to manage state effectively. When you move state out of the event sourcing model you are losing the state management benefits of event sourcing for that data. Maybe that’s a reasonable trade-off for GDPR, but there is a trae-off.


There are ways to put some state outside the event stream that play nicely with the model.

For example you could store the body in a content addressed store (a key value store whose keys are hashes of the value). This preserves the ideal model of immutable events as you cannot accidentally change the content of the event since that would violate the property of the CAS. But nothing prevents you from deleting an entry in the CAS.

However you still have to find all the records you need to delete!

The trick with forgetting the encryption key let makes this step a O(1) operation.

On the other hand it forces you to decide upfront the granularity of ownership, i.e. if you treat payload deletion as a batch process you can deal with mistakes, change your mind about which data should or should not be deleted. While with the encryption technique, changing your mind would require you to perform a full history replay and reencrypt data in a new stream (and throw away the old one) with the new granularity model.


This is actually a recommended approach within the GDPR legislation itself Article 25: "Pseudonymization"

https://blog.varonis.com/gdpr-requirements-list-in-plain-eng...


I'm not a Kafka user. However, I am a long-time keen implementer of event sourcing and log-derived data stores. Many of these involve per-user or per-individual sqlite databases.

I cannot understand why Kafka is causing people problems. I am surprised that this seems to be being accepted as a problem with event sourcing in general.

The most obvious solution I came up with with GDPR right to be forgotten was to have per-user logs. User gone? Delete their log.

Then you just need to handle how you share the projected data from the individual event logs. Messages are unlikely to be spread over multiple log events and can be just sourced directly. It's clear who owns that data: it's the user who wrote the message as they can at any time delete the only copy.

Aggregation is more difficult but can be done if you structure the things you are aggregating such that you only need a few lookups per user. Again you can cache references to certain events that can be efficiently looked up using a (user_id, event_id) pair. There's nothing to stop you storing the latest projection in a separate table of the per-user database file.

Is there some technical reason why so many Kafka users seem to be having such difficulty with this?


Kafaka you get topics and partitions. Topics is a grouping of data, normally having a common schema. Partitions allow parallelism spreading the data across multiple brokers.

Data on a topic can be partitioned by a key so you can partition by user id resulting in all user events going to the same partition however there’s no functionality to delete a partition by key.

If each user has their own topic that you can delete you would end up with a lot of topics and Kafka isn’t great with that.

You also have the issue if event sourcing or doing DDD your events are based around domains/aggregates with clear context bounds. The domain is very likely not a user but may involve a user. Its a tricky modeling issue.

Event sourcing is about immutable events. You cannot change an event in the past as its already happened, accounts don’t use rubbers. The ability to go back and change past events or delete them breaks the laws and guarantees of an immutable event system.

It’s a difficult problem. I like event sourcing due to the properties it gives but it may not be suitable for some things due to GDPR.

One thing I have heard talked about is encrypting personal identifiable information in an event with a user specific key. When the user requests deletion you delete the key. The events don’t change they are still there however anything which identifies a user is unreadable.


Indeed - we designed ours in a similar way as you describe and whilst GDPR causes a small bit of work for us it is not like we face a total design flaw type issue that it appears people are experiencing with Kafka.

We are looking at adding a Nuke command + Nuked event that will serve to destroy all data in the aggregate root snapshot and leave behind (effectively) a tombstone snapshot. The command handler will then purge all prior events leaving behind just a Nuked event. This means the event stream doesn't just disappear entirely, it can still be synchronised out to dependent systems, and indeed those systems will learn of the nukage just like any other event.


What (if any) is your favourite book on event sourcing?

I've seen tech meetup lectures (one particulary fascinating one was a guy who worked for the NHS) but I'd like to understand it better than I do.


To me the biggest thing that made event sourcing click was when I was working on an internal app that needed full auditing. With event sourcing the audit log is the authoritative data source.

Learning that there was a name for this and that others had worked it out better than I had was very handy. I didn't find a need for a book because I found reading Martin Fowler's article[0] on it to give me a good enough understanding.

I've found that it's a less well-defined pattern than some implementations make it seem. There are many trade-offs to be made. At the one extreme you can treat events as physical events that can be easily understood. Let's say we have a user that wants to add a phone number. Well, we could have an AddPhoneNumber event. But this leads to a RemovePhoneNumber event as well.

But you can also take the other extreme and say that each event is a set of collected information -- perhaps a different element of the root of a json object -- and that you can just diff to see what changed and only need to look at the latest such event.

I've found there's a middle ground that doesn't require too much thinking: related data goes together if it's not going to change very often. So I will generally put all contact information in together as an UpdateContacts event. This way we don't need to implement all the list operations for phone numbers and can skip looking for such events after the latest has been found. You also still have something simple enough to work out what changed and that is unlikely to fill your storage.

If you are implementing event sourcing, I would like to point out one thing that I didn't see written anywhere: set a reasonable upper limit on per-user storage. Because you will find somebody who, even not maliciously, just keeps changing things back and forth and it's better to stop them adding events than to stop all users by running out of disk space.

[0] https://martinfowler.com/eaaDev/EventSourcing.html


Thank you for the thoughtful response.

I was hoping for a good book of the "here be dragons" kind.

Often books on an architecture gloss over the rough bits and then you find them yourself when you already committed to it.

I've become incredibly conservative when it comes to application architecture the older I get.


If you're going to dive into CQRS/ES, I'd recommend:

* Enterprise Integration Patterns (basically an entire book about messaging architectures) [1] * Vaughn Vernon's books and online writing [2], * Domain Driven Design by Eric Evans [3], * and most of what Greg Young, Udi Dahan, and that constellation of folks has done online (lots of talks and blog articles.)

Depending on your platform of choice, there may be others worth reading. For my 2¢, the dragons are mostly in the design phase, not the implementation phase. The mechanics of ES are pretty straightforward—there are a few things to look out for, like detection of dropped messages, but they're primarily the risks you see with any distributed system, and you have a collection of tradeoffs to weigh against each other.

In design, however, your boundaries become very important, because you have to live with them for a long time and evolving them takes planning. If you create highly coupled bounded contexts, you're in for a lot of pain over the years you maintain a system. However, if you do a pretty good job with them, there's a lot of benefits completely aside from ES.

[1] https://www.amazon.com/Enterprise-Integration-Patterns-Desig...

[2] https://vaughnvernon.co

[3] https://www.amazon.com/Domain-Driven-Design-Tackling-Complex...


By default Kafka has no mechanism to delete records from the event store, apart from a TTL.

You can enable topic compaction which can be used to delete records by overwriting them, but compacted topics just don't make sense for a lot of Kafka use cases.

So the only option left is the one suggested in this article: encrypt everything with individual keys and delete the key (rather than the data) to comply with GPDR


Your idea is right, but it really depends on the event-store implementation. For example, Kafka has a limited amount of topics (10K if I remember well), so unless you have a very limited audience, I don't think having one topic per user is the solution over there.


Right. This makes sense as to why it is a problem for Kafka users. I must admit I was in a bit of a hurry when I looked into Kafka to see if it would be a better fit, but it seemed to be more about the scaling than I cared for.


as someone who is debating how to handle right to erasure this is very interesting. I've also been struggling with how to automate erasure within in 3rd party SaaS tools that we use.

I think to count we have 34 SaaS products of which something like half of them contain our customers PII.

Is the regulation state that we must guarantee right to erasure or that we must make a reasonable effort to erase customer data on request?

Are people generally automating this fractal process or manually deleting from systems that only offer a manual process (such as Google Analytics)?


You’re not allowed to store PII in Google Analytics already as per the terms of service:

You will not and will not assist or permit any third party to, pass information to Google that Google could use or recognise as personally identifiable information.

https://www.google.com/analytics/terms/us.html , section 7.


Even if you don't send personal data to Google analytics they store personal data automatically. At least Google Analytics for Firebase store the following identifiers: Mobile ad IDs IDFVs/Android IDs Instance IDs Analytics App Instance https://firebase.google.com/support/privacy/


But those can change and if you don't store personal information you can't relate to an specific individual.


> Is the regulation state that we must guarantee right to erasure or that we must make a reasonable effort to erase customer data on request?

The regulation does not say that this has to be automatic or instant - but if the request to erase comes in, you must be able to somehow do it. If it means a person going through admin interfaces of all 34 SaaS tools, that's fine. But in the end you have to erase all of it, "it would take unreasonable effort" is not accepted as a reason to refuse or skip some third parties where the data has been sent.

If the SaaS products don't offer permanent deletion options, then you can't send personally identifiable data to them in the first place.


A reasonable effort is to have an architecture in place that allows you to comply with the removal process. If that isn't the case, rebuild until it is.


> The main concern with this approach is that the event store is no longer immutable

I think what happens when you delete all events of a (deleted) aggregate root (such as a customer who requested to be forgotten) can be interpreted more charitable, in a way that the event store can still be called immutable.

If you look at a functional programming language, you can not force a data structure to be removed from memory. However, that obviously doesn't mean that your program consumes infinite amount of memory. If a data structure is not referenced anymore, it'll be removed from memory (by the GC). Your program itself didn't mutate the state of the data structure, so from that point of view everything is still mutable.

Now, let's apply the same principle to an event store: A deleted aggregate root (the to-be-forgotten customer) should have been removed from all projections (as required per GDPR). If you replay the events, it shouldn't matter to the final state of a read projection whether it processed the events belonging to this aggregate root, or not.

Therefore, one could interpret that removing the events of a deleted aggregate root in a GC-like fashion leaves the event store immutable, in the sense that my program(s) can't mutate the state (themselves), and their output doesn't change.


Going forward no one would consider Kafka a suitable system for any kind of production systems. Immutable persistent data already caused issues for HIPAA compliance, but now it is virtually illegal. Using event sourcing for only foreign keys is way too hard to enforce engineering-discipline wise, and is just not worth the risk.

Back to SQL/NoSql/Memory stores and their mutability.


The better way seems to be to make sure all events carry only references to users, and move user data off an event sourcing system. That way the records that something happened to user no. 42 can remain immutable, but if user no. 42 leaves their account be tombstoned and all their personal data effectively deleted.


I wonder how soon we’ll see a company accidentally forget everyones’ data due to a poorly architected/implemented solution to “right to be forgotten”.


That's a big win over the everyday occurrence of a companies leaking everyone's data


It can also mean bankruptcy because they also deleted the data from their backups...


This is vastly preferable to accidentally leaking customer information. If this is the price we have to pay for having good guarantees that we can be forgotten, I think it is a worthwhile tradeoff.


Can you really not imagine even a single case where this isn't true?


I'd imagine that this isn't generally true. Surely it's better, for at least some data, that it be preserved even if it's leaked to the public.


The forget the encryption key approach is clever.


I used to think so too (that's what iPhones and the like do on "erase", btw) but doesn't that just push the problem into the future, when computers can more easily crack today's encryption?


>the future, when computers can more easily crack today's encryption

No. Don't confuse symmetric with asymmetric (current public key) encryption. They aren't subject to the same potential attacks. Even with a theoretical fully scalable general purpose quantum computer, the best quantum attack vs a symmetric cypher is brute forcing with Grover's Algorithm, which provides a quadratic rather then exponential speed up. Ie., a n-bit key could be attacked in around 2^(n/2). This is trivially countered by doubling key length, a 256-bit key would still take 2^128 which would still be effectively impossible, and a 512-bit key would take 2^256. There is no future with any foreseen technology that would be able to brute force that, so when it comes to AES and the like using at least 256-bit keys it can be reasonably assumed that destroying the key means the data is lost (anything legacy running off 128-bit is reasonable to watch out for though, 2^64 is potentially tractable).

Present asymmetric crypto systems can theoretically [1] be attacked with Shor's Algorithm, which may be what you're kind of thinking of if you've heard about "today's encryption getting cracked" in the general media or scifi. And that would in fact be a big deal, it covers how most data is moved around in communications and the Internet at present. But QC isn't magic, and it doesn't just break anything. FDE and the like that just use symmetric crypto are safe.

1: "Theoretically" because that's if (big if) an ideal quantum computer that could be scaled to a sufficient number of qubits is created.


> There is no future with any foreseen technology that would be able to brute force that, so when it comes to AES and the like using at least 256-bit keys it can be reasonably assumed that destroying the key means the data is lost (anything legacy running off 128-bit is reasonable to watch out for though, 2^64 is potentially tractable).

But we can’t know for sure that AES or any other encryption algorithm doesn’t have some as-of-yet-unknown fatal flaw that would make it breakable in some way not necessarily even having anything to do with quantum computers?


Of course not. But now you're getting philosophical :) Can you really now anything?


Ha ha yeah that is true. My point though was that it’s important to keep in mind if we decide to use throwing away the encryption key as or way of protecting the data.


> Don't confuse symmetric with asymmetric (current public key) encryption

After reading this multiple times, and looking up QC (Quantum Computing) and FDE (Full Disk Encryption), I got the following out of it:

- symmetric encryption is safe for AES-256 and up - asymmetric encryption isn't getting cracked because QC isn't magic

Is this a correct TL;DR ?


The computations used to do symmetric and asymmetric encryptions are completely different. Asymmetric encryption is just about factoring out big numbers into primes, if we simplify things a bit. And modern computers aren't very good at it, but Quantum Computer happens to be, and can break it. Read a bit about Shor's Algorithm[0].

On the other hand, symmetric encryption can be seen as a super convoluted and costly shift cipher. And it seems that Quantum Computing does not help much with dumb and costly mathematics like this.

[0]: https://en.wikipedia.org/wiki/Shor%27s_algorithm


It'll be fine as long as the decryption time-horizon is beyond the lifetime of the company, the user, and the regulatory-regime :p


Not even this as it could have negative impact on the next at least 2 generations depending on politics in the future. Some people judge you based on who your ancestors were and what they did. There are countries today where you have to fear for your life if your father was gay, for example.

As we cannot foresee the future, neither politics nor any future decyphering capacities, I highly doubt deleting the key is a viable option.

Note: Edited for clarity.


Yes very cool! But I wonder if courts will decide that throwing away the key to the lockbox is the same as burning the contents.


Crypto-shredding, as it is known, it is already well established as a valid process and covers the requirements.

The point is to make the data unusable. Obviously when you delete a file from your recycle bin, the data is all still there, but it's acceptable under EU data protection reqs.


>but it's acceptable under EU data protection reqs

No it isn't. This is what got Facebook in hot waters awhile back. Marking a picture as deleted isn't the same as deleting it. Some other mechanism could easily point to where the data is actually located and not just its descriptor.


Yes, it is. The legal definition of deletion under EU data protection laws is to make the data unusable and inaccessible. Marking for deletion (as deleting from a recycling bin does) is sufficient under those rules.

Deleting the image ID and leaving the content on a CDN is not acceptable - but that's not what I said.


How does this work in practice? If each customers data has a unique encryption key, then how do you do SQL joins ?


It’s pretty straightforward: you don’t encrypt the data necessary for a join.

Data for a join, ie foreign keys, are used to inform your structure. They aren’t really data a person owns... they just describe how that data (theirs) fits into your database (yours).


You'd still need to decrypt the whole table to do e.g. aggregation.

For example, a customer's address is private information; statistics about how many customers come from each town/region is not, but calculating a simple "select town, count(*) from x group by town" requires decrypting every address.


you can store whatever you want unencrypted on the read model. That‘s the place where you do joins/aggregation anyway.


> you can store whatever you want unencrypted on the read model.

So to recap correct me if I’m wrong:

   You store everything encrypted.
   You have an api on it which hides the encryption and shows everything unencrypted.
   On the unencrypted api you do joins/aggregations.
   RightToBeForgotten then means delete its decryption key and remove it from the in-memory api.
Right?


You can always join on decrypt(key, a.column) = decrypt(key,b.column).


This is clever, but it requires strict following of best practices (nobody has copies of production data, and backups and the like are guarded with proper access controls. Keys are never written to log files, etc).

Make sure to get any DB backups as well, otherwise the key could still be out there somewhere. If you store the key under version control as well (bad practice but we've all seen it done) you'll need to rewrite git history to make sure the key isn't recoverable at all.

There could be an infinite number of ways a key could unintentionally persist, so something to keep in mind with this approach. Good reason to be disciplined with keys and other secrets to make sure they aren't leaving the environment.


Why is storing junk data for all eternity considered clever?


Because it beats getting a multimillion dollar fine from the EU?


I don't really see what encryption gets me over deleting the events. The end result is the same - the event log is unusable for replaying this particular projection. Why take on the extra complexity of adding encryption? Immutability is just a means to an end. It has no value on its' own.


If your messages are not guaranteed to contain info about at most a single individual, wouldn't you need an encryption key to cover tuples of persons? If so, would the existence of a key for a given tuple be a potential privacy sensitive piece of info that would need to be masked?


As ever, it turns out that however brilliant your design spec, data structure, or algorithm is, it breaks as soon as it comes in contact with the realities of human nature, business processes, legal requirements, and the universe itself. This is an insight huge swaths of tech culture fail to grasp, and it explains the massive problems of crypto currency fraud, Facebook privacy gaffes, and the continuing delusion that self driving cars and AI assistants are just over the horizon. If only the world would match up to my assumptions, then my code can save the world! But it never will. You have to build that mismatch into your plans or you’ll eventually fail.


Missed opportunity. Should have named the post "Kafka, GDPR and Kafka".


Indeed. The post seems to assume that all event sourced systems cannot tolerate the deletion of events on a per aggregate root basis.


Could you explain a bit more? Why does using Kafka imply that you cannot delete events on a per aggregate root basis?


Interesting article, but there’s this architect type out there I’ve been encountering who are like, Kafka maximalists. In this case, the overall model is ok, but the insistence to put the keys into Kafka instead of a normal data store, and then to rely on hot access and the Kafka streams state stuff for potentially millions of keys seems misguided. It’s ok to use relational db’s when they make sense. It almost feels like with Kafka we’re going through the whole NoSql or die thing all over again, just so in a few years we realize Postgres isn’t that bad.


Backups should also be stored and encrypted using the same technique so that you can just delete the key and so access to the data isn't possible.


Frankly, I'd rather go for removing from projections.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: