My problem is that regexes are write-only, unreadable once written (to me anyway). And sometimes they do more than you intended. You maybe tested on a few inputs and declared it fit for purpose, but there might be more inputs upon which it has unintended effects. I don't mind simple, straight-forward regexes. But when they become more complex, I tend to prefer to write out the procedural code, even if it is (much) longer in terms of lines. I find that generally I can read code better than regexes, and that code I write is more predictable than regexes I write.
> I tend to prefer to write out the procedural code, even if it is (much) longer in terms of lines.
This might work for you, but in general the amount of bugs is proportional to the amount of code. The regex engine is alredy throughly tested by someone else while a custom implementation in procedural code will probably have bugs and be a lot more work to maintain if the pattern changes.
That is quite a generalization. The regex engine is tested, but my specific regular expression isn't. My ability to write correct regular expressions is weak, so there can be many bugs in the one line of regular expession.
If you have made a bug in the specification of the pattern to match, then you will have the same bug in the hand-rolled implementation of the matching. It will just be more difficult to find the bug since the pattern is not explicitly specified anymore.
Surely complexity is a factor? A procedual implementation will necessarily have the same essential complexity as the regex it replaces, but then it will additionally have a bunch of incidental complexity in matching and looping and backtracking.
Regexes can certainly be hard to read - the solution is to use formatting and comments to make them easier to understand - not to drown the logic in reams of boilerplate code.
> A procedual implementation will necessarily have the same essential complexity as the regex it replaces
I don't think I fully agree with this, and I don't see a basis for why this should be true. If I have a very specific implementation, it could have very little incidental complexity, it could be fully targeted to the use case. Whereas with regular expressions there is incidental complexity of the regex engine itself by definition.
Complexity in the standard library is not that relevant. If you make your own custom dictionary implementation, you increase complexity of your code base compared to just using the one in the standard library, even if your own implementaion is simpler.
The relevant complexity for using a regex is the complexity of the pattern itself and the complexity of invoking the regex. Any custom procedural solution will be more complex unless it is literally something as simple as checking whether a string contain a given literal string.
> unreadable once written (to me anyway). (…) there might be more inputs upon which it has unintended effects.
https://regex101.com can explain your regex back to you, and allows you to test it with more inputs.
Though I’m not trying to convince you to always use regular expressions, I agree with GP:
> Obviously regexes aren't the right tool for every job, and they can certainly be done poorly; but in the right place at the right time they're the simplest, most robust, easiest to understand solution to the problem.
but they’re not an excuse to avoid regex. Similarly git has many warts but there’s no getting around it.
Same with CSS
If you want to run with the herd though you need to know these things, even enjoy them.
You can rely on tooling and training wheels like Python VERBOSE but you’re never going to get away from the fact that the “rump” of the population works with them.
Easier to bite the bullet and get practised. I’ve no doubt you have the intellect - you only need be convinced it’s a good use of your time.
I don't incorporate a lot of regular expressions into my code. But where I do like them is for search and replace. So I do treat them as mostly disposable.
I agree, there are only superficial similarities. Like they're all 3 C-based. And Go and Rust both compile to machine code. I believe once one of the creators of Go mentioned that it felt to some users "like a faster Python". But I have no clue how Python relates to Rust in any sense, I fail to see any similarities. In fact, I would almost be inclined to say that Python and Rust have more differences than similarities.
> I would almost be inclined to say that Python and Rust have more differences than similarities.
This is somewhat of a stretch: dyn Traits in Rust are sort of like compile time duck typing. OTOH, interfaces in Go and virtual functions in C++ are the same thing.
It really depends on what kind of axis you’re talking about. Under the hood, Go’s interfaces and Rust’s “dyn types” (the new nomenclature for trait objects) are the same, and C++’s virtual functions are different.
(Though you can emulate them with unsafe in Rust, like anyhow)
Could you explain how repository pattern is a "huge overkill that adds complexity with very little benefit"? I find it a very light-weight pattern and would recommend to always use it when database access is needed, to clearly separate concerns.
In the end, it's just making sure that all database access for a specific entity all goes through one point (the repository for that entity). Inside the repository, you can do whatever you want (run queries yourself, use ORM, etc).
A lot of the stuff written in the article under the section Repository pattern has very little to do with the pattern, and much more to do with all sorts of Python, Django, and SQLAlchemy details.
In theory it's a nice abstraction, and the benefit is clear. In practice, your repository likely ends up forwarding its arguments one-for-one to SQLAlchemy's select() or session.query().
That's aside from their particular example of SQLAlchemy sessions, which is extra weird because a Session is already a repository, more or less.
I mean, sure, there's a difference between your repository for your things and types you might consider foreign, in theory, but how theoretical are we going to get? For what actual gain? How big of an app are we talking?
You could alias Repository = Session, or define a simple protocol with stubs for some of Session's methods, just for typing, and you'd get the same amount of theoretical decoupling with no extra layer. If you want to test without a database, don't bind your models to a session. If you want to use a session anyway but still not touch the database, replace your Session's scopefunc and your tested code will never know the difference.
It's not a convincing example.
Building your repository layer over theirs, admittedly you stop the Query type from leaking out. But then you implement essentially the Query interface in little bits for use in different layers, just probably worse, and lacking twenty years of testing.
Thanks, that makes a lot of sense. I don't have a whole bunch of experience with SQLAlchemy itself. In general, I prefer not to use ORMs but just write queries and map the results into value objects. That work I would put into a Repository.
Also in my opinion it's important to decouple the database structure from the domain model in the code. One might have a Person type which is constructed by getting data from 3 tables. A Repository class could do that nicely: maybe run a join query and a separate query, combine the results together, and return the Person object. ORMs usually tightly couple with the DB schema, which might create the risk of coupling the rest of the application as well (again, I don't know how flexible SQLAlchemy is in this).
There could be some value in hiding SQLAlchemy, in case one would ever like to replace it with a better alternative. I don't have enough experience with Python to understand if that ever will be the case though.
All in all, trade-offs are always important to consider. A tiny microservice consisting of a few functions: just do whatever. A growing modulith with various evolving domains which have not been fully settled yet: put some effort into decoupling and separating concerns.
I've used SqlAlchemy in a biggish project. Had many problems, the worst ones were around session scoping and DB hitting season limits, but we had issues around the models too.
The argument for hiding SqlAlchemy is nothing to do with "what if we change the DB"; that's done approximately never, and, even if so, you have some work to do, so do it at the time. YAGNI
The argument is that SA models are funky things with lazy loading. IIRC, that's the library where the metaclasses have metaclasses! It's possible to accidentally call the DB just by accessing a property.
It can be a debugging nightmare. You can have data races. I remember shouting at the code, "I've refreshed the session you stupid @#£*"
The responsible thing to do is flatten them to, say, a pydantic DTO. Then you can chuck them about willy-nilly. Your type checker will highlight a DTO problem that an SA model would have slipped underneath your nose.
The difficulty you have following that is that, when you have nested models, you need to know in advance what fields you want so you don't overfetch. I guess you're thinking "duh, I handcraft my queries" and my goodness I see the value of that approach now. However, SA still offers benefits even if you're doing this more tightly-circumscribed fetch-then-translate approach.
This is partly how I got from the eager junior code golf attitude to my current view, which is, DO repeat yourself, copy-paste a million fields if you need, don't sweat brevity, just make a bunch of very boring data classes.
Just a heads-up if you haven't seen it: Overriding lazy-loading options at query time can help with overfetching.
class Author(Model):
books = relationship(..., lazy='select')
fetch_authors = select(Author).options(raiseload(Author.books))
Anything that gets its Authors with fetch_authors will get instances that raise instead of doing a SELECT for the books. You can throw that in a smoke test and see if there's anything sneaking a query. Or if you know you never want to lazy-load, relationship(..., lazy='raise') will stop it at the source.
SQLModel is supposed to be the best of both Pydantic and SQLAlchemy, but by design
an SQLModel entity backed by a database table doesn't validate its fields on creation, which is the point of Pydantic.
I can't take a position without looking under the hood, but what concerns me is "SqlModel is both a pydantic model and an SA model", which makes me think it may still have the dynamic unintended-query characteristics that I'm warning about.
I seem to recall using SqlModel in a pet project and having difficulty expressing many-to-many relationships, but that's buried in some branch somewhere. I recall liking the syntax more than plain SA. I suspect the benefits of SqlModel are syntactical rather than systemic?
"Spaghetti" is an unrelated problem. My problem codebase was spaghetti, and that likely increased the problem surface, but sensible code doesn't eliminate the danger
I mean that from the point of view of YAGNI for a small app. For a big one, absolutely, you will find the places where the theoretical distinctions suddenly turn real. Decoupling your data model from your storage is a real concern and Session on its own won't give you that advantage of a real repository layer.
SQLAlchemy is flexible, though. You can map a Person from three tables if you need to. It's a data mapper, then a separate query builder on top, then a separate ORM on top of that, and then Declarative which ties them all together with an ActiveRecord-ish approach.
> I prefer not to use ORMs but just write queries and map the results into value objects. That work I would put into a Repository.
Yep, I hear ya. Maybe if they'd built on top of something lower-level like stdlib sqlite3, it wouldn't be so tempting to dismiss as YAGNI. I think my comment sounded more dismissive than I really meant.
SQLAlchemy Session is actually a unit of work (UoW), which they also build on top. By the end of the book they are using their UoW to collect and dispatch events emitted by the services. How would they have done that if they just used SQLAlchemy directly?
You might argue that they should have waited until they wanted their own UoW behaviour before actually implementing it, but that means by the time they need it they need to go and modify potentially hundreds of bits of calling code to swap out SQLAlchemy for their own wrapper. Why not just build it first? The worst that happens is it sits there being mostly redundant. There have been far worse things.
The tricks you mention for the tests might work for SQLAlchemy, but what if we're not using SQLAlchemy? The repository pattern works for everything. That's what makes it a pattern.
I understand not everyone agrees on what "repository" means. The session is a UoW (at two or three levels) and also a repository (in the sense of object-scoped persistence) and also like four other things.
I'm sort of tolerant of bits of Session leaking into things. I'd argue that its leaking pieces are the application-level things you'd implement, not versions of them from the lower layers that you need to wrap.
When users filter data and their filters go from POST submissions to some high-level Filter thing I'd pass to a repository query, what does that construct look like? Pretty much Query.filter(). When I pick how many things I want from the repository, it's Query.first() or Query.one(), or Query.filter().filter().filter().all().
Yes, it's tied to SQL, but only in a literal sense. The API would look like that no matter what, even if it wasn't. When the benefit outweighs the cost, I choose to treat it like it is the thing I should have written.
It isn't ideal or ideally correct, but it's fine, and it's simple.
You seem to have stopped reading my comment after the first sentence. I asked some specific questions about how you would do what they did if you just use SQLAlchemy as your repository/UoW.
Repository pattern is useful if you really feel like you're going to need to switch out your database layer for something else at some point in the future, but I've literally never seen this happen in my career ever. Otherwise, it's just duplicate code you have to write.
What is the alternative that you use, how do you provide data access in a clean, separated, maintainable way?
I have seen it a lot in my career, and have used it a lot. I've never used it in any situation to switch out a database layer for something else. It seems like we have very different careers.
I also don't really see how it duplicates code. At the basic level, it's practically nothing more than putting database access code in one place rather than all over the place.
What we are talking about is a "transformation" or "mapper" layer isolating your domain entities from the persistence. If this is what we call "Repository" then yes, I absolutely agree with you -- this is the right approach to this problem. But if the "Repository pattern" means a complex structure of abstract and concrete classes and inheritance trees -- as I have usually seen it implemented -- then it is usually an overkill and rarely a good idea.
Thanks. In my mind, anything about complex structures of (abstract) classes and/or inheritance trees has nothing to do with a Repository pattern.
As I understand it, Repository pattern is basically a generalization of the Data Access Object (DAO) pattern, and sometimes treated synonymously.
The way I mean it and implement it, is basically for each entity have a separate class to provide the database access. E.g. you have a Person (not complex at all, simply a value object) and a PersonRepository to get, update, and delete Person objects.
Then based on the complexity and scope of the project, Person either 1-to-1 maps to a e.g. a database table or stored object/document, or it is a somewhat more complex object in the business domain and the repository could be doing a little bit more work to fetch and construct it (e.g. perhaps some joins or more than 1 query for some data).
> for each entity have a separate class to provide the database access
Let me correct you: for each entity that needs database access. This is why I'm talking about layers here: sometimes entities are never persisted directly, but only as "parts" or "relations" of other entities; in other cases you might have a very complex persistence implementation (e.g. some entities are stored in a RDB, while others in a filesystem) and there is no clear mapping.
I recommend you to approach this from the perspective of each domain entity individually; "persistability" is essentially just another property which might or might not apply in each case.
Naturally, Repository is a pattern for data(base) access, so it should have nothing to do with objects that are not persisted. I used "entity" as meaning a persisted object. That was not very clear, sorry.
Well, again, that is not completely straightforward - what exactly is a "persisted object"? We have two things here that are usually called entities:
1. The domain entities, which are normally represented as native objects in our codebase. They have no idea whether they need to be persisted and how.
2. The database entities, which are - in RDBs at least - represented by tables.
It is not uncommon that our entities of the first type can easily be mapped 1:1 to our entities of the second type - but that is far from guaranteed. Even if this is the case, the entities will be different because of the differences between the two "worlds": for example, Python's integer type doesn't have a direct equivalent in, say, PostgreSQL (it has to be converted into smallint, integer, bigint or numeric).
In my "correction" above I was talking about the domain entities, and my phrasing that they "need database access" is not fully correct; it should have been "need to be persisted", to be pedantic.
I’ve seen it, but of course there was no strict enforcement of the pattern so it was a nightmare of leakage and the change got stuck half implemented, with two databases in use.
In my experience, both SQL and real-world database schema are each complex enough beasts that to ensure everything is fetched reasonably optimally, you either need tons of entity-specific (i.e. not easily interface-able) methods for every little use case, or you need to expose some sort of builder, at which point why not just use the query builder you're almost certainly already calling underneath?
Repository patterns are fine for CRUD but don't really stretch to those endpoints where you really need the query with the two CTEs and the four joins onto a query selecting from another query based on the output of a window function.
I rarely mock a repository. Mocking the database is nice for unit-testing, it's also a lot faster than using a real DB, but the DB and DB-application interface are some of the hottest spots for bugs: using a real DB (same engine as prod) gives me a whole lot more confidence that my code actually works. It's probably the thing I'm least likely to mock out, despite making tests more difficult to write and quite a bit slowerq
I had a former boss who strongly pushed my team to use the repository pattern for a microservice. The team wanted to try it out since it was new to us and, like the other commenters are saying, it worked but we never actually needed it. So it just sat there as another layer of abstraction, more code, more tests, and nothing benefited from it.
Anecdotally, the project was stopped after nine months because it took too long. The decision to use the repository pattern wasn't the straw that broke the camel's back, but I think using patterns that were more complicated than the usecase required was at the heart of it.
Could you give me some insights what the possible alternative was that you would have rather seen?
I am either now learning that the Repository pattern is something different than what I understand it to be, or there is misunderstanding here.
I cannot understand how (basically) tucking away database access code in a repository can lead to complicated code, long development times, and the entire project failing.
Your understanding of the repository pattern is correct. It's the other people in this thread that seem to have misunderstood it and/or implemented it incorrectly. I use the repository pattern in virtually every service (when appropriate) and it's incredibly simple, easy to test and document, and easy to teach to coworkers. Because most of our services use the repository pattern, we can jump into any project we're not familiar with and immediately have the lay of the land, knowing where to go to find business logic or make modifications.
One thing to note -- you stated in another comment that the repository pattern is just for database access, but this isn't really true. You can use the repository pattern for any type of service that requires fetching data from some other location or multiple locations -- whether that's a database, another HTTP API, a plain old file system, a gRPC server, an ftp server, a message queue, an email service... whatever.
This has been hugely helpful for me as one of the things my company does is aggregate data from a lot of other APIs (whois records, stuff of that nature). Multiple times we've had to switch providers due to contract issues or because we found something better/cheaper. Being able to swap out implementations was incredibly helpful because the business logic layer and its unit tests didn't need to be touched at all.
Before I started my current role, we had been using kafka for message queues. There was a huge initiative to switch over to rabbit and it was extremely painful ripping out all the kafka stuff and replacing it with rabbit stuff and it took forever and we still have issues with how the switch was executed to this day, years later. If we'd been using the repository pattern, the switch would've been a piece of cake.
Thanks. I was starting to get pretty insecure about it. I don't actually know why in my brain it was tightly linked to only database access. It makes perfect sense to apply it to other types of data retrieval too. Thanks for the insights!
I was so in love with Delphi 7 (didn't care much for Delphi 8, didn't try any version after). I don't think I've ever been more excited about a programming language/environment as when I was easily and quickly creating desktop applications with Delphi 7.
One of things I find challenging is understand the meaning of the word "scales". It is sometimes used differently in different contexts.
Can it be performant in high load situations? Certainly. Can is elastically scale up and down based on demand? As far as I'm aware it cannot.
What I'm most interested in is how operations are handled. For example, if it's deployed in a cloud environment and you need more CPU and/or memory, you have to eat the downtime to scale it up. What if it's deployed to bare metal and it cannot handle the increasing load anymore? How costly (in terms of both time and money) is it to migrate it to bigger hardware?
When it "scales", it usually means "scales up". A scalable solution is such that can withstand a large and increasing load, past the usual limitations of obvious solutions.
Being elastic is nice, but not always needed. In most cases of database usage, downsizing never happens, or expected to happen: logically, data are only added, and any packaging and archiving only exists to keep the size manageable.
You’re conflating things. The question was about scaling compute and memory up and down based on load and you’re commenting about never needing to downsize on storage.
a database scaling dramatically up and down /under load/ and expecting it to perform the same as steady state seems a bit weird, vs a single, beefy database with a beefy ingest job and a bunch of read only clients searching it?
like you're more likely to encounter two phases (building the DB in heavy growth mode, and using the DB in light growth heavy read mode).
A business that doesn't quite yet know what size the DB needs to be has a frightening RDS bill incoming.