More

jack_squat · 2025-02-02T05:43:21 1738475001

Like most online forums, HN discussions are dominated by people who like to spend all their time online arguing with people. It's the 80:20 principle, or the 90-9-1 principle. Largely these are early-20s, politically ignorant men, with a long tail of socially stunted "greybeards" who have not managed to find greater fulfillment in life than what online discussion forums provide. The technical topics here are frequently recycled and the quality of discussion is not that notable. I'm embarrassed it took me over a decade to reach this conclusion.

I don't think the readership of HN leans RW, but the commentary and moderation does, largely due to the fact that RW commentators are quick to flag views they disagree with, in contravention of the stated moderation standards, while "normal people" can't be bothered to engage with this kind of platform, and the moderation team is content to let this dynamic persist. Witness the flagging on this very article as an example.

ashoeafoot · 2025-02-02T08:28:36 1738484916

Surely talking down to people will get you anywhere. It got you this far.

jack_squat · on Oct 25, 2023

My prediction is it'll be Cruise in red states and Waymo in blue states. I think they are done in SF but this will endear them in TX.

jack_squat · on Sept 25, 2023

I've done the Postgres skip locked thing at least three times and I'm currently doing it, but IMO it is actually more maintenance and overhead, not less -- at least when compared with the queues made available by the major cloud providers. Compared with Pubsub or SQS you need to handle,

* Metrics, monitoring, alarming on depth, message age

* Autoscaling on your custom metrics

* Managing retries, dead lettering, backoff

* Managing the DB workload: it's update-heavy and may easily be more intensive than the rest of your app. You may need to repeatedly scale your tiny, startup-scale DB, causing customer disruptions, because of your homemade queue.

The arguments for it are either avoidance of lock-in, or the hand-wavy one in this article/many comments: "we're a small startup, we need to minimize complexity, and make things as simple as possible, maintenance will kill us!".

Lock-in makes sense (though other queue alternatives should still be considered), but the latter argument gets used to justify all kinds of harebrained, superficially-simple but actually-complicated engineering schemes. I generally put the PG skip locked approach in that bucket, particularly when the alternative on hand is Pubsub or SQS. If it's between a Postgres table and ActiveMQ I might feel more conflicted.

5Qn8mNbc2FNCiVV · on Sept 26, 2023

That's what Temporal is for

jack_squat · on Aug 31, 2023

Typically you try and avoid closing a connection on each request, you hand it back to a connection pool. The underlying session stays open and associated with the connection (if you have your pooler set up right), so subsequent requests still use the cache.

I agree with the original commenter about ANY as well: using IN for dynamic lists of parameters makes viewing useful information in e.g pg_stat_statements impossible, though it's possible there's been some recent work around normalizing these.

jack_squat · on Aug 16, 2023

The author's site is primarily poorly-researched and hyperbolic BS, mostly to provide cover for his political hot takes.

Example: https://scottlocklin.wordpress.com/2021/11/20/us-public-heal...

jack_squat · on Aug 5, 2023

No piece of software has influenced my career and my thinking on design, user interfaces, and software quality as much as VIM. VIM made learning to code as fun as playing a game. Thank you Bram, you changed my life.

jack_squat · on May 28, 2023

When you use a surrogate key, you should still put a unique constraint on an alternate key. Multiple records for the same thing is a fundamental problem in a database.

jack_squat · on May 28, 2023

I also tend toward surrogate keys, and would also point out that the overhead concerns frequently weigh in favor of them as well (though not always).

If your natural key is bigger than the appropriately sized surrogate key, and it will be referenced across the database in other tables, the overhead of referencing it everywhere (and indexing it in those places -- potentially in much larger tables) can easily outweigh the overhead of the extra bytes on the original table.

Additionally, surrogate keys can realize much better performance WRT cache locality.

jack_squat · on May 22, 2023

1NF bans relation-valued attributes, not repetition of attribute value across tuples in a relation. Mainstream SQL databases don't support relation-valued attributes, so any table you make in a relational database is 1NF.

You can push back on this a little - for instance maybe you consider an array-valued attribute to be enough like a relation to argue array-valued attributes violate 1NF. But if you do that you must also explain what makes arrays different from strings, since strings are pretty similar to arrays of characters and can be treated the same way in most respects (for instance characters in a string can be addressed by index or split into substrings).

AlphaSite · on May 23, 2023

Arguably it’s a structured vs unstructured data issue

jack_squat · on May 22, 2023

From the Postgres docs,

ON CONFLICT DO UPDATE guarantees an atomic INSERT or UPDATE outcome; provided there is no independent error, one of those two outcomes is guaranteed, even under high concurrency. This is also known as UPSERT — “UPDATE or INSERT”.

https://www.postgresql.org/docs/current/sql-insert.html

What are you referring to?

richbell · on May 22, 2023

If I recall correctly (and it has been a while, so I'm not saying I am), the issue was with concurrent transactions inserting a record into tableA and another into tableB which has a foreign key constraint to tableA. The issue was likely specific to `ON CONFLICT DO NOTHING` and not `ON CONFLICT DO UPDATE`.

For example, let's saying you're building an index of open source packages and have two tables: package_type(id, name) and package(id, type_id, namespace, name).

If you receive two concurrent requests for `maven://log4j:log4j` and `maven://io.quarkus:quarkus`, a naive implementation to insert both "maven" and the packages if they don't exist might look something like this:

   WITH type_id AS (
     INSERT INTO package_type(name)
     VALUES (:type)
     RETURNING id
     ON CONFLICT DO NOTHING
   )
   INSERT INTO package (type_id, namespace, name)
   SELECT type_id, :namespace, :name
   FROM type
   ON CONFLICT DO NOTHING;

However, one or both inserts can fail intermittently because the primary key for `package_type` will be auto-incremented and thus the foreign key won't be valid. Also, as mentioned in another comment[0] this won't work if `maven` already exists in the `package_type` table.

[0]: https://news.ycombinator.com/item?id=36031790

jack_squat · on May 22, 2023

Atomicity doesn't mean "doesn't fail", it means "either fails or succeeds, but does not succeed halfway".

There is nothing about what you are describing that is different from the behavior you'd get from a regular insert or update. If two transactions conflict, a rollback will occur. That isn't violating atomicity. In fact, it is the way by which atomicity is guaranteed.

The behavior of sequence values getting incremented and not committed, resulting in gaps in the sequence, is a separate matter, not specific to Postgres or to upsert.

paulddraper · on May 22, 2023

I don't think this has to with concurrency; this query is fundamentally broken (besides typos), in that the CTE won't return anything if the package_type already exists.

You have two options:

(1) ON CONFLICT DO UPDATE, with dummy update:

    WITH
      type AS (
        INSERT INTO package_type (name)
        VALUES ($1)
        ON CONFLICT (name) DO UPDATE SET name = excluded.name
        RETURNING id
      )
    INSERT INTO package (type_id, namespace, name)
    SELECT id, $2, $3
    FROM type
    ON CONFLICT (type_id, namespace, name) DO UPDATE SET name = excluded.name
    RETURNING id;

(2) Separate statements with ON CONFLICT DO NOTHING (could be in a UDF if desired):

    INSERT INTO package_type (name)
    VALUES ($1)
    ON CONFLICT DO NOTHING;

    INSERT INTO package (type_id, namespace, name)
    SELECT type_id, $2, $3
    FROM package_type
    WHERE name = $1
    ON CONFLICT DO NOTHING;

    SELECT id
    FROM package
    WHERE (type_id, namespace, name) = ($1, $2, $3);

hn_throwaway_99 · on May 22, 2023

I think that actually more to the root of the problem, as other folks have noted, is that `ON CONFLICT DO NOTHING` means the RETURNING clause returns no rows if there is a conflict, which in my experience is rarely what people want. So instead people do `ON CONFLICT DO UPDATE` with a no-op update which has performance/locking implications, otherwise they need to do a complicated query (search stack overflow).

I wish that postgres would add some sort of backwards compatible option like `ON CONFLICT DO NOOP` or `ON CONFLICT DO RETURN` so that you got the semantics of `DO NOTHING` except that the conflicted rows are returned.

neallindsay · on May 22, 2023

It increments the sequence associated with with a "serial" or "bigserial" field—usually used for primary keys. People are often surprised by this because they expect their primary keys to be sequential and sequences are designed to leave gaps in order to avoid a lot of locking.

hn_throwaway_99 · on May 22, 2023

That just seems like a fundamental misunderstanding of sequences to me. They are guaranteed to be increasing but not necessarily sequential.

richbell · on May 22, 2023

I think it's more confusion of how `ON CONFLICT DO NOTHING` works. Incrementing sequences is not "doing nothing", even if it's valid and sensible behaviour, which leads to confusion.

cdogl · on May 27, 2023

In fairness to Postgres, “DO NOTHING” is accurate within the SQL domain if you squint. It’s reasonable for queries to have side-effects on the underlying implementation as long as the invariants defined by SQL are met. Surprising, yes, but the alternative would be leaking implementation details in the syntax.

On the other hand, it’s so easy to use and rely on autoincrement IDs and to assume they are monotonic and predictable based on naive testing. If we could do it all again I’d fuzz the IDs even during normal operation (occasionally increment by a random extra amount) so that they don’t become such a foot gun for developers.