Hacker News new | past | comments | ask | show | jobs | submit | mathnode's comments login

Logical replication does exist in pgsql, which is great. What it still lacks however (and I am sure they will very quickly catch up on) is the user facing process of being able to fix or sync a broken node without a rebuild. I'm also pretty sure pgsql logical replication is single threaded? Things like pg_rewind are layered on fixes that other database users don't have to depend on or learn. Except Oracle (because it's a mess).


postgres have table partitions now, mariadb can however partition a table over multiple servers or shards using the engines like spider and connect, or proxies like maxscale and proxsql. Local or remote, read your database manual about the fun and caveats that come from partitions.


I would say the main advantage is scale and uptime. If you need to replicate, duplicate, or maintain state beyond one server; there are very few good RDBMS options, let alone open source. The MariaDB ecosystem competes with IBM Purescale and Oracle RAC; it's hard to appreciate that, until you really need it.


From my POV, it's a supplier that is no longer beholden to shareholders demands for profit. A platform and service that can operate using customer sourced revenue and respond to market demand without a profit driven board is a big win for everyone while funding development and maintenance of a great open source project.


I think that’s exactly the wrong impression I get. Private equity is going to cause them to quickly get to better profitability through price increases, layoffs, and reduced R&D.


> it's a supplier that is no longer beholden to shareholders demands for profit.

That doesn't seem correct. It's now owned by a single shareholder, that's literally a financial organisation of the kind likely to turn the screws until the last drop of blood is gone.


Ooh ouch, I need to lay down. While I am healing I shall come up with one of those trendy JavaScript libraries I keep hearing about; like an aged rocker dating a young model I shall become nuanced, taken up, revered, established, lived and then hated…repeat.


Nope. Even way back then, I was using iTunes on mac and windows to rip and organise my music collection. A quick rsync or an smb mount from a Linux machine made it easy to access my media in VLC or Rhythmbox. The winamp/foobar aesthetics were really cool, but overall offered nothing to the practically or ease of actually buying/ripping/playing your music.

But you know, everyone is different and some folks had memorised a sequence of characters that were something like "FCKGW-...", install limewire, just to play that live acoustic version of Everlong.


This. With alsa-plugins and any console music player (cmus or mocp, cmus it's more collection oriented, mocp enforces you to just use directories and files) and a -rt kernel it was more than enough (if not better) to play huge collection files under Linux.


I will raise you- desktop.ini and thumbnails.db


Windows is polite enough to not write them on network shares, unlike .DS_Store.


Now, yes. It used to be a really irritating problem there too.


That's still a weaker hand. macOS also has the ton of ._ files. Would have been better to have folded than raised


No macOS does not.

The issue is the file system.

Apple file systems allow a file to have extended attributes or resource forks. Thus a file is not a simple stream of bytes.

When you copy a file to a file system (e.g. FAT) that does not understand these attributes macOS copies those to a ._ (I think if the file system was NTFS then you could probably convert them but I don't think anyone does)

Copying a file out of an Apple environment loses data (OK the data is metadata and usually no one cares)


back in windows xp days yes, it's pretty much never a problem nowadays. for the past...almost two decades, actually - thumbnails get stored in user profile folder since vista. (though it is different for network folders and may still be a problem.) and desktop.ini files - you'd only ever encounter them in predefined system folders (like pictures, etc.) or if you manually customize a folder in its properties - customize tab (like changing folder "type" to one like those predefined folders or changing its icon, not the same as changing size/thumbnail size/columns/etc though, that's stored elsewhere too)


Without interacting with it myself, none of this is surprising.

I have used excel in the past, and I am a long term python user. But if you asked me today what I really wanted to make my life easier and ultimately a product or business better using only excel? I would ask for lua or scheme. I don’t need a batteries included environment embedded into a spreadsheet. I just want sane syntax for common functionality which does not require arcane knowledge and long forgotten wisdom.


Your personal use-case might prefer Lua or Scheme, but most casual Excel (or SQL) users are non-programmers so they won't. They'll want the equivalent of decently-documented macros or boilerplate they can easily and quickly use without modification. (One common Excel use-case will clearly be "import/munge lots of data from various sources, then pass it into some AI model, then process the output". Can't see people writing that in Lua.) The real target customers for this one are commercial/enterprise non-programmer Windows-stack users whose legacy workflow/data is built around/glued to Excel and are already locked into paying $$ monthly/annual subscription. From looking at Reddit, I don't see much other takeup of Python in Excel.

I don't get your "shouldn't need batteries-included environment" objection; MSFT is bundling Anaconda distribution libraries with Excel. I'd expect it works seamlessly online and offline, as far as everything supported by Python stdlibs. (Can you actually point to any real problem with the batteries?) Really the only part I see you can quibble is things that are currently only implemented in uncommon third-party libraries, i.e. not stdlibs/numpy/scipy/scikit-learn/pandas/polars and the main plotting, data-science, ML, DB and web libraries.

> I just want sane syntax for common functionality which does not require arcane knowledge and long forgotten wisdom.

Show us some Python syntax for common functionality in Excel which does require arcane knowledge and long forgotten wisdom. Otherwise, this is purely your conjecture.

(If anything, bundling Python with Excel will stimulate healthy discussion towards which Python stdlibs need to be added/enhanced/changed, and which third-party libraries should be upgraded to stdlibs.)


Python in excel is a feature I would only expect to be used by power users. Someone who spends a lot of time in excel. Calling these people “non-programmers” isn’t true, excel itself is a pretty esoteric programming language.

I personally don’t think python is the problem here, but if their users can learn python they can certainly learn lua.


That's exactly what I said above: most Excel users are non-programmers. Hence Python in Excel would only be used by a subset of Excel power users.

Moreover, having to pay $$ recurring subscriptions for that stack to run open-source software (Python) they could run for free elsewhere mean it'll only be used by commercial/enterprise Windows-stack users who are already locked into some legacy workflow/data built around/glued to Excel. For example, financial users, or users who have some expensive license seat of some enterprise product(s). Means an even smaller subset of users.

We're saying the same thing.


Are we? I’m arguing that lua would have been a better choice than python.

Any traditional programming language that you put in excel is going to be a feature mostly for power users, and I think they could pick up lua just as easy as python


They could, but a lot more people already know Python than Lua.


Most Excel users (not the power users, just the 1.1 billion everyday ones, including many of the enterprise ones) don't know how to program in any language. You're coming at this with a HN mindset.

"Python vs Lua" is not even on their radar. And even if it was, their criteria would be dominated by platform lockin and compatibility with other licenses (e.g. commercial SQL, Tableau, MSFT, etc.). Not by "which open-source language?"


IMO you're the one coming in with an HN mindset. Python has massive mindshare even among people who have never programmed. It is the numeric computing language du jour. In any given financial company there are definitely already python users. Lua, a language primarily known for plugin scripting, with no numeric computing libraries, that has zero mindshare among non career programmers, is not even in the conversation.


Nobody here has made a case for Lua in Excel. I wrote "Python vs Lua" is not even on the radar of most Excel users, not even the subset that are programmers.

(Why are people here aggressively misreading everything I type, today?)

> Python... is the numeric computing language du jour. In any given financial company there are definitely already python users.

The original post didn't say "financial Excel users". Not all Excel users are financial; most aren't. I've worked with legal informatics users, e-commerce users, bioinformatic users, among others. Those sectors never use Excel for numeric computing, IME (drawing the occasional chart isn't numeric computing). They are more familiar with SQL, SQL macros, SQL query generators, importing/exporting to/from SaaS, etc. Like I said.


>excel is too bleh >python is too blah >myLanguageOfChoice is just right.

We can't all be special snowflakes, python and excel are lingua francas.


Lua is pretty uncontroversial as the embedded language of choice and is actually made specifically to be embedded and play nice with the surrounding application.

I get why they chose Python for this and it's not all that hard to embed, well the interpreter anyway, compiled modules are another story.


Yeah if what you are making are videogames.

If you are doing grown up stuff, you use a grown up language.


+1. I’ve had to fire guys like the OP. Smart guys often, but nearly impossible to work with productively.


> I would ask for lua or scheme.

Scheme? Did somebody say... Scheme?

https://apexdatasolutions.com/home2/acce%CE%BBerate/

It's a paid addon, but still...


The likely target users for this are analyst/ quant types. Very likely they are also using Python on the same tasks already.


All of that is much easier in Python where you have access to a lot of other people data wrangling utilities


Here’s a take nobody asked for, this costs a little less than 2 Real Dolls[1].

1. No way am I linking that.


And if you use MariaDB, just enable columnstore. Why not treat yourself to s3 backed storage while you are there?

It is extremely cost effective when you can scale a different workload without migrating.


This is no shade to postgres or maria, but they don’t hold a candle to the simplicity, speed, and cost efficiency of clickhouse for olap needs.


I have tons of OOMs with clickhouse on larger than RAM OLAP queries.

While postgres works fine (even it is slower, but actually returns results)


There are various knobs in ClickHouse that allow you to trade memory usage for performance. ( https://clickhouse.com/docs/en/operations/settings/query-com... e.g.)

But yes, I've seen similar issues, running out of memory during query processing, it's a price you pay for higher performance. You need to know what's happening under the hood and do more work to make sure your queries will work well. I think postgres can be a thousand or more times slower, and doesn't have the horizontal scalability, so if you need to do complex queries/aggregations over billions of records then "return result" doesn't cut it. If postgres addresses your needs then great- you don't need to use ClickHouse...


> There are various knobs in ClickHouse that allow you to trade memory usage for performance.

but what knobs to use and what values to use in each specific case? Query just usually fails with some generic OOM message without much information.


It's not actually so esoteric. The two main knobs are

- max_concurrent_queries, since each query uses a certain amount of memory

- max_memory_usage, which is the max per-query memory usage

Here's my full config for running clickhouse on a 2GiB server without OOMs. Some stuff in here is likely irrelevant, but it's a starting point.

    diff --git a/clickhouse-config.xml b/clickhouse-config.xml
    index f8213b65..7d7459cb 100644
    --- a/clickhouse-config.xml
    +++ b/clickhouse-config.xml
    @@ -197,7 +197,7 @@
     
         <!-- <listen_backlog>4096</listen_backlog> -->
     
    -    <max_connections>4096</max_connections>
    +    <max_connections>2000</max_connections>
     
         <!-- For 'Connection: keep-alive' in HTTP 1.1 -->
         <keep_alive_timeout>3</keep_alive_timeout>
    @@ -270,7 +270,7 @@
         -->
     
         <!-- Maximum number of concurrent queries. -->
    -    <max_concurrent_queries>100</max_concurrent_queries>
    +    <max_concurrent_queries>4</max_concurrent_queries>
     
         <!-- Maximum memory usage (resident set size) for server process.
              Zero value or unset means default. Default is "max_server_memory_usage_to_ram_ratio" of available physical RAM.
    @@ -335,7 +335,7 @@
              In bytes. Cache is single for server. Memory is allocated only on demand.
              You should not lower this value.
           -->
    -    <mark_cache_size>5368709120</mark_cache_size>
    +    <mark_cache_size>805306368</mark_cache_size>
     
     
         <!-- If you enable the `min_bytes_to_use_mmap_io` setting,
    @@ -981,11 +980,11 @@
         </distributed_ddl>
     
         <!-- Settings to fine tune MergeTree tables. See documentation in source code, in MergeTreeSettings.h -->
    -    <!--
         <merge_tree>
    -        <max_suspicious_broken_parts>5</max_suspicious_broken_parts>
    +        <merge_max_block_size>2048</merge_max_block_size>
    +        <max_bytes_to_merge_at_max_space_in_pool>1073741824</max_bytes_to_merge_at_max_space_in_pool>
    +        <number_of_free_entries_in_pool_to_lower_max_size_of_merge>0</number_of_free_entries_in_pool_to_lower_max_size_of_merge>
         </merge_tree>
    -    -->
     
         <!-- Protection from accidental DROP.
              If size of a MergeTree table is greater than max_table_size_to_drop (in bytes) than table could not be dropped with any DROP query.
    diff --git a/clickhouse-users.xml b/clickhouse-users.xml
    index f1856207..bbd4ced6 100644
    --- a/clickhouse-users.xml
    +++ b/clickhouse-users.xml
    @@ -7,7 +7,12 @@
             <!-- Default settings. -->
             <default>
                 <!-- Maximum memory usage for processing single query, in bytes. -->
    -            <max_memory_usage>10000000000</max_memory_usage>
    +            <max_memory_usage>536870912</max_memory_usage>
    +
    +            <queue_max_wait_ms>1000</queue_max_wait_ms>
    +            <max_execution_time>30</max_execution_time>
    +            <background_pool_size>4</background_pool_size>
    +
     
                 <!-- How to choose between replicas during distributed query processing.
                      random - choose random replica from set of replicas with minimum number of errors


> The two main knobs are

my experience is that those are not enough, multiple algorithms will just fail saying you hit max memory limit. There are many other knobs, for example: when to start external aggregation or sorting. For some cases I couldn't figure out setup and query just hits OOM without any ideas how to fix it.


How is your table setup? It’s plausible the on-disk/index layout is not amenable to the kinds of queries you’re trying to do.

What kind of queries are you trying to do? Also, what kind of machine are you running on?


Trivial example would be to run select count(distinct) from large table with high cardinality values: https://github.com/ClickHouse/ClickHouse/issues/47520


And I mean why should they? They work great for what they are made for and that is all that matters!


As a caveat, I'd probably say 'at large volumes.'

For a lot of what people may want to do, they'd probably notice very little difference between the three.


That's true, but we're trying to change that at ParadeDB. Postgres is still way ahead of ClickHouse in terms of operational simplicity, ease of hiring for DBAs who are used to operating it at scale, ecosystem tooling, etc. If you can patch the speed and cost efficiency of Postgres for analytics to a level comparable to ClickHouse, then you get the best of both worlds


> Postgres is still way ahead of ClickHouse in terms of operational simplicity

Having served as both ClickHouse and Postgres SRE, I don't agree with this statement.

- Minimal downtime major version upgrades in PostgreSQL is very challenging.

- glibc version upgrade breaks postgres indices. This basically prevents from upgrading linux OS.

And there are other things which makes postgres operationally difficult.

Any database with primary-replica architecture is operationally difficult IMO.


For multi-tb or pb needs I would not stray from mariadb. Especially when using columnstore. I have taken the pepsi challenge, even after trying vertica and netezza. Not HANA though; one has had enough of SAP.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: