Hacker News new | past | comments | ask | show | jobs | submit | ramraj07's comments login

How much of a slowdown did you estimate this bug caused?

SQLite only knows nested loop joins and the bloom filter can just tell us "no need to do a join, there is definitely no matching entry".

If it has a false positive all the time (the worst case) then the performance is the same as before the bloom filter optimization was implemented (besides the small bloom filter overhead).

As the bloom filter size in SQLite directly depends on the table size I estimated a false positive rate of 63.2% due to this bug, while it could have been just 11.75%.


It actually would have performed faster, but the false positive rate drastically increased.

I guess the person is asking how much a slowdown did the whole query receive.

90%?

Have you experienced waymo in SF? It actually drives faster than regular folks and brakes much more harder because of that. The speed limit doesn’t apply to the streets of San Francisco and it typically accelerates to the limit as fast as possible (especially electric).


Not in SF, only in Phoenix. My rides seemed much less erratic than a typical Uber ride.


Got my PhD from a lab that works on antibody drugs, they eventually even released one to the market.

I’d argue that our current system is broken. There’s no reliable metric of drug effectiveness in any of our pre-clinical models, and thus we end up going into clinical trials quite blind indeed. And more often than not, what drug gets into trials has more to do with ego and politics than actual scientific merit. And the folks involved in these types of activities are (IMO) the most unoriginal types I’ve ever seen.

There’s a lot we can do to improve our drug development process. It really doesn’t need to cost billions to bring a drug to the market. But the odds are stacked against anyone with a contrarian hypothesis and I just figured I’d save my sweat and leave this field instead.


It doesn’t (have to) cost billions to bring a (successful) drug to market.

And if you pick a single successful example that was discovered in academia, was spun out into a small focussed biotech, and was in a disease area that didn’t require large or multiple studies to make it to market, you’ll have your anecdote to prove your point.

Except… you’d be ignoring the costs of the 90% of drugs that fail in phase 1. You’d be ignoring the huge amount spent on discovery across the industry that never leads to a successful candidate.

Drug discovery and development is difficult because, for all of our clever science, it’s still essentially serendipitous and random. And we’ve not yet figured out how to make a production line out of something that’s random, try as we might. And it’s expensive because of the failures as well as the cost associated with success.


I am perfectly aware that this is the reason they blame for the insane costs - “we have to test so many drugs!”

Yet you seem to have assumed I’m oblivious to the reality when I’ve already pressed I’ve been in the deep end and am aware.

I’ve already given an explanation on why I don’t agree with “its still serendipitous and random” - the people working on it are not smart enough and are more interested in stoking egos and careers than doing real science, even if they’re capable of doing so.

“90% of the drugs fail in phase I” - why are you telling me that when I’ve already given an explanation on why that is so - we don’t have good preclinical models that correlate with drug effectiveness - is it that you didn’t understand what I wrote, or are also neck deep in this cultural quagmire you refuse to acknowledge it?


No need for anger - I think we're mostly in (violent) agreement here. :)

Maybe the one area we would significantly disagree is that I don't think it's simply that "the people working on it are not smart enough and are more interested in stoking egos and careers than doing real science".

Honestly, I've had discussions with so many mid-level smart trusted colleagues who always think that the higher-ups are making stupid decisions, and they'd do better. They're right that some of the decisions might be stupid (and you're probably right about "stoking egos and careers") some of the time, but people are promoted, decision-makers come and go, and the decisions (and failure rates) don't really improve. I (think I) see it for what it is, and agree that we lack meaningfully informative pre-clinical models, but I'm also comfortable to acknowledge the weaknesses of the system and be honest that I don't have all of the answers. At the moment, it's a heinously inefficient crap-shoot, but it's the best we've so far come up with.

But, prove me wrong. There are likely countless molecules that have been discarded that have therapeutic benefit waiting to be realised. (I don't mean to sound facetious here but) If you can do better, and are smarter than and will make better decisions than everyone else in the industry, you'll be a billionaire in short order, as this is literally the golden ticket in this industry that everyone else is missing.


I would love to prove my point by actually becoming a billionaire, but my point is that the system is stacked against folks like me. Gotta have a nature paper to get a job in Genentech fresh out of PhD. Who gets nature papers? People who join labs that already publish nature papers. Who gets to join there? Valedictorian Who undergrad in top schools. Apparently the odds are stacked the moment you slack off in eigth grade lol.

I have done my PhD, I need to take a break to actually take care of my family and immigration. I hope to get back to this field at some point, in my own terms, and see if I can succeed. If it works, it works! If not Who cares right! Let's see.


Right. You can't just choose to run the successful clinical trials anymore than you can choose to only buy stocks that will go up on wall street. you have to run various clinical trials for a drug, and they fail. a lot. that very very expensive with no payoff. the successes have to be so phenomenally profitable that they cover the costs of all the failures. So real change would come from making the costs of those failures go away, without being able to cheat the system. The amount of medicine is believed to work, but is unpatentable, and thus doesn't have the profit motive to be pushed through clinical trials is a huge black badge on the American version of capitalism as being the best way we can organize society for the advancement of science and technology.


All of these things also apply to startups. And creates a VC groupthink of "portfolio theory" that necessitates huge (10,000x) returns, which costs the public a lot of viable small/medium enterprises that are not victims of the perverse incentives.

I wonder if the "optimal" theory is portfolio in this case, or if there is a new generation of VC/pharma investors who want a higher probability at a lower return.


> It doesn’t (have to) cost billions to bring a (successful) drug to market

> you’d be ignoring the costs of the 90% of drugs that fail in phase 1

It depends on what you call "bringing a drug to market".

_________________

* Phase I costs little, around $1M during the trial, and involves only a small group of participants (one or two dozen people), so it's not multi-center and it is manageable by a few people at a biotech. The problem is that most phase I trials fail, but this is not an issue of cost, it's an issue of the way it is decided as explained by ramraj07, another commenter.

Too often it is started on a hunch without solid pre-clinical data, sometimes it is because the drug was tested and failed in another disease and the managers "pivoted" to a new disease because then it costs little to try again, sometimes it's just a "weird IP/financial trick" where you combine an existing drug and an unrelated drug. Then you know you have a relatively efficacious drug, no need for toxicity studies and you can patent it.

On the contrary, many trials could be done on drugs with good pre-clinical data, but that does not happen because it would be hard to patent.

_________________

* A phase III costs around $25M for one or two hundred participants during the trial [0]. It lasts 6 months at most.

Some publications cite much higher numbers (~$1G), but this does not make sense as drugs are often developed by biotechs (startups, in other words) with only a few million in their pockets.

Another cost inflationary cause is subcontracting to CROs, as most biotechs do not have the manpower, knowledge and business connections to conduct the trial.

_________________

* Once a drug receives commercialization authorization, a major company usually buys the rights and then starts the marketing phase. This starts with teaching doctors on how to prescribe and administer the drug. It means publishing articles in the mainstream medical press, inviting doctors to conferences and workshops, and paying medical sales representatives.

It is costly, this is probably where are spend the ~$500M but for me, this is not drug development costs, it's just marketing costs.

[0] http://idei.fr/sites/default/files/medias/doc/conf/pha/conf_...


I'm sorry, but for industry-sponsored trials your figures are off by up to an order of magnitude, despite the numbers in the (18 year old) reference.

Phase I: a small biotech I know of in oncology has phase I costs in the order of $500,000 per patient; this is a higher-end cost, due to their sites being in the US (more expensive than Europe) and because as a small biotech they're had to outsource virtually every aspect of running the trial. In big pharma, per-patient costs were more like $70-100k per patient, but this was just the pure money paid patient (to the site, and external costs like drug supply and shipping) and ignored the cost of laboratory, clinical, operations, and data management work that was being done in house. All told, it would typically be hard to get even a phase I study completed for less than 10x your estimate, and this is before you consider any additional recruitment needed between dose escalation and phase III.

Phase III: again it depends on many factors, but in big pharma a trial cost of $100-200k per patient was again not unreasonable, and typical phase III trials where you're comparing to a meaningful established medicine are larger than 100-200 patients. A biotech I know of is unable to run a phase III for a promising drug without finding a partner to support the majority of the cost (which is >100m EUR in oncology) and they're not wasting money.

---

A less anecdotal approach is to consider the total R&D costs of companies across a given timescale, and divide by the number of successes. It's a pretty old reference too, but Matthew Herper did this in 2013. [0] Yes, there were some outliers with low costs, but you'd have to understand the details for context. The typical costs were in the hunderes of millions to billions per successful drug.

[0] https://www.forbes.com/sites/matthewherper/2013/08/11/the-co...


I'd agree with a lot of that in terms of both many drugs being 'discovered' in clinical trials as oppose to earlier ( a lot of it it about choosing the right patients and dose ), and the differences in mindsets between researchers and those often involved in the clinical trial side.

One of the things you've missed is the strong restrictions put on pharma in terms of promoting use of existing drugs beyond the existing approval ( which makes sense ), and the almost complete freedom Doctors have to do what they want - they can just decide to prescribe something off-label if they think it might help.

It can take a very long time for new ideas to become new products - and a lot of that is inertia ( nobody else is doing it ).


I think the restrictions on pharma, while doctors have more freedom is quite helpful. There are some problems here as well where this freedom has been abused, but overall that isn't a problem in my opinion.

Clinical trials are long and expensive, the medical advisory board wants compensation as well. But even startups can theoretically fund new therapies if they and their medial advisory boards get subsidies. It is a lot of risk though because for most drugs or medical devices, the real effectiveness can only be determined later in the trial itself.


The current system is like Churchill's description of democracy: the worst system, except for all the others.

Biology is extremely complex. There's no substitute for actually trying things out on subjects in vivo. For many diseases we don't even know the cause (Alzheimer's for example). Drug companies have all the incentive in the world to improve the system to get better odds; it's not like they want drug discovery to be such a crapshoot.


It’s ironic that you brought Alzheimer’s as an example since it exactly proves your point - drug companies pushed a therapy that targets a highly questionable _symptom_ of the disease, even though every single step of the process gave negative or inconclusive results. It was all about ego and desperate attempts to make profits using iffy drug candidates.

And “biology is complex” is the type of truism I hinted at. You can always say that whenever you fail. Biology is complex and Alzheimer’s is the most complex of them all, to be sure, but I hope you’re aware of the. Alzheimer’s cabal allegations that the entire field was mutilated by a bunch of people into believing and pursuing the wrong hypotheses for decades.


We also don't understand how some drugs work, either (e.g. Tylenol).


I'd say we have a rather good idea about the mechanisms for pain relief from paracetamol. Even Wikipedia has a decent summary: https://en.wikipedia.org/wiki/Paracetamol#Pharmacodynamics


A utility-maximizing drug discovery system would, I think, devote some effort to biological experimentation on healthy humans, giving them chemical probes to see how that affected their biology. As is, ethics requires we get this information accidentally, for example from that famous recreational drug chemist who gave himself Parkinson's Disease with a botched synthesis that made a highly neurotoxic chemical. And some of the information comes from drug trials. A useful drug is not the only value obtained from a drug trial -- each trial is also a test of a hypothesis about the mechanisms of a disease.

One of the books of the "Colossus" trilogy (about a computer that takes over the world) had the computer doing this sort of medical experimentation on randomly selected drafted subjects, with the idea of maximizing overall utility. It shows the problem with utility maximization as a goal, similar to the requirement that people give up a healthy kidney if someone else needs a transplant.


Many thanks for saying what I suspected when looking at the research publications and clinical trials on neurodegenerative diseases. I was starting to think I was an unproductive perpetual malcontent.

For example, memantine has been tested 5 times in ALS. There even no pre-clinical studies that show any positive effect of memantine in animal models. This seems so bizarre to me.


Well to be fair, big pharma doesn't release preclinical results the same way that academia does. There might be no published work to support the hypothesis, but that doesn't mean they haven't done preclinical work.


How good are our animal models of ALS? Are they predictive of effectiveness in humans?


I am not an expert (I am a retired R&D telecom engineer) but here is my take:

* As for cancer, there are several (many?) ALS variants. The first gene to be associated with ALS was SOD1 G93A allele in 1993. It stayed the only ALS gene known until 2006. That was a curse for research as ALS with SOD1 origin is less than 2% of total cases, and even for SOD1 there are dozens of mutations associated with ALS, some with 6 months of life expectancy, others with 20 years.

* Most commercial animal models are SOD1 G93A mice [0]. The G93A mutation represents roughly only 0.4-1.4% of all ALS cases worldwide, yet it is the most used animal model!

SOD1 G93A ALS models are also the less costly animal models.

* I think another important thing is that ALS starts often in hands (split hand phenomena) and targets skeletal muscles. But humans' nervous system for hands is very special, only shared with other upper primates. Other mammals like mice have an interneuron between the upper and lower motor neuron for hands. We do not, there is a direct connection between upper and lower motor neurons, reflecting the importance of manipulation for humans. Therefore for me, we can't prove with mice at pre-clinical stage, that a drug is efficacious or not (many drugs have some efficacy in animal models, but none in humans).

* Some publications pretend they can use individual cells, fishes, or nematodes as animal models. That's laughable, it's ignoring the importance of anatomy and physiology. We are complex animals, our hormones, our immune system, and our metabolism are important to understanding ALS. The proof of that is that ALS patients who have the best life expectancy have a BMI of 27.

* Other publications pretend to make their own animal models with some chemical, like BMAA, a neurotoxin found in certain cyanobacteria. Those publications smell bad behavior for me.

If you want to buy a mice model of ALS:

[0] https://www.jax.org/jax-mice-and-services/preclinical-resear...


Is this a market that can be disrupted? It sounds if you know how to save a few billion and introduce more science based drugs, it’s ripe for an overtake.


In the same way Uber disrupted licensed taxis - or the big internet firms disrupted ad supported media.

ie totally ignoring existing regulations, pretending they don't apply to you and just hoping you can push through.

In a lot of the 'problems' are the regulations ( which are double edged and tricky to get right ) - and pharma companies are just following the rules.

I think governments might be less lax in letting there be a new wildwest in drug development.


Pointing the finger at regulation is misleading IMO. The regulations for bringing a drug to market are essentially quite simple: prove that it’s better than what currently exists.

What makes it difficult is the word “prove

It turns out it’s obscenely hard to make a drug that’s good, and even harder to prove that it’s good.


> prove that it’s better than what currently exists.

So how do you do that ethically? How do you justify taking off something that you know works to some extent and try something completely new or worse placebo? ie don't you have to construct the trial in the context of existing treatments etc?

These are the kind of challenges that makes drug development slow - in the end you don't do one trial, but a series of trials, slowly building confidence and making the case.

Often that's what takes the time during the clinical phase.

Of course it would be much faster to go straight to a big trial that would show how well your treatment works in conditions optimal to it - however that kind of 'move-fast break-things' approach involves potentially breaking things which happen to be people.

Regulation just reflects the cautious 'first do no harm' philosophy.

Now let's be honest - big pharma will simultaneous complain about regulation and the cost of development, and at the same time know it creates barriers to entry - there is always some frustration about the slowest of regulatory authorities to adopt new methods - however you wouldn't want your regulatory to be gungho.


> or worse placebo

Just to be clear, most drug trials for anything where we have an effective treatment are not “new drug vs placebo”, but instead “new drug vs standard of care”. Thus the goal being to prove it’s better than what already exists.


Sure - it rather depends on how good the 'standard of care' is or how much consensus there is on what that should actually be.

If the standard of care is already good and you don't need a placebo - then you have another problem - you probably are going to have to do quite a big trial to get the stats to show a significant difference, and you are going to find it harder to persuade people to participate with an experimental treatment if there already is a fairly good treatment.

The whole point about the challenges with clinical trials is that it's not an intellectual exercise in designing the perfect experiment and 'just doing it'.

It's about persuading yourself, the regulators, the doctors and ultimately the patients that it's something you should try - and before you've done your first trial you don't have any human data to show it's safe and effective - all a bit chicken and egg - the solution is often to move slowly in stages.


This is particularly difficult for drugs that affect the brain, like MDMA for PTSD in veterans. What do you use as the control group for that, when patients and clinicians can tell that who got the real thing and who did not. I call this the bridge problem. In order to do science, you have to have a control group, but if I built a bridge across a ravine, we don't have to have cars drive off a cliff and fall into the ravine in order to scientifically prove that the bridge works and exists. We engineered a bridge and put it there and obviously if there was no bridge cars would just fall into the ravine so we don't need to test that the bridge exists. We design the bridge, we rate it up to a certain capacity, we don't test it until it fails, we simply prohibit really heavy trucks from driving on smaller bridges that can't take their weight.

We can't do any of that for drugs that affect emotions and consciousness because we're barely in the stone age of our understanding of the brain and the technology we have to affect it.


That's a good explanation with the bridge. There is also the parachute clinical trial being used to explain the futility of it:

https://www.bmj.com/content/363/bmj.k5094


[flagged]


Sorry, are you having difficulty with the concept that human prisoners should have more rights than mice?


Purdue Pharma, fentanyl and doctors abrogating responsibility for patient safety is an example of 'go wild'.

On your second point - I'd agree that a lot of animal experiments are not that informative - but lets be clear 'clinical trials' are simply experiments on people.

I'm not sure I'd want to give Musk, Zuckerberg or Bezos free reign to experiment on desperate people in the medical space.

Depends on whether you treat people as just grist to your money making mill - or perhaps you think the ends justify the means?


the minds and qualia of incarcerated human beings and of rodents are very unequal in import and value.

what’s more, establishing legal precedent that incarcerated human beings may be freely experimented on is a recipe for ethical catastrophe.


Uber disrupted taxis because taxis were a sleazy experience, with dirty old cars, “broken” meters and rude drivers that tried to get you to pay extortionate prices if they knew you were in a pinch.

Stop trying to venerate the taxi industry, they’re horrible.


I think that depends on what part of the world you live in.

My experience of taxi companies in the UK is that they are generally safe, reliable and operate based on reputation.

My experience of taxi's in the US is that they appear to be often operated by desperate people living on the edge of existence.


Isn't that every service in the US? It takes pride on pushing the under people to the brink of death.


There's no hoping you can push through. The US Government has complete top-down control over the sale of prescription drugs in the US, from clinicals to approval to distribution & sale.

The sole reason Uber pulled off what they did, is there's no national authority governing taxi style services for all states and cities, it's a state and local effort. So Uber counted on navigating around zillions of slow local governments long enough to get big, and it worked very well. You can't do that in prescription drugs, the feds have a big hammer and can (and will) use it anytime they like.


Absolutely, and if you recall, even YC tried to get in on this idea.

Except they did the same mistake anyone who comes up with this disruption plan commits (including Google with Calico, or Zuck with CZI) - they recruit existing academics to do the disruption. Unfortunately this just fails miserably because they’re culturally corrupted to think of standard dogmas (like there can never be a single cure for cancer). I remember a time when other such dogmas existed (remember how it was considered impossible to de-differentiate somatic cells?).

The other mistake tech bros make in biology is they think they can make any cool idea work if they are smart enough. Because this is actually true in tech. But biology is restricted by laws of nature. If a drug doesn’t work, it can’t be made to work. There’s no room for wishful thinking.

Third mistake I see often is individual bias towards fields that they come from. Someone who has an RNA background will only try to use RNA to solve everything, likewise with antibodies, or imaging, etc. The current research funding system incentivizes such thinking and it becomes entrenched in anyone already in this field. There’s never a thought of “which is the exact technology and approach I should use to solve this problem independent of what I’m an expert at?” So a lot of projects are doomed from the start.

As long as you’re cognizant of these three facts, I think it’s very possible to disrupt this field.


Is there any plausible biological reason to think that there could ever be a single cure for cancer?


Perhaps immune-based therapies like CAR-T are based on the premise that there are many cancerous cells in your body all the time, but your immune system deals with them, and it’s only when it fails to do so that you end up in the pathological state. So the “single cure” is the normally-functioning immune system?


That might be part of it. And yet sometimes people with normally-functioning immune systems also get cancer. So while that might be an effective treatment for some patients it's not going to be a universal cure.


Human "normal" may not be enough.

Bat "normal" might be. Of course, now we are crossing the threshold from medicine to bio-augmentation.


There is no free lunch in biology. Augmenting the immune system to better attack cancer is going to cause other problems. It's so naive to think there is some simple solution that will improve on a billion years of evolution. I mean it's not impossible but realistically what are the odds?

There won't be any magic for cancer. It's just going to be slow grind to solve one hard problem after another.


There is no free lunch outside biology either. The problems that come with stronger immune systems may be more tractable or at least less unpleasant than cancer.

Also, you seem to be very pessimistic. Many interventions in the history of medicine, like washing hands or the first vaccine against smallpox, were almost "magical" in their efficiency: they addressed a lot of problems through a relatively trivial intervention.

It is likely that a lot of this low-hanging fruit has been picked up, but you insinuate that there isn't any low-hanging fruit to begin with, only an endless slog of attacking hard problems. That is way too negative.


Some mammal species like bats, whales and naked mole rats seem to be extremely unlikely to get cancer. Which may be an indication that a very efficient immune system can keep cancer in check indefinitely.


Some drugs not being able to make it into phase 1 clinical trials sounds like a functioning regulatory system to me. The bar isn't astronomically high for a phase 1. Like sure, you can't just do it in your garage like a web startup, but there are reasons for that. If anything, there are way too many drugs floating around in LDT right now, hence why those are being faded out.


There are companies trying to address this right? Have you seen biorce and other new ventures? Hopefully it can bring some innovation and reform to old processes.

That being said, we're talking about human lives either way so it needs to be thought through and avoid unintended disasters through lack of care.


Author list, author affiliation and past history all suggest this might be a green card paper.


How does that work in the US they pump papers out as evidence of exceptional ability or something along those lines? I thought that mattered for entry visas not for residency.


Brilliant single line that is better than every other description above. Kudos.


No one owes anyone open source. If they can make the business case work or if it works in their favor, sure.


OP and the people who reply to you are perfect examples of engineers being clueless about how the rest of the world operates. I know engineers who don’t know Claude, and I know many, many regular folk who pay for ChatGPT (basically anyone who’s smart and has money pays for it). And yet the engineers think they understand the world when in reality they just understand how they themselves work best.


I think they’ll acknowledge these models are truly intelligent only when the LLMs also irrationally go circles around logic to insist LLMs are statistical parrots.


Acknowledging an LLM is intelligent requires a general agreement of what intelligence is and how to measure it. I'd also argue that it requires a way of understanding how an LLM comes to its answer rather than just inputs and outputs.

To me that doesn't seem unreasonable and has nothing to do with irrationally going in circles, curious if you disagree though.


Humans judge if other humans are intelligent without going into philosophical circles.

How well they learn completely novel tasks (fail in conversation, pass with training). How well they do complex tasks (debated just look at this thread). How generally knowledgeable they are (pass). How often they do non sensical things (fail).

So IMO it really comes down if you’re judging by peak performances or minimum standards. If I had an employee that preformed as well as an LLM I’d call them an idiot because they needed constant supervision for even trivial tasks, but that’s not the standard everyone is using.


> Humans judge if other humans are intelligent without going into philosophical circles

That's totally fair. I expect that to continue to work well when kept in the context of something/someone else that is roughly as intelligent as you are. Bonus points for the fact that one human understands what it means to be human and we all have roughly similar experiences of reality.

I'm not so sure if that kind of judging intelligence by feel works when you are judging something that is (a) totally different from your or (b) massively more (or less) intelligent than you are.

For example, I could see something much smarter than me as acting irrationally when in reality they may be working with a much larger or complex set of facts and context that don't make sense to me.


I am just using duckdb on a 3TB dataset in a beefy ec2, and am pleasantly surprised at its performance on such a large table. I had to do some sharding to be sure but am able to match performance of snowflake or other cluster based systems using this single machine instance.

To clarify Clickhouse will likely match this performance as well, but doing things on a single machines look sexier to me than it ever did in decades.


Where does your data reside, is it on an attached EBS volume, or in S3, or somewhere else?

I had some spare time and tinkered with duckdb with a 70GB dataset, but just getting the 70GB on to the EC2 took hours. Would be pretty rocking if duckdb team could somehow set up a ~1TB sized demo that anyone can setup and try for themselves in, say, under an hour.


Local drives. DONT USE EBS! you’ll incur a huge IO charge. You have to choose instances with attached nvme storage which means one of the storage optimized instances.

Reading the data off s3 will mean you will be slower than offerings like snowflake. Snowflake has optimized the crap out of doing analytics in s3, so you can’t beat it with something as simple as duckdb.

Importantly you need the data in some distributed format like parquet or split csv. Otherwise duckdb can’t read it in parallel.


Hi – DuckDB Labs devrel here. It's great that you find DuckDB useful!

On the setup side, I agree that local (instance-attached) disks should be preferred but does EBS incur an IO fee? It incurs a significant latency for sure but it doesn't have a per-operation pricing:

> I/O is included in the price of the volumes, so you pay only for each GB of storage you provision.

(https://aws.amazon.com/ebs/pricing/)


Can’t remember anymore, but it’s either (a) the gp2 volumes were way too slow for the ops or (b) the IOPs charges made it bad. To be clear I didn’t do it on duckdb but hosted a Postgres. I moved to light sail instead and was happy with it (you don’t get attached SSD in ec2 until you go to instances that are super large).


Also, I learned that Hive-partitioned Parquet on S3 is much slower than on disk.

S3 is high latency unless you use for S3 Express Zones (the low latency version).

We used EFS (not EBS) and it was much faster.


Test out the nvme drives though. It’s blazing.


I tried to spread large dataset into thousands of files on S3 and use StepFunctions Distributed Map to launch thousands of Lambda instances to process those files in parallel, using DuckDB (or other libs) in Lambda. The parallel loading and processing is way faster than doing this in a single big EC2 instance.


Lambda isn’t infinitely parallel. I thought it doesn’t do more than 100 parallel runners? I4i.metal has 96 cores and can be faster than that.


As per AWS said in https://aws.amazon.com/cn/blogs/aws/aws-lambda-functions-now...

> Each synchronously invoked Lambda function now scales by 1,000 concurrent executions every 10 seconds.


I’ve tried reading streamed parquet via PyArrow with Duck, and it’s been pretty promising. Depending on the query, you won’t need to download everything off HTTP.


we use partitioned parquet files in s3. we use a csv in the bucket root to track the files. i’m sure there’s a better way but for now the 2tb of data are stored cheaply and we get fast reads by only reading the partitions we need to read.


I'm curious how much simpler to build, manage, and run vs cost it would be to simply running a database on a large vultr/DO instance and paying for 2tb of storage?

I feel like you'd get away with the whole thing for around $500/mo depending on how much compute was needed?


You just need to try it once to see the issue. Merely loading this amount of data onto a Postgres db will be hell.


well that's not the infrastructure we have. we are primarily an aws shop so we use the resources available to us in the context of our infrastructure decisions. it would be a hard sell to buy something outside of that ecosystem.


I understand that's the infrastructure you have. But that's more describing vendor lock-in haha.

Most of my work is with clients that don't have any set infrastructure yet, so was curious if anyone had any anecdotes.


Huge fan of Clickhouse, but the minute you have to deal with somebody else's CSV is when Duck wins over Clickhouse.


It’s more accurate to say we are piecing together a thousand page book with just 20-50 sentences at random.


Each new piece adds to the picture, but it reminds us of just how much remains undiscovered


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: