Hacker News new | past | comments | ask | show | jobs | submit login

We have a code base of roughly 200,000 lines of Haskell code, dealing with high performance SQL query parsing, compilation and optimizations.

I only remember one situation over the past 5 years that we had a performance issue with Haskell, that was solved by using the profiling capabilities of GHC.

I disagree that performance is hard to figure out. It could be better, yes - but it's not that different than what you'd get with other programming languages.




but it's not that different than what you'd get with other programming languages.

Until you have to solve a (maddeningly common) space leak issue. That's a problem unique to lazily evaluated languages (basically Haskell) and is godawful to debug.

It reminds me of solving git problems... you suddenly find yourself having to tear back the abstraction layer that is the language and compiler and start thinking about thunks and so forth.

It's jarring and incredibly frustrating.


I find this response a little ironic because we don't really see complaints about knowing the C runtime and C compiler when performance becomes a problem, which is also jarring and frustrating. But, ultimately, sometimes languages have failings and we need to peek under the hood to figure out how to address it - we're just more comfortable with the relatively common and direct mapping of the C runtime.

I am not experienced enough with Haskell to know whether peeking under its hood involves a more complicated model or not. It might be more frustrating. But its certainly not a unique experience - it's costs are just less distributed over other runtimes.


The C is very straightforward.

Maybe you meant C++? You see no complaint because noone use that anymore. Anything that can be done with an easier language is done with an easier language. The hardcore C++ performance critical code is left to a few veterans, who don't complain.


I don't consider the combination of the C runtime + the machine model straightforward - just less arcane than C++. Consider pipelines and branch prediction and cache lines and it quickly becomes difficult. Granted, those typically become relevant later in the optimization stage than other things.


Pipelines, branch prediction and caching are not part of the C runtime. And unlike Haskell, C makes it easy to look at the assembly for a piece of code, evaluate it for the machine it's running on, and fix these problems when they come up. C is not generally adding additional burdens, especially not ones that a higher-level language like Haskell won't also be adding to a far greater degree.


"we're just more comfortable with the relatively common and direct mapping of the C runtime"

That's a funny way to put it. It's more like the difference between getting results or abandoning the thing altogether due to exploding cost of required effort.


When have you ever had the C runtime be the cause of a performance problem?


I have not because I don't write C professionally. We generally have things that require algorithmic improvements due to the scale - language doesn't matter.

C's model requires you to understand the machine model. Haskell presumably requires you to understand the machine model (but less thoroughly) but understand the compiler's model also. It's a little more but comparable. So complaining only about the Haskell runtime just seems ironic to me.


Writing a custom replacement for malloc is relatively common. Does that count?


I don't see why not, though I wouldn't consider that to be an example of a difficult to diagnose problem in the same vein as lazy evaluation.


Writing a custom replacement for malloc is relatively common.

You... can't be serious. That's common to you?


Yes? From the Wikipedia:

"Because malloc and its relatives can have a strong impact on the performance of a program, it is not uncommon to override the functions for a specific application by custom implementations that are optimized for application's allocation patterns."


"not uncommon" is far far from "common". If you're writing your own malloc replacement you're pretty deep into the weeds of high performance computing. Heck even just deciding to replace the stock implementation with an off-the-shelf replacement puts you in fairly rarified company. I'd wager the vast majority of software written for Linux makes use of the standard glibc allocator.

I expect high performance games are the most common exception, but they represent a fraction of the C/C++ in the wild.

Space leaks in Haskell, on the other hand, are unfortunately relatively easy to introduce.


There is this really great paper about space leaks and arrows: https://pdfs.semanticscholar.org/cab9/4da6e4b88b01848747df82...


> Until you have to solve a (maddeningly common) space leak issue.

Hm, I've been making useful things with Haskell for a couple years including quite a few freelance projects and haven't encountered many space leaks.

Definitely not enough to say they are maddeningly common, or even enough to say they are common.


My experience tracks yours.


here is a method that can help debug space leaks: http://neilmitchell.blogspot.com/2015/09/detecting-space-lea...


I wish the problem was simply detecting and isolating space leaks.

My experience is that actually fixing them can be incredibly difficult, hence my comment about needing to understand the gory details about how the runtime evaluates lazy expressions.

Heck, that post even uses the phrase "Attempt to fix the space leak" as it's often not an obvious slam dunk. Sometimes it even devolves to peppering !'s around until the problem goes away.


> My experience is that actually fixing them can be incredibly difficult

My experience differs, FWIW. If you know where you're creating too many thunks, and you force them as you create them, they don't accumulate.

Making sure you actually are forcing them, and not simply suspending a "force this", is probably the trickiest bit until you're used to the evaluation model.


If you know where you're creating too many thunks, and you force them as you create them, they don't accumulate.

Translation: if you've internalized the way Haskell code is compiled and executes, so that you can easily reason about how lazy evaluation is actually implemented, you can solve these problems.

If not, it devolves to throwing !'s in and praying.

Which is basically my point.

If I don't have a hope of solving common space leaks without deeply understanding how Haskell is evaluated, that's a real challenge for anyone trying to learn the language.


This sounds a bit FUD-ish. Programming in any language involves understanding the evaluation strategy. You seem to be advocating languages that can be used with a level of ignorance or innocence which in practice just isn't possible.

https://en.wikipedia.org/wiki/Evaluation_strategy


I disagree. I contend that most of the time folks ignore the evaluation strategy for eagerly evaluated languages because they're simply easier to reason about. That strikes me as objectively true on its face and I believe most Haskellers would concede that point.

The only time I can think of where that's not the case is when microoptimizing for performance, where the exact instructions being produced and their impact on pipelining and cache behaviour matter. But that's in general far more rare than one encounters space leaks in idiomatic Haskell.

Heck, one just needs to read about two of the most common Haskell functions to hit concerns about space leaks: foldl and foldr. It's just part of the way of life for a Haskeller.

There's simply no analog that I can think of in the world of eagerly evaluated languages that a) requires as much in-depth knowledge of the language implementation, and b) is so commonly encountered.

The closest thing I can come up with in a common, eager language might be space leaks in GC'd languages, but they're pretty rare unless you're doing something odd (e.g. keeping references to objects from a static variable).


You're taking issue with material that is covered in every haskell 101 course worth it's salt (how to understand the properties of foldl and foldr).

We typically don't evaluate the efficacy of a language contingent on someone who is missing fundamental, well-known concepts about the language.

Also, I don't think folks "ignore the evaluation strategy for eagerly evaluated languages". They simply learn it early in in their programming experience.


You're taking issue with material that is covered in every haskell 101 course worth it's salt (how to understand the properties of foldl and foldr).

Oh, I'm not "taking issue". This isn't personal. It's just my observations.

And yes, that one needs to explain the consequences of lazy evaluation and the potential for space leaks to a complete neophyte to justify foldr/foldl is literally exactly what I'm talking about! :)

Space leaks are complicated. And they're nearly unavoidable. I doubt even the best Haskeller has avoided introducing space leaks in their code.

That's a problem.

Are you saying it's not? Because that would honestly surprise me.

Furthermore, are you saying eager languages have analogous challenges? If so, I'm curious what you think those are! It's possible I'm missing them because I take them for granted, but nothing honestly springs to mind.


I didn't claim space leaks aren't a problem. But one has to size the magnitude of the problem appropriately. And one should also cross-reference that with experience reports from companies using Haskell in production.


I think there are different types of space leaks:

- Reference is kept alive so that a computation can't be streamed. Easy to figure out with profiling but fixing them might make the code more complex. Also, if you have a giant static local variable ghc might decide to share it between calls so it won't be garbage collected when you'd expect it.

- Program lacks strictness so you build a giant thunk on the heap. This is probably what you think of when talking about dropping !'s everywhere. I don't find it that difficult to fix once the location is known but figuring that much out can be seriously annoying.

- Lazy pattern matching means the whole data has to be kept in memory even if you only use parts of it. I don't think I have ever really run into this but it is worth keeping in mind.

- I have seen code like `loop = doStuff >> loop >> return ()` several times from people learning haskell, including me. Super easy to track down but still worth noting I guess.

Building giant thunks is the only one where you really need some understanding of the execution model to fix it. 99% of the time it is enough to switch to a strict library function like foldl', though.


> I don't find it that difficult to fix once the location is known but figuring that much out can be seriously annoying.

To be clear, I agree with this. Easy to fix once you know where it is, if you're competent in the language. Occasionally very hard to know that.


> unique to lazily evaluated languages

More precisely, unique to significant use of laziness, which is going to (obviously) be more common in lazily evaluated languages but laziness is supported elsewhere.


(OP here)

I do say it's not that big of a deal in the end: it almost always is OK and at the end you optimize the inner loops by looking at profiles; like in any other project.

But when using GHC, I have indeed sometimes ran into situations where I expect something to be fast when it is not (e.g., `ByteString.map (+ value)` is incredibly slow compared to a pseudo-C loop).

I also did find a bona fides performance bug in GHC https://ghc.haskell.org/trac/ghc/ticket/11783


GHC isn't magical. It has bugs too.

We had an issue with the Data.Text package and OverloadedStrings in 7.10.2 which caused extremely slow compilation times and filed a bug report for that, which was solved for 7.10.3.


> (e.g., `ByteString.map (+ value)` is incredibly slow compared to a pseudo-C loop).

That situation sounds like you needed a ByteString Builder to get comparable performance.


Maybe, but pseudo-C actually worked very well.

I even have some "real C" in my Haskell code to handle some inner loop stuff.


I find that incredibly hard to believe. I wonder whether you are looking at history through rose-tinted glasses or you simply don't know because other people quietly solved those problems. I ran into such problems several times in 50 line programs and not because my code was non-idiomatic or wrong. #haskell agreed it was non-trivial to get the code to perform well.


Hi can you tell us a bit more about this SQL-related project? I'm rather practically interested.


I work for www.sqream.com, and our main product is SQream DB.

SQream DB is a GPU SQL database for analytics. Everything was written in-house - from the SQL parser all the way down to the storage layer. We're designed to deal with sizes from a few terabytes to hundreds of terabytes, and we use the GPU to do most of the work. It is the compiler (written in Haskell) however, that decides about specific optimizations and deals with generating the query plan.

Our SQL parser and typechecker is actually BSD3 licensed, if it's interesting: https://github.com/jakewheat/hssqlppp


> www.sqream.com

I don't like having to be that guy, but your landing page hijacks scrolling (which always causes problems; I noticed first because it broke ctrl-scroll zooming), downloads a 30MB video in a loop, forever, takes a significant amount of CPU even when the video isn't in view and the tab is not active, and despite having almost no content manages to take far more memory than any other non-email tab I have open.


Our website is... Yeah... I know... :/

I've passed on your comments in any case, but know that we're in the process of rebuilding the website from scratch.


Any chance SQream DB will be open sourced or a demo version? It seems every Column based GPU DB I find is behind a paywall and targeting enterprise levels.


I think it's unlikely at this point...

We too are currently during some large projects for enterprises...

We will be releasing a community version on AWS and Azure in the near future which should be cheap or free, other than the instance cost on the respective cloud provider.


If you want more generic SQL tooling in Haskell, my project was just recently open sourced (https://github.com/uber/queryparser). It currently has support for Vertica, Hive, and Presto; adding support for more dialects isn't complicated. I'm working on cleaning it up for hackage.


Very interesting! Do you do any sort of typechecking in this?

FWIW, HsSqlPpp also supports various dialects. In SQream DB we use a custom dialect, while HsSqlPpp is mostly Postgres.


Typechecking of the SQL statements? There was an internal branch where I was experimenting with that, but it didn't get landed before I left, and isn't in the squashed version.

It was relatively simplistic - along the lines of "things tested for equality must share a type". It was also focused on domain-relevant types, not SQL type (/representation).

> FWIW, HsSqlPpp also supports various dialects. In SQream DB we use a custom dialect, while HsSqlPpp is mostly Postgres.

Nice. It'll be interesting to see where our implementation choices differed, and what we can learn from each other :)


> but it's not that different than what you'd get with other programming languages

Does your list of other programming languages include C/C++/asm?

;)


Incidentally, yes.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: