voidstarcpp's comments

voidstarcpp · on Oct 11, 2023

>But if our performance knowledge is outdated, how do we know which cases to separate the code into?

Case separation is less of a commitment than deciding on the specific optimization yourself. You know empirically which cases are used in your application and can separate those to see if there's a benefit without any manual optimization knowledge, like I did in the image scaling example, without having much of a theory in mind as to why it might be faster.

voidstarcpp · on Oct 11, 2023

The "tiny math function" is used for expository purposes of the generic application, to fit a trivial example on one page. Obviously it's not how you would literally write a function that transforms elements in a vector.

voidstarcpp · on Oct 11, 2023

>GCC is actually good at that, -O3 has no issue recognizing it.

imo, if you have to go to O3 or enable a pragma to get an "obvious" optimization then this is undesirable and probably something the programmer still needs to be conscious of, especially since we spend so much time doing development builds that are not at the maximum release optimization level.

lifthrasiir · on Oct 11, 2023

"Misaligned" would be a better word than "obvious". It is well known and understood that a higher optimization level means more compilation time, potential speed gain and more subtle breakage in case of presence of UB, and being the maximum level, -O3 also implies potential binary size increase as well. In this understanding I believe `-funswitch-loops` is correctly placed on -O3. But ideally we want less compilation time, potential speed gain, less subtle breakage and insignificant binary size increase at once, and you can argue that all existing optimization levels are too far from that ideal.

voidstarcpp · on Oct 11, 2023

Addressing the aliasing concern would be the easiest improvement. I observed in the assembly that the source pixel is being re-read all four times it is used, which could be fixed.

Writing an optimal composite function is of course not really the goal, nor of much educational/entertainment value. For any additional speed I already have a function which slices up compositing tasks into chunks and puts them on a thread pool.

voidstarcpp · on Oct 11, 2023

This is possible if the call site can see the implementation, but you can't count on it for separate translation units or larger functions.

My goal was to not rely on site-specific optimization and instead have one separately compiled function body that can be improved for common cases. Certainly, once the compiler has a full view of everything it can take advantage of information as it pleases but this is less controllable. If I were really picky about optimizing for each use I would make it a template.

>Doesn't even need the `inline` keyword for this at `-O2`

The inline keyword means little in terms of actually causing inlining to happen. I would expect the majority of inlining compilers do happens automatically on functions that lack the "inline" keyword. Conversely, programmers probably add "inline" as an incantation all over the place not knowing that compilers often ignore it.

Thorrez · on Oct 11, 2023

>Conversely, programmers probably add "inline" as an incantation all over the place not knowing that compilers often ignore it.

Funnily, the inline keyword actually has a use, but that use isn't to tell the compiler to inline a function. The use is to allow a function (or variable) to be defined in multiple translations units without being an ODR violation.

maccard · on Oct 11, 2023

MSVC treats inline as a hint [0] , GCC is ambiguous [1] but I read it as utilising the hint. My understanding of clang [2] is that it tries to match GCC, which would imply that it applies.

[0] https://learn.microsoft.com/en-us/cpp/build/reference/ob-inl...

[1] https://gcc.gnu.org/onlinedocs/gcc/Inline.html

[2] https://clang.llvm.org/compatibility.html

rileymat2 · on Oct 11, 2023

What the OP is saying is that it has a use to satisfy (https://en.cppreference.com/w/cpp/language/definition) and in that case it is not optional and not ignored by the compiler. Whether it will "actually" inline it is another matter, that is optional.

chii · on Oct 11, 2023

would it have made a difference if the function was static? The compiler would then be able to deduce that it isn't used anywhere else, and thus could do this inline optimization.

thesnide · on Oct 11, 2023

Gcc actually totally does that. Even doing it while not static by copy/pasting the function and enabling further optims.

to me modern compilers are like voodoo magic

voidstarcpp · on Oct 10, 2023

TL;DR: If you copy paste the same implementation code in different branches, you give the compiler opportunities to generate faster code for each case it wouldn't have otherwise generated, without you having to do any manual optimization work.

jrumbut · on Oct 11, 2023

This is really an excellent programming technique article.

I think what makes it so great is that your examples hit a sweet spot of simplicity while still being motivating and you show how to get a "good enough" solution very quickly.

resonious · on Oct 11, 2023

If that's the case, then I imagine the inline keyword would have the same effect?

voidstarcpp · on Oct 11, 2023

You need to use a compiler specific "always inline" directive if you want macro-like functionality of actually inlining code.

On its own, the C++ "inline" keyword does not cause inlining to happen, although compilers may treat it like a hint depending on optimization level. GCC does not inline an "inline" function on O0.

"Inline" in the C++ standard means "multiple definitions permitted" so the same entity can exist in multiple translation units without upsetting the linker. This is why C++17 added "inline" variables, which can be initialized in a header that's included in multiple places, even though the inlining concept has no applicability to a variable. The keyword was adopted for this purpose because of the primary association with affecting linkage behavior.

swatcoder · on Oct 11, 2023

This isn’t an optimization you should consider without insight into your actual bottlenecks, but compilers can be even more aggressive with code that’s strictly local to one translation unit (i.e. inside an anonymous namespace in a cpp file) than they typically would be when seeing an inline hint elsewhere.

It’s not quite the same.

Plus, another benefit of duplication is that you can more freely hand-tune your implementation once you’ve decided its private. Memory alignment, pointer aliasing hints, clever loop structures, SSE stuff, etc can all be used more freely when you know nothing else needs to use this version.

The article is a good teaser around how unintuitive optimization can be, but it only scratches the surface.

lionkor · on Oct 11, 2023

yeah, languages which can do inking optimizations do this for you, sometimes even without you knowing.

spiritplumber · on Oct 11, 2023

thank you

voidstarcpp · on Sept 26, 2023

PostgreSQL supports heap tables, which should blow away SQLite's mandatory clustered index tables in an unindexed insert test.

voidstarcpp · on Sept 26, 2023

With SQLite by default nobody else can even read the database while it's being written, so your comment would be better directed to a conventional database server.

argiopetech · on Sept 26, 2023

Note that the default can be changed with a PRAGMA at compile time to Write-Ahead Logging (WAL) which is both faster and allows reading during writes.

[0] https://www.sqlite.org/wal.html

voidstarcpp · on Sept 26, 2023

In my limited experiments WAL wasn't faster for high-row-speed writes.

When rows/s are maximized SQLite is CPU limited by internal bytecode operations rather than waiting for disk stuff.

argiopetech · on Sept 26, 2023

Too late to edit: This should have said "at run time", as it's set by making a relatively normal (can't be inside a transaction) SQL query.

RedShift1 · on Sept 26, 2023

If you enable the WAL, reading while writing is possible. IMO it should be the default.

voidstarcpp · on Sept 26, 2023

>Parsing the very simple SQL doesn't seem like it would account for much time, so is the extra time spent redoing query planning or something else?

If you're inserting one million rows, even 5 microseconds of parse and planning time per query is five extra seconds on a job that could be done in half a second.

hot_gril · on Sept 26, 2023

I'm curious which takes longer, parsing or planning.

_a_a_a_ · on Sept 26, 2023

Depends on the complexity of the planning, but... typically planning (in mssql, am not familiar with sqlite).

Parsing is linear, planning gets exponential very quickly. It's got to consider data distribution, presence or not of indexes, output ordering, presence of foreign keys, uniqueness, and lots more I can't think of right now.

So, planning is much heavier than parsing (for any non-trivial query).

hot_gril · on Sept 26, 2023

That's what I expect, but for these very simple batch insert queries, it's hard to guess which takes longer.

_a_a_a_ · on Sept 26, 2023

A very simple statement may not be as simple as it looks to execute.

The table here is unconstrained (from the article "CREATE TABLE integers (value INTEGER);") but suppose that table had a primary key and a foreign key – a simple insertion of a single value into such a table would look trivial but consider what the query plan would look like as it verifies the PK and FK aren't violated. And maybe there are a couple of indexes to be updated as well. And a check constraint or three. Suddenly a simple INSERT of a literal value becomes quite involved under the skin.

(edit: and you can add possible triggers on the table)

hot_gril · on Sept 26, 2023

Right, but in this particular example, the table is just

  CREATE TABLE integers (value INTEGER);

_a_a_a_ · on Sept 26, 2023

I said exactly that.

hot_gril · on Sept 26, 2023

My bad, missed the first part.

_a_a_a_ · on Sept 26, 2023

Happen, happen, we've all done it. NP