>But if our performance knowledge is outdated, how do we know which cases to separate the code into?
Case separation is less of a commitment than deciding on the specific optimization yourself. You know empirically which cases are used in your application and can separate those to see if there's a benefit without any manual optimization knowledge, like I did in the image scaling example, without having much of a theory in mind as to why it might be faster.
The "tiny math function" is used for expository purposes of the generic application, to fit a trivial example on one page. Obviously it's not how you would literally write a function that transforms elements in a vector.
>GCC is actually good at that, -O3 has no issue recognizing it.
imo, if you have to go to O3 or enable a pragma to get an "obvious" optimization then this is undesirable and probably something the programmer still needs to be conscious of, especially since we spend so much time doing development builds that are not at the maximum release optimization level.
"Misaligned" would be a better word than "obvious". It is well known and understood that a higher optimization level means more compilation time, potential speed gain and more subtle breakage in case of presence of UB, and being the maximum level, -O3 also implies potential binary size increase as well. In this understanding I believe `-funswitch-loops` is correctly placed on -O3. But ideally we want less compilation time, potential speed gain, less subtle breakage and insignificant binary size increase at once, and you can argue that all existing optimization levels are too far from that ideal.
Addressing the aliasing concern would be the easiest improvement. I observed in the assembly that the source pixel is being re-read all four times it is used, which could be fixed.
Writing an optimal composite function is of course not really the goal, nor of much educational/entertainment value. For any additional speed I already have a function which slices up compositing tasks into chunks and puts them on a thread pool.
This is possible if the call site can see the implementation, but you can't count on it for separate translation units or larger functions.
My goal was to not rely on site-specific optimization and instead have one separately compiled function body that can be improved for common cases. Certainly, once the compiler has a full view of everything it can take advantage of information as it pleases but this is less controllable. If I were really picky about optimizing for each use I would make it a template.
>Doesn't even need the `inline` keyword for this at `-O2`
The inline keyword means little in terms of actually causing inlining to happen. I would expect the majority of inlining compilers do happens automatically on functions that lack the "inline" keyword. Conversely, programmers probably add "inline" as an incantation all over the place not knowing that compilers often ignore it.
>Conversely, programmers probably add "inline" as an incantation all over the place not knowing that compilers often ignore it.
Funnily, the inline keyword actually has a use, but that use isn't to tell the compiler to inline a function. The use is to allow a function (or variable) to be defined in multiple translations units without being an ODR violation.
MSVC treats inline as a hint [0] , GCC is ambiguous [1] but I read it as utilising the hint. My understanding of clang [2] is that it tries to match GCC, which would imply that it applies.
What the OP is saying is that it has a use to satisfy (https://en.cppreference.com/w/cpp/language/definition) and in that case it is not optional and not ignored by the compiler. Whether it will "actually" inline it is another matter, that is optional.
would it have made a difference if the function was static? The compiler would then be able to deduce that it isn't used anywhere else, and thus could do this inline optimization.
TL;DR: If you copy paste the same implementation code in different branches, you give the compiler opportunities to generate faster code for each case it wouldn't have otherwise generated, without you having to do any manual optimization work.
This is really an excellent programming technique article.
I think what makes it so great is that your examples hit a sweet spot of simplicity while still being motivating and you show how to get a "good enough" solution very quickly.
You need to use a compiler specific "always inline" directive if you want macro-like functionality of actually inlining code.
On its own, the C++ "inline" keyword does not cause inlining to happen, although compilers may treat it like a hint depending on optimization level. GCC does not inline an "inline" function on O0.
"Inline" in the C++ standard means "multiple definitions permitted" so the same entity can exist in multiple translation units without upsetting the linker. This is why C++17 added "inline" variables, which can be initialized in a header that's included in multiple places, even though the inlining concept has no applicability to a variable. The keyword was adopted for this purpose because of the primary association with affecting linkage behavior.
This isn’t an optimization you should consider without insight into your actual bottlenecks, but compilers can be even more aggressive with code that’s strictly local to one translation unit (i.e. inside an anonymous namespace in a cpp file) than they typically would be when seeing an inline hint elsewhere.
It’s not quite the same.
Plus, another benefit of duplication is that you can more freely hand-tune your implementation once you’ve decided its private. Memory alignment, pointer aliasing hints, clever loop structures, SSE stuff, etc can all be used more freely when you know nothing else needs to use this version.
The article is a good teaser around how unintuitive optimization can be, but it only scratches the surface.
With SQLite by default nobody else can even read the database while it's being written, so your comment would be better directed to a conventional database server.
>Parsing the very simple SQL doesn't seem like it would account for much time, so is the extra time spent redoing query planning or something else?
If you're inserting one million rows, even 5 microseconds of parse and planning time per query is five extra seconds on a job that could be done in half a second.
Depends on the complexity of the planning, but... typically planning (in mssql, am not familiar with sqlite).
Parsing is linear, planning gets exponential very quickly. It's got to consider data distribution, presence or not of indexes, output ordering, presence of foreign keys, uniqueness, and lots more I can't think of right now.
So, planning is much heavier than parsing (for any non-trivial query).
A very simple statement may not be as simple as it looks to execute.
The table here is unconstrained (from the article "CREATE TABLE integers (value INTEGER);") but suppose that table had a primary key and a foreign key – a simple insertion of a single value into such a table would look trivial but consider what the query plan would look like as it verifies the PK and FK aren't violated. And maybe there are a couple of indexes to be updated as well. And a check constraint or three. Suddenly a simple INSERT of a literal value becomes quite involved under the skin.
(edit: and you can add possible triggers on the table)
Case separation is less of a commitment than deciding on the specific optimization yourself. You know empirically which cases are used in your application and can separate those to see if there's a benefit without any manual optimization knowledge, like I did in the image scaling example, without having much of a theory in mind as to why it might be faster.