ClangQL: A tool to run SQL-like query on C/C++ Code

frabert · 2024-04-07T17:07:15.000000Z

I made a similar tool with the same name a couple of years ago :D https://github.com/frabert/ClangQL

mgaunard · 2024-04-07T20:43:34.000000Z

Yours looks much better to be honest.

wizzledonker · 2024-04-08T06:59:56.000000Z

Yours actually looks useful for my use case (remote debugging wrappers)

The schema of the linked post doesn't look useful to me at all haha

mgaunard · 2024-04-07T20:42:20.000000Z

I'm not too familiar with Rust, but took a look at the Cargo.lock file of this project.

It depends on half the universe for some reason, sometimes multiple versions of the same library. And that doesn't even include the main dependency which is libclang.

Is the Rust ecosystem just dependency hell?

epage · 2024-04-08T01:51:00.000000Z

There are a couple of aspects that can make lockfiles deceptive to look at

- Lockfiles include dependencies for all platforms, not just the ones you build with, including wasm and Hermit kernel

- Due to some quirks, lockfiles include more dependencies than will ever be used (related to a feature called "weak dependency features").

For a more accurate picture, look at `cargo tree` and `cargo tree --target=all`.

In this case, a lot of it is that the GitQL SDK depends on gix (gitoxide), a Rust re-implementation of git / libgit, which is made up of a lot of crates and depends on a lot of crates. This dependency should likely not be there and should be fixed. Even with `default-features = false`, gix pulls in a lot. Gix should not be relevant to a more generic SDK.

The other thing that stood out to me as unnecessary is that comfy-tables pulls in cross-term which has way too large of a dependency for what it is doing.

I am surprised at how much `chrono` pulls in but almost all of that is for specific platforms and `chrono` shrinks dramatically when looking only at the native platform.

I do see some stray dependencies that have built-in equivalents (lazy_static -> OnceCell, atty -> IsTerminal). Some of that is due to compatibility with old Rust versions. There were only a couple of these.

bastawhiz · 2024-04-08T01:20:08.000000Z

> Is the Rust ecosystem just dependency hell?

Dependency hell is a consequence of the developer's choices, not the ecosystem. The most foundational packages in any ecosystem for any language almost always have a trivial number of dependencies, and not because those packages have no needs.

lpribis · 2024-04-07T21:15:22.000000Z

> Is the Rust ecosystem just dependency hell?

Not quite to the extent of the js ecosystem, but yes. Especially for a purported systems language there's a lefpad-esque problem of people making tiny and somewhat useless libraries to learn or pad their resume which then get depended on by the world.

duped · 2024-04-08T14:00:45.000000Z

> sometimes multiple versions of the same library

If foo needs bar at ^1 and baz needs bar at ^2 and you need foo and baz then your compile would fail if multiple versions weren't allowed. In other words, this is a feature not a bug.

> Is the Rust ecosystem just dependency hell?

No because "dependency hell" is where you can't resolve dependencies because of simple issues like above in ecosystems that can't handle it.

Also, currently the compilation unit in Rust is a single crate. If you had a lock file that contained all the .o files generated by all your dependencies in C it would be massive too!

mgaunard · 2024-04-09T15:01:27.000000Z

Dependency hell means having many dependencies that are related to each other in non-trivial ways, making it difficult to ensure a compatible set of versions is maintained.

tialaramex · 2024-04-07T23:29:12.000000Z

My understanding is that a "dependency hell" first has to be a hell.

plasticeagle · 2024-04-07T21:27:33.000000Z

I think it's an inevitable outcome of a build system that makes it so easy to pull in packages - and so easy to create and publish them.

This is why C++'s weakness - difficulty of consuming third party libraries - is actually a strength. If you have to work hard to get third party code in there, you tend to make much better choices and keep your dependencies to a minimum.

In Rust, like JS and to a lesser extent Python also, there is no pressure to reduce dependencies. So you end up with this kind of a problem. Good luck upgrading one of those packages when its found to contain a bug.

mgaunard · 2024-04-09T15:02:25.000000Z

It's not difficult to consume third party C++ libraries -- what's difficult is to ensure third party code is maintained up to your standards and compatible with all the other choices you've made about your environment.

breakfastduck · 2024-04-08T09:51:08.000000Z

It does appear to be heading slowly down the JS rpm route

morgante · 2024-04-07T17:46:45.000000Z

Cool to see another query language for source code! Yours is definitely closer to SQL than GritQL is.[0] I particularly like the count semantics.

[0] https://github.com/getgrit/gritql

owlstuffing · 2024-04-08T04:14:02.000000Z

Bit of a tangent, but if we commonly stored source in a standard, fully attributed AST instead of caveman text, caching/indexing and deterministic search would be a breeze. So would a million other useful applications such as every other IDE feature and source control.

As the decades go by it confounds me that we are still messing with plain text, and with virtually no resistance to it. Maybe it's to keep the tabs vs. spaces flame war alive. _shrug_

duped · 2024-04-08T15:00:54.000000Z

I see this comment all the time but it falls apart when you try and actually implement IDE features.

Fundamentally, an abstract syntax tree is a canonical representation of a program in one grammar. You cannot standardize it, because different languages have different grammars, and therefore different abstract syntax trees.

You also don't want an abstract syntax tree for IDE features at all, you want a concrete (or full) syntax tree - since ASTs are lossy and don't represent the text. Another barrier is that constructing an AST usually requires the input to be correct and complete, but most of the time, source code is incomplete and invalid. It doesn't make sense to cache the AST because it is invalidated nearly every key stroke - you want to be able to construct an AST for performing the queries you need as fast as possible in the context of invalid input.

And finally, you still need to have the source text to display anything useful to the user. So you're not saving any work by using a cached AST, you're not saving memory or disk, and you still need the plaintext anyway.

All that said, there is such a thing as the language server index format (LSIF) which LSPs can provide as a cache to clients that support LSP for performing queries on rarely invalidated source (eg dependencies) but it's not super useful for code in flight.

owlstuffing · 2024-04-15T20:31:54.000000Z

> different languages have different grammars

Obviously. It should be understood that the idea is to have a standard AST API/format _per language_. As such it would be standard practice to provide this information as part of the parser/compiler library.

flohofwoe · 2024-04-08T09:02:08.000000Z

This is such an obvious and old idea that it would have killed text as source code by now if it would be an actually *good* idea ;) (IIRC one attempt was an IDE from IBM in the 90s).

owlstuffing · 2024-04-15T20:47:57.000000Z

One reason it hasn’t gotten traction is due to a “herding cats” problem - the breadth of existing languages and SCMs in use is too big to manage. But mostly it’s the business model, who’s going to invest the vast amount of time for this when there is virtually nothing in it for them, it’s a giant onion. There’s little money in tooling, there’s far less in metatooling.

There’s also a bit of “squeaky wheel” resistance. Still a significant minority clinging to text editors as opposed to modern IDEs. But given the demographics here I imagine this segment will begin to shrink dramatically within a decade or so. Shrug.

zokier · 2024-04-08T08:54:42.000000Z

tbh the on-disk storage format is pretty irrelevant, its all just bits and bytes on disk anyways. for many cases you can consider source text a serialization of ast; no matter how you store it you still need to parse the data somehow. Sure some formats are easier to parse than others (hello sexprs), but that is just minor perf aspect, conceptually its all the same.

this closely relates to the missing type discussion, there is no universal "tree" type, and as such it is also difficult to construct any universal "ast" type; different uses need trees structured in different ways: https://news.ycombinator.com/item?id=39592444

owlstuffing · 2024-04-15T20:02:24.000000Z

> tbh the on-disk storage format is pretty irrelevant, it’s all just bits…

Nonsense. A completely attributed AST is the result of a full parse. The information in the AST is vastly richer than source.

> there is no universal tree type

Obviously. It should be understood that the idea is to have a standard AST API/format _per language_. As such it would be standard practice to provide this information as part of the parser/compiler library.

pif · 2024-04-08T09:21:46.000000Z

> As the decades go by it confounds me that we are still messing with plain text

As the decades go by, it confounds me that people have yet to understand the concept of simplicity.

owlstuffing · 2024-04-15T20:28:05.000000Z

I understand your position that raw source is a simpler format. But in terms of professional IDEs is it simpler to throw out the parsed information and save only raw changes and force static tooling/IDEs to maintain separate caches and indexes of AST models? Not to mention the out of sync issues caused by this separation.

Is it simpler for source control managers like git to only understand text and make everyday changes such as rename refactors cascade across entire codebases and cause unnecessary, costly merge conflicts?

There are countless examples where AST-as-source _minimizes_ the complexities currently foisted on developers both directly and indirectly through tooling inefficiencies and other issues rooted in having multiple sources of truth.

giancarlostoro · 2024-04-07T16:46:21.000000Z

As cool as it looks, I have to ask: why?

bastawhiz · 2024-04-08T01:21:22.000000Z

How else are you supposed to fluff up your performance self review with real numbers when you can't use lines of code?

Hendrikto · 2024-04-07T16:58:19.000000Z

One obvious example would be refactoring. Many patterns are not (easily) expressible as regexes.

eddd-ddde · 2024-04-07T17:00:14.000000Z

There is also ast grep [0]

0: https://ast-grep.github.io/guide/introduction.html

yodsanklai · 2024-04-07T22:06:39.000000Z

Wouldn't you use tools like clang-tidy for that use case? wouldn't it be more flexible and general? Also does this project let you rewrite code or only run queries?

giancarlostoro · 2024-04-07T19:26:22.000000Z

Ah so this is useful for IDE tooling and I guess someone grokking to see how much impact a change has, yeah makes sense.

rodrigobellusci · 2024-04-07T23:39:44.000000Z

Maaaan I've wanted this for a while for Java and Dart (for flutter apps). Nice job!

wruza · 2024-04-07T19:46:02.000000Z

Why no line:col info?

ranger_danger · 2024-04-07T17:59:52.000000Z

Do you have any binaries? Stock ubuntu 22.04 does not have new enough versions of things to build this.

boywitharupee · 2024-04-08T16:11:23.000000Z

i wonder if we can train a foundational model on this data which will eventually allow to semantically search the codebase?

cozzyd · 2024-04-07T17:31:08.000000Z

Wonder how it deals with templates