I've used both Code Search and Livegrep. No, Livegrep does not even come close t...

isker · 2024-04-10T19:48:58 1712778538

Agreed. There are some public building blocks available (e.g. Kythe or meta's Glean) but having something generic that produces the kind of experience you can get on cs.chromium.org seems impossible. You need such bespoke build integration across an entire organization to get there.

Basic text search, as opposed to navigation, is all you'll get from anything out of the box.

init · 2024-04-10T19:58:40 1712779120

In a past job I built a code search clone on top of Kythe, Zoekt and LSP (for languages that didn't have bazel integration). I got help from another colleague to make the UI based on Monaco. We create a demo that many people loved but we didn't productionize it for a few reasons (it was an unfunded hackathon project and the company was considering another solution when they already had Livegrep)

Producing the Kythe graph from the bazel artifacts was the most expensive part.

Working with Kythe is also not easy as there is no documentation on how to run it at scale.

isker · 2024-04-10T20:05:15 1712779515

Very cool. I tried to do things with Kythe at $JOB in the past, but gave up because the build (really, the many many independent builds) precluded any really useful integration.

I did end up making a nice UI for vanilla Zoekt, as I mentioned elsewhere: https://github.com/isker/neogrok.

birktj · 2024-04-10T21:03:53 1712783033

I see most replies here ar mentioning the the build integration is what is mainly missing in the public tools. I wonder if nix and nixpkgs could be used here? Nix is a language agnostic build-system and with nixpkgs it has a build instructions for a massive amount of packages. Artifacts for all packages are also available via hydra.

Nix should also have enough context so that for any project it can get the source code of all dependencies and (optionally) all build-time dependencies.

jeffbee · 2024-04-10T21:29:30 1712784570

Build integration is not the main thing that is missing between Livegrep and Code Search. The main thing that is missing is the semantic index. Kythe knows the difference between this::fn(int) and this::fn(double) and that::fn(double) and so on. So you can find all the callers of the nullary constructor of some class, without false positives of the callers of the copy constructor or the move constructor. Livegrep simply doesn't have that ability at all. Livegrep is what it says it is on the box: grep.

humanrebar · 2024-04-10T22:28:47 1712788127

The build system coherence provided by a monorepo with a single build system is what makes you understand this::fn(double) as a single thing. Otherwise, you will get N different mostly compatible but subtly different flavors of entities depending on the build flavor, combinations of versioned dependencies, and other things.

jeffbee · 2024-04-10T22:31:01 1712788261

Sure. Also, if you eat a bunch of glass, you will get a stomach ache. I have no idea why anyone uses a polyrepo.

humanrebar · 2024-04-10T22:33:03 1712788383

The problem with monorepos is that they're so great that everyone has a few.

refulgentis · 2024-04-10T23:53:08 1712793188

God that is good.

yencabulator · 2024-04-12T16:53:50 1712940830

Nix builds suck for development because there is no incrementality there. Any source file changes in any way, and your typical nix flake will rebuild the project from scratch. At best, you get to reuse builds of dependencies.

tayo42 · 2024-04-10T19:53:55 1712778835

Is there like a summary of what's missing from public attempts and what makes it so much better?

sdesol · 2024-04-10T20:38:15 1712781495

The short answer is context. The reason why Google's internal code search is so good, is it is tied into their build system. This means, when you search, you know exactly what files to consider. Without context, you are making an educated guess, with regards to what files to consider.

riku_iki · 2024-04-10T21:01:37 1712782897

How exactly integration with build system helps Google? Maybe you could give specific example?..

isker · 2024-04-10T21:24:54 1712784294

Try clicking around https://source.chromium.org/chromium/chromium/src, which is built with Kythe (I believe, or perhaps it's using something internal to Google that Kythe is the open source version of).

By hooking into C++ compilation, Kythe is giving you things like _macro-aware_ navigation. Instead of trying to process raw source text off to the side, it's using the same data the compiler used to compile the code in the first place. So things like cross-references are "perfect", with no false positives in the results: Kythe knows the difference between two symbols in two different source files with the same name, whereas a search engine naively indexing source text, or even something with limited semantic knowledge like tree sitter, cannot perfectly make the distinction.

dmoy · 2024-04-11T21:51:43 1712872303

Yes the clicking around on semantic links on source.chromoum.org is served off of an index built by the Kythe team at Google.

The internal Kythe has some interesting bits (mostly around scaling) that aren't open sourced, but it's probably doable to run something on chromium scale without too much of that.

The grep/search box up top is a different index, maintained by a different team.

sdesol · 2024-04-10T21:17:10 1712783830

If you want to build a product with a build system, you need to tell it what source to include. With this information, you know what files to consider and if you are dealing with a statically typed language like C or C++, you have build artifacts that can tell you where the implementation was defined. All of this, takes the guess work out of answering questions like "What foo() implentation was used".

If all you know are repo branches, the best you can do is return matches from different repo branches with the hopes that one of them is right.

Edit: I should also add that with a build system, you know what version of a file to use.

j2kun · 2024-04-10T20:43:37 1712781817

Google builds all the code in its momnorepo continuously, and the built artifacts are available for the search. Open source tools are never going to incur the cost of actually building all the code it indexes.

DannyBee · 2024-04-11T01:51:19 1712800279

The short summary is: It's a suite of stuff that someone actually thought about making work together well, instead of a random assortment of pieces that, with tons of work, might be able to be cobbled together into a working system.

All the answers about the technical details or better/worseness mostly miss the point entirely - the public stuff doesn't work as well because it's 1000 providers who produce 1000 pieces that trade integration flexibility for product coherence. On purpose mind you, because it's hard to survive in business (or attract open source users if that's your thing) otherwise.

If you are trying to do something like make "code review" and "code search" work together well, it's a lot easier to build a coherent, easy to use system that feels good to a user if you are trying to make two things total work together, and the product management directly talks to each other.

Most open source doesn't have product management to begin with, and the corporate stuff often does but that's just one provider.

They also have a matrix of, generously, 10-20 tools with meaningful marketshare they might need to try to work with.

So if you are a code search provider are trying to make a code search tool integrate well with any of the top 20 code review tools, well, good luck.

Sometimes people come along and do a good enough job abstracting a problem that you can make this work (LSP is a good example), but it's pretty rare.

Now try it with "discover, search, edit, build, test, release, deploy, debug", etc. Once you are talking about 10x10x10x10x10x10x10x10 combinations of possible tools, with nobody who gets to decide which combinations are the well lit path, ...

Also, when you work somewhere like Google or Amazon, it's not just that someone made those specific things work really well together, but often, they have both data and insight into where you get stuck overall in the dev process and why (so they can fix it).

At a place like Google, I can actually tell you all the paths that people take when trying to achieve a journey. So that means I know all the loops (counts, times, etc) through development tools that start with something like "user opens their editor". Whether that's "open editor, make change, build, test, review, submit" or "open editor, make change, go to lunch", or "open editor, go look at docs, go back to editor, go back to docs, etc".

So i have real answers to something like "how often do people start in their IDE, discover they can't figure out how to do X, leave the IDE to go find the answer, not find it, give up, and go to lunch". I can tell you what the top X where that happens is, and how much time is or is not wasted through this path, etc.

Just as an example. I can then use all of this to improve the tooling so users can get more done.

You will not find this in most public tooling, and to the degree telemetry exists that you could generate for your own use, nobody thinks about how all that telemetry works together.

Now, mind you, all the above is meant as an explanation - i'm trying to explain why the public attempts don't end up as "good". But myself, good/bad is all about what you value.

Most tradeoffs here were deliberate.

But they are tradeoffs.

Some people value the flexibility more than coherence. or whatever. I'm not gonna judge them, but I can explain why you can't have it all :)