Hacker News new | past | comments | ask | show | jobs | submit login

I've used both Code Search and Livegrep. No, Livegrep does not even come close to what Code Search can do.

Sourcegraph is the closest thing I know of.




Agreed. There are some public building blocks available (e.g. Kythe or meta's Glean) but having something generic that produces the kind of experience you can get on cs.chromium.org seems impossible. You need such bespoke build integration across an entire organization to get there.

Basic text search, as opposed to navigation, is all you'll get from anything out of the box.


In a past job I built a code search clone on top of Kythe, Zoekt and LSP (for languages that didn't have bazel integration). I got help from another colleague to make the UI based on Monaco. We create a demo that many people loved but we didn't productionize it for a few reasons (it was an unfunded hackathon project and the company was considering another solution when they already had Livegrep)

Producing the Kythe graph from the bazel artifacts was the most expensive part.

Working with Kythe is also not easy as there is no documentation on how to run it at scale.


Very cool. I tried to do things with Kythe at $JOB in the past, but gave up because the build (really, the many many independent builds) precluded any really useful integration.

I did end up making a nice UI for vanilla Zoekt, as I mentioned elsewhere: https://github.com/isker/neogrok.


I see most replies here ar mentioning the the build integration is what is mainly missing in the public tools. I wonder if nix and nixpkgs could be used here? Nix is a language agnostic build-system and with nixpkgs it has a build instructions for a massive amount of packages. Artifacts for all packages are also available via hydra.

Nix should also have enough context so that for any project it can get the source code of all dependencies and (optionally) all build-time dependencies.


Build integration is not the main thing that is missing between Livegrep and Code Search. The main thing that is missing is the semantic index. Kythe knows the difference between this::fn(int) and this::fn(double) and that::fn(double) and so on. So you can find all the callers of the nullary constructor of some class, without false positives of the callers of the copy constructor or the move constructor. Livegrep simply doesn't have that ability at all. Livegrep is what it says it is on the box: grep.


The build system coherence provided by a monorepo with a single build system is what makes you understand this::fn(double) as a single thing. Otherwise, you will get N different mostly compatible but subtly different flavors of entities depending on the build flavor, combinations of versioned dependencies, and other things.


Sure. Also, if you eat a bunch of glass, you will get a stomach ache. I have no idea why anyone uses a polyrepo.


The problem with monorepos is that they're so great that everyone has a few.


God that is good.


Nix builds suck for development because there is no incrementality there. Any source file changes in any way, and your typical nix flake will rebuild the project from scratch. At best, you get to reuse builds of dependencies.


Is there like a summary of what's missing from public attempts and what makes it so much better?


The short answer is context. The reason why Google's internal code search is so good, is it is tied into their build system. This means, when you search, you know exactly what files to consider. Without context, you are making an educated guess, with regards to what files to consider.


How exactly integration with build system helps Google? Maybe you could give specific example?..


Try clicking around https://source.chromium.org/chromium/chromium/src, which is built with Kythe (I believe, or perhaps it's using something internal to Google that Kythe is the open source version of).

By hooking into C++ compilation, Kythe is giving you things like _macro-aware_ navigation. Instead of trying to process raw source text off to the side, it's using the same data the compiler used to compile the code in the first place. So things like cross-references are "perfect", with no false positives in the results: Kythe knows the difference between two symbols in two different source files with the same name, whereas a search engine naively indexing source text, or even something with limited semantic knowledge like tree sitter, cannot perfectly make the distinction.


Yes the clicking around on semantic links on source.chromoum.org is served off of an index built by the Kythe team at Google.

The internal Kythe has some interesting bits (mostly around scaling) that aren't open sourced, but it's probably doable to run something on chromium scale without too much of that.

The grep/search box up top is a different index, maintained by a different team.


If you want to build a product with a build system, you need to tell it what source to include. With this information, you know what files to consider and if you are dealing with a statically typed language like C or C++, you have build artifacts that can tell you where the implementation was defined. All of this, takes the guess work out of answering questions like "What foo() implentation was used".

If all you know are repo branches, the best you can do is return matches from different repo branches with the hopes that one of them is right.

Edit: I should also add that with a build system, you know what version of a file to use.


Google builds all the code in its momnorepo continuously, and the built artifacts are available for the search. Open source tools are never going to incur the cost of actually building all the code it indexes.


The short summary is: It's a suite of stuff that someone actually thought about making work together well, instead of a random assortment of pieces that, with tons of work, might be able to be cobbled together into a working system.

All the answers about the technical details or better/worseness mostly miss the point entirely - the public stuff doesn't work as well because it's 1000 providers who produce 1000 pieces that trade integration flexibility for product coherence. On purpose mind you, because it's hard to survive in business (or attract open source users if that's your thing) otherwise.

If you are trying to do something like make "code review" and "code search" work together well, it's a lot easier to build a coherent, easy to use system that feels good to a user if you are trying to make two things total work together, and the product management directly talks to each other.

Most open source doesn't have product management to begin with, and the corporate stuff often does but that's just one provider.

They also have a matrix of, generously, 10-20 tools with meaningful marketshare they might need to try to work with.

So if you are a code search provider are trying to make a code search tool integrate well with any of the top 20 code review tools, well, good luck.

Sometimes people come along and do a good enough job abstracting a problem that you can make this work (LSP is a good example), but it's pretty rare.

Now try it with "discover, search, edit, build, test, release, deploy, debug", etc. Once you are talking about 10x10x10x10x10x10x10x10 combinations of possible tools, with nobody who gets to decide which combinations are the well lit path, ...

Also, when you work somewhere like Google or Amazon, it's not just that someone made those specific things work really well together, but often, they have both data and insight into where you get stuck overall in the dev process and why (so they can fix it).

At a place like Google, I can actually tell you all the paths that people take when trying to achieve a journey. So that means I know all the loops (counts, times, etc) through development tools that start with something like "user opens their editor". Whether that's "open editor, make change, build, test, review, submit" or "open editor, make change, go to lunch", or "open editor, go look at docs, go back to editor, go back to docs, etc".

So i have real answers to something like "how often do people start in their IDE, discover they can't figure out how to do X, leave the IDE to go find the answer, not find it, give up, and go to lunch". I can tell you what the top X where that happens is, and how much time is or is not wasted through this path, etc.

Just as an example. I can then use all of this to improve the tooling so users can get more done.

You will not find this in most public tooling, and to the degree telemetry exists that you could generate for your own use, nobody thinks about how all that telemetry works together.

Now, mind you, all the above is meant as an explanation - i'm trying to explain why the public attempts don't end up as "good". But myself, good/bad is all about what you value.

Most tradeoffs here were deliberate.

But they are tradeoffs.

Some people value the flexibility more than coherence. or whatever. I'm not gonna judge them, but I can explain why you can't have it all :)




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: