Show HN: DontBeEvil.rip: Search, for developers (API, expressions, CLI)

sitkack · on March 3, 2022

I love this idea.

This is mostly just raw data, it isn't that useful (yet).

The security issues with using curl directly to my terminal feel a bit dangerous. I'd rather use my browser and be able to see the results over a json tree. Providing raw access to ES has a high risk to reward.

The search results for https://dontbeevil.rip/search?q=python%20context%20manager%2... are non-topical hits on SO records. I even put the name of the python package (from stdlib no less) in the query string.

I was able to find what I needed on devdocs.io in less than 10 seconds.

https://devdocs.io/python~3.10/library/contextlib#contextlib...

In no way am I trying to discourage you, but until the basics are in place, search over arxiv abstracts is way less useful than just SO and docs (language and libraries).

I would recommend returning text/plain by default and .json if someone asks for it (in the url), no everyone can set headers. I'd also put an about page at the root of your site, plain text is fine.

alangibson · on March 3, 2022

Thanks for the feedback. I'm considering making searching for language docs a special use case, maybe even with its own index.

> Providing raw access to ES has a high risk to reward.

I'm not quite that foolhardy :). The only thing that gets passed to ES is a sanitized simple_query_string.query. Should be relatively safe. I'm sure someone will prove me wrong though.

laurent123456 · on March 3, 2022

All this curl script does is print strings in the terminal. Are you saying that printing strings is dangerous? I think you may be confusing it with executing the output of curl which is not what this script is doing.

_ktx2 · on March 3, 2022

Escape codes could be leveraged to RCE the terminal. That said, every CLI you install on your computer can do code execution and could potentially couple that with remote instructions. There's two vectors there, in one where you don't trust the server and another where you don't trust the connection between the client and the server.

junon · on March 4, 2022

Which escape codes?

user123456780 · on March 3, 2022

> one where you don't trust the server

And you can't trust the server here because its publishing 3rd party content in the way of search results.

The risk reward ratio of mentioned above is to far in the risk for my liking

tekromancr · on March 3, 2022

What security implications does running curl have that wouldn't be present in a browser?

feanaro · on March 3, 2022

There have been instances of terminal vulnerabilities via terminal escape codes, as bad as an RCE in iterm2: https://blog.mozilla.org/security/2019/10/09/iterm2-critical.... I suppose the OP is thinking of something like that.

tekromancr · on March 3, 2022

Yea, I was wondering about that; but the risk feels similar to a browser RCE to me. Maybe it's higher because browsers are more widely used/analyzed; but then again, a browser RCE has a much wider range of targets with more opportunities to exploit

dundarious · on March 3, 2022

Even just having the potential for the terminal to interpret escape codes is frustrating. Always pipe remote output to `less` or `less -R` (not `less -r`).

laumars · on March 3, 2022

And this is exactly why I’m always playing the damp squid when people advocate for more features being supported via shell escape codes.

artursapek · on March 3, 2022

I’m wondering the same. You’re not piping them into a shell.

fao_ · on March 3, 2022

> The security issues with using curl directly to my terminal feel a bit dangerous.

What?

nano9 · on March 3, 2022

Then don't use a privileged terminal to run curl.

jrockway · on March 3, 2022

I don't think that's a satisfactory mitigation. For example, there is a terminal escape code to change the title of your terminal. Your windowing system, which is very privileged, then displays that.

I think it's fine to be paranoid here, the attack surface is massive.

arminiusreturns · on March 3, 2022

Best of luck to you, I think there is a targettable niche that could utilize this.

Having thought quite a bit about the search space, I think a whitelist approach is going to be the next big search thing, because advertising and bs sites have corrupted SEO too far.

I'm reminded of the site indexer websites in the early days of the internet. Curation if done properly (based on quality of content and not certain other factors that currently play too heavy a role in the seo algo black boxes) seems to be how we adapt to the current information tsunami we are all dealing with.

I think a long time ago I decided I would even pay for such a service, just like I am willing to pay for a good new source (FT for me, not cheap, but worth it). Im not positive the 10$ mark is low enough but I hope for the general landscape it is.

Just dont forget to keep dontbeevil more than a catchphrase. In particlar, please be transparent with what user data you collect and how you use it.

alangibson · on March 3, 2022

> I think a whitelist approach is going to be the next big search thing

It almost as to be. Spammers, growth hackers, et al. are just too numerous and too good.

> Im not positive the 10$ mark is low enough but I hope for the general landscape it is.

I saw enough people mention $10 that I decided to go for it. To be honest, $10 is already probably too low to be sustainable because of the huge amount of resources search consumes and the high cost of development.

My gut feeling is that it's economically impossible to build a good search engine that isn't loaded with ads and spyware. But I spent so long complaining about G that I decided to prove to myself one way or another.

wolfgang42 · on March 3, 2022

> because of the huge amount of resources search consumes

I’ve been intermittently working on much the same idea as the OP, and I suspect this is actually a lot less of a problem than it seems, since they’re focusing on a niche. Indexing everything the way Google does requires a lot of resources, but indexing the majority of useful material in a specific domain takes a lot less. (My ElasticSearch index for the entirety of StackOverflow is a mere 40 GB, for example.)

By far the more expensive part is likely to be paying market rates for a developer (you need a decent number of users paying $10/mo to hit a mid-market salary), but in theory this scales relatively independently of userbase.

Edit: I’ve just noticed I’m replying to the OP, who’s mentioned downthread that they’re using BigQuery and spending $200/week. I’ve gone the marginalia.nu route and run everything on a computer in my living room, which changes the calculus somewhat—it’s a lot cheaper, but probably involves more development time.

For me it’s mainly about the learning experience but I’d be interested to hear your thoughts on the tradeoff.

operator-name · on March 3, 2022

I've tried it out. It's quite obvious the limited number of crawled sites when searching for anything obscure or one step outside of programming.

Even 'javascript reverse string', which I expected some docs or stack overflow pages seems to give me a HN thread, someone's github repo and a not very related SO thread.

Is MDN, MDSN, more dev docs documentation on the roadmap?

It's definitely an interesting technique. Do you have anything in place to detect garbage, substenceless articles like which has started popping up on Google?

I've seen the occasional one using github repositories or pages. Looking at the current list you're broadly reliant on moderators and communities, and as the search engine you moderate which sites are indexed.

alangibson · on March 3, 2022

> one step outside of programming.

Indeed, it's explicitly for programming only (for now).

> Even 'javascript reverse string', which I expected some docs or stack overflow pages seems to give me a HN thread,

Next up is indexing language documentation. At this point I'm relying heavily on Q&A and community sites since they have their own built in quality rankings.

> Is MDN, MDSN, more dev docs documentation on the roadmap?

Most definitely. Feel free to dump a list of urls of your favorite doc sites. I'm building a whitelist now.

> Do you have anything in place to detect garbage, substenceless articles like which has started popping up on Google?

My strategy is to not index spam in the first place. That's why I started by extracting links from with community sites that have their own moderation in place. The next step is to whitelist high quality sites. That is potentially a huge list to maintain, which is why I'm am narrowly focused on software development.

Everything old is new again...

pbhjpbhj · on March 3, 2022

>At this point I'm relying heavily on Q&A and community sites since they have their own built in quality rankings. //

How are you using the "built in quality rankings", could you give some examples?

On Reddit, say, except in a few groups like AskHistorians you can still get very high ranking for a meme post and very low ranking for a list that has high informational value. StackOverflow is extraordinarily prone to killing off reasonably good contributions and giving very high ranking to out of date answers (the latter is the biggest problem with SO sites at present IMO).

ejp · on March 3, 2022

I use this documentation aggregator/search in the browser to access most language docs. It might serve as a whitelist starting point! https://devdocs.io/

alangibson · on March 3, 2022

Nice. Thanks!

thewebcount · on March 3, 2022

If you haven't already, please crawl <https://developer.apple.com> and <https://swift.org>.

vulcan01 · on March 3, 2022

Ok, site request (aside from MDN): pkg.go.dev

Many of these are linked to GitHub/GitLab repos, so I'm not sure how you'll deduplicate that.

gargarplex · on March 3, 2022

I guess blogs that are linked-to in non-killed HN comments should probably be crawled a bit. Have you considered using social user karma (this could be a 1-10 score uniquely calculated for users of each of HN, Twitter, Reddit as long as it's built in a modular way) as a weight in a PageRank style schema?

Here's how I am going to evaluate your search engine. Yesterday I searched Google for "get dynamodb table row count" and found this URL, https://bobbyhadz.com/blog/aws-dynamodb-count-items, which provides a terrible recommendation involving a full table scan.

With DontBeEvil, I didn't find the correct answer, to use the describe-table API.

If you really plan to dedicate a year to this, I would strongly encourage you to re-post again as soon as you have a strong update. Right now this has potential to provide value but really does not. So update us when you have confidence that you might be providing value! But we think you're on to a great opportunity.

alangibson · on March 3, 2022

> I guess blogs that are linked-to in non-killed HN comments should probably be crawled a bit

They are, but there are relatively a few of them because my only page content source is the Common Crawl. The hit rate vs the total urls I'm interested in is not great. I expect to fix this soon.

I'm also not indexing entire sites, only specific upvoted urls. This will change as well.

> Have you considered using social user karma (this could be a 1-10 score uniquely calculated for users of each of HN, Twitter, Reddit as long as it's built in a modular way) as a weight in a PageRank style schema?

Definitely. I've already started in on calculating a rank coefficient for submitters, but it's not completely clear now to best use it yet.

> Here's how I am going to evaluate your search engine

Feel free to dump more of these. Some solid test cases would be very helpful.

Natfan · on March 3, 2022

Here's a simple PowerShell wrapper for your lovely tool:

  function rip {
    param (
      [Parameter(Mandatory, ValueFromRemainingArguments)]
      [String]
      $Query
    )
    $RequestParameters = @{
      URI = "https://dontbeevil.rip/search?q=$Query"
      Headers = @{ Accept = "text/plain" }
    }
    $Request = Invoke-RestMethod @RequestParameters
    return $Request
  }

Usage:

  > rip heartbleed bug

  Heartbleed Bug
  <https://heartbleed.com/>    
  Heartbleed Bug The Heartbleed Bug The Heartbleed Bug is a 
  serious vulnerability in the popular OpenSSL cryptographic 
  software library....

martinald · on March 3, 2022

Why not make a website for this? Why just limit it to the terminal (hard to use on mobile for example)?

Edit: obviously you can query it from a browser, but it would take like a couple hours to have a view that parsed the json and put it in a google-style layout with a search bar.

alangibson · on March 3, 2022

> Why not make a website for this?

It's coming. I decided to get it out there as soon as I had an index that could theoretically be useful. The feedback I'm getting will drive the next chunk of work that gets done. For instance, I'll probably bring in language docs next as a lot of people have asked for them already.

Traster · on March 3, 2022

This is an interesting approach to the general problem. The general problem being that whatever data source you use is inevitably going to be polluted by players who wish to be top of the rankings in a search engine if your engine is used. Maybe this solution - serving a very small niche will work, but I'd be really interested to know if you guys have spent any time trying to SEO your own search engine? Hire an intern whose sole task is to get a page to the top of a fairly common search query like replacing some common python package with your own one?

rmbyrro · on March 3, 2022

Seems he's relying on the communities/owners behind the sources to moderate and keep bad content away.

I think it's a good bet. Would expect to be a LOT harder for a "growth hacker" to make his way up in HN points as opposed to Google rank.

wodenokoto · on March 3, 2022

My shell-fu isn't the greatest. I thought I could be clever and do

    >alias rip="curl -G -H 'Accept: text/plain' \
    --url https://dontbeevil.rip/search --data-urlencode q="

    >rip hello
    {"message": "Missing required request parameters: [q]"}
    curl: (6) Could not resolve host: hello

cudder · on March 3, 2022

Turn it into a function:

    rip() {
        curl -G -H 'Accept: text/plain' \
        --url https://dontbeevil.rip/search --data-urlencode "q=$@"
    }

Now `rip hello` works.

harryvederci · on March 3, 2022

This only works with 1-word arguments. You can change $@ to $* to fix that.

(I'm acting all wise, but I learned that today as well :) )

pxeger1 · on March 3, 2022

Switch to Zsh, where there'll be no difference!

harryvederci · on March 3, 2022

You could also put a shell script on your `PATH` instead of creating an alias:

  #! /bin/sh
  query_string="q=$@"
  curl --get \
    --header 'Accept: text/plain' \
    --url https://dontbeevil.rip/search \
    --data-urlencode "${query_string}"

mrlemke · on March 3, 2022

I'd do it almost the same but without the variable. Note: the long shebang is for using on Termux, PC users should change it to something like #!/use/bin/env sh.

   #!/data/data/com.termux/files/use/bin/env sh
  curl -G -H "Accept: text/plain" \
  --url "https://dontbeevil.rip/search" \
  --data-urlenconde "q=$*"

harryvederci · on March 3, 2022

Didn't know about "$*", thanks!

Edit: typo in your version: "urlenconde"

laumars · on March 3, 2022

Also every instance of ‘usr’ has been autocorrected to ‘use’.

Autocorrect does make me laugh some days :D

mrlemke · on March 5, 2022

That's what I get for phone posting. Thanks for pointing it out you two!

nameless912 · on March 3, 2022

  function rip() {
    printf ">>"
    read query
    curl -G -H 'Accept: text/plain' --url 
  https://dontbeevil.rip/search --data-urlencode q="$query"
  }

:)

erklik · on March 3, 2022

> programmer Reddit

Is it just /r/programmer? or many other programming related subreddits?

In general though, this is great. I would similarly love a solution where we could submit sites to be indexed. A way to have a search engine for all the websites I want specifically would be awesome. You could probably add some sort of popular filter on top of it so that only sites popular enough get indexed. I don't know. Just an idea.

I love the fact that it's accessible from the terminal. That's fantastic. Although, would be nice to have a very simple HTML front-end. Think very early Google or go very brutalistic.

Anyway, excited to hear about it.

Edit:

Doing the following gives me an Internal Server Error for some reason.

curl -G -H 'Accept: text/plain' --url https://dontbeevil.rip/search --data-urlencode 'q=Notes' {"message": "Internal server error"}

alangibson · on March 3, 2022

> Is it just /r/programmer? or many other programming related subreddits?

There's about 30 programming focused subreddits.

> I would similarly love a solution where we could submit sites to be indexed.

This is on the plan. I want to allow common interest groups to maintain their own search verticals. I also want to allow individual users to add everything from bookmarks to notes (privately only of course) to act as a sort of external memory. That's very long term though.

> Doing the following gives me an Internal Server Error for some reason.

Should be fixed now

hansott · on March 3, 2022

Couple of thoughts:

Make it available through a web page instead of a raw search dump?

Hide the internals of your search engine? In case you want to switch to meilisearch, algolia, ... (for cost reasons)

Preferably use your own search DSL to avoid users to learn about Elasticsearch queries (goes hand in hand with hiding internals of search engine)

Good luck! :)

alangibson · on March 3, 2022

That'll all happen (though ES simple search expressions are quite OK). The reason it is the way it is today so to enable me to get it out into the world as fast as possible. It puts the M in MVP.

randomsilence · on March 3, 2022

Why reinvent the wheel? It's a selling point that Elasticsearch queries can be used.

If he changes the engine he might as well implement the Elasticsearch language then for the new engine.

ghawk1ns · on March 3, 2022

You can try with a function to simplify the cli:

$ rip() { curl -G -H 'Accept: text/plain' --url https://dontbeevil.rip/search --data-urlencode 'q=$@'}

$ rip heartbleed bug

mananaysiempre · on March 3, 2022

The single quotes probably need to be double ones in the last argument to permit parameter expansion, and the $@ (separately quote every argument if quoted) probably wants to be a $* (quote the entire space-separated argument array if quoted)? There’s also the grammar quirk where the last command inside braces (but not parens) needs a semicolon or newline to separate it from the brace itself. Thus:

  rip() { curl -G -H 'Accept: text/plain' --url https://dontbeevil.rip/search --data-urlencode "q=$*"; }

(tested).

I still support the point that there is no reason for this to be a (grammar-defying) alias rather than a (tame) shell function or even a separate script.

dotancohen · on March 3, 2022

This is great, even if I'm not getting as good results as from google for now.

Can you expose the filtering features of ES? I'd love to query e.g. "+python lists" and get results only related to Python, no e.g. Lisp results. For Stack Overflow you could use the question's tags as filter keys, and for other sites you'd add them manually (so e.g. the PHP docs get the PHP key).

If you're thinking of monetizing this, I'll tell you what I tell all the small, useful services that I'd like to pay for. There are too many small, useful services that I'd like to pay for. I'll gladly pay $1 for such a service, but you'll have a hard time convincing me to pay more.

przemub · on March 3, 2022

I really like it and it already gives some useful results. A rise of curated search engines as yours would be lovely.

It would be nice if the main page linked to your blog or anything really, because I would like to know where can I follow this project!

alangibson · on March 3, 2022

Thanks!

I'm giving this project a year to build up momentum. If it looks promising, I plan on having other STEM verticals. Maybe even fix recipe searches one day :)

A real homepage is coming. Feel free to subscribe to my blogs RSS feed for now: https://landshark.io/feed.xml

AnyTimeTraveler · on March 3, 2022

I just tried a few queries related to rust and it's library rocket. I got only useless results on the first page and didn't check further.

I'm guessing that's because it doesn't index docs.rs and the rust forum. Both incredibly important for Rust development.

So as long as this engine doesn't also index most programming related forums, I won't be able to use it effectively, even though I really would like to.

The concept of limiting the scope to just a few websites sounds really interesting, though. I think I will take this idea and build a little thing on top of google to implement that site filtering on my queries.

alangibson · on March 3, 2022

Thanks for the feedback. Language docs sites are the current weak point. Theyll be the next big addition to the index.

dmix · on March 3, 2022

If you search for "Reddit" the first result is "Google Search is Dying" on Hacker News.

https://dontbeevil.rip/search?q=reddit

alangibson · on March 3, 2022

The reason is because there was a long discussion about Reddit as a search engine in that thread. reddit.com will likely never be indexed. Many of the subreddits already are, but I haven't exposed the ability to do something like Google's `site:reddit.com/r/*` yet. That's coming though.

mynameismon · on March 3, 2022

> StackOverflow Does this include the entire StackExchange network, or only StackOverflow? Because some SE sites (in particular, UnixSE and ServerFault) also produce highly relevant results.

alangibson · on March 3, 2022

> Does this include the entire StackExchange network

Not yet. I'm focused on explicitly developer-oriented resources right now. Those you mentioned are on the TODO list though.

wolfgang42 · on March 3, 2022

Congrats on the launch! Over the past 6 months or so I’ve been intermittently working on building pretty much exactly the same thing, but with a lot of procrastinating on fiddling with the internals rather than just putting something out there. Your API-first approach is a clever way to get around the desire to keep fiddling around with the page design!

alangibson · on March 3, 2022

I find that plain text is a very effective anti-procrastination tool. That's why the "API" was actually text-first. Limiting your options can be very liberating.

ykevinator2 · on March 3, 2022

Can we drop the q= and the quotes from the shell cmd somehow, that would make it so much nicer, and rip is a great command line.

alangibson · on March 3, 2022

There's a real command line coming. If you're on a Debian Linux and feel like testing it out, just do

apt install curl jq

pip3 install jtbl

curl -O https://raw.githubusercontent.com/alangibson/dontbeevil.rip/...

chmod u+x rip

./rip 'what is a monad'

l0b0 · on March 3, 2022

With long options, JSON output, and no extra Python dependencies:

  rip() {
      curl \
          --data-urlencode "q=${1}" \
          --get \
          --header 'Content-Type: application/json' --header 'Accept: application/json' \
          --silent \
          'https://dontbeevil.rip/search' \
          | jq '[ .hits.hits[] | { title: .fields.title[0], url: .fields.url[0], highlight: .highlight.text[0] } ]'
  }

alexrsagen · on March 3, 2022

This function should work for you:

$ rip() { curl -G -H "Accept: text/plain" --url https://dontbeevil.rip/search --data-urlencode "q=$*"; }

$ rip Heartbleed bug

edit: alangibson's solution in this thread is better :)

niek_pas · on March 3, 2022

This looks really cool! It would be neat to have a proper CLI with a more fully-flushed out UI, with things like shortcuts to quickly open links. Is there any way I can be kept up to date with the state of this project?

Also, am I correct in assuming it's not open source?

alangibson · on March 3, 2022

The repo is over here: https://github.com/alangibson/dontbeevil.rip

You'll be disappointed though as most of the important stuff only lives as BigQuery queries. I will be updating it in the near future though.

kordlessagain · on March 3, 2022

Like an ultra powerful goosh.org UI, with AI command synthesis, image uploads, crawling, search, opening pages, etc.

alangibson · on March 3, 2022

CLI in the browser. I love it.

eatbitseveryday · on March 3, 2022

I would recommend adding technical blogs. Not by hand, but if you can automate identifying some. Many are small but have good content.

Edit: also some corporate technical documentation like Mozilla, Microsoft, IBM, etc have many such developer pages.

alangibson · on March 3, 2022

I automate it by pulling urls out of HN, programmer Reddit, etc. Right now my only source of page content is the Common Crawl, which is why there are relatively few web pages indexed. That will change.

A next step is to index entire sites, not just individual pages, based on the positive votes their links get.

kiettv · on March 3, 2022

It's powered by ElasticSearch, is it? So I can use all of its query parameters?

alangibson · on March 3, 2022

Indeed it is. You can you use simple query strings for the q parameter. See https://www.elastic.co/guide/en/elasticsearch/reference/curr...

I'm considering opening up full ES query support for paying customers, but it's too dangerous to expose it to the Internet unrestricted.

rmbyrro · on March 3, 2022

I think it's dangerous to expect a malicious actor would not pay $10 to screw your service.

cj · on March 3, 2022

"Risk management" is often not the same as "risk elimination"

alangibson · on March 3, 2022

Indeed it is. Presumably I would have had time to build up some safeguards and run beefier servers by then though.

marginalia_nu · on March 3, 2022

As a word of warning, when HN discovered my search engine, I was hit hard by a botnet within a few days. Saw about 30-40k queries/hour from some 10k IP addresses. I'm self hosted so the worst that happened is my search engine was a bit slow, but if I was cloud hosted I'd have a very sizable bill to pay.

If you do not already have a global rate limit, implement one ASAP. Better to have one and not need it, than to need it and not have it.

alangibson · on March 3, 2022

I can't wait for the bots to show up. Setting a rate limit was one of the first things I did :)

kiettv · on March 4, 2022

Do you have reversed proxy in front of your API like HA Proxy or Nginx, most of bots will hit you by IP only, so filter and reject request without domain will be eliminate most of them.

marginalia_nu · on March 4, 2022

This was a directed attack, not some random drive-by.

husainfazel · on March 3, 2022

Going to "https://dontbeevil.rip/" results in a JSON error in the browser:

{"message":"Missing Authentication Token"}

alangibson · on March 3, 2022

There's nothing there yet. I'm 100% focused on building the index and tuning the master search query. There is however a blog post that goes into more detail on how to do things like pagination soon.

tl;dr: https://dontbeevil.rip/search?q=monads&from=10

przemub · on March 3, 2022

Apparently developers need no homepages, just APIs :)

alangibson · on March 3, 2022

Jup. No time for fancy things like HTML yet :)

genewitch · on March 3, 2022

i run a few dozen internet and web services with FQDN and only 1 of them has something if you type http://example.com - no homepages, but there's a webserver listening on ~80% of the domains.

Stop attacking me!

kiettv · on March 3, 2022

You need to access the sub path /search?q= :)

feanaro · on March 3, 2022

Getting internal server error for many ordinary requests. I'm not able to discern a pattern. An example is `rip 'q=zelda'`.

alangibson · on March 3, 2022

Thanks for the report. I'll get this fixed.

In the mean time you can use application/json:

curl -G -H 'Accept: application/json' https://dontbeevil.rip/search?q=zelda

alangibson · on March 3, 2022

It should be fixed now.

l0b0 · on March 3, 2022

Nice! Do any more advanced query strings work right now? Like looking for recent pages or only searching titles?

waynecochran · on March 3, 2022

Are you using something similar to the original pagerank algorithm that uses eigen-analysis of the link graph?

alangibson · on March 3, 2022

It's not even that sophisticated yet. I'm ranking urls based on their normalized score on the various community sites I find them on. My next TODO is to roll up those ranks to get a rank for the site, then index the whole site.

I will also be using the PageRank calculated by Common Crawl as soon as they release the next data set.

cbreynoldson · on March 3, 2022

did this idea spark from PG's old talk on new ideas https://youtu.be/R9ITLdmfdLI? One of them is literally "search engine for developers/hackers"

0x20cowboy · on March 3, 2022

Maybe use Gopher? Lynx supports it and there are a few other newish clients out there.

amar-laksh · on March 3, 2022

I saw your blog post a couple of days ago, This looks really promising!

alangibson · on March 3, 2022

Thanks! I'll be updating that post today. I changed quite a few things getting ready for this Show HN, so it's now out of date.

pcthrowaway · on March 3, 2022

Are you not putting github (+issues and PRs) in the indexed set?

alangibson · on March 3, 2022

Not yet. That's an astonishing about of data, and I want to make sure that people genuinely want it first. I'm considering an index specifically for this actually.

I'll put you down as a +1

dorianmariefr · on March 3, 2022

I'm getting:

> {"message": "Internal server error"}

alangibson · on March 3, 2022

Give it another try. I fixed a flaw in the json to text translation.

dorianmariefr · on March 3, 2022

Thanks, it works now

alangibson · on March 3, 2022

What query are you running?

rdiddly · on March 3, 2022

Love the URL most of all but will be trying this out!

bryanrasmussen · on March 3, 2022

how's ranking done, I searched for xslt and I saw a lot of HN results in the first part, seemed weird that HN would rank highly for that.

alangibson · on March 3, 2022

Results are heavily weighted to HN and Stackoverflow right now because they are the easiest resources to access and rank. Since posts have a score on both platforms, it's easy to pull out some 'authority' signal.

There's many more web pages coming. They are much more difficult to get ahold of and rank though because I need to run my own crawler to fill in what Common Crawl doesn't have and then calculate my own site authority rankings.

leke · on March 3, 2022

Nice. Some very different results returned.

sjbrown · on March 3, 2022

Now do recipes

all2 · on March 3, 2022

StackOverflow but for food. RecipeOverflow. StackFood.

alangibson · on March 3, 2022

I want to. So much.

baggachipz · on March 3, 2022

  1. Stand up ElasticSearch instance
  2. Have it index SO and HN
  3. Charge $10 per month
  4. Profit!!

alangibson · on March 3, 2022

Maybe you should read the other posts about future plans and how this is extremely alpha. Or maybe go anywhere else and do literally anything else but be obnoxious in this thread.

I'm spending over $200 per week just to stand up the service as it is. $10 for a full functional search engine will likely not be even close to PROFIT!!!!

sdesol · on March 3, 2022

> I'm spending over $200 per week just to stand up the service as it is

What are using if you don't mind me asking? Not trying to criticize or anything. I have a Heztner box that gives me 1TB SSD in RAID 0 mode and 64 GB of RAM for about 80 CAD a month.

alangibson · on March 3, 2022

I have 3 sizeable EC2 instances running an ElasticSearch cluster, plus a beefy box for data preprocessing and crawling.

A big chunk actually goes to BigQuery. There are publicly available datasets for HN, Stackoverflow and a few others there. I've also loaded up the Common Crawl index. The query and storage fees really add up.

I'm hopefully done with huge BigQuery queries, so that $200 will probably drop for a while.

rmbyrro · on March 3, 2022

I'm probably wrong about my assumptions, but presume you are open to any kind of constructive feedback, so here it goes...

Maybe you're overkilling with the infra stack.

I would simplify until having a mature product, especially if I'm bootstrapping, which I think is your case.

Right now, you're still a bit far from MVP, from my point of view. Those $200 can probably be reduced by 50%-75% if you compromise on stuff only important to non-alpha services (i.e. 99.99% availability). A single EC2 box should be enough. Maybe look into Postgres or another FOSS instead of BigQuery.

These $100-$150 savings per week can go into promoting your service, getting as much attention as possible to maximize feedback.

Good luck!

harryvederci · on March 3, 2022

> Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something.

Source: https://news.ycombinator.com/newsguidelines.html

baggachipz · on March 3, 2022

You're right, that was dickish. I could have asked what's different without the snark.

alangibson · on March 3, 2022

The purpose of this project is to see if it's possible to build a highly targeted, privacy respecting, search engine that people will pay for. I've given myself a year to build the index and tune for relevance. If at the end of that year it's not a path to sustainability, I'll shut it down secure in the knowledge that, despite what they say, people really won't pay for search. If it is, then I'll start scaling into other STEM subjects.

So the difference is, it has the things folks on HN say they want: - search expressions - REST api - no tracking - users are buyers, not products

all2 · on March 3, 2022

Would you consider allowing users to host instances/nodes of the engine in return for free or reduced monthly rates? I wouldn't mind making that kind of trade.

rglullis · on March 3, 2022

How would one go about ensuring these nodes are not malicious?

genewitch · on March 3, 2022

just query two at random, if they don't match, hit an API endpoint and something like `diff` output? if the API endpoint with the two 'nodes' gets enough complaints about a node(s) then blacklist it from the round robin/haproxy/whatever.

rglullis · on March 4, 2022

> if the API endpoint with the two 'nodes' gets enough complaints about a node(s) then blacklist it

Great, now you just found a way for malicious actors to create reputation bombs, remove honest nodes from the pool and make it even easier for them to spam/poison the results.

rmbyrro · on March 3, 2022

Make them run in a blockchain! ;D