Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: DontBeEvil.rip: Search, for developers (API, expressions, CLI)
269 points by alangibson on March 3, 2022 | hide | past | favorite | 121 comments
I'd like to invite everyone to try out DontBeEvil.rip, an experimental search engine for developers.

tl;dr

$ alias rip="curl -G -H 'Accept: text/plain' --url https://dontbeevil.rip/search --data-urlencode "

$ rip 'q=Heartbleed bug'

DontBeEvil.rip is a year long experiment to see if a small team can build a developer-focused search engine that is self-sustaining on $10 monthly subscriptions.

It works by only indexing high-quality resources that are relevant to developers. You won't get useless listicles because we'll never crawl them. Relevant urls are harvested from HN, StackOverflow, programmer Reddit, and a few others. Page content comes mostly from the Common Crawl project.

The limited, but awesome, features in this first release are:

- Expressions! Experience the power of Elasticsearch’s Simple Query Strings.

- REST API. Just change 'text/plain' to `application/json` in the above alias.

- CLI. Just use curl in the terminal. Simple as.

HackerNews, StackOverflow, Arxiv abstracts, 2M Github repos, and programmer Reddit (up to 2020) are being indexed right now. There's much more to come in the next few months.

I'd love to hear your questions, comments and suggestions in the comments below.



I love this idea.

This is mostly just raw data, it isn't that useful (yet).

The security issues with using curl directly to my terminal feel a bit dangerous. I'd rather use my browser and be able to see the results over a json tree. Providing raw access to ES has a high risk to reward.

The search results for https://dontbeevil.rip/search?q=python%20context%20manager%2... are non-topical hits on SO records. I even put the name of the python package (from stdlib no less) in the query string.

I was able to find what I needed on devdocs.io in less than 10 seconds.

https://devdocs.io/python~3.10/library/contextlib#contextlib...

In no way am I trying to discourage you, but until the basics are in place, search over arxiv abstracts is way less useful than just SO and docs (language and libraries).

I would recommend returning text/plain by default and .json if someone asks for it (in the url), no everyone can set headers. I'd also put an about page at the root of your site, plain text is fine.


Thanks for the feedback. I'm considering making searching for language docs a special use case, maybe even with its own index.

> Providing raw access to ES has a high risk to reward.

I'm not quite that foolhardy :). The only thing that gets passed to ES is a sanitized simple_query_string.query. Should be relatively safe. I'm sure someone will prove me wrong though.


All this curl script does is print strings in the terminal. Are you saying that printing strings is dangerous? I think you may be confusing it with executing the output of curl which is not what this script is doing.


Escape codes could be leveraged to RCE the terminal. That said, every CLI you install on your computer can do code execution and could potentially couple that with remote instructions. There's two vectors there, in one where you don't trust the server and another where you don't trust the connection between the client and the server.


Which escape codes?


> one where you don't trust the server

And you can't trust the server here because its publishing 3rd party content in the way of search results.

The risk reward ratio of mentioned above is to far in the risk for my liking


What security implications does running curl have that wouldn't be present in a browser?


There have been instances of terminal vulnerabilities via terminal escape codes, as bad as an RCE in iterm2: https://blog.mozilla.org/security/2019/10/09/iterm2-critical.... I suppose the OP is thinking of something like that.


Yea, I was wondering about that; but the risk feels similar to a browser RCE to me. Maybe it's higher because browsers are more widely used/analyzed; but then again, a browser RCE has a much wider range of targets with more opportunities to exploit


Even just having the potential for the terminal to interpret escape codes is frustrating. Always pipe remote output to `less` or `less -R` (not `less -r`).


And this is exactly why I’m always playing the damp squid when people advocate for more features being supported via shell escape codes.


I’m wondering the same. You’re not piping them into a shell.


> The security issues with using curl directly to my terminal feel a bit dangerous.

What?


Then don't use a privileged terminal to run curl.


I don't think that's a satisfactory mitigation. For example, there is a terminal escape code to change the title of your terminal. Your windowing system, which is very privileged, then displays that.

I think it's fine to be paranoid here, the attack surface is massive.


Best of luck to you, I think there is a targettable niche that could utilize this.

Having thought quite a bit about the search space, I think a whitelist approach is going to be the next big search thing, because advertising and bs sites have corrupted SEO too far.

I'm reminded of the site indexer websites in the early days of the internet. Curation if done properly (based on quality of content and not certain other factors that currently play too heavy a role in the seo algo black boxes) seems to be how we adapt to the current information tsunami we are all dealing with.

I think a long time ago I decided I would even pay for such a service, just like I am willing to pay for a good new source (FT for me, not cheap, but worth it). Im not positive the 10$ mark is low enough but I hope for the general landscape it is.

Just dont forget to keep dontbeevil more than a catchphrase. In particlar, please be transparent with what user data you collect and how you use it.


> I think a whitelist approach is going to be the next big search thing

It almost as to be. Spammers, growth hackers, et al. are just too numerous and too good.

> Im not positive the 10$ mark is low enough but I hope for the general landscape it is.

I saw enough people mention $10 that I decided to go for it. To be honest, $10 is already probably too low to be sustainable because of the huge amount of resources search consumes and the high cost of development.

My gut feeling is that it's economically impossible to build a good search engine that isn't loaded with ads and spyware. But I spent so long complaining about G that I decided to prove to myself one way or another.


> because of the huge amount of resources search consumes

I’ve been intermittently working on much the same idea as the OP, and I suspect this is actually a lot less of a problem than it seems, since they’re focusing on a niche. Indexing everything the way Google does requires a lot of resources, but indexing the majority of useful material in a specific domain takes a lot less. (My ElasticSearch index for the entirety of StackOverflow is a mere 40 GB, for example.)

By far the more expensive part is likely to be paying market rates for a developer (you need a decent number of users paying $10/mo to hit a mid-market salary), but in theory this scales relatively independently of userbase.

Edit: I’ve just noticed I’m replying to the OP, who’s mentioned downthread that they’re using BigQuery and spending $200/week. I’ve gone the marginalia.nu route and run everything on a computer in my living room, which changes the calculus somewhat—it’s a lot cheaper, but probably involves more development time.

For me it’s mainly about the learning experience but I’d be interested to hear your thoughts on the tradeoff.


I've tried it out. It's quite obvious the limited number of crawled sites when searching for anything obscure or one step outside of programming.

Even 'javascript reverse string', which I expected some docs or stack overflow pages seems to give me a HN thread, someone's github repo and a not very related SO thread.

Is MDN, MDSN, more dev docs documentation on the roadmap?

It's definitely an interesting technique. Do you have anything in place to detect garbage, substenceless articles like which has started popping up on Google?

I've seen the occasional one using github repositories or pages. Looking at the current list you're broadly reliant on moderators and communities, and as the search engine you moderate which sites are indexed.


> one step outside of programming.

Indeed, it's explicitly for programming only (for now).

> Even 'javascript reverse string', which I expected some docs or stack overflow pages seems to give me a HN thread,

Next up is indexing language documentation. At this point I'm relying heavily on Q&A and community sites since they have their own built in quality rankings.

> Is MDN, MDSN, more dev docs documentation on the roadmap?

Most definitely. Feel free to dump a list of urls of your favorite doc sites. I'm building a whitelist now.

> Do you have anything in place to detect garbage, substenceless articles like which has started popping up on Google?

My strategy is to not index spam in the first place. That's why I started by extracting links from with community sites that have their own moderation in place. The next step is to whitelist high quality sites. That is potentially a huge list to maintain, which is why I'm am narrowly focused on software development.

Everything old is new again...


>At this point I'm relying heavily on Q&A and community sites since they have their own built in quality rankings. //

How are you using the "built in quality rankings", could you give some examples?

On Reddit, say, except in a few groups like AskHistorians you can still get very high ranking for a meme post and very low ranking for a list that has high informational value. StackOverflow is extraordinarily prone to killing off reasonably good contributions and giving very high ranking to out of date answers (the latter is the biggest problem with SO sites at present IMO).


I use this documentation aggregator/search in the browser to access most language docs. It might serve as a whitelist starting point! https://devdocs.io/


Nice. Thanks!


If you haven't already, please crawl <https://developer.apple.com> and <https://swift.org>.


Ok, site request (aside from MDN): pkg.go.dev

Many of these are linked to GitHub/GitLab repos, so I'm not sure how you'll deduplicate that.


I guess blogs that are linked-to in non-killed HN comments should probably be crawled a bit. Have you considered using social user karma (this could be a 1-10 score uniquely calculated for users of each of HN, Twitter, Reddit as long as it's built in a modular way) as a weight in a PageRank style schema?

Here's how I am going to evaluate your search engine. Yesterday I searched Google for "get dynamodb table row count" and found this URL, https://bobbyhadz.com/blog/aws-dynamodb-count-items, which provides a terrible recommendation involving a full table scan.

With DontBeEvil, I didn't find the correct answer, to use the describe-table API.

If you really plan to dedicate a year to this, I would strongly encourage you to re-post again as soon as you have a strong update. Right now this has potential to provide value but really does not. So update us when you have confidence that you might be providing value! But we think you're on to a great opportunity.


> I guess blogs that are linked-to in non-killed HN comments should probably be crawled a bit

They are, but there are relatively a few of them because my only page content source is the Common Crawl. The hit rate vs the total urls I'm interested in is not great. I expect to fix this soon.

I'm also not indexing entire sites, only specific upvoted urls. This will change as well.

> Have you considered using social user karma (this could be a 1-10 score uniquely calculated for users of each of HN, Twitter, Reddit as long as it's built in a modular way) as a weight in a PageRank style schema?

Definitely. I've already started in on calculating a rank coefficient for submitters, but it's not completely clear now to best use it yet.

> Here's how I am going to evaluate your search engine

Feel free to dump more of these. Some solid test cases would be very helpful.


Here's a simple PowerShell wrapper for your lovely tool:

  function rip {
    param (
      [Parameter(Mandatory, ValueFromRemainingArguments)]
      [String]
      $Query
    )
    $RequestParameters = @{
      URI = "https://dontbeevil.rip/search?q=$Query"
      Headers = @{ Accept = "text/plain" }
    }
    $Request = Invoke-RestMethod @RequestParameters
    return $Request
  }
Usage:

  > rip heartbleed bug

  Heartbleed Bug
  <https://heartbleed.com/>    
  Heartbleed Bug The Heartbleed Bug The Heartbleed Bug is a 
  serious vulnerability in the popular OpenSSL cryptographic 
  software library....


Why not make a website for this? Why just limit it to the terminal (hard to use on mobile for example)?

Edit: obviously you can query it from a browser, but it would take like a couple hours to have a view that parsed the json and put it in a google-style layout with a search bar.


> Why not make a website for this?

It's coming. I decided to get it out there as soon as I had an index that could theoretically be useful. The feedback I'm getting will drive the next chunk of work that gets done. For instance, I'll probably bring in language docs next as a lot of people have asked for them already.


This is an interesting approach to the general problem. The general problem being that whatever data source you use is inevitably going to be polluted by players who wish to be top of the rankings in a search engine if your engine is used. Maybe this solution - serving a very small niche will work, but I'd be really interested to know if you guys have spent any time trying to SEO your own search engine? Hire an intern whose sole task is to get a page to the top of a fairly common search query like replacing some common python package with your own one?


Seems he's relying on the communities/owners behind the sources to moderate and keep bad content away.

I think it's a good bet. Would expect to be a LOT harder for a "growth hacker" to make his way up in HN points as opposed to Google rank.


My shell-fu isn't the greatest. I thought I could be clever and do

    >alias rip="curl -G -H 'Accept: text/plain' \
    --url https://dontbeevil.rip/search --data-urlencode q="

    >rip hello
    {"message": "Missing required request parameters: [q]"}
    curl: (6) Could not resolve host: hello


Turn it into a function:

    rip() {
        curl -G -H 'Accept: text/plain' \
        --url https://dontbeevil.rip/search --data-urlencode "q=$@"
    }
Now `rip hello` works.


This only works with 1-word arguments. You can change $@ to $* to fix that.

(I'm acting all wise, but I learned that today as well :) )


Switch to Zsh, where there'll be no difference!


You could also put a shell script on your `PATH` instead of creating an alias:

  #! /bin/sh
  query_string="q=$@"
  curl --get \
    --header 'Accept: text/plain' \
    --url https://dontbeevil.rip/search \
    --data-urlencode "${query_string}"


I'd do it almost the same but without the variable. Note: the long shebang is for using on Termux, PC users should change it to something like #!/use/bin/env sh.

   #!/data/data/com.termux/files/use/bin/env sh
  curl -G -H "Accept: text/plain" \
  --url "https://dontbeevil.rip/search" \
  --data-urlenconde "q=$*"


Didn't know about "$*", thanks!

Edit: typo in your version: "urlenconde"


Also every instance of ‘usr’ has been autocorrected to ‘use’.

Autocorrect does make me laugh some days :D


That's what I get for phone posting. Thanks for pointing it out you two!


  function rip() {
    printf ">>"
    read query
    curl -G -H 'Accept: text/plain' --url 
  https://dontbeevil.rip/search --data-urlencode q="$query"
  }
:)


> programmer Reddit

Is it just /r/programmer? or many other programming related subreddits?

In general though, this is great. I would similarly love a solution where we could submit sites to be indexed. A way to have a search engine for all the websites I want specifically would be awesome. You could probably add some sort of popular filter on top of it so that only sites popular enough get indexed. I don't know. Just an idea.

I love the fact that it's accessible from the terminal. That's fantastic. Although, would be nice to have a very simple HTML front-end. Think very early Google or go very brutalistic.

Anyway, excited to hear about it.

Edit:

Doing the following gives me an Internal Server Error for some reason.

curl -G -H 'Accept: text/plain' --url https://dontbeevil.rip/search --data-urlencode 'q=Notes' {"message": "Internal server error"}


> Is it just /r/programmer? or many other programming related subreddits?

There's about 30 programming focused subreddits.

> I would similarly love a solution where we could submit sites to be indexed.

This is on the plan. I want to allow common interest groups to maintain their own search verticals. I also want to allow individual users to add everything from bookmarks to notes (privately only of course) to act as a sort of external memory. That's very long term though.

> Doing the following gives me an Internal Server Error for some reason.

Should be fixed now


Couple of thoughts:

Make it available through a web page instead of a raw search dump?

Hide the internals of your search engine? In case you want to switch to meilisearch, algolia, ... (for cost reasons)

Preferably use your own search DSL to avoid users to learn about Elasticsearch queries (goes hand in hand with hiding internals of search engine)

Good luck! :)


That'll all happen (though ES simple search expressions are quite OK). The reason it is the way it is today so to enable me to get it out into the world as fast as possible. It puts the M in MVP.


Why reinvent the wheel? It's a selling point that Elasticsearch queries can be used.

If he changes the engine he might as well implement the Elasticsearch language then for the new engine.


You can try with a function to simplify the cli:

$ rip() { curl -G -H 'Accept: text/plain' --url https://dontbeevil.rip/search --data-urlencode 'q=$@'}

$ rip heartbleed bug


The single quotes probably need to be double ones in the last argument to permit parameter expansion, and the $@ (separately quote every argument if quoted) probably wants to be a $* (quote the entire space-separated argument array if quoted)? There’s also the grammar quirk where the last command inside braces (but not parens) needs a semicolon or newline to separate it from the brace itself. Thus:

  rip() { curl -G -H 'Accept: text/plain' --url https://dontbeevil.rip/search --data-urlencode "q=$*"; }
(tested).

I still support the point that there is no reason for this to be a (grammar-defying) alias rather than a (tame) shell function or even a separate script.


This is great, even if I'm not getting as good results as from google for now.

Can you expose the filtering features of ES? I'd love to query e.g. "+python lists" and get results only related to Python, no e.g. Lisp results. For Stack Overflow you could use the question's tags as filter keys, and for other sites you'd add them manually (so e.g. the PHP docs get the PHP key).

If you're thinking of monetizing this, I'll tell you what I tell all the small, useful services that I'd like to pay for. There are too many small, useful services that I'd like to pay for. I'll gladly pay $1 for such a service, but you'll have a hard time convincing me to pay more.


I really like it and it already gives some useful results. A rise of curated search engines as yours would be lovely.

It would be nice if the main page linked to your blog or anything really, because I would like to know where can I follow this project!


Thanks!

I'm giving this project a year to build up momentum. If it looks promising, I plan on having other STEM verticals. Maybe even fix recipe searches one day :)

A real homepage is coming. Feel free to subscribe to my blogs RSS feed for now: https://landshark.io/feed.xml


I just tried a few queries related to rust and it's library rocket. I got only useless results on the first page and didn't check further.

I'm guessing that's because it doesn't index docs.rs and the rust forum. Both incredibly important for Rust development.

So as long as this engine doesn't also index most programming related forums, I won't be able to use it effectively, even though I really would like to.

The concept of limiting the scope to just a few websites sounds really interesting, though. I think I will take this idea and build a little thing on top of google to implement that site filtering on my queries.


Thanks for the feedback. Language docs sites are the current weak point. Theyll be the next big addition to the index.


If you search for "Reddit" the first result is "Google Search is Dying" on Hacker News.

https://dontbeevil.rip/search?q=reddit


The reason is because there was a long discussion about Reddit as a search engine in that thread. reddit.com will likely never be indexed. Many of the subreddits already are, but I haven't exposed the ability to do something like Google's `site:reddit.com/r/*` yet. That's coming though.


> StackOverflow Does this include the entire StackExchange network, or only StackOverflow? Because some SE sites (in particular, UnixSE and ServerFault) also produce highly relevant results.


> Does this include the entire StackExchange network

Not yet. I'm focused on explicitly developer-oriented resources right now. Those you mentioned are on the TODO list though.


Congrats on the launch! Over the past 6 months or so I’ve been intermittently working on building pretty much exactly the same thing, but with a lot of procrastinating on fiddling with the internals rather than just putting something out there. Your API-first approach is a clever way to get around the desire to keep fiddling around with the page design!


I find that plain text is a very effective anti-procrastination tool. That's why the "API" was actually text-first. Limiting your options can be very liberating.


Can we drop the q= and the quotes from the shell cmd somehow, that would make it so much nicer, and rip is a great command line.


There's a real command line coming. If you're on a Debian Linux and feel like testing it out, just do

apt install curl jq

pip3 install jtbl

curl -O https://raw.githubusercontent.com/alangibson/dontbeevil.rip/...

chmod u+x rip

./rip 'what is a monad'


With long options, JSON output, and no extra Python dependencies:

  rip() {
      curl \
          --data-urlencode "q=${1}" \
          --get \
          --header 'Content-Type: application/json' --header 'Accept: application/json' \
          --silent \
          'https://dontbeevil.rip/search' \
          | jq '[ .hits.hits[] | { title: .fields.title[0], url: .fields.url[0], highlight: .highlight.text[0] } ]'
  }


This function should work for you:

$ rip() { curl -G -H "Accept: text/plain" --url https://dontbeevil.rip/search --data-urlencode "q=$*"; }

$ rip Heartbleed bug

edit: alangibson's solution in this thread is better :)


This looks really cool! It would be neat to have a proper CLI with a more fully-flushed out UI, with things like shortcuts to quickly open links. Is there any way I can be kept up to date with the state of this project?

Also, am I correct in assuming it's not open source?


The repo is over here: https://github.com/alangibson/dontbeevil.rip

You'll be disappointed though as most of the important stuff only lives as BigQuery queries. I will be updating it in the near future though.


Like an ultra powerful goosh.org UI, with AI command synthesis, image uploads, crawling, search, opening pages, etc.


CLI in the browser. I love it.


I would recommend adding technical blogs. Not by hand, but if you can automate identifying some. Many are small but have good content.

Edit: also some corporate technical documentation like Mozilla, Microsoft, IBM, etc have many such developer pages.


I automate it by pulling urls out of HN, programmer Reddit, etc. Right now my only source of page content is the Common Crawl, which is why there are relatively few web pages indexed. That will change.

A next step is to index entire sites, not just individual pages, based on the positive votes their links get.


It's powered by ElasticSearch, is it? So I can use all of its query parameters?


Indeed it is. You can you use simple query strings for the q parameter. See https://www.elastic.co/guide/en/elasticsearch/reference/curr...

I'm considering opening up full ES query support for paying customers, but it's too dangerous to expose it to the Internet unrestricted.


I think it's dangerous to expect a malicious actor would not pay $10 to screw your service.


"Risk management" is often not the same as "risk elimination"


Indeed it is. Presumably I would have had time to build up some safeguards and run beefier servers by then though.


As a word of warning, when HN discovered my search engine, I was hit hard by a botnet within a few days. Saw about 30-40k queries/hour from some 10k IP addresses. I'm self hosted so the worst that happened is my search engine was a bit slow, but if I was cloud hosted I'd have a very sizable bill to pay.

If you do not already have a global rate limit, implement one ASAP. Better to have one and not need it, than to need it and not have it.


I can't wait for the bots to show up. Setting a rate limit was one of the first things I did :)


Do you have reversed proxy in front of your API like HA Proxy or Nginx, most of bots will hit you by IP only, so filter and reject request without domain will be eliminate most of them.


This was a directed attack, not some random drive-by.


Going to "https://dontbeevil.rip/" results in a JSON error in the browser:

{"message":"Missing Authentication Token"}


There's nothing there yet. I'm 100% focused on building the index and tuning the master search query. There is however a blog post that goes into more detail on how to do things like pagination soon.

tl;dr: https://dontbeevil.rip/search?q=monads&from=10


Apparently developers need no homepages, just APIs :)


Jup. No time for fancy things like HTML yet :)


i run a few dozen internet and web services with FQDN and only 1 of them has something if you type http://example.com - no homepages, but there's a webserver listening on ~80% of the domains.

Stop attacking me!


You need to access the sub path /search?q= :)


Getting internal server error for many ordinary requests. I'm not able to discern a pattern. An example is `rip 'q=zelda'`.


Thanks for the report. I'll get this fixed.

In the mean time you can use application/json:

curl -G -H 'Accept: application/json' https://dontbeevil.rip/search?q=zelda


It should be fixed now.


Nice! Do any more advanced query strings work right now? Like looking for recent pages or only searching titles?


Are you using something similar to the original pagerank algorithm that uses eigen-analysis of the link graph?


It's not even that sophisticated yet. I'm ranking urls based on their normalized score on the various community sites I find them on. My next TODO is to roll up those ranks to get a rank for the site, then index the whole site.

I will also be using the PageRank calculated by Common Crawl as soon as they release the next data set.


did this idea spark from PG's old talk on new ideas https://youtu.be/R9ITLdmfdLI? One of them is literally "search engine for developers/hackers"


Maybe use Gopher? Lynx supports it and there are a few other newish clients out there.


I saw your blog post a couple of days ago, This looks really promising!


Thanks! I'll be updating that post today. I changed quite a few things getting ready for this Show HN, so it's now out of date.


Are you not putting github (+issues and PRs) in the indexed set?


Not yet. That's an astonishing about of data, and I want to make sure that people genuinely want it first. I'm considering an index specifically for this actually.

I'll put you down as a +1


I'm getting:

> {"message": "Internal server error"}


Give it another try. I fixed a flaw in the json to text translation.


Thanks, it works now


What query are you running?


Love the URL most of all but will be trying this out!


how's ranking done, I searched for xslt and I saw a lot of HN results in the first part, seemed weird that HN would rank highly for that.


Results are heavily weighted to HN and Stackoverflow right now because they are the easiest resources to access and rank. Since posts have a score on both platforms, it's easy to pull out some 'authority' signal.

There's many more web pages coming. They are much more difficult to get ahold of and rank though because I need to run my own crawler to fill in what Common Crawl doesn't have and then calculate my own site authority rankings.


Nice. Some very different results returned.


Now do recipes


StackOverflow but for food. RecipeOverflow. StackFood.


I want to. So much.


  1. Stand up ElasticSearch instance
  2. Have it index SO and HN
  3. Charge $10 per month
  4. Profit!!


Maybe you should read the other posts about future plans and how this is extremely alpha. Or maybe go anywhere else and do literally anything else but be obnoxious in this thread.

I'm spending over $200 per week just to stand up the service as it is. $10 for a full functional search engine will likely not be even close to PROFIT!!!!


> I'm spending over $200 per week just to stand up the service as it is

What are using if you don't mind me asking? Not trying to criticize or anything. I have a Heztner box that gives me 1TB SSD in RAID 0 mode and 64 GB of RAM for about 80 CAD a month.


I have 3 sizeable EC2 instances running an ElasticSearch cluster, plus a beefy box for data preprocessing and crawling.

A big chunk actually goes to BigQuery. There are publicly available datasets for HN, Stackoverflow and a few others there. I've also loaded up the Common Crawl index. The query and storage fees really add up.

I'm hopefully done with huge BigQuery queries, so that $200 will probably drop for a while.


I'm probably wrong about my assumptions, but presume you are open to any kind of constructive feedback, so here it goes...

Maybe you're overkilling with the infra stack.

I would simplify until having a mature product, especially if I'm bootstrapping, which I think is your case.

Right now, you're still a bit far from MVP, from my point of view. Those $200 can probably be reduced by 50%-75% if you compromise on stuff only important to non-alpha services (i.e. 99.99% availability). A single EC2 box should be enough. Maybe look into Postgres or another FOSS instead of BigQuery.

These $100-$150 savings per week can go into promoting your service, getting as much attention as possible to maximize feedback.

Good luck!


> Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something.

Source: https://news.ycombinator.com/newsguidelines.html


You're right, that was dickish. I could have asked what's different without the snark.


The purpose of this project is to see if it's possible to build a highly targeted, privacy respecting, search engine that people will pay for. I've given myself a year to build the index and tune for relevance. If at the end of that year it's not a path to sustainability, I'll shut it down secure in the knowledge that, despite what they say, people really won't pay for search. If it is, then I'll start scaling into other STEM subjects.

So the difference is, it has the things folks on HN say they want: - search expressions - REST api - no tracking - users are buyers, not products


Would you consider allowing users to host instances/nodes of the engine in return for free or reduced monthly rates? I wouldn't mind making that kind of trade.


How would one go about ensuring these nodes are not malicious?


just query two at random, if they don't match, hit an API endpoint and something like `diff` output? if the API endpoint with the two 'nodes' gets enough complaints about a node(s) then blacklist it from the round robin/haproxy/whatever.


> if the API endpoint with the two 'nodes' gets enough complaints about a node(s) then blacklist it

Great, now you just found a way for malicious actors to create reputation bombs, remove honest nodes from the pool and make it even easier for them to spam/poison the results.


Make them run in a blockchain! ;D




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: