We can do better than DuckDuckGo

jimnotgym · on Nov 17, 2020

> The search results suck

Do they really though, for normal people that is?. Some of my searches today below, can't remember the exact terms I used. Mix of DDG and Google.

1) Walt Whitman, I wanted a basic overview of his work to satisfy some idle curiosity. DDG gave me his wikipedia page. Bingo

2) EAN-13 check digit. First result wikipedia telling me how to calculate it. I see it is simple and I have a long list in Excel to check. I can't be bothered to think so...

3) EAN-13 Excel. First result has an example that I copied and pasted.

4) Timezone [niche cloud system]. Said system didn't do what we expected, seems to be timezone issue. First article is discussing this niche issue and offers solutions

5) Does Shopify support x payments. Yes it does

6) Coronavirus test. Got straight to government site.

7) MacOS version numbers. First hit...

8) How come my Microsoft x platform is showing as being at y level of service when my Buddies is not. Straight in

Am I just a perfect search customer? I don't seem to be getting the problems Drew is?

kringdom89 · on Nov 18, 2020

I suspect that anyone who claims that Duckduckgo "Just works" only do english search. I usually do "english" / "mother tongue" searchs all day. Everytime, I need to remember to toggle the regional button otherwise I get attrocious results. Whereas google simply understand that if I'm searching using the english language it should prioritize english results while if I'm searching in another language it should prioritize it instead.

It gets tiring quickly and I find easier to append !g instead of clicking the regional toggle button.

Semaphor · on Nov 18, 2020

For me (German) it’s different. With DDG, I can easily choose to search for German content (by using !ddgde), with google I have to hope that they search for what I want. Sometimes google does, sometimes it does not. And if it doesn’t I’m out of luck unless I go into the settings and look for a way to tell it what to do.

Google automates, DDG leaves me to choose. I prefer the 2nd approach every time.

zuhsetaqi · on Nov 21, 2020

> Google automates, DDG leaves me to choose. I prefer the 2nd approach every time.

This is exactly why I like DDG way more than Google and why I love to use Alfred instead of Spotlight on my Mac. With DDG you have !bangs and with Alfred you also can tell him what you’re looking for. 99.9% of the time I know I’m looking for a file or a folder or a definition of a word or want to open an app or want to search the web etc. With Spotlight you’re stuck to the order Apple designed the results to show up

benhurmarcel · on Nov 18, 2020

It's also very useful to have that control when you live in another country. I'm in Spain now, but most of the time I want to search in English or even French. Google only gives you local results.

fsflover · on Nov 18, 2020

Alternatively, you can go to the settings [0] and create a special URL: https://duckduckgo.com/?kl=de-de.

[0] https://duckduckgo.com/settings

Pyramus · on Nov 18, 2020

Thank you for the !ddgde bang - I face exactly the same problem as you.

nessex · on Nov 18, 2020

I really wish Google would prioritize English results for English searches consistently. I'm living in Japan as a native English speaker, and have my OS, browser and logged in Google account all configured for English only. Despite that, Google search results always prioritize Japanese language content. Every now and then (though not consistently) it gives me a yellow popup asking if I'd like English results instead, which is a bit disappointing given they already have all the information they should need to make a judgement call about that. Maybe the individual experience here depends on the languages and regions involved.

DoingIsLearning · on Nov 18, 2020

There was a time, a long time ago, where google had this:

www.google.com/ncr

'ncr' here stands for no country recognition. It allowed many expats to do technical searches without the noise of regionalization results.

Of course someone clever at google figured out that was probably too useful and now it just redirects you back to google.com because screw all those niche use-cases.

r_sreeram · on Nov 20, 2020

That's not what it was. "ncr" was "No Country Redirect".

When you were in a different country (e.g., India), and you typed in google.com out of habit, it would recognize your IP-geo and redirect you to the country-specific domain (e.g., google.co.in).

If you really just wanted google.com for whatever reason, then you'd type google.com/ncr. It then wouldn't redirect you based on your IP-geo, and you'd stay on google.com.

In other words, google.com/ncr _always_ redirected you back to google.com. Then, and now.

DoingIsLearning · on Nov 21, 2020

Thanks for correcting the acronym TIL.

However you can see from the comments in both android police [0] and reddit [1] that, irrespective of your assertiveness, the behaviour did indeed change at least in 2017 if not more times before.

It at the very least used to preserve the suffix and absolutely respect no regional results. It's the same as the old bolean operators, google claims the behaviour is unchanged but will silently ignore them.

[0] https://www.androidpolice.com/2017/10/27/changing-googles-do...

[1] https://www.reddit.com/r/google/comments/4xda1p/googlecomncr...

hysan · on Nov 18, 2020

Was going to post the exact same thing. This was my experience while living in Japan too. To me, the takeaway is that you simply cannot catch everyone with your defaults. Google and DDG have made different prioritization defaults and the result of that is what we see in anecdotes in this thread.

benhurmarcel · on Nov 18, 2020

You don't need to catch everyone with your defaults. You just need to make it possible to not use the default.

Sure, give me local results if I don't specify anything. But let me tell you if I want results in English now.

fsflover · on Nov 18, 2020

> To me, the takeaway is that you simply cannot catch everyone with your defaults.

You don't have to. DuckDuckGo allows [0] to create a link with settings: https://duckduckgo.com/?kl=jp-jp.

[0] https://duckduckgo.com/settings

soraminazuki · on Nov 18, 2020

On the contrary, I like the explicit language toggle because some search terms have better results with a specific language. I get annoyed when I enter a programming related search term and get non-English results.

nickserv · on Nov 18, 2020

I do a lot of searches in French, where a lot of the words are identical to English (English being heavily influenced by French), especially if one leaves out the accents.

I also search in Spanish (Castilian), but sometimes I want results from Latin America, sometimes from Spain.

Being able to set the language/region is of incredible help in both cases. There is no way to automatically detect this.

3np · on Nov 18, 2020

I also find that even with the regional toggle off, my results are still skewed towards my location or the native language of it. This is true for both DDG and Google. I want results completely agnostic of where my IP happens to be positioned.

numpad0 · on Nov 18, 2020

And your account histories as well! It sucks(more than sucks) when Google only lets me what I want from someone else’s browser.

szszrk · on Nov 18, 2020

I actually prefer the toggle button for regional search. Also when I search in native tongue Google sometimes (like when using brand names, models of devices etc) gives me 2-4 pages of advertisement and stores links. It's hard to find a large companies homepage.

In DDG it's usually the first page.

I still do a lot of !g when I search technical stuff, as it lets +word -word and DDG doesnt find a lot of weird github issue pages, old forums, usenet posts sometimes.

fogihujy · on Nov 18, 2020

DDG works well for Swedish. As for Finnish, DDG doesn't suck more than Google does.

krick · on Nov 18, 2020

Yeah, all of these are quite DDG-friendly searches. It is my default engine and, yes, some results do suck quite consistently.

I'm a bit lazy right now to remember all the problems it has, but some of the most obvious are looking up for news on recent events (especially something small, stuff that doesn't appear in reuters and these sorts of media) and trying to find out some basic stuff about local shops and such (of course, I only know about how it feels in my location, not worldwide). On both occasions I pretty much always use "!g ..." right away, because DDG is just clueless about this shit. Google does this just fine (in fact, sometimes it's even impressive: there are thousands of cities like mine, yet Google can often tell me where I can buy some stuff I'd have no idea where to look for).

chris_f · on Nov 18, 2020

> Yeah, all of these are quite DDG-friendly searches...

This is exactly correct. Excluding poor local search results (which is understandable bc of the privacy aspect), Bing/DDG has trouble with long tale search query relevance (5+ word queries), and also finding results from small or obscure sites. The later is simply because Bing's organic index is not as large as Googles.

Bing/DDG's organic results are still very good, but they are not as good as Google's in the above specific circumstances.

efournie · on Nov 18, 2020

Compared to Google, Bing has a huge problem with paid search results, at least in some non english languages.

My mother wanted to access Amazon last week and typed "my amazon account" in french in the windows search, which searched for those terms in Bing. One of the first (paid) results was a scam site triggering alarm sound, fake virus notifications and asking her to call a scam hotline.

At least DDG filters out the ads but the problem in this case is Bing's OS integration.

Semaphor · on Nov 18, 2020

> Bing/DDG has trouble with long tale search query relevance

The last time I did a comparison, Bing did better (I don’t know what DDG does with the Bing results exactly, everyone says they just show Bing results, but no one knows and it simply doesn’t mesh with my experience).

Because Bing does not randomly filter out half my terms while DDG does even for "-forced terms. This is my #1 problem with DDG and I complain about it in pretty much every DDG thread (while otherwise loving DDG).

For few result searches, DDG shows you essentially random stuff even if they have the result I want (which can be tested by searching for an exact sentence from the result page). On the other hand doing the search on Bing gives me the result without neutering my query.

sightmost · on Nov 18, 2020

I am typing this from India. DDG never provides satisfactory results for anything country specific. As an example, point 6 above is a failure. I used to have DDG as my default, but my workflow got so convoluted that I would search first on DDG, see that the results as not good, open google and search again. It is so frustrating that I switched back to Google even when I didn't want to.

Edit: typos.

miccah · on Nov 18, 2020

I'm sure others have commented this elsewhere, but DDG has bang operators.

For your use case, simply append !g to the DDG search and it will do a Google search instead.

bigdict · on Nov 18, 2020

Right, but Google supports Google search without bang operators.

Deely123123 · on Nov 18, 2020

No way!

Semaphor · on Nov 18, 2020

Did you use localized search or the general search? For me 6 works great with !ddgde

mschulze · on Nov 18, 2020

Thank you, I didn't know that feature. I almost always used g! to switch to Google when searching for country specific terms, guess I can change that now.

blntechie · on Nov 18, 2020

I don’t find a !ddgde equivalent operator for India(en).

nl · on Nov 17, 2020

In consumer search there is a really long tail of questions (in 2017 15% of Google's daily queries have never been seen before[1]) and performance on this is very important.

I just searched for "lockdown rules for SA" (I'm in South Australia and we just had a new 20 person cluster, so we are going back into lockdown).

On DDG the first results was a Guardian article which was good, but then the rest were a mix of South African articles and blog spam. There were no SA Gov pages on the first page of results.

On Google the first result was the South Australian gov site with the rules, the second was the Guardian article, then more SA Gov pages and at result 8 I got a South African result.

https://searchengineland.com/google-reaffirms-15-searches-ne...

qiqitori · on Nov 18, 2020

Hrm, I think it's extremely iffy to abbreviate South Australia like that in a search query. You don't need the "for" either.

BTW, when I perform the same search, Google's first result is "What Are the Lockdown Rules for South Africa? A Guide for ..." and all the other results on the first page are about South Africa too. (Note: I'm in Japan)

nl · on Nov 18, 2020

> Hrm, I think it's extremely iffy to abbreviate South Australia like that in a search query.

Everyone in Australia uses "SA" - this is one of the reasons why location based context is important.

> You don't need the "for" either.

I worked on consumer search for a few years, and on text based search word like "for" are helpful to get exact match. Even if the term frequency of "for" on its own isn't particularly useful "for SA" absolutely is. (And these days with neural ranking using sub-word parts it is even more useful).

JauntyHatAngle · on Nov 18, 2020

SA is how all Australians would word that.

justinclift · on Nov 18, 2020

Interesting. Good example. :)

Searching for "lockdown rules" "sa" (together) just now gave a bunch of South Australian specific results with the "Australia" localisation setting enabled.

With the localisation setting disabled, all the results were indeed about South Africa instead.

ddingus · on Nov 18, 2020

I got this:

https://www.enca.com/news/sa-lockdown-strick-international-t...

nl · on Nov 18, 2020

Which is about South Africa - which might be a good result for you, depending on where you live.

So at least they are trying to to location based results.

ddingus · on Nov 18, 2020

Well, I am in PDX, so....

3np · on Nov 18, 2020

They do. Something fundamentally changed at some point during the past couple of years. It used to be that DDG was the best for verbatim search (meaning I want to only have results were the exact words I search for are included).

Now, even with quotes, I routinely get a whole first page of results where my terms are not included anywhere. Google generally respect the quotes.

JProthero · on Nov 18, 2020

I have noticed the same problem with DuckDuckGo searches recently.

I hope that a verbatim search function will be restored in the future; I think it's an essential basic tool for a search engine, and without it the user can be left with the impression that the engine either doesn't understand what it is being asked to do, or that it is wilfully disregarding instructions because it thinks — often wrongly — that it has a better idea of what the user is searching for than the user does.

justinclift · on Nov 18, 2020

Agreed. Getting results which don't include the quoted terms is mind bogglingly useless.

I thought the whole point was to improve over time, not get worse. :(

ziotom78 · on Nov 18, 2020

This was exactly my problem. I tried to love DDG, I really did, but this behavior was so annoying that I turned back to Google a few months ago.

(No, I do not consider typing "!g" before any search that contains quotes a solution to the problem.)

ddingus · on Nov 18, 2020

I use ddg often myself.

Google does infer purpose better, and if someone is looking to buy something, it does well there too.

Ddg is very good at info queries and the more one uses it, the better it is.

What they could do is exactly what google did and that's to review those uses and improve.

But what they have right now is solid, given just a tiny bit of work.

carom · on Nov 18, 2020

The biggest habit I had to break moving from Google to DDG was phrasing everything as a question.

If anyone is thinking of making the switch, you can always redirect your searches to Google by throwing a g! in the query.

mr_toad · on Nov 18, 2020

> phrasing everything as a question.

I wonder if this is generational or cultural?

Personally, I dislike trying to interface with a machine using natural language, because I know it can’t really understand me, and I’d rather read and interpret the results for myself than have an algorithm pick the “best”.

I actually find speaking to machines (e.g. automated phone systems, Siri etc) using natural language quite embarrassing, as if we were pretending that real life was like Star Trek.

zbrozek · on Nov 18, 2020

I also hate talking to machines, minus some exceptional circumstances. I find it especially annoying when phone menu systems insist that I talk to them. Some of them don't even respond to mashing zero.

ddingus · on Nov 18, 2020

This. We do not have machines able to sort meaning out yet, so why bother?

potatoman22 · on Nov 18, 2020

For most information retrieval purposes, Google's NLP algorithms can effectively determine the meaning of your query thanks to BERT

ddingus · on Nov 18, 2020

In a simple probabilistic sense, sure. Shove enough data at the problem and the easy cases work out. Those are only a small subset of the problem space.

Until we address meaning, my statement remains solid. And it can often be easier to treat the tool like what it is rather than figure out how to best pretend it is something it is clearly not.

admax88q · on Nov 18, 2020

I think its less about trying to talk to Google, and more about phrasing your search the way somebody would ask it on some random forum. That is often the benefit of asking as a question.

Although such queries are habit forming and now google does a decent job understanding the actual question

ddingus · on Nov 18, 2020

Funny, I never used questions with Google, until very recently, and only with some queries.

Got good at including words for context early on and never stopped.

austinprete · on Nov 18, 2020

Out of curiosity — what made you start?

ddingus · on Nov 18, 2020

Seeing a few queries others did. Now, if it seems like an obvious question, I may try it.

My default remains word searches. Frankly, better operators would benefit me more than questions would.

I really do not always want to formulate a question. Doing that makes sense sometimes.

Often, I want to see relevant info, then formulate other queries.

powersnail · on Nov 18, 2020

It sucks for me when I search anything outside technical/science and daily life.

It also sucks at retrieving very new information.

And I say this as someone who set DDG as default.

I mean, you do seem to be DDG’s ideal user. You searched for mostly technical issues, and a hot political issue.

blablabla123 · on Nov 18, 2020

Apart from that, it's nice that is has no commercial bias. For instance if I search for a thing that is both a real thing and a product, I get the real thing returned.

Still, for really niche topics if search needed there is no way around Google. On the other hand Google is not the only way to explore the web, let alone auto-complete a url...

DenverCode · on Nov 18, 2020

I also have no issues with DDG but most of my searches are pretty specific or I'll just end up at the Wikipedia page anyway.

Other than that, I've been using [Runaroo](https://www.runnaroo.com) as well.

btgeekboy · on Nov 18, 2020

I’ve tried searching for stuff related to Alexa Presentation Language (APL) a bunch of times. It never finds anything useful; I throw “!g” on the query string and what I’m looking for is typically the first or second result.

mcdevilkiller · on Nov 18, 2020

Yep, DDG gives me much better results than Google. Even if I'm searching in Spanish.

jedimastert · on Nov 17, 2020

> Instead, it should crawl a whitelist of domains, or “tier 1” domains. These would be the limited mainly to authoritative or high-quality sources for their respective specializations, and would be weighed upwards in search results.

Not a big fan of this conclusion. Who chooses the white list, and why should I trust them? Is it democratically chosen? Just because a site is popular very clear does not mean it's trustworthy. Does it get vetted? by whom? Also, who's definition of trustworthy are we trusting?

If I want my blog to show up on your search engine, do I have to get it linked by one of those sites, or can I register with you? Will I be tier 1, or

bscphil · on Nov 17, 2020

> If I want my blog to show up on your search engine, do I have to get it linked by one of those sites, or can I register with you? Will I be tier 1, or

I think what I'd say in defense is that we've misunderstood what search engines are useful for. They're really bad at helping us discover new things. Your blog might be awesome, but it's not going to be easy for a search engine to tell that it's awesome. It's going to have to compete with other blogs that also want views, some of whom are going to be better than yours at SEO, and so on.

What a search engine might be able to tell is that it's useful. Because what search engines are at least potentially good at is answering questions. You do that by having a list of known good sites to answer specific types of questions, and looking at the sites they link to. It's when you try to do both (index everything on the web and provide accurate answers to specific questions) that you end up failing to do either. For example this is the #2 result for "python f strings" on DDG[1]. It's total garbage, and, quoting the blog, "we can do better". (This result is also on page 1 for the same query on Google.)

What I believe ddevault is suggesting is that we make a search engine that does the only thing search engines are really good at, answering questions. You throw away the idea of indexing everything on the web, and therefore the possibility of "discovery". What that means is that in 2020 you need some other mechanism for discovering new sites, bloggers, and so on. Fortunately we do have some alternatives in that space.

To be clear, I don't know if I 100% buy this argument, but I think it's the general idea behind what's being suggested in this blog post.

[1] https://careerkarma.com/blog/python-f-string/

mxcrossb · on Nov 18, 2020

As an experiment, I searched “tech news aggregator” on both google and DDG. Neither listed Hacker News. Instead, apart from a few actual sites, most of the links were articles saying “top ten tech news sites” or links to quora q&a threads.

It definitely seems that search engines can’t find new websites for people. Now they are just aggregating Q&A.

bscphil · on Nov 18, 2020

Yeah - this is sort of my thinking. For better or worse if you enter "discovery" terms, what you will get back is not the results of a "discovery" search, but rather an answer to the question "what are some websites that will help me discover X".

chris_f · on Nov 18, 2020

> For example this is the #2 result for "python f strings" on DDG[1]. It's total garbage, and, quoting the blog, "we can do better". (This result is also on page 1 for the same query on Google.)

There are other options that may be better, but in general, very few people are looking for them.

Here's your "python f strings" query on Runnaroo:

https://www.runnaroo.com/search?term=python+f+strings

I'm the creator of Runnaroo.

edgyquant · on Nov 18, 2020

I tried this out real quick just to see for myself. First it was slow for me, if I wasn’t wanting to try it out I may have closed the tab, but I did a search for a question I recently googled (that was an example of an annoying google search.) I searched “Python arcade sprite opacity“ and while the first three results were the same as google the fourth was a github link to the project itself and brought me to the answer (which wasn’t on page 1 of google for me, although it was on page two.)

So you need to speed it up, but it does look good for searching documentation at least.

jrochkind1 · on Nov 17, 2020

> You do that by having a list of known good sites to answer specific types of questions, and looking at the sites they link to.

I mean, that's basically the core of original Google pagerank, right? A "good" site linking to another site is what makes that other site some amount of "good" too, links from better sites carry more 'juice'. "good" is of course not just binary, but a quantitative weight.

I don't know to what extent that's still at the core of their relevancy rankings. I don't know how all those annoying spammy recipe blogs or content farms get to the top of the results either. I don't think it's because Google's engineers believe they are "good" results.

Relevancy ranking on web search is clearly a hard problem, mainly because so many authors are trying to game it, it's a feedback loop.

If Google can only do as google does despite pouring a whole lot of money into it, I don't see a reason to bet that better will be what's basically an over-simplified description of how Google started out doing it (and then evolved it because it wasn't good enough).

JProthero · on Nov 18, 2020

That was exactly my thought on reading that part of the article: that the author was describing Pagerank, but substituting Pagerank's design objective of algorithmically outsourcing quality judgments to the global community of website publishers with the programmer's prejudices instead.

I hope the project of improving on DuckDuckGo is successful though, and some of the other proposals in the article sound promising to me.

In the future I would like to see an open source search engine 'paid for' using some combination of homomorphic encryption, blockchain and Tor-style technologies to trade bandwidth and processing power with its userbase in exchange for search results and other services, but I don't have the expertise to assess how feasible that might be.

pksebben · on Nov 19, 2020

the money really is the crux of the issue, too. if you take as a given that consumers won't pay for a service they can get for free, then you're kind of in a bind creating an unopinionated search - if you're truly objective, and reflecting the underlying value of each link faithfully, who do you charge?

dubbel · on Nov 17, 2020

This is not about a single search engine instance replacing all your search engine needs.

There could be a community of Software developers running one instance of the OS search engine that focuses on programmers needs: documentation, vcs hosters, dev blogs, tech news websites and on topic blogs. Great if you need to search for how to solve a software issue, terrible if you need to figure out how long to cook spaghetti.

The lists of crawled pages i guess would be visible/searchable too, at least on instances that you could trust.

If you don't like the list, and the maintainers for whatever reason don't want to change it to your liking, feel free to use another instance or set one up yourself.

prox · on Nov 17, 2020

This is the single suggestion that got me excited, instead of a behemoth do-it-all engine, have focused search engines. Exciting idea.

Shared404 · on Nov 17, 2020

You may like 'boardreader.com'.

I find it helps when looking for obscure info on random topics.

prox · on Nov 18, 2020

Wow, pretty cool! Thank you.

hoopleheaded · on Nov 17, 2020

Then who makes the search engine to search for those search engines?

prox · on Nov 17, 2020

Bring back link rings :)

joe_the_user · on Nov 18, 2020

Things actually sort of ran that way once. The DMOZ directory system was a cannonical, list of sites by subject listed top-down. It was maintained by a community of volunteers in the fashion of Wikipeda. I believe it was used as one reference for Google and other search engines at one time. I don't know if such an "objective" system could be rebuilt, however.

Still, it's good to remember that it was once uncertain whether people should access the web using something a table of content (portal/directory) or something like an index (search engine). It seems the search engines won.

See: https://en.wikipedia.org/wiki/DMOZ

Shared404 · on Nov 17, 2020

> Who chooses the white list, and why should I trust them? Is it democratically chosen?

You could have user compiled lists of sites to show in search results.

Let the users pick the lists they want to see, and communities can create and distribute lists within themselves.

RileyJames · on Nov 17, 2020

Great idea, but why build a search engine at all in this case? You can use DDG + your filter and see only the results from your whitelist.

Could easily be implemented for any current search engine.

To a large extent, this is what you already do when you view a page of search results. Filter them based on your understand of what sites / results hold value.

Shared404 · on Nov 17, 2020

> why build a search engine at all in this case?

On a public scale, you could make an argument for tighter integration/better privacy with the lists. For example:

    Browser -----Request-to-SE-----> Search Engine
      ^                                   |
      |                      Unfiltered Results (In YAML/JSON)
      |                                   |
      |                                   V
      |--Desired Results------ Local Filtering/Rendering

On a private scale, if you are only crawling sites on the allow list than you have the possibility of being able to better maintain a local database of sites to show up on the search.

Edit: Possibly this could be easier to use to set up distributed search as well, as each node could index a given list, and then distribute that list similarly to DNS. Don't really know how well that would work though, just an idea.

edgyquant · on Nov 18, 2020

Aside from this a big reason to build this is it seems a lot simpler than writing a giant web crawler ala google and thus is a good target for an open source solution. Which is the biggest problem with duck duck go.

Shared404 · on Nov 18, 2020

Do you have thoughts on implementing the distributed search?

I'm thinking about playing around with this in my spare time, but that part seems the hardest to do.

rakoo · on Nov 18, 2020

Don't start from scratch, take a look at yacy (https://yacy.net/) that already does most of what is discussed here

edgyquant · on Nov 18, 2020

I would think everyone would run their own “crawler” but maybe you could use a ledger and delegate sites to different workers. If you’re whitelisting sites you could maybe only crawl a link or two deep.

It’d take a lot of cycles while small but if you get a network growing you could even have sub-networks with their own whitelist additions (and every user has a blacklist.)

Jtsummers · on Nov 17, 2020

That’s what directory sites offered once upon a time. It was a pretty good way to discover new content back then. I spent a lot of time on dmoz when I wanted to find information about various topics.

ecommerceguy · on Nov 17, 2020

So basically lock out any new site, regardless of content. Great idea /s

mekster · on Nov 17, 2020

But the email is already like this. It's the inbox providers who choose what domain is legit and new domains start from negative rating. Treating the web the same way doesn't sound too unnatural.

It would be bad if those in the positions profit by "authorizing" who is good though.

buzzerbetrayed · on Nov 17, 2020

I'm not sure why email should be an example of the correct way to do it. And with email I can check my spam folder and see exactly what has been rejected. So unless the search engine has a list of sites that aren't deemed worthy included with every search (which probably wouldn't happen), I think this solution has some pretty big flaws. It should be noted that the current system also has these flaws, as Google and DDG can show you whatever they want base don whatever criteria they see fit.

6510 · on Nov 17, 2020

I like this idea! Have the usual official results... then have an option to go to level 2, level 3, level 4 etc (lvl 1 is not included in lvl 2)

You can have really biased technically terrible filters that for example put a site on level 4 because it is to new, to small and any number of other dumb SEO nonsense arguments. (The topic was not in the url! There was poor choice of text color!)

I think wikipedia has a lot of research to offer on what to do but also what not to do. Try getting to tier 2 edits on a popular article? It would take days to sort out the edits and construct by hand a tier 2 article.

pbhjpbhj · on Nov 17, 2020

Per your 2nd para Google used to have some options to tailor the results more, like allinurl or inurl or title or link (IIRC the word had to be in a link pointing to that page) or whatever.

I expected that to evolve to get more specificity but things went completely the otherway and we can't even specify a term is on a page reliably with Google now.

Similarly, I was all in on xhtml and semantics (like microformats) where you'd be able to search for "address: high street AND item:beer with price:<2" to find a cheap drink.

6510 · on Nov 18, 2020

I use to use inauthor: a lot.

I imagine for a FOSS solution we would have to make configurable every separable ranking algo and the option to toggle them in groups as well as build cli like queries around them (with a gui)

I'm starting to see a picture now. In stead of wondering how to build a search engine we should just build things that are compatible. A bit like The output of your database is the input of my filter.

Take site search, it is easy to write specs for with tons of optional features and can easily outperform any crawler. Meta site search can produce similar output. Distributed diy cralwers can provide similar data.

Arguably top websites should not be indexed at all. They should provide their own search api.

The end user puts in a query and gets a bunch of results. It goes into a table with a colum for each unique property. The properties show up in the side bar to refine results (sorted by howmany results have the property) Clicking on one/filling out the field/setting a min max displays the results and sends out a new more specific query looking for those specific properties. New properties are obtained that way.

pbhjpbhj · on Nov 18, 2020

Yes, I was thinking along similar lines IIUC of a sort of federated search using common db schemas and search apis so that I could crawl pages and they could be dragged in to your SERP by a meta-search engine. I think the main thing you lose is popularity and ranking from other people's past searches - that could be built in but it relies on trusting the individual indices, which would be distributed and so could be modified to fake popularity or return results that were not wanted (though then one could just cut off that part of one's search network).

vorpalhex · on Nov 17, 2020

I wonder if the correct answer is a blacklist for known spammy sites and the ability to turn the list off.

If I never see a pinterest link, or one of those sites that just republishes stackoverflow answers unedited, I'd be fine with it.

Of course, these systems always get abused and some political or news site will end up on it.

ssivark · on Nov 17, 2020

Chose your list of favorites, or subscribe to someone whose curation you trust. No worse than trusting Google/Twitter/Facebook/etc.

In other words, this is precisely how a market functions.

guywhocodes · on Nov 18, 2020

This is exactly where he lost me. I don't think it is hard at all to find results in "tier 1" domains with DDG. I would argue the we have the opposite problem almost entirely. Besides blogspam / internet cancer and t1 sites you hardly get any results. It's incapable of finding the actually useful communities or blogs for your query.

thayne · on Nov 18, 2020

> Who chooses the white list, and why should I trust them?

That's part of the point. There could be different search engines that run the same code, but have different sets of tier 1 domains that cater to different audiences. And if you have the resources, you could set up your own engine with a set of tier-1 domains that you chose.

oever · on Nov 17, 2020

Instead of having many bots do inefficient crawling, web sites should publish their own index. Intermediate parties can combine indexes of the sites. Sites that do not provide indexes get less visitors.

rakoo · on Nov 18, 2020

That's the best way to fill your search engine with spam. There needs to be a third-party that verifies that the site-provided index is inline with the actual content. At which point said third-party can be a crawler.

adrianstoll · on Nov 18, 2020

This is the idea behind sitemaps which have existed forever.

oever · on Nov 18, 2020

Sitemaps are lists of urls on a site. They are not a text index.

rasengan · on Nov 19, 2020

Private.sh anonymizes search results by proxying requests only after they are also encrypted client side.

It uses Gigablast which has a much more fair search result set more akin to search engines of the past!

jron · on Nov 17, 2020

SEO is crushing the utility of Google. It is pretty telling when you need to add things like site:reddit.com to get anything of value. Harnessing real user experiences (blogs, etc) is the key to a better search engine. This model unfortunately crumbles under walled gardens which is increasingly the preferred location of user activity.

RileyJames · on Nov 17, 2020

That’s where blogs were at, but now a massive portion of them are content farms / splogs.

You’re right that the walled gardens have hurt this. So often I search something specific, or a topic, and find very little. But I know there are communities on Facebook for this, I know there would be peoples posts out there on Instagram which 100% answer my question. But they may as well not exist. Unless I was “following” then when it was said, and mentally indexed it, these things are mostly unfindable, and that’s if I even have an account for said service (which I don’t for Facebook)

It’s sad, more people than ever using the internet, more content & knowledge being created than ever before, yet it’s no longer possible to find the great answers.

guywhocodes · on Nov 18, 2020

Combined with disinformation campaigns and post-truth phenomena, what we defined as "the information age" seems to have been short lived.

djsumdog · on Nov 17, 2020

> Harnessing real user experiences (blogs, etc)

This is what we need more than anything. More independent blogs. The ability to search events now, or 10 years ago, mass indexing of RSS feeds, etc.

A general search engine is kinda way out of the ballpark for now. But you could specialize for long form blogs, from all sides, hard-left, hard-right, women in tech, white supremacists, all the extremes and moderates.

I've love to have an interface to search a topic and see what all kinds of people have posted long form, without commentary or Twitter/Facebook bullshit "Fact checking" notices. I what to see what real writers are seeing across the spectrum on a given topic for the week or month.

Mediterraneo10 · on Nov 17, 2020

> More independent blogs.

The problem is that content farms have mastered the art of writing like an ostensibly independent blog. This is most visible in recipe blogs, where for example the site will look independent, the blog owner’s "About Me" page will say that she is a young woman born and raised in Louisiana and passionate about her home region’s cooking, but the English is replete with the sort of mistakes that non-native speakers make. You can tell that the content writing was farmed out to someone from Eastern Europe or Southeast Asia, and basically the whole blog and its owner are fake. (Even all the recipes were drawn from other blogs, but someone was paid to rewrite them slightly.)

account42 · on Nov 19, 2020

I wonder how viable it would be to just exclude all sites with ads.

grey_earthling · on Nov 17, 2020

> This is what we need more than anything. More independent blogs. The ability to search events now, or 10 years ago, mass indexing of RSS feeds, etc.

Thought experiment: what would a search engine look like if it only indexed RSS and Atom feeds?

ant6n · on Nov 17, 2020

Its hard to get readership writing blogs these days. Thats pretty demotivating.

NewJazz · on Nov 17, 2020

Also difficult to distinguish a blog from a content farm if you are just crawling the web. Any content pattern you select for would likely be quickly adopted by SEOs.

spc476 · on Nov 18, 2020

I've found a direct correlation between the chance of a content farm and the number of ads on the blog. With 0 ads, the likelyhook of a content farm is 0%.

jdm63 · on Nov 17, 2020

You could use machine learning instead of a hard-coded heuristic.

benmller313 · on Nov 17, 2020

I think this person actually means "We can imagine doing better than DuckDuckGo".

timClicks · on Nov 17, 2020

Well, ideas are much easier than implementations.

TedDoesntTalk · on Nov 17, 2020

Cliqz in Germany was one such implementation, funded in part by Mozilla but completely independent.

They wrote their own search engine.

They closed shop earlier this year.

ikiris · on Nov 17, 2020

It's kind of amazing how many people think an idea is the biggest part of a viable product.

6510 · on Nov 17, 2020

So you want to build a team and organize finances first? That doesn't seem like a bad idea... wait...

6510 · on Nov 17, 2020

The right question is: How to do search using open source tools?

If your goal is "to make something better than the Duck" and you succeed, the Duck dies... what is your goal now?

pmoriarty · on Nov 17, 2020

My main problem with DDG is that there's no way to be sure they actually respect their users' privacy as they claim to.

Ideally, services like theirs would be continuously audited by respectable, trusted organizations like the EFF.. multiple such organizations even.

Then I'd have at least some reason to believe their claims of not collecting data about me.

As it stands, I only have their word for it.. which in this day and age is pretty worthless.

That said, I'd still much rather use DDG, who at least pay lip service to privacy, than sites like Google or Facebook, who are openly contemptuous of it.

At the very least it sends a message to these organizations that privacy is still valued, and they'd lose out by not trying to accommodate the privacy needs of their users to some extent.

dangus · on Nov 17, 2020

I don’t even care about the privacy. (Well, I do, but in this context I have no reasonable way to ensure it)

What I do care about is trust-building and monopolistic practices.

That, to me, is a great reason to use DDG instead of Google or even Bing.

cpeterso · on Nov 17, 2020

I also prefer DDG's user interface over Google's. And DDG's !bang search shortcuts.

DDG has been my default search engine for years and its results are good enough for me 95% of the time. I only need to use Google as a fallback when searching for niche technical information or "needles in haystacks".

jschwartzi · on Nov 17, 2020

Even then, the Google results are usually terrible. I haven't used Google as a fallback in about a year because every time I tried it they couldn't find what I was looking for either. Or they did something atrocious like changing my search terms for me.

spinach · on Nov 17, 2020

Facebook and Google are huge, global companies where their main product is free, and yet they aren't a charity. The only way to be mega-rich and offer something free is to be shady and manipulative with user's data. Exploiting privacy is their business model. They aren't gonna respect it.

Being super financially successful off free products and services is not a recipe for an honest, citizen respecting company.

Dahoon · on Nov 17, 2020

DDG search costs the same as google search.

andreareina · on Nov 17, 2020

DDG does operate their own crawler[1], though they also do still rely on third parties[2].

[1] https://help.duckduckgo.com/duckduckgo-help-pages/results/du...

[2] https://help.duckduckgo.com/duckduckgo-help-pages/results/so...

Kiro · on Nov 17, 2020

Their own crawler is only used to fetch things for the widgets, not the search index.

andreareina · on Nov 18, 2020

Ahh, that does seem like a more correct reading of the page. Do we have a source that's unambiguous? Seems strange to have a bot that's able to parse pages for their instant answers but then not use those same results for regular search.

anaganisk · on Nov 18, 2020

They literally user Bing api for search results which is well known

andreareina · on Nov 18, 2020

It's well-known that they use Bing. They also say that they use other sources. In this case I'm looking for an explicit disambiguation of what sources they use for what; my first read led me to interpret it as them also using their crawler to return search links (as opposed to just being used for their instant answers).

Kiro · on Nov 18, 2020

"We also of course have more traditional links in the search results, which we also source from multiple partners, though most commonly from Bing (and none from Google)."

This refers to the actual search results and if they used their bot for that I don't see why they would say "multiple partners" instead of "multiple partners and our own crawler". The fact that they don't have their own index is such a common "complaint", and this page is often referred, so if they really used their own bot they should have added that a long time ago.

And it's not just a legacy page they have forgot to update. It keeps being updated. In 2019 it said:

"We also of course have more traditional links in the search results, which we also source from a variety of partners, including Oath (formerly Yahoo) and Bing."

It's interesting to note that back in 2014 the page looked like this: https://web.archive.org/web/20131202065705/https://duck.co/h...

Here they talk about their own indexes getting bigger but at the same time admitting that "it seems silly to compete on crawling and, besides, we do not have the money to do so". Completely understandable but also interesting that the current page doesn't mention their own index at all. Maybe they used to have a goal to build their own independent index that has now been dropped?

All in all, I think it's safe to presume that their own crawler is only used for Instant Answers etc since that's the part of the sources where it's mentioned. Or at the very least used to such a small extent in the actual search results that it would be disingenuous to even mention it as a source.

mekster · on Nov 17, 2020

Author didn't even DDG to find this out?

Dahoon · on Nov 17, 2020

Clearly neither did you as he is correct. DDGs crawler is not a crawler like googlebot.

eqv · on Nov 17, 2020

[flagged]

dang · on Nov 17, 2020

No personal attacks on HN, please.

https://news.ycombinator.com/newsguidelines.html

Digging up past internet history as ammunition in an argument isn't cool in general. It's not that such details are necessarily wrong or irrelevant, but doing this has a systemically degrading effect and we don't want to be that sort of community.

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...

djsumdog · on Nov 17, 2020

I don't like to criticize the author. We all have good takes and bad takes and really for a single post, you should address the argument. Digging up the past is part of what's making the world worse.

That being said, I do see a valid reason for bringing up his history of bad takes. I use to respect Devault. He banned me on the Fediverse because he disagreed with me being against defunding the police and against critical race theory.

I find some of stuff interesting, and I agree with more AGPL and more real open source development. I'd even say I'm jealous that he can actually fund himself off of his FOSS projects and do what he loves.

But I do agree, he does have a lot of questionable takes. He seems to love Go and hate Rust, hate threads for some reason, and has a lot of RMS style takes. Not all of them are bad, and hardcore people can help you think.

As far as this post goes, I do think search is pretty broken. I think a better solution is more specialized search. Have a web tool just for tech searching that does StackExchange sites, github, blogs, forums, bug trackers and other things specialized to development.

Another idea would be an index that just did blogs, do you can look up any topic and see what people are writing about long form for the current month. Add features to easily see what people were saying 5 or 10 years ago too. There is a ton of specialized work there, in filtering blog spam, making sure you get topics from all sides (including "banned" blogs), etc.

You use to have to go to Lycos, Yahoo, Hotbot, Excite and you'd get different results and find lots of different helpful things. We need that back. It will take some good, specialized tools, to break people from Google search.

Eeems · on Nov 17, 2020

[flagged]

dang · on Nov 17, 2020

Quite apart from this type of attack breaking the site guidelines and not being allowed on HN (about which see https://news.ycombinator.com/item?id=25130908)... that was 9 years ago. Imagine being publicly shamed for the worst thing you've done in the past decade. I don't think anyone is going to pass that test.

Is this the kind of world you (or any of us) really want to be part of? Surely not. Therefore please don't help to create it here.

djsumdog · on Nov 17, 2020

Wow. I'm honestly not surprised. That's ... that's pretty shitty.

Eeems · on Nov 17, 2020

Knowing a bit of his personal history I can kind of understand why he acts the way he does, and has the opinions he does. Doesn't excuse some of it, but at least I kinda get why.

I just wish his name would stop coming up for me tied to opinion pieces like this. I'd rather just see things about how some project he's working on is doing great and being widely adopted.

mixologic · on Nov 17, 2020

How would anybody ever know what the server is running and/or doing with the data you send it, regardless of if it is running open or closed source code?

A service, running on somebody else's machine, is essntially closed.

I think the only way to have an 'open' service is to have it managed like a co-op, where the users all have access to deployment logs or other such transparency.

Even then, it requires implicit trust in whomever has the authorization to access the servers.

joshuaissac · on Nov 17, 2020

That sounds a bit like YaCy.[1] It is a program that apparently lets you host a search engine on your own machine, or have it run as a P2P node.

I think the next step forward should be to have indices that can be shared/sold for use with local mode. So you might buy specialised indices for particular fields, or general ones like what Google has. The size of Google's index is measured in petabytes, so a normal person would still not have the capability to run something like that locally.

Edit: In another thread, ddorian43 has pointed out the existence of Common Crawl,[2] which provides Web crawl data for free. I have no idea if it can be integrated with YaCy, but it is there.

1. https://yacy.net/

2. https://commoncrawl.org/

joosters · on Nov 17, 2020

In theory, this is the kind of thing that the GPL v3 was trying to address: roughly speaking, if you host & run a service that is derived from GPL-v3'd software, you are obliged to publish your modifications.

But, I agree with you - and I don't think the author had really thought through what they were demanding, they made no mention of licensing other than singing happy praises of FOSS as if that would magically mean you could trust what a search engine was doing.

lixtra · on Nov 17, 2020

> In theory, this is the kind of thing that the GPL v3 was trying to address: roughly speaking, if you host & run a service that is derived from GPL-v3'd software, you are obliged to publish your modifications.

You mean AGPL https://en.m.wikipedia.org/wiki/Affero_General_Public_Licens...

joosters · on Nov 17, 2020

You're right... I'm misremembering the GPL, wikipedia says that it was only 'Early drafts of GPLv3 also let licensors add an Affero-like requirement that would have plugged the ASP loophole in the GPL' - I hadn't realised it never made it into the final version.

jedimastert · on Nov 17, 2020

> In theory, this is the kind of thing that the GPL v3 was trying to address: roughly speaking, if you host & run a service that is derived from GPL-v3'd software, you are obliged to publish your modifications.

Why would I trust someone to do that, though?

Jyaif · on Nov 17, 2020

> How would anybody ever know what the server is running and/or doing with the data you send it, regardless of if it is running open or closed source code?

https://en.wikipedia.org/wiki/Homomorphic_encryption

6510 · on Nov 17, 2020

hah, gave me a picture of a base plate ontowhich one can click on top an infinite number of hardisk enclosures or additional base plates to the 4 sides.

You get a subscription and the index updates (enclosure+preloaded drive) are send to you periodically.

The front of the enclosure says: 2020 4th quarter

neurobashing · on Nov 17, 2020

Am I the only person who just doesn't have problems with DDG search results?

What am I doing wrong (or right), here? I put a thing in and find it. I just don't use Google any more.

Genuinely curious why it's working for me and such garbage for everyone else.

aembleton · on Nov 17, 2020

You're probably searching for English language articles and are being explicit about what you want.

For example, you might search for `vue js on show` whereas `vue on show` will show you (in the UK) results for what is on at Vue cinemas.

With Google, I expect it would understand that you are probably searching for JS related vue questions and rank those higher.

Moru · on Nov 17, 2020

I'm also using DDG exclusively since many years. I find what I need usually as the first couple of results or in the box on the right, that usually goes directly to the authorative source anyway.

dybber · on Nov 17, 2020

I'm mostly getting Norwegian results, when searching for Danish subjects from a Danish IP address. It also seems it just hasn't indexed as many websites as Google.

zeepzeep · on Nov 18, 2020

I re-search almost everything technical with google after ddg showed me crap. I still use ddg by default tho', it works for most things, just not for work.

Dahoon · on Nov 17, 2020

Does Google search results work for you? If yes, then I'd say the reason is you don't see or agree with how bad results are today (as others have posted extensively about). I for one find DDG as the search engine that returns the worst results. Qwant is a better Bing-using engine IMO but it is still bad.

jlarocco · on Nov 17, 2020

Yeah, I'm with you.

I can think of some improvements (better forum/mailing list coverage), but it's generally pretty good. Lately if I don't find it on DDG I probably won't have much luck anywhere else, either.

proactivesvcs · on Nov 17, 2020

I sometimes come across inappropriate results - for example I search for a hex error code and the results are for other numbers - and sometimes the adverts are misleading, but neither are so prevalent enough that it harms the experience in general.

I always send feedback when I come across incorrect results and also try to when I get a really easy find.

I have not had to resort to any other search engine for at least five years.

pizza234 · on Nov 17, 2020

I've tried DDG for a while, around a couple of years ago, and I had lower-quality results particularly for technical subjects (which are the vast majority of my searches). I will give DDG another shot, though.

keithnz · on Nov 17, 2020

for generic stuff DDG is mostly ok. But for local results, even though it has a switch for local results, it REALLY REALLY REALLY sucks bad and often doesn't get any of the expected places anywhere in the first few pages for New Zealand which makes it somewhat useless

djsumdog · on Nov 17, 2020

I say about 50% I'm good with DDG. About 1/3 of the time I add !g, usually for weird error messages and tech stuff.

Honestly we shouldn't be using Google for everything. Why not just search StackExchange or Github issues directly for known bug problems? If you need a movie, !imdb or !rt forward you to exactly where you want to really search on.

If DDG or Google also included independent small blogs for movie results, I could see the value in that. I'd prefer someone's review on their own site or video channel, but it doesn't. We've kinda lost that part of the Internet.

ziml77 · on Nov 18, 2020

Just the weekend I had the "weird error message" situation. DDG gave me 1 page with no relevant results while Google had no problem finding proper matches.

Would be nice if they could do better on queries like that... though funny thing is if they didn't respect privacy they probably could. Log any searches where a user looked for something and then tried the same thing prefixed with !g. Use those for figuring out where to focus efforts and what to test with.

lambda_obrien · on Nov 17, 2020

Why couldn't several coordinating specialized search engines share their data via something like "charge the downloader" S3 buckets? Then you get an org like StackExchange who could provide indexed data from their site and the algorithms to search the data the most efficiently, GitHub can do the same for their specific zone of speciality, Amazon, etc.

Then anyone who wants to use the data can either copy it to their own S3 buckets to pay just once, or can use it with some sort of pay-as-you-go method. Anyone who runs a search engine can use the algorithms as a guide for the specific searches they are interested in for their site, or can just make their own.

You could trust the other indexers not to give you bad data, because you'd have some sort of legal agreement and technical standards that would ensure that they couldn't/wouldn't "poison the well" somehow with the data they provide. Further, if a bad actor was providing faulty data, the other actors would notice and kick them out of the group or just stop using their data.

It would have to be fully open source, I agree with the other parts of Drew's essay here, but I think we could share the index/data somehow if we got together and tried to think about it. We just need a standard for how we share the data.

ricardo81 · on Nov 17, 2020

There's Common Crawl for the crawling aspect, about 3.2 billion pages last time I looked. One of the issues with that kind of detachment of jobs is crawl data freshness.

lambda_obrien · on Nov 18, 2020

I'm thinking more like search indexers with like 100k pages, in a specialized category like 3d printing or basketball. I can index things like those on my home PC, practically.

phire · on Nov 18, 2020

I think your idea has merit.

Though it would require development of the "charge to download" S3 buckets and infrastructure to support payments.

There is also an economic issue where you have to calculate the download cost to also cover storage costs.

lambda_obrien · on Nov 18, 2020

I think you can already set an S3 bucket to charge the downloader for bandwidth, that's what I was talking about, the storage part is harder. The storage costs could be borne via some sharing agreements between the commercial interests and the pay customers where the infrastructure could be provided for smaller indexers (storage, compute) and in exchange the provider can use the data freely for their own services. Or, we could have a micropayments system that aggregates towards the end of the month, something open source and free to use, maybe a blockchain, this is actually one place that tokens on a blockchain could work as a semi-decentralized payment system between the parties in agreement in my "vision" or whatever you want to call it.

cptskippy · on Nov 17, 2020

So you're proposing Snowflake for search?

lambda_obrien · on Nov 18, 2020

It appears to be the case in a technical workflow sense, from the little I just read of Snowflake, but my proposal would be a much more open system than one under control of a single vendor. It'd be more like a set of standards for interoperation and a common data center so that the data is accessible under one roof. Maybe each entity could do specialized search as a service and the search aggregators would pay by providing some infrastructure to run the crawlers.

I don't personally think any system specification is impossible unless it goes against since mathematical law, so a really fast, distributed query system where there are a few hundred specialized providers for a single query is feasible. Imagine the aggregator does initial analysis to determine the category of search, like programming, news, or restaurant reviews, then sends the users query to a set of specialized providers that supply an index for that category, then fuse the results with some further analysis of the metadata returned. Then the user can also include or exclude the specialized providers at will.

You could also eliminate the aggregator as a service as simply make it a user application on the desktop, allowing for even more user control and maybe caching or something.

bscphil · on Nov 17, 2020

> they’ve demonstrated gross incompetence in privacy

Not sure I buy the example that is given here.

1. It's an issue in their browser app, not their search service.

2. It's not completely indefensible: it allows fetching favicons (potentially) much faster, since they're cached, and they promise that the favicon service is 100% anonymous anyway.

3. They responded to user feedback and switched to fetching favicons locally, so this is no longer an issue. https://github.com/duckduckgo/Android/issues/527#issuecommen...

> The search results suck! The authoritative sources for anything I want to find are almost always buried beneath 2-5 results from content scrapers and blogspam. This is also true of other search engines like Google.

This part is kinda funny because "DuckDuckGo sucks, it's just as bad as Google" is ... not the sort of complaint you normally hear about an alternative search engine, nor does it really connect with any of the normal reasons people consider alternative search engines.

That said, I agree with this point. Both DDG and Google seem to be losing the spam war, from what I can tell. And the diagnosis is a good one too: the problem with modern search engines is that they're not opinionated / biased enough!

> Crucially, I would not have it crawling the entire web from the outset. Instead, it should crawl a whitelist of domains, or “tier 1” domains. These would be the limited mainly to authoritative or high-quality sources for their respective specializations, and would be weighed upwards in search results. Pages that these sites link to would be crawled as well, and given tier 2 status, recursively up to an arbitrary N tiers.

This is, obviously, very different from the modern search engine paradigm where domains are treated neutrally at the outset, and then they "learn" weights from how often they get linked and so on. (I'm not sure whether it's possible to make these opinionated decisions in an open source way, but it seems like obviously the right way to go for higher quality results.) Some kind of logic like "For Python programming queries, docs.python.org and then StackExchange are the tier 1 sources" seems to be the kind of hard-coded information that would vastly improve my experience trying to look things up on DuckDuckGo.

jbay808 · on Nov 17, 2020

Maybe instead of hard-coding these preferences in the search engine, or having it try to guess for you based on your search history, you can opt-in to download and apply such lists of ranking modifiers to your user profile. Those lists would be maintained by 3rd parties and users, just like eg. adblock blacklists and whitelists. For example, Python devs might maintain a list of search terms and associated urls that get boosted, including stack exchange and their own docs. "Learn python" tutorials would recommend you set up your search preferences for efficient python work, just like they recommend you set up the rest of your workflow. Japanese python devs might have their own list that boosts the official python docs and also whatever the popular local equivalent of stackexchange is in Japan, which gets recommended by the Japanese tutorials. People really into 3D printing can compile their own list for 3D printing hobbyists. You can apply and remove any number of these to your profile at a time.

bscphil · on Nov 17, 2020

I like this idea! I think the biggest difficulty with it - which is also probably the most important reason that engines like Google and DDG are currently struggling to return good results - is that the search space is just so enormously large now. The advantage of the suggestion in the blog post is that you trim down the possible results to a handful of "known good" sources.

As I understand it, you'd want to continue to search the whole "unbiased" web, then apply different filters / weights on every search. I really do like the idea, but I imagine we'd be talking about an increase in compute requirements of several orders of magnitude for each search as a result.

Maybe something like this could be made a paid feature, with a certain set of reasonable filters / weights made the default.

Spooky23 · on Nov 17, 2020

I disagree; the search space is shrinking as more and more stuff moves to walled gardens like Facebook and Twitter.

retsibsi · on Nov 17, 2020

This may be a very dumb question, but could the filtering be done client-side? As in, DDG's servers do their thing as normal and return the results, then code is executed on your machine to weight/prune the results according to your preferences.

Maybe this would require too much data to be sent to the client, compared to the usual case where it only needs a page of results at a time. If so, would a compromise be viable, whereby the client receives the top X results and filters those?

bscphil · on Nov 18, 2020

This would work if you had a blacklist of domains you didn't want to see. But the idea in the post is closer to a whitelist: the highest priority sites (tier 1) should be set manually, and anything after that should be weighted by how often it's referenced by the tier 1 sites. For a lot of searches you're going to have to pull many results to fill a page with stuff from a small handful of domains, and in fact you might not be able to get them at all. And that's before you start dealing with the weighting issue, which would require quite a bit of metadata to be sent with each request.

hobs · on Nov 17, 2020

Back in the day you'd have webrings - groups of sites that linked each other in clear association.

visarga · on Nov 17, 2020

I have had a similar idea, what you're proposing is essentially a ranking/filtering customisation. The internet is a big scene, and on this scene we have companies and their products, political parties, ad agencies and regular users. Everyone is fighting for attention, clicks. Google has control over a ranking and filtering system that covers most searches on the internet. FB and Twitter hold another ranking/filtering sweet spot for social networks.

The problem is that we have no say in ranking and filtering. I think it should be customisable both on a personal and community level. We need a way to filter out the crap and surface the good parts on all these sites. I am sure Google wouldn't like to lose control of ranking and filtering, but we can't trust a single company with such an essential function of our society, and we can't force a single editorial view on everyone.

As we have many newspapers, each with its own editorial views, we need multiple search engine curators as well.

absolutelyrad · on Nov 18, 2020

Would you pay $10/yr for this feature?

jbay808 · on Nov 18, 2020

Unfortunately I suspect that if it were a premium feature, not enough groups would volunteer the requisite time into compiling and maintaining the site ranking lists. This sort of thing really has to become a community effort in order to scale, I think.

brundolf · on Nov 17, 2020

This is a great idea. It's like a modern reboot of the old concept of curated "link lists", maintained by everyone from bloggers to Yahoo. Doing it at a meta level for search-engine domains is a really cool thought.

mech422 · on Nov 17, 2020

This would be awesome! I'm so tired of google ignoring what I tell it, and trying to 'guess' what I want.

I'd also love to be able to specify I want results from the last year without having to set it everytime.

wstrange · on Nov 17, 2020

This doesn't really seem immune from spam.

I got signed up for goodreads (book review site), and I get tons of spam. It's not quite the same as your idea, but it is a currated list. I don't know how you stop spammers from adding bogus links in the python interest list (to use an example).

This is a hard problem..

EDIT: Clarified goodreads reference!

vorpalhex · on Nov 17, 2020

Like any other list, it depends on who maintains it. You basically want to find the correct BDFL to maintain a list, much like many awesome-* repositories operate.

AsyncAwait · on Nov 17, 2020

This is actually a great idea and something I can see working rather well.

nolanhergert89 · on Nov 17, 2020

As a hack until then, I've found Google's Custom Search Engine feature to work well enough for my use cases. I just add the URLs that are "tier 1" for me. https://programmablesearchengine.google.com/cse/all

867-5309 · on Nov 17, 2020

> to guess for you based on your search history, you can opt-in to download and apply such lists of ranking modifiers to your user profile

pro-privacy does not sit well with terms such as search history and user profile

jbay808 · on Nov 17, 2020

You might have misread. My proposal is an alternative to inferring user preferences based on their search history.

867-5309 · on Nov 18, 2020

any type of profiling, opt-in or not, may be used to identify users

jbay808 · on Nov 18, 2020

I mean, hacker news can probably also identify users based on which articles they click on, and how often they jump straight to the comments. I hope they don't.

But a system such as I'm describing is probably the only one that can be entirely consistent with the two disparate requirements of fully anonymizing users, and being useful to both programmers and ophiologists studying different things called "python".

867-5309 · on Nov 18, 2020

what you are describing is relevance based on user input (be that cookies, search history, interests, a preference for x over y) that may be used as identifying information, which vastly de-anonymises the service. if a search query is too ambiguous then it can be refined. if the user knows they want a programming language and not a snake, they can let the search engine know themselves. don't sacrifice their anonymity for perceived usefulness

jbay808 · on Nov 18, 2020

Presumably, the most common search preference lists would be used by very large numbers of people -- for example, almost all programmers would rather see Python (language) queries over Python (snake) queries and would probably all be using whichever search preferences become the most popular and well-maintained, like "mit_cs_club.json". A subset of those would also be into anime and enable their anime search preferences (probably more particular), and some of them will also like mountaineering, pottery, and baking, and will have such preferences configured as well. Yes that might be enough to identify you (just like searching for your own name would be) but those preferences don't need to be attached to you, just your query, and you could disable or enable any of them at any time.

It would basically be like sending a search query in this form:

"Python importerror help --prefs={mit_cs_club, studioghiblifans_new, britains_best_baking_prefs, AlpineMountaineersIntl}"

If you like baking, anime, and mountaineering, it's probably convenient to leave all those active for your searches, even your purely programming-focused searches. But you could toggle some of them off if articles about "helping to protect imported mountain pythons" are interfering with your search results, or if you want to be more anonymous. If you're especially paranoid you could even throw in a bunch of random preferences that don't affect your query but do throw off attempts to profile you. You could pretty easily write a script that salts every search with a few extra random preference lists, for privacy or just for fun, and make that an additional feature. The tool doesn't need to maintain any history of your past activity to cater to your search, so I think it would be a good thing for privacy overall.

867-5309 · on Nov 18, 2020

> Presumably, the most common search preference lists would be used by very large numbers of people

the more anonymous among us tend to opt for common IP addresses and common user agents to become the tree among the forest. adding a profile to that would, well, only add to a digital fingerprinting profile

> those preferences don't need to be attached to you, just your query

that's not how it works. preferences are by their nature personal. every transaction would have your interests and hobbies embedded, on top of metadata

> mit_cs_club, studioghiblifans_new, britains_best_baking_prefs, AlpineMountaineersIntl

you have voluntarily made yourself the birch among the ebony

> If you're especially paranoid you could even throw in a bunch of random preferences that don't affect your query but do throw off attempts to profile you

how would they not affect the query? they complement the query. or rather, unnecessarily accompany the query. your results depend on your input. it doesn't matter what colour glove you wear to pull the trigger if you bury the gun with the body

> write a script that salts every search with a few extra random preference lists, for privacy or just for fun

just the latter. fuzzing would be pointless since the engine will have already identified you by now

it sounds like an annoying browser extension at best. to label it a pro-privacy tool would be ludicrous

brundolf · on Nov 17, 2020

Agreed. I think the key point here is that the web is a radically different place than it was in 1998 (when Google launched and established the search engine paradigm as we know it). Back then the quality-to-spam ratio was probably much higher, the overall size of the web was certainly much smaller (making scraping the entire thing more tractable), and there were many more self-hosted sources rather than platforms (meaning it was more necessary to rely on inter-linking, and "authoritative domains" weren't as much of a thing). The naive scraping approach was both more crucial and more effective. And in the decades since, it's been a constant war of attrition to keep that model working under more and more adversarial conditions.

So I think that stepping back and re-thinking what a search engine fundamentally is, is a great starting point for disruption.

Additionally, something the OP didn't mention is that ML technologies have progressed dramatically since 1998, and that much of that progress has been done in the open. I can't imagine that not being a force-multiplier for any upstart in this domain.

PaulDavisThe1st · on Nov 17, 2020

But the situation with authoritative domains hasn't changed much, and what "platforms" tend to be strong for answering questions? As in 1998, there are a few very good places for getting answers to certain kinds of questions. They are not facebook or twitter, ever.

jedberg · on Nov 17, 2020

I think Google sort of takes into account "votes", in that they look at the last thing you clicked on from that search, and consider that the "right answer", which they then feed back into their results.

As such, they effectively have a list of "tier 1" domains.

gregmac · on Nov 17, 2020

I kind of hope they don't, or there is more to it than just that -- for example, a user coming back and clicking on something else counts as a downvote for the first item.

Any system that ranks things purely based on votes or view counts can have a feedback loop that can amplify "bad" results that happen to get near the top for whatever reason. For web search, this would encourage results that look right from the results page, even if they're not actually a good result of what the user is looking for.

An example of this would be when you're trying to find an answer to a specific question like "How do I do X when Y?". The best result I'd hope for is a page that answers the question (or a close enough question to be applicable), while the promising-looking-but-actually-bad result is a page where someone asks the exact same question but there are no answers.

eyelidlessness · on Nov 17, 2020

> Any system that ranks things purely based on votes or view counts can have a feedback loop that can amplify "bad" results that happen to get near the top for whatever reason.

I think this is a place where Google has pretty obvious algorithm problems. For example, I’m building a personal website for the first time in many years, and obviously that means I’m doing a fair bit of looking up new or forgotten webdev stuffs. It’s widely known that W3Schools is low quality/high clickbait/has a long history of gaming the SEO system. They’ve been penalized by Google’s algorithm rule changes but continue to get the top result (or even the top 3-5 results!), even with Google having a profile of my browsing habits, and knowing that I intentionally spend longer on these searches to pick a result from MDN or whatever. It seems pretty likely that W3Schools is just riding click rate to stay at the top. And it’s pathological.

beckingz · on Nov 17, 2020

Is w3schools that bad?

for some languages, W3schools is as good a reference or better than the official documentation.

And they're definitely better than most seospam.

eyelidlessness · on Nov 17, 2020

W3Schools is awful. The official documentation is hard to navigate, but W3Schools is notorious for misleading and poor quality examples and advice. MDN, caniuse, CSS Tricks and such are much better resources.

Edit: I semi-intentionally forgot to mention Stack Overflow because it’s so unpredictable in terms of quality.

bscphil · on Nov 17, 2020

I don't know if DDG does that exactly, but their help page does say this:

> Second, we measure engagement of specific events on the page (e.g. when a misspelling message is displayed, and when it is clicked). This allows us to run experiments where we can test different misspelling messages and use CTR (click through rate) to determine the message's efficacy. If you are looking at network requests, these are the ones going to the one-pixel image at improving.duckduckgo.com. These requests are anonymous and the information is used only by us to improve our products.

The Firefox network logger does show requests to this domain when I click on a link in the search results, before the page navigates away. This suggests to me they might by logging this information. To be clear, this is speculation on my part, because I haven't examined the URL parameters in detail.

In any case, I'm not sure how much this manages to improve the results, since usually I can get help with my Python query (for example) using whatever crappy blog post is first in the results, but results from the official docs or StackExchange are still probably better and should be prioritized.

dwd · on Nov 17, 2020

I thought DDG already crawled their own curated list of sites?

There is a DuckDuckGoBot and I think it was an interview or podcast Gabriel did a while back that he mentioned they use it for filling out gaps in the Bing API data to provide the instant answers, favicons. Their preference for the instant answers were authoritative references such as docs.python.org. This would have been a while back though.

bscphil · on Nov 17, 2020

If memory serves, those crawls are only used for Instant Answers. My interpretation of the blog post is that it would be nice to have a search engine that's sort of a hybrid approach based on Instant Answers for the whole web.

Silhouette · on Nov 17, 2020

Some kind of logic like "For Python programming queries, docs.python.org and then StackExchange are the tier 1 sources" seems to be the kind of hard-coded information that would vastly improve my experience trying to look things up on DuckDuckGo.

The problem with this strategy is always going to be that different users will regard different sources as most desirable.

For example, it's enormously frustrating that searching for almost anything Python-related on DDG seems to return lots of random blog posts but hardly ever shows the official Python docs near the top. I don't personally think the official Python docs are ideally presented, but they're almost certainly more useful to me at that time than some random blog that happens to mention an API call I'm looking up.

On the other hand, I would gladly have an option in a search engine to hide the entire Stack Exchange network by default. The signal/noise ratio has been so bad for a long time that I would prefer to remove them from my search experience entirely rather than prioritise them. YMMV, of course. (Which is my point.)

judge2020 · on Nov 17, 2020

> and they promise that the favicon service is 100% anonymous anyway.

With that logic, Apple’s OCSP server is also 100% anonymous (which I legitimately can believe it is).

_fnqu · on Nov 17, 2020

I miss Cliqz. It was a new search engine, with its own crawler, almost completely from scratch. It even had a dev blog where they wrote articles on how to build your own search engine: https://0x65.dev/blog/2019-12-06/building-a-search-engine-fr...