Do they really though, for normal people that is?. Some of my searches today below, can't remember the exact terms I used. Mix of DDG and Google.
1) Walt Whitman, I wanted a basic overview of his work to satisfy some idle curiosity. DDG gave me his wikipedia page. Bingo
2) EAN-13 check digit. First result wikipedia telling me how to calculate it. I see it is simple and I have a long list in Excel to check. I can't be bothered to think so...
3) EAN-13 Excel. First result has an example that I copied and pasted.
4) Timezone [niche cloud system]. Said system didn't do what we expected, seems to be timezone issue. First article is discussing this niche issue and offers solutions
5) Does Shopify support x payments. Yes it does
6) Coronavirus test. Got straight to government site.
7) MacOS version numbers. First hit...
8) How come my Microsoft x platform is showing as being at y level of service when my Buddies is not. Straight in
Am I just a perfect search customer? I don't seem to be getting the problems Drew is?
I suspect that anyone who claims that Duckduckgo "Just works" only do english search. I usually do "english" / "mother tongue" searchs all day. Everytime, I need to remember to toggle the regional button otherwise I get attrocious results. Whereas google simply understand that if I'm searching using the english language it should prioritize english results while if I'm searching in another language it should prioritize it instead.
It gets tiring quickly and I find easier to append !g instead of clicking the regional toggle button.
For me (German) it’s different. With DDG, I can easily choose to search for German content (by using !ddgde), with google I have to hope that they search for what I want. Sometimes google does, sometimes it does not. And if it doesn’t I’m out of luck unless I go into the settings and look for a way to tell it what to do.
Google automates, DDG leaves me to choose. I prefer the 2nd approach every time.
> Google automates, DDG leaves me to choose. I prefer the 2nd approach every time.
This is exactly why I like DDG way more than Google and why I love to use Alfred instead of Spotlight on my Mac. With DDG you have !bangs and with Alfred you also can tell him what you’re looking for. 99.9% of the time I know I’m looking for a file or a folder or a definition of a word or want to open an app or want to search the web etc. With Spotlight you’re stuck to the order Apple designed the results to show up
It's also very useful to have that control when you live in another country. I'm in Spain now, but most of the time I want to search in English or even French. Google only gives you local results.
I really wish Google would prioritize English results for English searches consistently. I'm living in Japan as a native English speaker, and have my OS, browser and logged in Google account all configured for English only. Despite that, Google search results always prioritize Japanese language content. Every now and then (though not consistently) it gives me a yellow popup asking if I'd like English results instead, which is a bit disappointing given they already have all the information they should need to make a judgement call about that. Maybe the individual experience here depends on the languages and regions involved.
There was a time, a long time ago, where google had this:
www.google.com/ncr
'ncr' here stands for no country recognition. It allowed many expats to do technical searches without the noise of regionalization results.
Of course someone clever at google figured out that was probably too useful and now it just redirects you back to google.com because screw all those niche use-cases.
That's not what it was. "ncr" was "No Country Redirect".
When you were in a different country (e.g., India), and you typed in google.com out of habit, it would recognize your IP-geo and redirect you to the country-specific domain (e.g., google.co.in).
If you really just wanted google.com for whatever reason, then you'd type google.com/ncr. It then wouldn't redirect you based on your IP-geo, and you'd stay on google.com.
In other words, google.com/ncr _always_ redirected you back to google.com. Then, and now.
However you can see from the comments in both android police [0] and reddit [1] that, irrespective of your assertiveness, the behaviour did indeed change at least in 2017 if not more times before.
It at the very least used to preserve the suffix and absolutely respect no regional results. It's the same as the old bolean operators, google claims the behaviour is unchanged but will silently ignore them.
Was going to post the exact same thing. This was my experience while living in Japan too. To me, the takeaway is that you simply cannot catch everyone with your defaults. Google and DDG have made different prioritization defaults and the result of that is what we see in anecdotes in this thread.
On the contrary, I like the explicit language toggle because some search terms have better results with a specific language. I get annoyed when I enter a programming related search term and get non-English results.
I do a lot of searches in French, where a lot of the words are identical to English (English being heavily influenced by French), especially if one leaves out the accents.
I also search in Spanish (Castilian), but sometimes I want results from Latin America, sometimes from Spain.
Being able to set the language/region is of incredible help in both cases. There is no way to automatically detect this.
I also find that even with the regional toggle off, my results are still skewed towards my location or the native language of it. This is true for both DDG and Google. I want results completely agnostic of where my IP happens to be positioned.
I actually prefer the toggle button for regional search.
Also when I search in native tongue Google sometimes (like when using brand names, models of devices etc) gives me 2-4 pages of advertisement and stores links. It's hard to find a large companies homepage.
In DDG it's usually the first page.
I still do a lot of !g when I search technical stuff, as it lets +word -word and DDG doesnt find a lot of weird github issue pages, old forums, usenet posts sometimes.
Yeah, all of these are quite DDG-friendly searches. It is my default engine and, yes, some results do suck quite consistently.
I'm a bit lazy right now to remember all the problems it has, but some of the most obvious are looking up for news on recent events (especially something small, stuff that doesn't appear in reuters and these sorts of media) and trying to find out some basic stuff about local shops and such (of course, I only know about how it feels in my location, not worldwide). On both occasions I pretty much always use "!g ..." right away, because DDG is just clueless about this shit. Google does this just fine (in fact, sometimes it's even impressive: there are thousands of cities like mine, yet Google can often tell me where I can buy some stuff I'd have no idea where to look for).
> Yeah, all of these are quite DDG-friendly searches...
This is exactly correct. Excluding poor local search results (which is understandable bc of the privacy aspect), Bing/DDG has trouble with long tale search query relevance (5+ word queries), and also finding results from small or obscure sites. The later is simply because Bing's organic index is not as large as Googles.
Bing/DDG's organic results are still very good, but they are not as good as Google's in the above specific circumstances.
Compared to Google, Bing has a huge problem with paid search results, at least in some non english languages.
My mother wanted to access Amazon last week and typed "my amazon account" in french in the windows search, which searched for those terms in Bing. One of the first (paid) results was a scam site triggering alarm sound, fake virus notifications and asking her to call a scam hotline.
At least DDG filters out the ads but the problem in this case is Bing's OS integration.
> Bing/DDG has trouble with long tale search query relevance
The last time I did a comparison, Bing did better (I don’t know what DDG does with the Bing results exactly, everyone says they just show Bing results, but no one knows and it simply doesn’t mesh with my experience).
Because Bing does not randomly filter out half my terms while DDG does even for "-forced terms. This is my #1 problem with DDG and I complain about it in pretty much every DDG thread (while otherwise loving DDG).
For few result searches, DDG shows you essentially random stuff even if they have the result I want (which can be tested by searching for an exact sentence from the result page). On the other hand doing the search on Bing gives me the result without neutering my query.
I am typing this from India. DDG never provides satisfactory results for anything country specific. As an example, point 6 above is a failure. I used to have DDG as my default, but my workflow got so convoluted that I would search first on DDG, see that the results as not good, open google and search again. It is so frustrating that I switched back to Google even when I didn't want to.
Thank you, I didn't know that feature. I almost always used g! to switch to Google when searching for country specific terms, guess I can change that now.
In consumer search there is a really long tail of questions (in 2017 15% of Google's daily queries have never been seen before[1]) and performance on this is very important.
I just searched for "lockdown rules for SA" (I'm in South Australia and we just had a new 20 person cluster, so we are going back into lockdown).
On DDG the first results was a Guardian article which was good, but then the rest were a mix of South African articles and blog spam. There were no SA Gov pages on the first page of results.
On Google the first result was the South Australian gov site with the rules, the second was the Guardian article, then more SA Gov pages and at result 8 I got a South African result.
Hrm, I think it's extremely iffy to abbreviate South Australia like that in a search query. You don't need the "for" either.
BTW, when I perform the same search, Google's first result is
"What Are the Lockdown Rules for South Africa? A Guide for ..." and all the other results on the first page are about South Africa too. (Note: I'm in Japan)
> Hrm, I think it's extremely iffy to abbreviate South Australia like that in a search query.
Everyone in Australia uses "SA" - this is one of the reasons why location based context is important.
> You don't need the "for" either.
I worked on consumer search for a few years, and on text based search word like "for" are helpful to get exact match. Even if the term frequency of "for" on its own isn't particularly useful "for SA" absolutely is. (And these days with neural ranking using sub-word parts it is even more useful).
Searching for "lockdown rules" "sa" (together) just now gave a bunch of South Australian specific results with the "Australia" localisation setting enabled.
With the localisation setting disabled, all the results were indeed about South Africa instead.
They do. Something fundamentally changed at some point during the past couple of years. It used to be that DDG was the best for verbatim search (meaning I want to only have results were the exact words I search for are included).
Now, even with quotes, I routinely get a whole first page of results where my terms are not included anywhere. Google generally respect the quotes.
I have noticed the same problem with DuckDuckGo searches recently.
I hope that a verbatim search function will be restored in the future; I think it's an essential basic tool for a search engine, and without it the user can be left with the impression that the engine either doesn't understand what it is being asked to do, or that it is wilfully disregarding instructions because it thinks — often wrongly — that it has a better idea of what the user is searching for than the user does.
Personally, I dislike trying to interface with a machine using natural language, because I know it can’t really understand me, and I’d rather read and interpret the results for myself than have an algorithm pick the “best”.
I actually find speaking to machines (e.g. automated phone systems, Siri etc) using natural language quite embarrassing, as if we were pretending that real life was like Star Trek.
I also hate talking to machines, minus some exceptional circumstances. I find it especially annoying when phone menu systems insist that I talk to them. Some of them don't even respond to mashing zero.
In a simple probabilistic sense, sure. Shove enough data at the problem and the easy cases work out. Those are only a small subset of the problem space.
Until we address meaning, my statement remains solid. And it can often be easier to treat the tool like what it is rather than figure out how to best pretend it is something it is clearly not.
I think its less about trying to talk to Google, and more about phrasing your search the way somebody would ask it on some random forum. That is often the benefit of asking as a question.
Although such queries are habit forming and now google does a decent job understanding the actual question
Apart from that, it's nice that is has no commercial bias. For instance if I search for a thing that is both a real thing and a product, I get the real thing returned.
Still, for really niche topics if search needed there is no way around Google. On the other hand Google is not the only way to explore the web, let alone auto-complete a url...
I’ve tried searching for stuff related to Alexa Presentation Language (APL) a bunch of times. It never finds anything useful; I throw “!g” on the query string and what I’m looking for is typically the first or second result.
> Instead, it should crawl a whitelist of domains, or “tier 1” domains. These would be the limited mainly to authoritative or high-quality sources for their respective specializations, and would be weighed upwards in search results.
Not a big fan of this conclusion. Who chooses the white list, and why should I trust them? Is it democratically chosen? Just because a site is popular very clear does not mean it's trustworthy. Does it get vetted? by whom? Also, who's definition of trustworthy are we trusting?
If I want my blog to show up on your search engine, do I have to get it linked by one of those sites, or can I register with you? Will I be tier 1, or
> If I want my blog to show up on your search engine, do I have to get it linked by one of those sites, or can I register with you? Will I be tier 1, or
I think what I'd say in defense is that we've misunderstood what search engines are useful for. They're really bad at helping us discover new things. Your blog might be awesome, but it's not going to be easy for a search engine to tell that it's awesome. It's going to have to compete with other blogs that also want views, some of whom are going to be better than yours at SEO, and so on.
What a search engine might be able to tell is that it's useful. Because what search engines are at least potentially good at is answering questions. You do that by having a list of known good sites to answer specific types of questions, and looking at the sites they link to. It's when you try to do both (index everything on the web and provide accurate answers to specific questions) that you end up failing to do either. For example this is the #2 result for "python f strings" on DDG[1]. It's total garbage, and, quoting the blog, "we can do better". (This result is also on page 1 for the same query on Google.)
What I believe ddevault is suggesting is that we make a search engine that does the only thing search engines are really good at, answering questions. You throw away the idea of indexing everything on the web, and therefore the possibility of "discovery". What that means is that in 2020 you need some other mechanism for discovering new sites, bloggers, and so on. Fortunately we do have some alternatives in that space.
To be clear, I don't know if I 100% buy this argument, but I think it's the general idea behind what's being suggested in this blog post.
As an experiment, I searched “tech news aggregator” on both google and DDG. Neither listed Hacker News. Instead, apart from a few actual sites, most of the links were articles saying “top ten tech news sites” or links to quora q&a threads.
It definitely seems that search engines can’t find new websites for people. Now they are just aggregating Q&A.
Yeah - this is sort of my thinking. For better or worse if you enter "discovery" terms, what you will get back is not the results of a "discovery" search, but rather an answer to the question "what are some websites that will help me discover X".
> For example this is the #2 result for "python f strings" on DDG[1]. It's total garbage, and, quoting the blog, "we can do better". (This result is also on page 1 for the same query on Google.)
There are other options that may be better, but in general, very few people are looking for them.
I tried this out real quick just to see for myself. First it was slow for me, if I wasn’t wanting to try it out I may have closed the tab, but I did a search for a question I recently googled (that was an example of an annoying google search.) I searched “Python arcade sprite opacity“ and while the first three results were the same as google the fourth was a github link to the project itself and brought me to the answer (which wasn’t on page 1 of google for me, although it was on page two.)
So you need to speed it up, but it does look good for searching documentation at least.
> You do that by having a list of known good sites to answer specific types of questions, and looking at the sites they link to.
I mean, that's basically the core of original Google pagerank, right? A "good" site linking to another site is what makes that other site some amount of "good" too, links from better sites carry more 'juice'. "good" is of course not just binary, but a quantitative weight.
I don't know to what extent that's still at the core of their relevancy rankings. I don't know how all those annoying spammy recipe blogs or content farms get to the top of the results either. I don't think it's because Google's engineers believe they are "good" results.
Relevancy ranking on web search is clearly a hard problem, mainly because so many authors are trying to game it, it's a feedback loop.
If Google can only do as google does despite pouring a whole lot of money into it, I don't see a reason to bet that better will be what's basically an over-simplified description of how Google started out doing it (and then evolved it because it wasn't good enough).
That was exactly my thought on reading that part of the article: that the author was describing Pagerank, but substituting Pagerank's design objective of algorithmically outsourcing quality judgments to the global community of website publishers with the programmer's prejudices instead.
I hope the project of improving on DuckDuckGo is successful though, and some of the other proposals in the article sound promising to me.
In the future I would like to see an open source search engine 'paid for' using some combination of homomorphic encryption, blockchain and Tor-style technologies to trade bandwidth and processing power with its userbase in exchange for search results and other services, but I don't have the expertise to assess how feasible that might be.
the money really is the crux of the issue, too. if you take as a given that consumers won't pay for a service they can get for free, then you're kind of in a bind creating an unopinionated search - if you're truly objective, and reflecting the underlying value of each link faithfully, who do you charge?
This is not about a single search engine instance replacing all your search engine needs.
There could be a community of Software developers running one instance of the OS search engine that focuses on programmers needs: documentation, vcs hosters, dev blogs, tech news websites and on topic blogs.
Great if you need to search for how to solve a software issue, terrible if you need to figure out how long to cook spaghetti.
The lists of crawled pages i guess would be visible/searchable too, at least on instances that you could trust.
If you don't like the list, and the maintainers for whatever reason don't want to change it to your liking, feel free to use another instance or set one up yourself.
Things actually sort of ran that way once. The DMOZ directory system was a cannonical, list of sites by subject listed top-down. It was maintained by a community of volunteers in the fashion of Wikipeda. I believe it was used as one reference for Google and other search engines at one time. I don't know if such an "objective" system could be rebuilt, however.
Still, it's good to remember that it was once uncertain whether people should access the web using something a table of content (portal/directory) or something like an index (search engine). It seems the search engines won.
Great idea, but why build a search engine at all in this case? You can use DDG + your filter and see only the results from your whitelist.
Could easily be implemented for any current search engine.
To a large extent, this is what you already do when you view a page of search results. Filter them based on your understand of what sites / results hold value.
On a public scale, you could make an argument for tighter integration/better privacy with the lists. For example:
Browser -----Request-to-SE-----> Search Engine
^ |
| Unfiltered Results (In YAML/JSON)
| |
| V
|--Desired Results------ Local Filtering/Rendering
On a private scale, if you are only crawling sites on the allow list than you have the possibility of being able to better maintain a local database of sites to show up on the search.
Edit: Possibly this could be easier to use to set up distributed search as well, as each node could index a given list, and then distribute that list similarly to DNS. Don't really know how well that would work though, just an idea.
Aside from this a big reason to build this is it seems a lot simpler than writing a giant web crawler ala google and thus is a good target for an open source solution. Which is the biggest problem with duck duck go.
I would think everyone would run their own “crawler” but maybe you could use a ledger and delegate sites to different workers. If you’re whitelisting sites you could maybe only crawl a link or two deep.
It’d take a lot of cycles while small but if you get a network growing you could even have sub-networks with their own whitelist additions (and every user has a blacklist.)
That’s what directory sites offered once upon a time. It was a pretty good way to discover new content back then. I spent a lot of time on dmoz when I wanted to find information about various topics.
But the email is already like this. It's the inbox providers who choose what domain is legit and new domains start from negative rating. Treating the web the same way doesn't sound too unnatural.
It would be bad if those in the positions profit by "authorizing" who is good though.
I'm not sure why email should be an example of the correct way to do it. And with email I can check my spam folder and see exactly what has been rejected. So unless the search engine has a list of sites that aren't deemed worthy included with every search (which probably wouldn't happen), I think this solution has some pretty big flaws. It should be noted that the current system also has these flaws, as Google and DDG can show you whatever they want base don whatever criteria they see fit.
I like this idea! Have the usual official results... then have an option to go to level 2, level 3, level 4 etc (lvl 1 is not included in lvl 2)
You can have really biased technically terrible filters that for example put a site on level 4 because it is to new, to small and any number of other dumb SEO nonsense arguments. (The topic was not in the url! There was poor choice of text color!)
I think wikipedia has a lot of research to offer on what to do but also what not to do. Try getting to tier 2 edits on a popular article? It would take days to sort out the edits and construct by hand a tier 2 article.
Per your 2nd para Google used to have some options to tailor the results more, like allinurl or inurl or title or link (IIRC the word had to be in a link pointing to that page) or whatever.
I expected that to evolve to get more specificity but things went completely the otherway and we can't even specify a term is on a page reliably with Google now.
Similarly, I was all in on xhtml and semantics (like microformats) where you'd be able to search for "address: high street AND item:beer with price:<2" to find a cheap drink.
I imagine for a FOSS solution we would have to make configurable every separable ranking algo and the option to toggle them in groups as well as build cli like queries around them (with a gui)
I'm starting to see a picture now. In stead of wondering how to build a search engine we should just build things that are compatible. A bit like The output of your database is the input of my filter.
Take site search, it is easy to write specs for with tons of optional features and can easily outperform any crawler. Meta site search can produce similar output. Distributed diy cralwers can provide similar data.
Arguably top websites should not be indexed at all. They should provide their own search api.
The end user puts in a query and gets a bunch of results. It goes into a table with a colum for each unique property. The properties show up in the side bar to refine results (sorted by howmany results have the property) Clicking on one/filling out the field/setting a min max displays the results and sends out a new more specific query looking for those specific properties. New properties are obtained that way.
Yes, I was thinking along similar lines IIUC of a sort of federated search using common db schemas and search apis so that I could crawl pages and they could be dragged in to your SERP by a meta-search engine. I think the main thing you lose is popularity and ranking from other people's past searches - that could be built in but it relies on trusting the individual indices, which would be distributed and so could be modified to fake popularity or return results that were not wanted (though then one could just cut off that part of one's search network).
This is exactly where he lost me. I don't think it is hard at all to find results in "tier 1" domains with DDG. I would argue the we have the opposite problem almost entirely. Besides blogspam / internet cancer and t1 sites you hardly get any results. It's incapable of finding the actually useful communities or blogs for your query.
> Who chooses the white list, and why should I trust them?
That's part of the point. There could be different search engines that run the same code, but have different sets of tier 1 domains that cater to different audiences. And if you have the resources, you could set up your own engine with a set of tier-1 domains that you chose.
Instead of having many bots do inefficient crawling, web sites should publish their own index. Intermediate parties can combine indexes of the sites. Sites that do not provide indexes get less visitors.
That's the best way to fill your search engine with spam. There needs to be a third-party that verifies that the site-provided index is inline with the actual content. At which point said third-party can be a crawler.
SEO is crushing the utility of Google. It is pretty telling when you need to add things like site:reddit.com to get anything of value. Harnessing real user experiences (blogs, etc) is the key to a better search engine. This model unfortunately crumbles under walled gardens which is increasingly the preferred location of user activity.
That’s where blogs were at, but now a massive portion of them are content farms / splogs.
You’re right that the walled gardens have hurt this. So often I search something specific, or a topic, and find very little. But I know there are communities on Facebook for this, I know there would be peoples posts out there on Instagram which 100% answer my question. But they may as well not exist. Unless I was “following” then when it was said, and mentally indexed it, these things are mostly unfindable, and that’s if I even have an account for said service (which I don’t for Facebook)
It’s sad, more people than ever using the internet, more content & knowledge being created than ever before, yet it’s no longer possible to find the great answers.
This is what we need more than anything. More independent blogs. The ability to search events now, or 10 years ago, mass indexing of RSS feeds, etc.
A general search engine is kinda way out of the ballpark for now. But you could specialize for long form blogs, from all sides, hard-left, hard-right, women in tech, white supremacists, all the extremes and moderates.
I've love to have an interface to search a topic and see what all kinds of people have posted long form, without commentary or Twitter/Facebook bullshit "Fact checking" notices. I what to see what real writers are seeing across the spectrum on a given topic for the week or month.
The problem is that content farms have mastered the art of writing like an ostensibly independent blog. This is most visible in recipe blogs, where for example the site will look independent, the blog owner’s "About Me" page will say that she is a young woman born and raised in Louisiana and passionate about her home region’s cooking, but the English is replete with the sort of mistakes that non-native speakers make. You can tell that the content writing was farmed out to someone from Eastern Europe or Southeast Asia, and basically the whole blog and its owner are fake. (Even all the recipes were drawn from other blogs, but someone was paid to rewrite them slightly.)
Also difficult to distinguish a blog from a content farm if you are just crawling the web. Any content pattern you select for would likely be quickly adopted by SEOs.
I've found a direct correlation between the chance of a content farm and the number of ads on the blog. With 0 ads, the likelyhook of a content farm is 0%.
My main problem with DDG is that there's no way to be sure they actually respect their users' privacy as they claim to.
Ideally, services like theirs would be continuously audited by respectable, trusted organizations like the EFF.. multiple such organizations even.
Then I'd have at least some reason to believe their claims of not collecting data about me.
As it stands, I only have their word for it.. which in this day and age is pretty worthless.
That said, I'd still much rather use DDG, who at least pay lip service to privacy, than sites like Google or Facebook, who are openly contemptuous of it.
At the very least it sends a message to these organizations that privacy is still valued, and they'd lose out by not trying to accommodate the privacy needs of their users to some extent.
I also prefer DDG's user interface over Google's. And DDG's !bang search shortcuts.
DDG has been my default search engine for years and its results are good enough for me 95% of the time. I only need to use Google as a fallback when searching for niche technical information or "needles in haystacks".
Even then, the Google results are usually terrible. I haven't used Google as a fallback in about a year because every time I tried it they couldn't find what I was looking for either. Or they did something atrocious like changing my search terms for me.
Facebook and Google are huge, global companies where their main product is free, and yet they aren't a charity. The only way to be mega-rich and offer something free is to be shady and manipulative with user's data. Exploiting privacy is their business model. They aren't gonna respect it.
Being super financially successful off free products and services is not a recipe for an honest, citizen respecting company.
Ahh, that does seem like a more correct reading of the page. Do we have a source that's unambiguous? Seems strange to have a bot that's able to parse pages for their instant answers but then not use those same results for regular search.
It's well-known that they use Bing. They also say that they use other sources. In this case I'm looking for an explicit disambiguation of what sources they use for what; my first read led me to interpret it as them also using their crawler to return search links (as opposed to just being used for their instant answers).
"We also of course have more traditional links in the search results, which we also source from multiple partners, though most commonly from Bing (and none from Google)."
This refers to the actual search results and if they used their bot for that I don't see why they would say "multiple partners" instead of "multiple partners and our own crawler". The fact that they don't have their own index is such a common "complaint", and this page is often referred, so if they really used their own bot they should have added that a long time ago.
And it's not just a legacy page they have forgot to update. It keeps being updated. In 2019 it said:
"We also of course have more traditional links in the search results, which we also source from a variety of partners, including Oath (formerly Yahoo) and Bing."
Here they talk about their own indexes getting bigger but at the same time admitting that "it seems silly to compete on crawling and, besides, we do not have the money to do so". Completely understandable but also interesting that the current page doesn't mention their own index at all. Maybe they used to have a goal to build their own independent index that has now been dropped?
All in all, I think it's safe to presume that their own crawler is only used for Instant Answers etc since that's the part of the sources where it's mentioned. Or at the very least used to such a small extent in the actual search results that it would be disingenuous to even mention it as a source.
Digging up past internet history as ammunition in an argument isn't cool in general. It's not that such details are necessarily wrong or irrelevant, but doing this has a systemically degrading effect and we don't want to be that sort of community.
I don't like to criticize the author. We all have good takes and bad takes and really for a single post, you should address the argument. Digging up the past is part of what's making the world worse.
That being said, I do see a valid reason for bringing up his history of bad takes. I use to respect Devault. He banned me on the Fediverse because he disagreed with me being against defunding the police and against critical race theory.
I find some of stuff interesting, and I agree with more AGPL and more real open source development. I'd even say I'm jealous that he can actually fund himself off of his FOSS projects and do what he loves.
But I do agree, he does have a lot of questionable takes. He seems to love Go and hate Rust, hate threads for some reason, and has a lot of RMS style takes. Not all of them are bad, and hardcore people can help you think.
As far as this post goes, I do think search is pretty broken. I think a better solution is more specialized search. Have a web tool just for tech searching that does StackExchange sites, github, blogs, forums, bug trackers and other things specialized to development.
Another idea would be an index that just did blogs, do you can look up any topic and see what people are writing about long form for the current month. Add features to easily see what people were saying 5 or 10 years ago too. There is a ton of specialized work there, in filtering blog spam, making sure you get topics from all sides (including "banned" blogs), etc.
You use to have to go to Lycos, Yahoo, Hotbot, Excite and you'd get different results and find lots of different helpful things. We need that back. It will take some good, specialized tools, to break people from Google search.
Quite apart from this type of attack breaking the site guidelines and not being allowed on HN (about which see https://news.ycombinator.com/item?id=25130908)... that was 9 years ago. Imagine being publicly shamed for the worst thing you've done in the past decade. I don't think anyone is going to pass that test.
Is this the kind of world you (or any of us) really want to be part of? Surely not. Therefore please don't help to create it here.
Knowing a bit of his personal history I can kind of understand why he acts the way he does, and has the opinions he does. Doesn't excuse some of it, but at least I kinda get why.
I just wish his name would stop coming up for me tied to opinion pieces like this. I'd rather just see things about how some project he's working on is doing great and being widely adopted.
How would anybody ever know what the server is running and/or doing with the data you send it, regardless of if it is running open or closed source code?
A service, running on somebody else's machine, is essntially closed.
I think the only way to have an 'open' service is to have it managed like a co-op, where the users all have access to deployment logs or other such transparency.
Even then, it requires implicit trust in whomever has the authorization to access the servers.
That sounds a bit like YaCy.[1] It is a program that apparently lets you host a search engine on your own machine, or have it run as a P2P node.
I think the next step forward should be to have indices that can be shared/sold for use with local mode. So you might buy specialised indices for particular fields, or general ones like what Google has. The size of Google's index is measured in petabytes, so a normal person would still not have the capability to run something like that locally.
Edit: In another thread, ddorian43 has pointed out the existence of Common Crawl,[2] which provides Web crawl data for free. I have no idea if it can be integrated with YaCy, but it is there.
In theory, this is the kind of thing that the GPL v3 was trying to address: roughly speaking, if you host & run a service that is derived from GPL-v3'd software, you are obliged to publish your modifications.
But, I agree with you - and I don't think the author had really thought through what they were demanding, they made no mention of licensing other than singing happy praises of FOSS as if that would magically mean you could trust what a search engine was doing.
> In theory, this is the kind of thing that the GPL v3 was trying to address: roughly speaking, if you host & run a service that is derived from GPL-v3'd software, you are obliged to publish your modifications.
You're right... I'm misremembering the GPL, wikipedia says that it was only 'Early drafts of GPLv3 also let licensors add an Affero-like requirement that would have plugged the ASP loophole in the GPL' - I hadn't realised it never made it into the final version.
> In theory, this is the kind of thing that the GPL v3 was trying to address: roughly speaking, if you host & run a service that is derived from GPL-v3'd software, you are obliged to publish your modifications.
> How would anybody ever know what the server is running and/or doing with the data you send it, regardless of if it is running open or closed source code?
hah, gave me a picture of a base plate ontowhich one can click on top an infinite number of hardisk enclosures or additional base plates to the 4 sides.
You get a subscription and the index updates (enclosure+preloaded drive) are send to you periodically.
I'm also using DDG exclusively since many years. I find what I need usually as the first couple of results or in the box on the right, that usually goes directly to the authorative source anyway.
I'm mostly getting Norwegian results, when searching for Danish subjects from a Danish IP address. It also seems it just hasn't indexed as many websites as Google.
I re-search almost everything technical with google after ddg showed me crap. I still use ddg by default tho', it works for most things, just not for work.
Does Google search results work for you? If yes, then I'd say the reason is you don't see or agree with how bad results are today (as others have posted extensively about). I for one find DDG as the search engine that returns the worst results. Qwant is a better Bing-using engine IMO but it is still bad.
I can think of some improvements (better forum/mailing list coverage), but it's generally pretty good. Lately if I don't find it on DDG I probably won't have much luck anywhere else, either.
I sometimes come across inappropriate results - for example I search for a hex error code and the results are for other numbers - and sometimes the adverts are misleading, but neither are so prevalent enough that it harms the experience in general.
I always send feedback when I come across incorrect results and also try to when I get a really easy find.
I have not had to resort to any other search engine for at least five years.
I've tried DDG for a while, around a couple of years ago, and I had lower-quality results particularly for technical subjects (which are the vast majority of my searches). I will give DDG another shot, though.
for generic stuff DDG is mostly ok. But for local results, even though it has a switch for local results, it REALLY REALLY REALLY sucks bad and often doesn't get any of the expected places anywhere in the first few pages for New Zealand which makes it somewhat useless
I say about 50% I'm good with DDG. About 1/3 of the time I add !g, usually for weird error messages and tech stuff.
Honestly we shouldn't be using Google for everything. Why not just search StackExchange or Github issues directly for known bug problems? If you need a movie, !imdb or !rt forward you to exactly where you want to really search on.
If DDG or Google also included independent small blogs for movie results, I could see the value in that. I'd prefer someone's review on their own site or video channel, but it doesn't. We've kinda lost that part of the Internet.
Just the weekend I had the "weird error message" situation. DDG gave me 1 page with no relevant results while Google had no problem finding proper matches.
Would be nice if they could do better on queries like that... though funny thing is if they didn't respect privacy they probably could. Log any searches where a user looked for something and then tried the same thing prefixed with !g. Use those for figuring out where to focus efforts and what to test with.
Why couldn't several coordinating specialized search engines share their data via something like "charge the downloader" S3 buckets? Then you get an org like StackExchange who could provide indexed data from their site and the algorithms to search the data the most efficiently, GitHub can do the same for their specific zone of speciality, Amazon, etc.
Then anyone who wants to use the data can either copy it to their own S3 buckets to pay just once, or can use it with some sort of pay-as-you-go method. Anyone who runs a search engine can use the algorithms as a guide for the specific searches they are interested in for their site, or can just make their own.
You could trust the other indexers not to give you bad data, because you'd have some sort of legal agreement and technical standards that would ensure that they couldn't/wouldn't "poison the well" somehow with the data they provide. Further, if a bad actor was providing faulty data, the other actors would notice and kick them out of the group or just stop using their data.
It would have to be fully open source, I agree with the other parts of Drew's essay here, but I think we could share the index/data somehow if we got together and tried to think about it. We just need a standard for how we share the data.
There's Common Crawl for the crawling aspect, about 3.2 billion pages last time I looked. One of the issues with that kind of detachment of jobs is crawl data freshness.
I'm thinking more like search indexers with like 100k pages, in a specialized category like 3d printing or basketball. I can index things like those on my home PC, practically.
I think you can already set an S3 bucket to charge the downloader for bandwidth, that's what I was talking about, the storage part is harder. The storage costs could be borne via some sharing agreements between the commercial interests and the pay customers where the infrastructure could be provided for smaller indexers (storage, compute) and in exchange the provider can use the data freely for their own services. Or, we could have a micropayments system that aggregates towards the end of the month, something open source and free to use, maybe a blockchain, this is actually one place that tokens on a blockchain could work as a semi-decentralized payment system between the parties in agreement in my "vision" or whatever you want to call it.
It appears to be the case in a technical workflow sense, from the little I just read of Snowflake, but my proposal would be a much more open system than one under control of a single vendor. It'd be more like a set of standards for interoperation and a common data center so that the data is accessible under one roof. Maybe each entity could do specialized search as a service and the search aggregators would pay by providing some infrastructure to run the crawlers.
I don't personally think any system specification is impossible unless it goes against since mathematical law, so a really fast, distributed query system where there are a few hundred specialized providers for a single query is feasible. Imagine the aggregator does initial analysis to determine the category of search, like programming, news, or restaurant reviews, then sends the users query to a set of specialized providers that supply an index for that category, then fuse the results with some further analysis of the metadata returned. Then the user can also include or exclude the specialized providers at will.
You could also eliminate the aggregator as a service as simply make it a user application on the desktop, allowing for even more user control and maybe caching or something.
> they’ve demonstrated gross incompetence in privacy
Not sure I buy the example that is given here.
1. It's an issue in their browser app, not their search service.
2. It's not completely indefensible: it allows fetching favicons (potentially) much faster, since they're cached, and they promise that the favicon service is 100% anonymous anyway.
> The search results suck! The authoritative sources for anything I want to find are almost always buried beneath 2-5 results from content scrapers and blogspam. This is also true of other search engines like Google.
This part is kinda funny because "DuckDuckGo sucks, it's just as bad as Google" is ... not the sort of complaint you normally hear about an alternative search engine, nor does it really connect with any of the normal reasons people consider alternative search engines.
That said, I agree with this point. Both DDG and Google seem to be losing the spam war, from what I can tell. And the diagnosis is a good one too: the problem with modern search engines is that they're not opinionated / biased enough!
> Crucially, I would not have it crawling the entire web from the outset. Instead, it should crawl a whitelist of domains, or “tier 1” domains. These would be the limited mainly to authoritative or high-quality sources for their respective specializations, and would be weighed upwards in search results. Pages that these sites link to would be crawled as well, and given tier 2 status, recursively up to an arbitrary N tiers.
This is, obviously, very different from the modern search engine paradigm where domains are treated neutrally at the outset, and then they "learn" weights from how often they get linked and so on. (I'm not sure whether it's possible to make these opinionated decisions in an open source way, but it seems like obviously the right way to go for higher quality results.) Some kind of logic like "For Python programming queries, docs.python.org and then StackExchange are the tier 1 sources" seems to be the kind of hard-coded information that would vastly improve my experience trying to look things up on DuckDuckGo.
Maybe instead of hard-coding these preferences in the search engine, or having it try to guess for you based on your search history, you can opt-in to download and apply such lists of ranking modifiers to your user profile. Those lists would be maintained by 3rd parties and users, just like eg. adblock blacklists and whitelists. For example, Python devs might maintain a list of search terms and associated urls that get boosted, including stack exchange and their own docs. "Learn python" tutorials would recommend you set up your search preferences for efficient python work, just like they recommend you set up the rest of your workflow. Japanese python devs might have their own list that boosts the official python docs and also whatever the popular local equivalent of stackexchange is in Japan, which gets recommended by the Japanese tutorials. People really into 3D printing can compile their own list for 3D printing hobbyists. You can apply and remove any number of these to your profile at a time.
I like this idea! I think the biggest difficulty with it - which is also probably the most important reason that engines like Google and DDG are currently struggling to return good results - is that the search space is just so enormously large now. The advantage of the suggestion in the blog post is that you trim down the possible results to a handful of "known good" sources.
As I understand it, you'd want to continue to search the whole "unbiased" web, then apply different filters / weights on every search. I really do like the idea, but I imagine we'd be talking about an increase in compute requirements of several orders of magnitude for each search as a result.
Maybe something like this could be made a paid feature, with a certain set of reasonable filters / weights made the default.
This may be a very dumb question, but could the filtering be done client-side? As in, DDG's servers do their thing as normal and return the results, then code is executed on your machine to weight/prune the results according to your preferences.
Maybe this would require too much data to be sent to the client, compared to the usual case where it only needs a page of results at a time. If so, would a compromise be viable, whereby the client receives the top X results and filters those?
This would work if you had a blacklist of domains you didn't want to see. But the idea in the post is closer to a whitelist: the highest priority sites (tier 1) should be set manually, and anything after that should be weighted by how often it's referenced by the tier 1 sites. For a lot of searches you're going to have to pull many results to fill a page with stuff from a small handful of domains, and in fact you might not be able to get them at all. And that's before you start dealing with the weighting issue, which would require quite a bit of metadata to be sent with each request.
I have had a similar idea, what you're proposing is essentially a ranking/filtering customisation. The internet is a big scene, and on this scene we have companies and their products, political parties, ad agencies and regular users. Everyone is fighting for attention, clicks. Google has control over a ranking and filtering system that covers most searches on the internet. FB and Twitter hold another ranking/filtering sweet spot for social networks.
The problem is that we have no say in ranking and filtering. I think it should be customisable both on a personal and community level. We need a way to filter out the crap and surface the good parts on all these sites. I am sure Google wouldn't like to lose control of ranking and filtering, but we can't trust a single company with such an essential function of our society, and we can't force a single editorial view on everyone.
As we have many newspapers, each with its own editorial views, we need multiple search engine curators as well.
Unfortunately I suspect that if it were a premium feature, not enough groups would volunteer the requisite time into compiling and maintaining the site ranking lists. This sort of thing really has to become a community effort in order to scale, I think.
This is a great idea. It's like a modern reboot of the old concept of curated "link lists", maintained by everyone from bloggers to Yahoo. Doing it at a meta level for search-engine domains is a really cool thought.
I got signed up for goodreads (book review site), and I get tons of spam. It's not quite the same as your idea, but it is a
currated list. I don't know how you stop spammers from adding bogus links in the python interest list (to use an example).
Like any other list, it depends on who maintains it. You basically want to find the correct BDFL to maintain a list, much like many awesome-* repositories operate.
As a hack until then, I've found Google's Custom Search Engine feature to work well enough for my use cases. I just add the URLs that are "tier 1" for me. https://programmablesearchengine.google.com/cse/all
I mean, hacker news can probably also identify users based on which articles they click on, and how often they jump straight to the comments. I hope they don't.
But a system such as I'm describing is probably the only one that can be entirely consistent with the two disparate requirements of fully anonymizing users, and being useful to both programmers and ophiologists studying different things called "python".
what you are describing is relevance based on user input (be that cookies, search history, interests, a preference for x over y) that may be used as identifying information, which vastly de-anonymises the service. if a search query is too ambiguous then it can be refined. if the user knows they want a programming language and not a snake, they can let the search engine know themselves. don't sacrifice their anonymity for perceived usefulness
Presumably, the most common search preference lists would be used by very large numbers of people -- for example, almost all programmers would rather see Python (language) queries over Python (snake) queries and would probably all be using whichever search preferences become the most popular and well-maintained, like "mit_cs_club.json". A subset of those would also be into anime and enable their anime search preferences (probably more particular), and some of them will also like mountaineering, pottery, and baking, and will have such preferences configured as well. Yes that might be enough to identify you (just like searching for your own name would be) but those preferences don't need to be attached to you, just your query, and you could disable or enable any of them at any time.
It would basically be like sending a search query in this form:
"Python importerror help --prefs={mit_cs_club, studioghiblifans_new, britains_best_baking_prefs, AlpineMountaineersIntl}"
If you like baking, anime, and mountaineering, it's probably convenient to leave all those active for your searches, even your purely programming-focused searches. But you could toggle some of them off if articles about "helping to protect imported mountain pythons" are interfering with your search results, or if you want to be more anonymous. If you're especially paranoid you could even throw in a bunch of random preferences that don't affect your query but do throw off attempts to profile you. You could pretty easily write a script that salts every search with a few extra random preference lists, for privacy or just for fun, and make that an additional feature. The tool doesn't need to maintain any history of your past activity to cater to your search, so I think it would be a good thing for privacy overall.
> Presumably, the most common search preference lists would be used by very large numbers of people
the more anonymous among us tend to opt for common IP addresses and common user agents to become the tree among the forest. adding a profile to that would, well, only add to a digital fingerprinting profile
> those preferences don't need to be attached to you, just your query
that's not how it works. preferences are by their nature personal. every transaction would have your interests and hobbies embedded, on top of metadata
you have voluntarily made yourself the birch among the ebony
> If you're especially paranoid you could even throw in a bunch of random preferences that don't affect your query but do throw off attempts to profile you
how would they not affect the query? they complement the query. or rather, unnecessarily accompany the query. your results depend on your input. it doesn't matter what colour glove you wear to pull the trigger if you bury the gun with the body
> write a script that salts every search with a few extra random preference lists, for privacy or just for fun
just the latter. fuzzing would be pointless since the engine will have already identified you by now
it sounds like an annoying browser extension at best. to label it a pro-privacy tool would be ludicrous
Agreed. I think the key point here is that the web is a radically different place than it was in 1998 (when Google launched and established the search engine paradigm as we know it). Back then the quality-to-spam ratio was probably much higher, the overall size of the web was certainly much smaller (making scraping the entire thing more tractable), and there were many more self-hosted sources rather than platforms (meaning it was more necessary to rely on inter-linking, and "authoritative domains" weren't as much of a thing). The naive scraping approach was both more crucial and more effective. And in the decades since, it's been a constant war of attrition to keep that model working under more and more adversarial conditions.
So I think that stepping back and re-thinking what a search engine fundamentally is, is a great starting point for disruption.
Additionally, something the OP didn't mention is that ML technologies have progressed dramatically since 1998, and that much of that progress has been done in the open. I can't imagine that not being a force-multiplier for any upstart in this domain.
But the situation with authoritative domains hasn't changed much, and what "platforms" tend to be strong for answering questions? As in 1998, there are a few very good places for getting answers to certain kinds of questions. They are not facebook or twitter, ever.
I think Google sort of takes into account "votes", in that they look at the last thing you clicked on from that search, and consider that the "right answer", which they then feed back into their results.
As such, they effectively have a list of "tier 1" domains.
I kind of hope they don't, or there is more to it than just that -- for example, a user coming back and clicking on something else counts as a downvote for the first item.
Any system that ranks things purely based on votes or view counts can have a feedback loop that can amplify "bad" results that happen to get near the top for whatever reason. For web search, this would encourage results that look right from the results page, even if they're not actually a good result of what the user is looking for.
An example of this would be when you're trying to find an answer to a specific question like "How do I do X when Y?". The best result I'd hope for is a page that answers the question (or a close enough question to be applicable), while the promising-looking-but-actually-bad result is a page where someone asks the exact same question but there are no answers.
> Any system that ranks things purely based on votes or view counts can have a feedback loop that can amplify "bad" results that happen to get near the top for whatever reason.
I think this is a place where Google has pretty obvious algorithm problems. For example, I’m building a personal website for the first time in many years, and obviously that means I’m doing a fair bit of looking up new or forgotten webdev stuffs. It’s widely known that W3Schools is low quality/high clickbait/has a long history of gaming the SEO system. They’ve been penalized by Google’s algorithm rule changes but continue to get the top result (or even the top 3-5 results!), even with Google having a profile of my browsing habits, and knowing that I intentionally spend longer on these searches to pick a result from MDN or whatever. It seems pretty likely that W3Schools is just riding click rate to stay at the top. And it’s pathological.
W3Schools is awful. The official documentation is hard to navigate, but W3Schools is notorious for misleading and poor quality examples and advice. MDN, caniuse, CSS Tricks and such are much better resources.
Edit: I semi-intentionally forgot to mention Stack Overflow because it’s so unpredictable in terms of quality.
I don't know if DDG does that exactly, but their help page does say this:
> Second, we measure engagement of specific events on the page (e.g. when a misspelling message is displayed, and when it is clicked). This allows us to run experiments where we can test different misspelling messages and use CTR (click through rate) to determine the message's efficacy. If you are looking at network requests, these are the ones going to the one-pixel image at improving.duckduckgo.com. These requests are anonymous and the information is used only by us to improve our products.
The Firefox network logger does show requests to this domain when I click on a link in the search results, before the page navigates away. This suggests to me they might by logging this information. To be clear, this is speculation on my part, because I haven't examined the URL parameters in detail.
In any case, I'm not sure how much this manages to improve the results, since usually I can get help with my Python query (for example) using whatever crappy blog post is first in the results, but results from the official docs or StackExchange are still probably better and should be prioritized.
I thought DDG already crawled their own curated list of sites?
There is a DuckDuckGoBot and I think it was an interview or podcast Gabriel did a while back that he mentioned they use it for filling out gaps in the Bing API data to provide the instant answers, favicons. Their preference for the instant answers were authoritative references such as docs.python.org. This would have been a while back though.
If memory serves, those crawls are only used for Instant Answers. My interpretation of the blog post is that it would be nice to have a search engine that's sort of a hybrid approach based on Instant Answers for the whole web.
Some kind of logic like "For Python programming queries, docs.python.org and then StackExchange are the tier 1 sources" seems to be the kind of hard-coded information that would vastly improve my experience trying to look things up on DuckDuckGo.
The problem with this strategy is always going to be that different users will regard different sources as most desirable.
For example, it's enormously frustrating that searching for almost anything Python-related on DDG seems to return lots of random blog posts but hardly ever shows the official Python docs near the top. I don't personally think the official Python docs are ideally presented, but they're almost certainly more useful to me at that time than some random blog that happens to mention an API call I'm looking up.
On the other hand, I would gladly have an option in a search engine to hide the entire Stack Exchange network by default. The signal/noise ratio has been so bad for a long time that I would prefer to remove them from my search experience entirely rather than prioritise them. YMMV, of course. (Which is my point.)
I tried to build something like this in 2007, together with a small band of nerds and geeks and Linux enthusiasts. It was called Beeseek. [0]
I knew close to nothing about building a company or a project, or how a proper business model would have helped it. I was the leader (SABDFL) of the group, and unfortunately I didn't lead it well enough to succeed. We had some good ideas, but ultimately we failed at building more than the initial prototype.
The idea behind it was simple: WorkerBee nodes (users' computers) would crawl the web, and provide the computational power to run Beeseek. Users could upvote pages (using "trackers" that anonymously "spy" the user in order to find new pages - repeat: anonymously). The entire DB would be hosted across multiple nodes. Auth and other functionalities would be provided by "higher level" nodes (QueenBee nodes).
Everything was going to be open source.
Well, it didn't work.
Thankfully, because of Beeseek, I met a few very smart people that I am in touch with to this day.
Life is strange and beautiful in its own way.
Weird, though, that today I still believe that Beeseek could have been the right thing to build. Who knows?
In what ways does what OP describes remind you of your project? Just that it was an open source web search?
One difference from what you describe is that the OP is specifically recommending against decentralization/federation, where it seems to have been the core differentiator of your effort. I don't think what OP is describing is quite what you are describing.
I think that DDV was arguing against decentralization/federation for searching the index. Not necessarily related in any way to building the index (if the distributed nodes all just forward back results to central hub).
Thinking you can design a better search engine by yourself is either egotism or ignorance. Assuming you based it on state of the art search engine research, and could somehow avoid patent encumbrance, it'd still take you 5 years to match Google's results (and even then not likely) sans all the SEO bullshit.
Most people still believe that it's possible for one search engine to help anyone find anything without it knowing anything about them, which is just ridiculous. To get good search results you practically have to read someone's mind. Google basically does this (along with their e-mails, and voicemails, and texts, and web searches, and AMP links, and PageRanked crawls, and context-aware filters) and they still don't always get it right.
There is no magic algorithm that replaces statistical analysis of a large corpus along with a massive database of customized rulesets.
> We should also prepare the software to boldly lead the way on new internet standards. Crawling and indexing non-HTTP data sources (Gemini? Man pages? Linux distribution repositories?), supporting non-traditional network stacks (Tor? Yggdrasil? cjdns?) and third-party name systems (OpenNIC?), and anything else we could leverage our influence to give a leg up on.
Oh, great, so become the Devil himself, then. Count me out.
Yes, we can do better than DDG. But if you are expecting to fund a real search engine with a few hundred thousand dollars you are insane. It will take a ton of development and a ton of hardware to create an index that isn't a pile of garbage. This isn't 2000 anymore. You need to index >100 billion pages and you need it updated and you need great crawling and parsing and you need great algorithms and probably an entirely proprietary engine and you need to CONSTANTLY refine all the above until it isn't garbage. Maybe you could muster something passable for $1B over 5 years with a strong core team that attracts great talent. If Apple actually does this, as they are rumored to, I bet they dump $10b into it just for the initial version.
Agreed, it's going to require significant investment in hardware and software.
The recent UK Competitions and Market Authority report evaluating Google and the UK search market came to the conclusion a new entrant would require about 18 Billion GBP in capital to become a credible alternative search engine, in terms of size, quality, hardware, man hours making it.
Remember Cuil? Had the size, the fanfare but unfortunately not the quality.
Open Street Map is a nice analogy for what could work. Aside from the open source maintenance of the map, there's also tons of corporate help in the background. Companies that are delivering OSM as a service or relying on it for their own services have an interest in making it better. MapBox, for example, apparently pays tons of people a salary who are contributing upstream to Open Street Map. If we can get an Apple/Microsoft/Other players collab maybe a venerable alternative can actually be built.
I agree and I have been hoping Apple builds a serious competitor. I welcome any competition at this point. Let's be real, not many people are using bing. People _would_ actually use apple search.
Microsoft tried and failed to build a competitor and it's not like they have shallow pockets.
They grossly underestimated a number of aspects:
- The huge number of man-years invested in Google's search quality stack hand-tuning and what it would take to replicate it.
- The fact that the machine learning field was simply not ready to tackle the search quality problem.
- The infrastructure required to build a crawler / indexer stack as good as Google's
I think in 2020, the second problem is within reach of many companies technically. It's mostly a matter of throwing enough money at optimized infrastructure.
However, replicating the search quality stack is going to be very hard, unless someone makes a huge breakthrough in machine learning / language modeling / language understanding at a thousandth of the cost it currently takes to run something like GPT-3.
The most likely candidate to execute properly on that last bit is - unfortunately - Google.
Sure but Microsoft has tried and failed to build a competitor to google not to DDG.
I don't see it that hopeless. I feel it kinda is like starting Open Street Maps. It won't be perfect for a long time but there will be people who'd prefer it and help out.
Of course they would - it would be set as the default search on their iPhones with no clear-cut way to change it. You know, "security". The users don't know what's best for them, etc. as Apple seems to think.
Check out the serious difficulties the Common Crawl had with crawling 1% of the public internet on donated money and then get back to me with a plan. This is really, really hard to do for free. Maybe talk to Gates :)
I don't fully understand something about the general tech industry discourse around search and would love to hear if I'm wrong.
Here's my brief and slightly made up history of search engines:
In the beginning of time, search engines took a Boolean query (duck AND pond) and found all the documents which contained both words using an inverted index and then returned them in something like descending date order. But for queries which had big result sets, this order wasn't very useful and so search engines began letting users enter more "natural language" queries (duck pond) and sorting documents based on the number of terms that overlap with the query. They came up with a bunch of relevance formulas - tfidf, BM25 - that tried to model the query overlap. But it turns out this is tricky because user intent is a really tricky problem and so modern day search engines just declare that relevance is whatever users click on. Specifically they just model the probability that you're going to click on a link (or something) using a DNN that uses things like the individual term overlap, the number of users that have clicked on this link, the probability it's spam, the PageRank etc. Some search engines like Google also include personalized features like the number of times you have clicked on this particular domain - because for instance as a programmer your query of (Java) might have different intent than your grandmother's. This score then gets used to sort the results into a ranked list. This is why search engines (DDG included) collect all this data - because it makes the relevance problem tractable at web scale.
Maybe just my perspective but I just really don't understand why OP would want to build an index - it's hard boring expensive and doesn't violate data privacy - and I don't think people grasp that - at least to some extent - data privacy and relevance are in direct conflict?
I wonder if you could start small on something like this. Build a proof of concept, a search engine for programmers that indexes only programming sites/material. See if you can technically do it, & if you can figure out governance mechanisms for the project. Sort of like Amazon starting with just selling books.
I wonder if instead of another search engine we would benefit from a directory, like DMOZ, or perhaps something tag based or non-hierarchical. Sometimes I find better results by first finding a good website in space of my query, and then searching within that site, as opposed to applying a specific query over all websites. Once example would be recipes: if you search for "bean burger recipe" you will get lots of results across many website, but some may not be very good, whereas if you already know of recipe websites that you consider high-quality or match your preferences, then you'll find the best (subjectively) recipe by visiting that site and then searching for bean burgers.
I've recently been /tinkering/ with exactly such an idea! In my case, it's even more specific and scoped: A search engine with only allow-listed domains about software engineering/tech/product blogs that I trust.
It's not even really at the POC stage yet, but I hope to host it with a simple web frontend sometime soon. Primarily, this is just for myself... I just want a good way to search the sources that I myself trust.
Its still pretty new and I'm working on it in my spare time, but my side-project https://searchmysite.net/ seems pretty close to what the author is after:
- "100% of the software would be free software, and third parties would be encouraged to set up their own installations" - I'm planning on open sourcing it under AGPL soon, once I've got documentation, testing etc. ready. Plus it's easy to set up your own installation (git clone; mkdirs for data; docker-compose up -d).
- "I would not have it crawling the entire web from the outset" - That's one of the key features of my approach, only crawling submitted domains. I'm focussing on personal websites and independent websites at the moment, primarily because I don't currently have the money for infra to crawl big but useful sites like wikipedia, but there's nothing to stop people setting up their own instances for other types of site.
- "who’s going to pay for it? Advertisements or paid results are not going to fly" - A tough anti-advert stance is another key differentiating feature to try to keep out spam, e.g. I detect adverts on indexed pages and make sure those pages are heavily downranked. Planning to pay running costs via a listing fee, which gives access to additional features like greater control over indexing (e.g. being able to trigger reindexing on demand).
What about a curated search engine? Allow anyone to curate the results and define a custom list of allowed URLs. Then, others can use their list.
For example, I decide Google is terrible when I'm searching for product reviews, and all I get are results to Amazon referral websites and spam blogs that never owned the products to begin with. So, I find 200 sites or forums that actually have quality reviews and I create a whitelist of those URLs, and I name it "John Doe's product reviews list".
Other people visit the search engine and they can see my list, favorite it, and apply it to their results. I maintain the list, so they continue to get updates as it's refined.
The idea is you visit the search engine, type your query, then select from a drop down one of your favorite curated lists to apply. Maybe you like to use "Mike's favorite free stock photo websites" when searching for free photos for your projects. Maybe you like to apply "Jane's vegan friendly results" when searching recipes or face creams. Maybe you want to buy local, so you use the "Handmade in X" list when searching for your next belt. Maybe you use a list that only shows results from forums, or another for tracking/ad free websites.
Keep track of list changes. So, if someone gets paid off to allow certain sites on their popular list, others can easily fork a past version of the list.
I really like your idea and had similar thoughts - curious how you would build this, and if you planned to? I saw in your comments you ran a website that sounded like an earlier version of Reddit - could you share the name of the site? My email is orangegummy@gmail.com if you aren't interested in publicly posting that info ( I would have emailed you but there's no contact info I could see)
It's almost impossible to build a decent web search engine from scratch today (i.e., build your own index, fight SEO spams, tweak search result relevance...). The web is already so big and so complex. Otherwise Google won't need to hire so many people to work on search alone.
If you didn't start at the very early stage of tiny web (e.g., Google in 1996 as a research project) and grew with the web over the past 20+ years, or you don't have super deep pocket (e.g., Microsoft Bing in mid 2000s), then it's almost impossible to build a decent web search engine within a few years.
It's possible to build vertical search engines on far smaller scale, far less complex, far less lucrative things that Google/Microsoft has little interest today (e.g., recipes [2], podcasts [3], gifs [4]...)
It's also possible to come up with a different discovery mechanism for web (or a small portion of web), other than a traditional complete web search engine. Essentially you don't cross moat to attack a huge castle (e.g., Google). Instead, you bypass the castle [1], as it becomes irrelevant.
You are probably right. But still... the approach suggested makes kind of sense. A curated list of trusted sites as kind of seed. Not the entire web. This can be as small or as large as can be useful. It does not need to be about the entire web. How big is the "useful" blogosphere, for example? Cannot an opensource project that gathers momentum somehow create a curated list of let's say 10 000 trusted blogs and index those? Index all mailing lists that can be found, index all of reddit, index Hacker News, index Wikipedia, the 100 most well regarded news sites in each country, etc. Would not such an index be a good start and better than Google in many cases?
is somebody aware of a project where the end-user Browser acts as a Crawler? it already spent the energy to render the content. Readability.js extracts page section, does some processing for keywords, hashes anchor links, signs it and sends it off. Cache-Control response headers indicate if the page is public or private. Of course, where it is sending to will have an electricity bill to pay to index the submissions.
Yes. Both PeARS[1] and Cliqz[2] tried to do that. Both got direct support from Mozilla[3][4] but it looks like neither really kicked off.
PeARS was meant to be installed voluntarily by users who would then choose to share their indexes only to those they personally trusted, so the idea is very privacy conscious but also very hard to scale.
Cliqz, on the other hand, apparently tried to work around that issue by having their add-on bundled by default in some Firefox installations[5] which was obviously very controversial because of its privacy and user consent implications.
I still think the idea has potential, though, even if it's in a more limited scope.
thanks for the pointer to PeARS, this was wholly new and I'll read into it.
I was aware Mozilla had some involvement with Cliqz, but didn't really pay attention, I remembered the company became owner of the Firefox Addon Ghostery some years ago. They closed shop mid 2020, but their tech-blog 0x65.dev is still up. There are a lot of posts from last December that explain its inner workings.
User-agents really do contribute their history, containing which search terms led to which pages. From this data (they named it the "human web") the search engine had a page model of which search terms led in higher frequency to the page. Related search terms were normalized. Only later did a "fetcher" really index high frequency content and consider it again in a later stage of search. Interesting bootstrap, more energy efficient maybe as it can run on less information.
Sending the search and browsing history offsite needs explaining and trust. But ultimately, any centralized search engine will see the search data too. Cliqz approach was trying to piggy-back on the result sets on search terms by established search engines, the search term + choosen result combination a donation of the user. Not any less invasive then other search engines. Would I send off the whole corpus of my browsing history? this raises good questions. Thanks for the links!
> If SourceHut eventually grows in revenue — at least 5-10× its present revenue — I intend to sponsor this as a public benefit project, with no plans for generating revenue.
I like this attitude. Makes me happy to be a paying member of SourceHut.
I’m pro privacy, but I dont have a problem with AdWords, outside of googles implementation.
If AdWords targeting was purely based on the search term, I don’t mind.
The search engine has to generate revenue somehow, and the revenue generated on “Saas crm” with a single click is likely to be larger than any users annual subscription. (10 - 100+ per click)
I’m unclear on the ethical / privacy concerns of “AdWords” style advertising.
A search engine's job is to present you with the best possible results for any given query.
A ad is either A) the best possible result or B) not the best possible result. If the ad is the best possible result, then the search engine must display it anyway in order to fulfill its mission. If it is not the best possible result, the search engine must violate its mission in order to display it. To put it bluntly, advertising is paying to decrease the quality of search results.
FWIW it's no longer called Adwords, just Google Ads
Agree, keyword (and location, a lot of searches are for X near me) for the most part offers a way of delivering relevant ads.
Google are able to generate more income per search because of their critical mass of searches and advertisers, as well as having more data on searchers based on search history to maximise that revenue per search.
Low quality information is more profitable to produce than high quality information (thanks to Google and their ads). So all the incentives for people currently are to produce content that is nearly indistinguishable from spam. It is much more profitable to take 1 hour to write generic content than to take weeks to really think through everything. This problem will continue to get worse with AI generated text, and I don't see how Google can fight that. This is why 90% of my time is now spent on Youtube instead, which has the advantage that focusing on quality has a much higher ROI with video than with text. That doesn't mean it's not filled with garbage though, just that it's easier to swift through it. Product search is also something I now do on Amazon because Google results are essentially spam. The results there are also not ideal, but at least I can distinguish the bad results faster.
One way to make search better would be to embrace bias and give people what they want. Just accept that most information is biased and bring it to the forefront. Initially there would be some default domain whitelisting, but users can request sites to add or remove from their bubble. Maybe they can "share" these lists among each other and certain clusters would form, which you can then use to recommend more nodes. Users would also be clearly told that results are biased based on their preferences, and that they can choose to view results from other points of reference. It would also have different "modes" for things like products, information, and news. Maybe some users only want to see results from independent or ad-free sources, and they can choose those clusters. Maybe they want only far-right or far-left sources, and then they can choose those. At least people would be aware of their bias, which I think will help fight it more than acting like it doesn't exist. Essentially there would be an explorable graph so I can see different realities. It would be a combination of search and social media.
I dont want privacy, I want competition. Which is why I use Bing and honestly it works virtually all the time.
Maybe there could be some pure anonymous ad-free search engine but it more realistic to have alternative commercial one. I really dont care that people are looking at my searches for how to resize an array or cheap hotels in Florida.
This is why search is hard: 15% of Google searches are new each day. [1] And, with over 1.7+ billion web pages, [2] it would take a gargantuan open source effort to put something together like this.
Not to mention the cost, not sure something like this could be sustained with a Wikipedia-esque "please donate $15" fundraising model.
> DuckDuckGo is not a search engine. It’s more aptly described as a search engine frontend. They do handle features like bangs and instant answers internally, but their actual search results come from third-parties like Bing. They don’t operate a crawler for their search results, and are not independent.
I didn't know this. I tried to use DDG for a while after I switched to Brave, but the results were just not very good and completely missing results I was looking for at times. This would explain it if it was coming from Bing.
I spent seven years working at Bing, and I can tell you that this guy is massively, hugely underestimating the difficulty of this problem. His repeated "it's easy! You just have to..." suggestions are absurd. This is typical HN content where someone with no domain expertise swaggers in and assumes everyone in the space must be idiots, and that only he can save the day.
Trust me, there is not a ton of potential "just sitting on the floor" in web search.
Findx[1] was an attempt to make an opensource search engine. Today it's just another bing wrapper; but their code[2] is still available, waiting to be used as a starting point for another project.
> The main problem is: who’s going to pay for it? Advertisements or paid results are not going to fly — conflict of interest. Private, paid access to search APIs or index internals is one opportunity, but it’s kind of shit and I think that preferring open data access and open APIs would be exceptionally valuable for the community.
There's no reason you couldn't allow the first N number of api hits to be free, then charge for higher tiers of access.
For me, one of the weakest parts of ddg/google is finding niche content. Getting any results to a non-mainstream blog from anything but a direct quote is extremely hard. I always have to type HN/reddit to get authentic recommendations or opinions from people who have actual experience with the subject matter. Otherwise 90% of the results are from SEO-optmisezd sites that barely introduces the subject matter.
How about a crowdsourced search engine like wikipedia or stackoverflow? Like:
When you search for "kittens" you get the links that are most upvoted by the community.
If nobody has ever submitted links for search term "kittens" , you get a link to selected generic search engines. And "kittens" end up into a list of words someone has searched but nobody has yet added a good result link for.
I hate to be so negative but that's just another sort of SEO problem. Someone will pay a large group of people to sit and click upvotes for their clients nonstop.
Of course there are going to be some highly debated search terms. But I think that applieas also to weikipedia and they have managed to pull it off so that it works reasonably well.
I mean, you could always put a big red badge on top of the results that says something in the line of "this search term seems to be troublesome. You may want to check qwant/ddg or maybe even google."
One of the biggest problems the article points out more than anything is "Who's Going To Pay For It?"
You have one of two options. The crowd-funded approach would have to come with an understanding that you're trying to build a better, more private search engine that will leave the payer paying for all those who don't and the payer won't be able to have a say in anything as we'd like search results to be flat and even across the board. That means if I search for Coronavirus results, not only do I get the goverment and "Official" sources, I should get everything that I'm looking for and refine as needed.
The second approach is obviously big money but if you have big money coming in, big money will give You one of two options; Do as they say or they withdraw funding leaving you back at option one and having to downsize.
Rocks and hard places people. Not much else you can do there. Unless you take a Pilled.net approach.
I'd love a truly open source world class search engine. Curious how both the crawler and the search index / search is done by the likes of Google/Bing/DDG. Eventually someone will make an oss version of it that can compete.
The beauty of such oss solution maybe the custom heuristics that can be created based off the crawled data.
The challenges to OSS developers are numerous. First of all, many popular sites on the internet block crawlers other than Google and Bing, because only those ones seem to matter to their business, and any small upstart would be assumed to be a dodgy bot. Secondly, Google amasses the database it has only with vast data centers, incredible amounts of bandwidth, and power requirements unavailable to a startup.
We should probably classify the crawler identifying problem as impossible and move along. Less resources wasted and easier automation for everyone. Assuming a crawler is malicious is narrow-minded.
This helps to verify that a bot that announces itself as google bot is indeed a google bot. It doesn’t help identify a bot that pretends to be a user/browser.
While I agree with its lack of organization I don't think YaCy being untolerably slow is necessarily an argument. If you are looking for a complete set of pages on a specific topic time is sort-of irrelevant. Google for example has alerts for new results. That these pages are not available sooner (before publication) is not intolerable. You can also throw hardware at YaCy and adjust the settings which improves it a lot. The challenge with a distributed approach is sorting the results. Other crawlers have the same problem but in a distributed system it is even harder.
Running an instance for websites related to your occupation or hobby YaCy is quite wonderful. You don't want google removing a bunch of pages that might cover exactly the sub-topic you are looking for. Of course the smaller the number of pages in your niche the better it works.
> Crucially, I would not have it crawling the entire web from the outset. Instead, it should crawl a whitelist of domains, or “tier 1” domains. These would be the limited mainly to authoritative or high-quality sources for their respective specializations, and would be weighed upwards in search results. Pages that these sites link to would be crawled as well, and given tier 2 status, recursively up to an arbitrary N tiers.
I like this idea. It would be interesting to see the domains of every search query that I have clicked on and see what the distributions is like. I suspect there would be a long tail but I wonder how many domains actually need to be indexed for 99% of my personal search needs. Does anyone have data like this?
I think the only way to get an open search engine going is to specifically target a niche first, build profit from ad results, and then expand.
My suggestion? Older folks don't have a well known search engine targeted at them, and there are features that could make a search engine helpful for them (High contrast mode or built in screen reading for the vision impaired, anti fraud results for common scams that target seniors based on search results, links to places to watch old shows when people search for character names), and they are a lucrative demographic, both in terms of having money to buy things but also for political ads.
I'm sure a focus group of seniors would have more detailed thoughts.
>We need a real, working FOSS search engine, complete with its own crawler.
How would an open-source search engine stand against abusive SEO-optimization? If anyone can understand how the ranking algorithm works then anyone can game it.
One thing I always wished for is if there were a way to use duckduckgo bang searches in my browser without sending them through DDG. But apparently it's harder to implement than it sounds.
You absolutely can, at least in Firefox you can right click on search field, select "Add a Keyword for this search..." Then save it as bookmark and enter the keyword (you don't have to use !, but it is an option if you chose so).
You can also create such bookmark manually and use %s in the url as a placeholder where search query should be placed.
The manual configuration can be useful when there's no direct search field. For example freshports.org allows querying freebsd.org. I can add a bookmark with search keyword "fp" to point to https://freshports.org/%S
After that I can type in address bar: fp lang/python39 to land on https://freshports.org/lang/python39 (the capital %S doesn't escape special characters like /)
In Firefox you can right click on a search field and add a keyword bookmark. Once saved, you can type 'kw search query', where kw is your defined key word, in the address bar to directly search the relevant site
I'm aware of that. The problem is that you have to manually add all the keywords yourself. AFAIK, there isn't an easy way to import a large list of curated keywords like the DDG bang list.
FYI, searx implements the DDG bang searches (1). There is no autocompletion (for now), but at least you can all of them in one meta search engine. The syntax is slightly different: "!!g <text>" instead of "!g <text>".
Chromium implements this as a feature by default. Visiting a website with an OpenSearch tag in its `head` or searching on a website without one lets you later search the website itself from the urlbar by pressing "tab". It's history-based and works very well.
> If SourceHut eventually grows in revenue — at least 5-10× its present revenue — I intend to sponsor this as a public benefit project, with no plans for generating revenue. I am not aware of any monetization approach for a search engine which squares with my ethics and doesn’t fundamentally undermine the mission. So, if no one else has figured it out by the time we have the resources to take it on, we’ll do it.
Now _that_ is putting your money where your mouth is!
Glad to see a technology leader taking this important issue head-on.
The whitelist approach reminds me of Yahoo's internet directory or DMOZ. The internet directories all ended up closing.
That approach will not scale in the general case, not even with the search engine following links from said sites. Too many areas of interest, too many languages, nowhere near enough people to categorize 'high-quality' sites all the time.
It could work for domain-specific searches for expert communities. SourceHut should start with a good code search engine...
The way I'd code a better search engine is I'd design an ML model that's trained to recognize handwritten HTML like this, and only add those to the index. It'd be cheap to crawl probably only needing a single computer to run the whole search engine. It'd resurrect The Old Web, that still exists, but just got buried beneath the spammy SEO optimized grifter web over the years as normies flooded the scene.
I hope to never use your search engine. I love hand written HTML as much as the next guy, but search engine's are made to find things. And useful information exists on web sites that use generated and/or minified HTML.
> I guess the lesson is if you can't do everything Google does, don't even try
I don't agree with that at all. But if your goal is to make "a better search engine" as you said, it does actually have to be "better" and not just different.
A major challenge with search in 2020 is that it's adversarial. Any open source search engine that gets popular is going to be analyzed by black hat SEO people and explicitly targeted by spam networks. Competently indexing and searching content is really only a small part of the problem now, with the adversarial "red queen's race" against black hat SEO and spam being the more significant issue.
I'm an old-time user of DDG. I agree with the 3 points. I feel like I'm using a shitty car in a world where everyone runs a ferrari.
The one that strikes me most is the results. I feel like DDG doesn't search the entire internet, like there's zillions of pages there waiting to be indexed, even old websites, but the results i get are so poor.
Even with this handicap, I still use over INSERT YOUR AD HERE Google Search.
I think it makes sense to separate the notion of bangs, what a lot of people use DDG for, from that of a search engine. I think projects like my own (very primitive!) bangin are a step in the right direction here: https://github.com/samhh/bangin
Aren't the secrets of the algorithm what prevent people from gaming the results? While I love the idea of search becoming fully open source I'm skeptical it could be done. I hope I'm wrong and I'd love to dedicate time to an open source project with this goal if anyone presents a convincing plan.
The algorithms, code and configuration can be public, if the ranking isn't done just by those, but instead by many participants in the project all over the world and also by personal preferences of each client. That would be hard to game.
For all it's charms, DDG seems to be tailored to answering the kinds of questions that are most likely to be asked. IME, it's not well-suited to detailed dives into a narrow (likely unpopular) topic.
If the first page doesn't satisfy, just prefix the search terms with '!b ' and Bing usually nails it.
The three problems mentioned are not the only existing. Most of the people are unaware of censorship DuckDuckGo also does, as do Google and other major search engines.
My biggest pet peeve with DDG at the moment is that whenever I search for something on my phone the first two results are ads, and those two results actually take up my whole screen. I mean sure, those are probably not privacy invading, but I literally don't care as I wasn't looking for them.
I think DDG is mostly fine. It does sometimes give results for the wrong location or older results, so if you want specific new information for your location you should use Google (easy enough, just use !g in the query and it sends you to Google).
You need money and dedicated resources to run and manage the service, which at some point is just going to require trust. Trusting nobody is smart, but expecting a service to compete and win the long game without trusting it is pointless.
The only place I've really found duckduckgo lacking is its meme game and maybe esoteric tech problems. Even with the techy stuff if I narrow my search to github and stackoverflow I usually find what I'm looking for.
Am I wrong or there are simply not enough parties out there interested in revolutionizing search? Google is good enough despite the misgivings around privacy? All we get are skins of existing search engines, or proxies.
I'm surprised noone has rented Ahrefs database, whipped up an algorithm and called it a search engine. Besides google and microsoft, who has a bigger snapshot of the entire web (NSA not included)? Majestic maybe?
I prefer DDG over google not because of the search results, but they don't block the Tor IP-range like google do.
So google is for me unusable (always revolve captchas is very annoying).
Crawlers is top down approach, a distributed list that people pay digital money for listing will both incentivize nodes to be online and transforms sybil attacks into paid advertising.
DDG is a search engine that respect your privacy. DDG doesn't put you in a « bubble » based on your location, search history, your browsing history, physical moves, time of day, country of residence, IP, and billions parameters I don't know about. Therefore, there are things that cannot be done, like Google or Bing do.
Google is like an oracle because it exploit all of this. And it works, with a cost for your privacy (but maybe you are ok to pay it). DDG is more a "meta-search-engine" with limited capabilities. Bit you got the flexibility of accessing the search engines of thousand of websites.
If you don't care for your privacy, don't use DDG and stay with Google.
I can sympathize with the author. DDG is a terrible choice for a search engine. They have managed to somehow underplay the importance of their cause, which is why nobody uses DDG.
We have to make sure to include highly relevant advertisements in the search results, at least 50% of the results should be ads. So there needs to be a marketplace for buying/selling ads.
We can't have a search engine that is only useful for finding the most relevant web pages for a given query. People love highly relevant advertisement in their search results.
I normally would not reply to someone who equates my blog posts with the ravings of a megalomaniacal fachist, but I will at least clarify for the benefit of onlookers that all three of these points are false. I handle account deletion and GDPR requests all the time, and every email you get from sr.ht (1) is not a marketing email and (2) can be trivially unsubsribed from, with the exception of payment notifications - which is not only allowed per the GPDR, but a lot better than silently charging you a recurring payment forever.
Trump is not a fascist. He's not far-right. Mere nationalist tendencies do not make one a fascist.
From wikipedia:-
"Fascism (/ˈfæʃɪzəm/) is a form of far-right, authoritarian ultranationalism characterized by dictatorial power, forcible suppression of opposition and strong regimentation of society and of the economy which came to prominence in early 20th-century Europe. The first fascist movements emerged in Italy during World War I, before spreading to other European countries. Opposed to liberalism, Marxism, and anarchism, fascism is placed on the far right within the traditional left–right spectrum."
Refusing to concede a fair election, using gerrymandering and electoral fraud to keep a conservative minority party in power, obstructing the political process, packing the courts as a political weapon, not to mention open racism, a cult of personality, fervent nationalism, and tacit endorsement of political violence? If you compare the rise of European facism to contemporary trends in US politics then the parallels are pretty stark.
If it walks like a duck, and quacks like a duck... it is, at the least, a precursor to fascism. It doesn't happen all at once, and if we wait until it's plainly obvious to call it like it is, then it'll probably be too late. And even if you don't want to call it facist, it's out of line to compare this shit to my personal blog. That's a baseless character attack, which is a dick move, not to mention against the HN guidelines.
How to get quality results, and a sustainable, community-led search engine?
=== Contexts ===
A "search engine" such as Google is good at many things, and extremely bad at others. The main issue with it in my view is that it lacks context about what you are looking for. The main context you can ask for is "Videos", "Pictures", etc.
* Specifying context takes time, so it's OK for long searchs Google sucks at (find this specific article I read a while back). Take your time while you specify language, exact/fuzzy match, publication date, background color, author name or any number of things you know about your search.
* Some requests can be processed with instant answers, that's good news as it fits the open source model quite well.
* Lastly, the other requests. Some are asked like a question and might require NLP to sort trough. Quite hard IMO, it might get better but will still require compute power if done server-side. It's mostly: parse the question to find the context, and perform a contextual keyword search/instant answer.
* And those that aren't questions: "regular", keyword-based requests, that "just" require a big index and a big infrastructure to search it.
=== Hardware ===
Now, we are left with the cost centers: hardware. IMO, the only way to scale is to rely on the community and distribute things.
* Databases: if this is a community project, and not too latency-sensitive, the community can help by distributing them over a p2p network, even with a single source of trust.
* Queries, walking the database: delegating processing to untrusted third-parties is a bit more dangerous. Maybe allow each user to specify a list of trusted servers? Can be centralized and clients ask the network, though it might leak part of their search, depending on the index method. Could be client-side?
* Processing the answers: client-side, or trough any number of frontends (like searx).
* Crawlers: crawling the net isn't cheap. You could use one or multiple sources of trust. Domain-specific crawlers, like hinted at in Drew's post. Maybe crawl on demand or trough the user's computer (web extension that indexes as the user browses, and allows them to full-text search their history; share it or not).
=== Content ===
For some measure of quality, I find that websites that do not have advertisements offer better-quality content. That's likely due to conflicting interests. It would be great if the semantic web mandated disclosing revenue sources. You could downrank or avoid crawling sites with ads and/or Google Analytics, for instance. This could be abused to an extent if the service ever becomes popular, but heh https://xkcd.com/810/
Domain-specific crawlers would be nice as well.
=== Added value ===
To be adopted, the service needs to be better than the original in some ways. I think that a new search engine should not try to conquer the masses at first. Instead, find some people that are not satisfied with the current offering and court them. Currently, I think this isn't met by advanced search: exclude websites protected by recaptcha, only include websites that are less than X years old, no ads, etc.
Allow users to create their own contexts and easily switch them: bangs, tabs, date/time/geoip, etc. Have them create contexts dedicated to their activities: programming is an obvious one, but so is cooking, gardening, encyclopedic search, language usage/dictionaries, etc.
=== Monetization ===
At that point, I am not sure it can ever turn a profit? EU grants? Consulting? Help webmaster set up search on their own website? Sell desktop indexing software?
Well, I do have some ideas around content curation, but I am a tad reticent to share them here, and not sure they are more useful than the above.
This is severely hampered by the availability of Google. After a few false starts I ended up switching to DDG as my default search engine. My usage pattern is as follows: for things that are predictably easy to find (e.g. the queries "normal" people would issue), I use DDG. That's about 90-95% of all my queries. For obscure stuff, I add "!g" at the end of the query and go to Google. That's the remaining 5-10%, and I doubt anyone can do better than Google there. That way I get both the privacy and the long tail.
Perhaps it was better for him to say, "There's a better way to do it than DDG" than "We can do better than DDG" as if he's about to do it when in fact he's waiting for his revenue to go up.
Googles filtering is so weird. I have a bad habit of buying old hardware without checking if the documentation had been made available on the web.
Recently I found myself desperate for any information on a price of hardware i had gotten. I was swapping out all sorts of queries woth different keywords hoping to find a manual. I was able to find some marketing material which was helpful, albeit barely. Eventually I had exhausted the search results for most pf my queries, gave up and assumed that it was simply lost to time and I was out of luck.
Eventually I went back to the sales paper I found. Going to the site it was hosted on, a Lithuanian reseller. I translated the page, eventually finding a direct link to a user manual on the exact same page as the sales paper I had found. The document was in English, contained important words from my queries (such as the product name, company, "user manual" etc. The document was at the same path as the sales paper too. I hace no idea why Google found the sales paper but not the manual.
Unfortunately the manual still wasn't what I was looking for exactly but it was a hell of a lot better than what I could get from Google's results.
DuckDuckGo is a mirage and should not be used by privacy-conscious folks. Take a look at its terms of service, information collected section:
"We also save searches, but again, not in a personally identifiable way, as we do not store IP addresses or unique User agent strings. We use aggregate, non-personal search data to improve things like misspellings."
So they save your web searches and claim that they do so in an non-personally identifiable way. The privacy problems with this claim are many, even if one accepts it at face value (good luck verifying that this is the case).
I don't see why you'd both nitpick their terms of service, and then also claim that it's a pack of lies and can't be trusted. Why do the former and then the latter? If your complaint is just "I can't verify anything about their privacy" then that would've made sense.
How do you save a search in a non-personally identifiable way? Do you have a human verify the data belonging to each and every search ? Not saving IPs and/or browser data doesn't solve the problem since the search terms themselves can be personally identifiable.
How do you verify that DuckDuckGo does -the minimal and ineffective- things they claim to do? They offer no proof.
How do you verify that DuckDuckGo does not secretly cooperate with more powerful coercive actors?
How do you verify that DuckDuckGo, offering a single point of compromise, has not been thoroughly compromised by more powerful actors?
"How do you save a search in a non-personally identifiable way?"
To a first approximation, you just... do it.
Granted, if you search "{jerf's realname here} {embarrassing disease} cure" or something, in the pathological case, you could at least guess that maybe it was me, though even then my real name is far from unique, and nothing stops anyone else from running such a search.
But otherwise, if all you have is a pile of a few billion searches, you don't have any information about any of the specific searchers. Even if you search for your own specific address, you don't really get anything out of it; there's no guarantee it was you, or a friend of yours, or an automated address scraper. There isn't much you can get out of a search string without more information connected to it.
The rest of your criticisms are too powerful for the topic at hand; they don't prove we shouldn't use DDG, they prove we shouldn't use the internet at all.
The mere existence of someone is not really PII. You don't know that I did that search, nor can you connect to anything else... and this is a constructed example in which I try to jam some sort of PII into a single search is itself a bizarre example that probably corresponds to fewer than 1 in 100,000 or 1 in 1,000,000 searches, if that. When's the last time you stuck your own PII into a search box and connected it to something of some sort of significance? It's a very small edge case.
A search history can reveal many things about a person. The mere fact that someone, somewhere searched for "star wars harry potter crossover slash", unconnected to any other search item, doesn't reveal anything about anybody.
> How do you save a search in a non-personally identifiable way?
Save a sha256 hash of every search for 24 hours. If you see the same hash from >10 distinct IP addresses in a 24 hour period, save the search terms.
That's just off the top of my head, I have no reason to think they're doing it exactly like that. The point is that you're claiming that we shouldn't trust DuckDuckGo because you can't think of a way that they could securely and privately do what they do -- but that's just your intuitions, for whatever they may be worth.
I also don't really buy the worries you have with the last two questions, e.g.:
> How do you verify that DuckDuckGo does not secretly cooperate with more powerful coercive actors?
How would you verify that for any centralized service, open source or not? I think your security concerns go a bit beyond what most people interested in critiquing / improving DDG can reasonably expect to achieve.
>How would you verify that for any centralized service, open source or not?
Other centralized (search) services don't have their entire existence depending on this one factor. What is DDG if not alleged privacy? Just use Bing directly.
I don't understand that argument at all. What's the threat model?
I think it's entirely reasonable to be in the following posture: I want as much privacy for my web searches as I can reasonably achieve without having to run a search engine myself. I'm willing to trust that search providers are not saving personally identifiable information or passively turning over search data to law enforcement if they claim that they are not in their terms of service.
That's pretty much the use case for DDG. With Bing you know they are violating your privacy. With DDG you have a promise in writing that they are not. It's hard to see how that's not strictly better than what you get from Bing if privacy is among your core desiderata.
I think we're on the same page. I was saying that if it were to be discovered that DDG lacks privacy then there would be no reason to use it over Bing since that is its raison d'etre.
>I'm willing to trust that search providers are not saving personally identifiable information or passively turning over search data to law enforcement if they claim that they are not in their terms of service.
Do other search companies disclose that they share data with the FBI, NSA, etc in their ToS? Genuinely don't know.
> How would you verify that for any centralized service, open source or not?
I think, technically, some sort of honeypot verification could prove a compromise (i.e. if information that has very little chance of existing naturally in two systems, say a string a guids).
But... I agree with your point. I don't think this is actually feasible or realistic, just technically possible.
The only solution I see is fully distributed/decentralized search. Run your own crawler or be part of a network that distributes this out to each participating node.
Every centralized search engine has immensely hard-to-resist and powerful incentives to play "The Eye of Sauron" with your data. Additionally, they offer single points of compromise to other, far more powerful actors. Whatever guarantees DuckDuckGo gives you -and right now they don't give any- don't mean much, if they've been thoroughly (willingly or unwillingly) compromised.
Which doesn't mean one should always steer well clear just that one should at least be aware of the tradeoffs one makes when using a centralized search engine. And with DuckDuckGo's misleading marketing, I feel that this point is lost on significant chunks of its userbase.
Such search engines have been around for many years, and they suck donkey balls. Pardon my French. Install YaCy and tell me how you like it.
It wouldn't matter anyway, because decentralization doesn't really solve privacy any better than centralized search, besides the fact that it could theoretically provide more choices.
No matter what you use, privacy ultimately depends on trust. The reason that I have more trust for DDG than I do Google is, unlike Google, its primary audience is privacy-minded folks. If it came out that DDG was tracking users and selling that data, DDG would be immediately done as a brand. They at least have some incentive to do what they say. Decentralization provides no such benefit because a search "node" is unlikely to have any sort of meaningful brand to keep up.
> And with DuckDuckGo's misleading marketing, I feel that this point is lost on significant chunks of its userbase.
How is it misleading? My understanding from their marketing is that they don't create profiles of their users based on searches. Until we have evidence to the contrary, it's not outrageous to assume they are being truthful.
Yeah, now you're just saying "Nothing centralized can ever be trusted". So just say that rather than nitpicking their ToS. You weren't going to care what they said anyway.
Hi. I took a look at Mojeek (first time I've heard about it) and since you mentioned the site and you work there -
In your Privacy page (Data Usage Section) there is a mention of stored "Browser Data" & " These logs contain the time of visit, page requested, possibly referral data, and located in a separate log browser information." & "We may also use aggregate, non-personal search data to improve our results".
This is an honest question - How is that not exactly what the Parent stated was the issue?
So they save your web searches and claim that they do so in an non-personally identifiable way.
An often referred to issue with DDG is that its favicon service was informing DDG of sites you visit, rather than searches you make.
But agreed that all search engines have to be trusted on their word about anonymising data and not retaining PII when it comes to searches specifically. There's nothing any front end user can do to verify it.
Do they really though, for normal people that is?. Some of my searches today below, can't remember the exact terms I used. Mix of DDG and Google.
1) Walt Whitman, I wanted a basic overview of his work to satisfy some idle curiosity. DDG gave me his wikipedia page. Bingo
2) EAN-13 check digit. First result wikipedia telling me how to calculate it. I see it is simple and I have a long list in Excel to check. I can't be bothered to think so...
3) EAN-13 Excel. First result has an example that I copied and pasted.
4) Timezone [niche cloud system]. Said system didn't do what we expected, seems to be timezone issue. First article is discussing this niche issue and offers solutions
5) Does Shopify support x payments. Yes it does
6) Coronavirus test. Got straight to government site.
7) MacOS version numbers. First hit...
8) How come my Microsoft x platform is showing as being at y level of service when my Buddies is not. Straight in
Am I just a perfect search customer? I don't seem to be getting the problems Drew is?