BeautifulSoup and its clones does parsing pretty well. Just extracting the text out of HTML isn't incredibly hard, and metadata is too unreliable to ever be much use.
You can calculate anchor tag density across the DOM tree and prune branches that exceed a certain threshold to remove navigational elements with reasonable accuracy if that is a problem.
It's not going to be perfect, but even Google messes this up every once in a while. I wouldn't consider it a major hurdle.
The hard part is understanding which parts are the content versus navigation or promotions of other content. I’ve written a couple search engines. Have you tried making one with beautiful soup? Why does it matter? You love seafood, so just literally run grep on the entire page and if it contains the word then include it as a correct. In reality, you will miss a lot of real seafood pages because they don't really need to mention "seafood" and context matters, so what? Chances are that that one website where person randomly added "I love seafood" to the top of the page will be the only page that you've ever wanted to see anyway. There's too much data for you to go through in entire life in any case, so why worry about it as long as you can get something that's good enough? You will never get best data, if it was possible, google would be giving you best data already. How do I know? Well, looking up my real name shows where I grew up, what school I went to, graduated, and even which exam I scored 100 on... And even some places I used to work for in the past, and while that part is going to make most people paranoid, I wish ALL results were as detailed as this one, but there's little you can do. No I use JSoup for my search engine. You can calculate anchor tag density across the DOM tree and prune branches that exceed a certain threshold to remove navigational elements with reasonable accuracy if that is a problem. It's not going to be perfect, but even Google messes this up every once in a while. I wouldn't consider it a major hurdle. I don't presume the source is available... unbelievably cool project that I'm sure a lot of people have imagined themselves doing.
Yeah, it depends on what you want to prioritize and value in your search engine. I’m coming at it from the angle that if you want to make a good, new, and different kind of search engine you need to do something fundamentally different than Google. No one is going to beat Google at their own game. Leveraging meta data is a very easy way to make something new and different, but it won’t be as comprehensive as Google. I doubt that someone doing what you described over a few months or year could make a search engine that anyone wanted to use.
> I doubt that someone doing what you described over a few months or year could make a search engine that anyone wanted to use.
Dunno, not only are people sending me money to develop my search engine, not enough to live off but still, I also get emails and tweets from people who say they love it almost on a weekly basis.
I think attempting to be as comprehensive (or more) than Google is a trap. The better move is to fly under them. Be cheaper and better at something. Recipes is a great example of something Google is just miserable at, that is easy to do much better. There's plenty of such niches.
You love seafood, so just literally run grep on the entire page and if it contains the word then include it as a correct.
In reality, you will miss a lot of real seafood pages because they don't really need to mention "seafood" and context matters, so what? Chances are that that one website where person randomly added "I love seafood" to the top of the page will be the only page that you've ever wanted to see anyway.
There's too much data for you to go through in entire life in any case, so why worry about it as long as you can get something that's good enough? You will never get best data, if it was possible, google would be giving you best data already.
How do I know? Well, looking up my real name shows where I grew up, what school I went to, graduated, and even which exam I scored 100 on... And even some places I used to work for in the past, and while that part is going to make most people paranoid, I wish ALL results were as detailed as this one, but there's little you can do.
That’s how you make a worse search engine than Google. If you are serious about competing in that space I think you need to do something fundamentally different than Google. Treating pages as a bag of words leads to a shitty search engine. Like I said, I’ve built a few search engines, and I have tried this.