BeautifulSoup and its clones does parsing pretty well. Just extracting the text ...

jshen · on Aug 25, 2022

The hard part is understanding which parts are the content versus navigation or promotions of other content.

I’ve written a couple search engines. Have you tried making one with beautiful soup?

marginalia_nu · on Aug 25, 2022

No I use JSoup for my search engine.

You can calculate anchor tag density across the DOM tree and prune branches that exceed a certain threshold to remove navigational elements with reasonable accuracy if that is a problem.

It's not going to be perfect, but even Google messes this up every once in a while. I wouldn't consider it a major hurdle.

peteradio · on Aug 25, 2022

I don't presume the source is available... unbelievably cool project that I'm sure a lot of people have imagined themselves doing.

edit: https://git.marginalia.nu/marginalia/marginalia.nu !!!

marginalia_nu · on Aug 25, 2022

The actual feature I described is not in that repo though. It's something I've been working on. Here's the code for that (AGPL):

    private static final double PRUNE_THRESHOLD = .5;

    public void prune(Document document) {
        PruningVisitor pruningVisitor = new PruningVisitor();
        document.traverse(pruningVisitor);

        pruningVisitor.data.forEach((node, data) -> {
            if (data.depth <= 1) {
                return;
            }
            if (data.signalNodeSize == 0) node.remove();
            else if (data.noiseNodeSize > 0
                    && data.signalRate() < PRUNE_THRESHOLD
                    && data.treeSize > 2) {
                node.remove();
            }
        });
    }



    private static class PruningVisitor implements NodeVisitor {

        private final Map<Node, NodeData> data = new HashMap<>();
        private final NodeData dummy = new NodeData(Integer.MAX_VALUE, 1, 0);

        @Override
        public void head(Node node, int depth) {}

        @Override
        public void tail(Node node, int depth) {
            final NodeData dataForNode;

            if (node instanceof TextNode tn) {
                dataForNode = new NodeData(depth, tn.text().length(), 0);
            }
            else if (isSignal(node)) {
                dataForNode = new NodeData(depth,  0,0);
                for (var childNode : node.childNodes()) {
                    dataForNode.add(data.getOrDefault(childNode, dummy));
                }
            }
            else {
                dataForNode = new NodeData(depth,  0,0);
                for (var childNode : node.childNodes()) {
                    dataForNode.addAsNoise(data.getOrDefault(childNode, dummy));
                }
            }



            data.put(node, dataForNode);
        }

        public boolean isSignal(Node node) {

            if (node instanceof Element e) {
                if ("a".equalsIgnoreCase(e.tagName()))
                    return false;
                if ("nav".equalsIgnoreCase(e.tagName()))
                    return false;
                if ("footer".equalsIgnoreCase(e.tagName()))
                    return false;
                if ("header".equalsIgnoreCase(e.tagName()))
                    return false;
            }

            return true;
        }
    }

    private static class NodeData {
        int signalNodeSize = 0;
        int noiseNodeSize = 0;
        int treeSize = 1;
        int depth = 0;

        public void NodeData(int depth) {}

        private NodeData(int depth, int signalNodeSize, int noiseNodeSize) {
            this.depth = depth;
            this.signalNodeSize = signalNodeSize;
            this.noiseNodeSize = noiseNodeSize;
        }

        public void add(NodeData other) {
            signalNodeSize += other.signalNodeSize;
            noiseNodeSize += other.noiseNodeSize;
            treeSize += other.treeSize;
        }

        public void addAsNoise(NodeData other) {
            noiseNodeSize += other.noiseNodeSize + other.signalNodeSize;
            treeSize += other.treeSize;
        }

        public double signalRate() {
            return signalNodeSize / (double)(signalNodeSize + noiseNodeSize);
        }
    }

It renders the text of this link (at present): https://news.ycombinator.com/item?id=32594821

Into this search-engine friendly text:

The hard part is understanding which parts are the content versus navigation or promotions of other content. I’ve written a couple search engines. Have you tried making one with beautiful soup? Why does it matter? You love seafood, so just literally run grep on the entire page and if it contains the word then include it as a correct. In reality, you will miss a lot of real seafood pages because they don't really need to mention "seafood" and context matters, so what? Chances are that that one website where person randomly added "I love seafood" to the top of the page will be the only page that you've ever wanted to see anyway. There's too much data for you to go through in entire life in any case, so why worry about it as long as you can get something that's good enough? You will never get best data, if it was possible, google would be giving you best data already. How do I know? Well, looking up my real name shows where I grew up, what school I went to, graduated, and even which exam I scored 100 on... And even some places I used to work for in the past, and while that part is going to make most people paranoid, I wish ALL results were as detailed as this one, but there's little you can do. No I use JSoup for my search engine. You can calculate anchor tag density across the DOM tree and prune branches that exceed a certain threshold to remove navigational elements with reasonable accuracy if that is a problem. It's not going to be perfect, but even Google messes this up every once in a while. I wouldn't consider it a major hurdle. I don't presume the source is available... unbelievably cool project that I'm sure a lot of people have imagined themselves doing.

jshen · on Aug 25, 2022

Thanks for sharing! I’ll try this after work on some URLs from the search engine I’m working on as my hobby project.

marginalia_nu · on Aug 25, 2022

Yeah, almost certainly could do with some tweaking and tuning, but the basic idea works remarkably well in many cases.

jshen · on Aug 26, 2022

Do you have a fully functioning code example? I didn't realize it was just a snippet when I looked at it earlier.

marginalia_nu · on Aug 26, 2022

https://paste.ofcode.org/vxC36HShHxpsEgztgqwzxi

jshen · on Aug 25, 2022

Yeah, it depends on what you want to prioritize and value in your search engine. I’m coming at it from the angle that if you want to make a good, new, and different kind of search engine you need to do something fundamentally different than Google. No one is going to beat Google at their own game. Leveraging meta data is a very easy way to make something new and different, but it won’t be as comprehensive as Google. I doubt that someone doing what you described over a few months or year could make a search engine that anyone wanted to use.

marginalia_nu · on Aug 25, 2022

> I doubt that someone doing what you described over a few months or year could make a search engine that anyone wanted to use.

Dunno, not only are people sending me money to develop my search engine, not enough to live off but still, I also get emails and tweets from people who say they love it almost on a weekly basis.

I think attempting to be as comprehensive (or more) than Google is a trap. The better move is to fly under them. Be cheaper and better at something. Recipes is a great example of something Google is just miserable at, that is easy to do much better. There's plenty of such niches.

mayoi · on Aug 25, 2022

Why does it matter?

You love seafood, so just literally run grep on the entire page and if it contains the word then include it as a correct.

In reality, you will miss a lot of real seafood pages because they don't really need to mention "seafood" and context matters, so what? Chances are that that one website where person randomly added "I love seafood" to the top of the page will be the only page that you've ever wanted to see anyway.

There's too much data for you to go through in entire life in any case, so why worry about it as long as you can get something that's good enough? You will never get best data, if it was possible, google would be giving you best data already.

How do I know? Well, looking up my real name shows where I grew up, what school I went to, graduated, and even which exam I scored 100 on... And even some places I used to work for in the past, and while that part is going to make most people paranoid, I wish ALL results were as detailed as this one, but there's little you can do.

jshen · on Aug 25, 2022

That’s how you make a worse search engine than Google. If you are serious about competing in that space I think you need to do something fundamentally different than Google. Treating pages as a bag of words leads to a shitty search engine. Like I said, I’ve built a few search engines, and I have tried this.

Edit: https://en.wikipedia.org/wiki/Bag-of-words_model