Imagine if instead of kneecapping XHTML and the semantic web properties it had b...

yoz-y · on Jan 19, 2020

I work for Google but not on search.

I think in this case semantic web would not work, unless there was some way to weed out spam. There are currently multiple competing microdata formats out there than enable you to specify any kind of metadata but they still won't help if spammers fill those too.

Maybe some sort of webring of trust where trusted people can endorse other sites and the chain breaks if somebody is found endorsing crap? (as in, you lose trust and everybody under you too)

echelon · on Jan 19, 2020

> I think in this case semantic web would not work, unless there was some way to weed out spam.

That's not so hard. It's one of the first problems Google solved.

PageRank, web of trust, pubkey signing articles... I'd much rather tackle this problem in isolation than the search problem we have now.

The trust graph is different from the core problem of extracting meaning from documents. Semantic tags make it easy to derive this from structure, which is a hard problem we're currently trying to use ML and NLP to solve.

grey-area · on Jan 19, 2020

>Semantic tags make it easy to derive this from structure

HTML has a lot of structure already (for example all levels of heading are easy to pick out, lists are easy to pick out), and Google does encourage use of semantic tags (for example for review scores, or author details, or hotel details). For most searches I don't think the problem lies with being able to read meaning - the problem is you can't trust the page author to tell you what the page is about, or link to the right pages, because spammers lie. Semantic tags don't help with that at all and it's a hard problem to differentiate spam and good content for a given reader - the reader might not even know the difference.

ori_b · on Jan 19, 2020

> PageRank, web of trust, pubkey signing articles...

What prevents spammers from signing articles? How do you implement this without driving authors to throw their hands in the air and give up?

bsanr2 · on Jan 19, 2020

In the interests of not causing a crisis when Top Level Trust Domain endorses the wrong site and the algorithm goes, "Uh uh," (or the endorsement is falsely labeled spam by malicious actors, or whatever), maybe the effect decreases the closer you are to that top level.

But that's hierarchical in a very un-web-y way... Hm.

klingonopera · on Jan 19, 2020

The internet is still kind of a hierarchy though, "changing" "ownership" from the government DARPA to the non-profit ICANN.

And that has worked... quite fine. I have no objections (maybe they're a bit too liberal with the new TLDs).

Most of the stuff that makes the hierarchies seem bad are actually faults of for-profit organizations (or other unsuited people/entities) being at the top, and not just that someone is at the top per se. In fact, in my experience, and contrary to popular expectation, when a hierarchy works well, an outsider shouldn't actually be able to immediately recognize it as such.

acdha · on Jan 19, 2020

> Imagine if instead of kneecapping XHTML and the semantic web properties it had baked in, Google had not entered into the web browser space. We might be able to mark articles up with `<article>`, and set their subject tags to the URN of the people, places, and things involved. We could give things a published and revised date with change logs. Mark up questions, solutions, code and language metadata.

Can you explain in technical details what you think was lost by Google launching a browser or what properties were unique to XHTML?

Everything you listed above is possible with HTML5 (see e.g. schema.org) and has been for many years so I think it would be better to look at the failure to have market incentives which support that outcome.

zozbot234 · on Jan 19, 2020

Good machine-readable ("semantic") information will only be provided if incentives aren't misaligned against it, as they are on much of the commercial (as opposed to academic, hobbyist, etc.) Web. Given misaligned incentives, these features will be subverted and abused, as we saw back in the 1990s with <meta description="etc."> tags and the like.

overgard · on Jan 20, 2020

I don't think there's any reason to think google was responsible for the semantic web not taking off. People just didn't care that much. It may have been a generally useful idea, but it didn't solve anyone's problem directly enough to matter.

pbourke · on Jan 19, 2020

It wouldn’t matter. 0.0001% of content authors would employ semantic markup. Everyone else would continue to serve up puréed tag soup.

wongarsu · on Jan 19, 2020

If WordPress outputs semantic output that instantly gives you a lot more than 0.0001%. The rest would follow as soon as it improves discoverability of their content

plorkyeran · on Jan 19, 2020

Wordpress can't magically infer semantic meaning from user input any better than Google can. The whole point of the semantic web is to have humans specifically mark their intention. A better UI for semantic tagging would help for that, but it would still be reliant on the user clicking the right buttons rather than just using whichever thing results in the correct visual appearance.

echelon · on Jan 19, 2020

> 0.0001% of content authors would employ semantic markup.

You don't think we'd have rich tooling to support it and make it easy to author?

Once people are using it with success, others will follow.

majewsky · on Jan 19, 2020

The breakthrough would be when Google were to rank pages with proper semantic markup higher. Just look at AMP.

(Of course that won't ever happen, but that's what would be needed.)