Fundamentally, it suffers from the same problems that the semantic web suffers from as well, primarily the top of my concerns being link rot & scenarios involving at least one link in those examples being inaccessible/unrecoverable.
For the Semantic Web / JSON-LD to work, the infrastructure behind it needs to permanently keep all of the necessary sources that a file relies on (i.e. permaweb). Otherwise, the entire knowledge graph suffers from a left-pad style situation, wherein the removal of a critical piece of definition(s) will cause the unparsability of all documents reliant on it for definitions.
Yes, JSON-LD is just a different serialization format for RDF.
For anyone looking for a solution to the link rot problem (as well as versioning and dependency management), they might be interested in Plow[0].
We built it due to frustration with
exactly those drawbacks of the semantic web/data space. At the core it builds on a model very similar to Cargo, with a public index and checksummed artifacts, that allows for easy mirroring for use-cases where all transitive links always need to be resolvable.
We are also looking to build out some tooling in the future that makes keeping up those properties easier, such as a linter that ensures that e.g. all referenced concepts in the dependency tree have a definition. (We already have a primitive version of that, but that's just on a single package level.)
IPFS alone only gives you content addressing. So you'll need at the minimum some higher-level IPLD structure that allow you to express structures such as an ontology, etc.. That can be based on the Plow model, if you want.
If you want a workable ecosystem that enables many decentralized parties to collaborate, the model of not-too-big-not-too-small packages that are interconnected via abstract dependencies (embodied by SemVer ranges), is still the gold standard to achieve this in my opinion. And breaking the mold of traditional monolithic slow-moving ontologies with high technical barriers to publish is one of the main motivating factors for building Plow (together with building a stable ontological layer that you can build other software on).
I actually have worked on bridging the gap between semantic data tech and IPFS in the past[0] (both on a fine-granular per-concept level and a more coarse-grained ontology level), and I can just say that there are a ton of additional challenges if you want to do it right (and semtech is already challenging enough as it stands today).
Conceptually, the infrastructure parts that make up Plow (the index and artifact store) are also flexible enough that you could distribute them via IPFS.
It also comes from the exact same line of development at W3C[1]. LD, or “linked data”, is W3C’s attempt at rebranding the “semantic web” (and, despite my general dislike for rebranding attempts, seems more descriptive). This is the first time I see LD expanded as “linking data” , though.
[1] Manu Sporny [JSON-LD 1.0 editor and CG chair], “JSON-LD and Why I Hate the Semantic Web” (2014), http://manu.sporny.org/2014/json-ld-origins-2/ (down right now but that looks temporary?). Unfortunately, he did not participate in JSON-LD 1.1 (as much?), and the spec once again returned to its RDF-jargon-filled equilibrium.
Yep it’s just hypermedia. And as we know, cool URLs don’t change. Of course there are hypermedia solutions to even changing URLs, but that would require investing in using the actual platform rather than constantly rediscovering parts of it.
Is "left-pad style situation" (aka the Azer/Kik/NPM drama) the formal name for this class/type/definition of system fragility? What would this be an example of, formally, in math or systems design or whatever?
The reason I ask, is that I see it literally all the time, sometimes as a design goal, which drives me absolutely up the wall. "Let's make a centralized repository (CIR) of all acronyms, and then everyone can reference the CIR". Ah yes, sure, great idea and the first time someone changes Air Conditioning to Alternating Current ALL YOUR DOCUMENTS CHANGE[1].
If "left-pad style" has a formal name, I might be able to offer a counter-argument in the form of math.
[1] Now, obviously, the CIR problem does have a solution, I'm aware of that, but it's not a solution that lies in the domain of the document markup - it's a change process problem, which is in another domain, which means extra eyeballs and red tape. So I guess the broad definition, I am looking for, is what is wrong with these sorts of functionalities that break out of their domain like this? What do you call that?
Agreed, there's an overly-optimistic assumption that other people will go out of their way to maintain referential integrity for applications that they don't know about.
I think it would be better if it was done in a content addressed way so that you can trivially host the bits you're relying on without having to also be an authoritative source for them.
Once we have that figured out we can socialize it onto the users. Like either pay $1 for access or pin whichever nodes in the app's dependent knowledge graph hash to your user ID mod 1024. Having helped with your 1/1024th share of the hosting, you get access for free.
Maybe it's not "link rot," it's just information that's no longer relevant. Design your systems to be able to handle that. And you'll be able to interop and reason with all kinds of data.
Designing a system to operate with partial data (or backing it up everywhere) has pretty hard tradeoffs when you are dealing with a system the size of the internet, any piece of content can go from a few bytes to terabytes.
So we add a DHT and use magnet links. The permanence problem is then at least distributed and the semantic graph is no longer a collection of single points of failure.
Why isn’t semantic web more popular inside companies?
This expandable graph seems like the purest representation of all data in a company.
Instead we have several different databases each with their own schemas. All with entities related to one another, but without these relations defined anywhere, and impossible to query.
RDBMS are implementation details of this graph that involve a heap of manual work. Wouldn’t it be great if the rdbms was generated from this graph.
There are so many interconnections between the data involved in the design and implementation of a product and the product database itself.
Imagine hovering over a UI element and seeing who implemented it and when, what project it was part of, why the project was initiated, and what kpis and goals it contributes to.
At the point you create the data, you don't have link rot. At the point you create the link rot, you don't notice it. At the point you notice it, you don't have time to fix it. At the point you try to fix it, you don't have a full list of rotten links. And those who keep starting the cycle by creating links don't care that ends up this way because there's no immediate impact on their silo.
Thus to fix it, you need an organisation-level project, but it's hard to identify what business value you get from it.
I have had this question for years myself, and I still think it has a lot of potential for companies.
For new tech to take off, multiple things need to be true and often the social factors are the most important. The semantic web tech seems mostly driven by scientists and highly specialist companies. The barrier for developers is still quite high and a lot of concerns have been poorly addressed. Probably the most important merit of json-ld is improving the usability for developers a bit, making it less difficult to take advantage of the semantic web.
Furthermore, most organizations subdivide the work in silos (teams) and have clear subdivision of goals. Anything you are doing that isn't directly contributing to the goals assigned to you or your team is potentially damaging your standing and rewards, whether those are promotional, monetary or simply recognition and praise. After all, this effort cannot be spend in your main focus.
So this comes together with my previous point: it takes a lot of 'extra' time adapting the data in your silo to work with the semantic web of your company, and there are usually zero incentives for doing so. This is true even if it would be quite valuable to the company as a whole, because individuals and teams don't act from that perspective. Thus, it simply doesn't get done, since it only really works if (nearly) everybody is onboard.
Its like the famous decree of Jeff Bezos demanding everything to be available via an api: it takes a single minded visionary (or dictator) to push everyone to do this 'extra' work and get to the point where the investment pays off.
I think the only hope comes from going graph-first and using a graph db as the source of truth. I think programming as a whole would be in a much better place if graph dbs beat out rdbms' and they attracted more R&D for perf optimization.
> Imagine hovering over a UI element and seeing who implemented it and when, what project it was part of, why the project was initiated, and what kpis and goals it contributes to.
That's exactly what we are building at Field 33[0] with a package manager for ontologies (Plow[1]) as an underpinning to get a good level of flexibility/reusability/colaboration on all the concepts that go into your graph.
------
> Why isn’t semantic web more popular inside companies?
As part of building Field 33 we obviously also asked ourselves that question.
My rough hypothesis would be that ~10 years ago semantic tech didn't provide tangible enough benefits, and since then got left behind in the dust by non-semantic tech.
That caused a tech chasm that widened and widened, where the non-semantic side became a lot more accessible with quasi-standards (REST) and new methods of querying data for frontend usage (GraphQL), while the status quo of the semantic web space is still SPARQL (a query language full of footguns). Same thing goes for triple stores (the prevalent databases in the space) that roughly go through the same advancements as RDBMs, just at a much slower pace.
It also doesn't help that most work being done in the space comes from academia rather than companies that utilize it in production scenarios.
There is quite a nice curated list of problems/papercuts about the semantic web/RDF space[2].
Overall, despite the current status quo, I'm quite optimistic that the space can have a revival.
I think a big reason for lack of popularity is familiarity of graphs in general. Most people wouldn't know of a good graph editor I would guess. Everyone knows documents and spreadsheets/tables/lists/folders, but it's rare to see a graph anywhere. Online and offline. If I have to think about an example of a graph in real-life what comes to mind is: a corkboard with strings going between them to track a criminal in a tv show, or a subway network map. In VSCode/IntelliJ, there is not one graph I look at frequently, even though pretty much all code is a data flow and dependency graph.
I think part of this is that graphs always appear like a complicated mess, and we prefer hierarchies and categories.
I would really like a tool like Airtable for graphs. You start with spreadsheets with columns relating to other columns, and then you view the graph next to it as you go. I don't know of a popular tool that does this. It's funny because behind-the-scenes of spreadsheets there is always big dependency graph that updates cells as changes come through.
All the specs feel overly-complex too. Like a relic from the XML/SOAP days. For such a simple base concept (subject-predicate-object / entity-attribute-value) it feels like overkill. It's interesting though thinking about how JSON won, while being extremely inferior to XML. Although I think this ability to move fast has left us with a ton of untyped data lying around, and plenty of ad-hoc data transformations.
I'm interested to read into the EasierRDF doc you sent - looks very interesting.
> You start with spreadsheets with columns relating to other columns, and then you view the graph next to it as you go.
We are actually starting to crystalize something like that in our app. It's currently more read- than write-oriented but I think we are getting there. :)
> It's funny because behind-the-scenes of spreadsheets there is always big dependency graph that updates cells as changes come through.
Yup! Actually one dream that I have for our platform is that we'll build a Excel importer, that will import a fully fledged spreadsheet representation, including formulas that continue working (VBA macros excluded). Our platform does already support the core pieces required for this to work, there are just a ton of nitty gritty details to work out about how this would nicely integrate into the product in a way that isn't too cumbersome for our end-users.
> All the specs feel overly-complex too. Like a relic from the XML/SOAP days.
Oh, I could rant about that for days... RDF itself is already a bit weird in that aspect (the provide both the possibility to specify a datatype and a language tag, but not both at the same time to specify e.g. "German Markdown"). I don't think any comparable standard today would bake in localization on such a fundamental layer.
> Why isn’t semantic web more popular inside companies?
Because SemWeb doesn't solve any real problem while create a ton of new ones and uses a complex data model. Any company that would benefit from using graphs will be better served using a simpler and more general graph model and database.
SemWeb is nothing more than stringly typed data with an URI fetish.
>Why isn’t semantic web more popular inside companies?
Because it offers no protection against some team inside the company breaking the whole web by moving to a different URI or refactoring their domain model in incompatible ways. A department pays for some subgraph tailored to their needs and they are not interested in financing this for the whole org.
Industrial companies use master data management systems, they are centralized and are considered the single source of truth, everyone else builds on them.
Because it's not what you thing and doesn't solve the problems most companies have.
Firstly jsonld is only a format for serializing semantic metadata, nothing more, so it only specifies how you can attach that metadata, but not what exctly can be attached in what way with what subtle meanings.
Secondly it's one of this very generic tools which "can solve everything" but in practice often
only make things more complicated as long as you don't have enough very complicated (and matrue!) tooling around it.
And that's where the problem lies, the availability and maturity of this tooling is limited and awareness or experience about which tooling is good and mature and which isn't often is missing too.
So it's easy to end up with adding a lot of complexity with very little gain, hence why most avoid it. Through some companies had success with it, but their are often on the size of SAP or similar.
> This expandable graph seems like the purest representation of all data in a company.
Yes, but requires you to have all the data in the right format correctly annotated, correctly maintained and changed and available often non of this is true.
> Imagine hovering over a UI element and seeing who implemented it and when
This would need a proper integration of json-ld in the version manager and development flow, project manager, and probably more. There is a good chance that whatever tooling you use has not integration for any of this points, not even considering that you then still need to bind the data together (query) in a usable performant way which might having a non graph db for caching common queries etc. Each of this points being likely a non trivial sub project, one which most companies wouldn't want to afford for some minor benefit like having such a tooltip in a UI builder.
Now if everyone would always support the semantic web, and would agree on common annotations for all kind of metadata (far beyond the scope of the jsonld spec) and would add accessible apis based on this to their product etc. _then yes it would be grate_.
Yeh I think it's way too complex to shoe-horn in later with minimal benefit.
But new databases are being built at new companies every day. A lot of new companies I see, build out their first MVPs, CRMs, etc. adhoc in Airtable. Then some mockups in Figma. Then they bring in the devs to build a RDBMS.
Now if all these low-code tools worked on manipulating a single graph instead of building a bunch of disparate relational databases...that would be cool. And then you just need a good graph database to build web apps with.
I think a bunch of tools need to be re-invented with this in mind.
> would agree on common annotations for all kind of metadata
Thinking about a project/task manager for example. They all pretty much have similar schemas at this point. There is also a huge industry in connecting tools together. Zapier/IFTTT/Unito/etc. Everything is adhoc though or proprietary. Standardization is slow and boring.
I think the best thing would be if someone made a schema for this that gained wide adoption, and then the transformers from these existing applications fed into this graph. Basically using a graph db instead of relational or key-value.
same reason wikis arent, it takes too much effort to maintain and keep it semantic instead of copypasted pile of text
you'd need CMS, CRM, knowledge base, documentation, source control, chat/forum, inventory, point of sale, customer support channels, issues ticketing system, and every other thing interconnected to with each other
do you know of anyone offering this as a complex turn-key integrated solution at a reasonable price?
In every category you mentioned there are new products released and widely adopted all the time. Look at the rise of ClickUp for example in such a crowded space as project management.
I don't think its too farfetched to one day see a new entrant offering as a key selling point that their api is just a big interoperable graph that you can easily plug your company into...and that's not just GraphQL.
I know of several ERP vendors offering such products at a price that is reasonable for such complexity. "Reasonable" is not the same as "cheap" though.
JSON-LD is the reason ActivityPub hasn’t taken off even more. It’s just such a pain to deal with variables that could be URLs or entire object trees. Especially for strongly typed languages, it’s a nightmare.
It isn't hard: I've built several Serde impls for Rust that handle JSON-api and JSON-LD setups (for a.o. ActivityPub). It's just an Enum and a type with some logic to parse one or the other. An if/else isn't hard.
And it certainly is not the reason AP hasn't taken off. For one, because AP has taken off, and secondly because of all the reasons it would potentially not have taken off, the underlying dialect of JSON is the least of all concerns.
You're right about serializing/de-serializing the data not being difficult, but the problem lies more in how you handle that data down the pipleine after you deserialized it.
Maybe rust has a better typing system that can handle the union of MaybeSliceOfStrings|MaybeSliceOfObjects|MaybeString|MaybeObject|MaybeNull[1] types better than Go does, but I can tell you - after about 4-5 years of trying to implement a clean Go API for processing ActivityPub payloads that the complexity is quite high, at least if you want to build something that can work with more than one client/service.
[1] I believe that a good implementation should handle even mixed cases, where you have a slice of both objects and strings.
It sounds like the issue is not Json-* formats but Go's type system.
And I'm not familiar enough with large go setups, but generally -in any language- anything that must handle data "down the pipeline" should ideally not be bothered about whether this link (or meta, or paginator, or whatever) is there, is of Type-T and so on: that's for middleware/controllers etc to ensure, and to transform. In other words: your business-logic should ideally never have to bother about a link being either a string or a struct or a null etc: it should be able to assume it's a MyLink.
I'm not sure I understand correctly what you mean by assuming everything to be a "MyLink" but that's not always possible. It depends first on how deep the rabbit hole you want to go with dereferencing URLs - what logic will you employ to decide what level of dereference to full objects you stop at, and, more importantly on being authorized to access the objects pointed at by said URL. Sometimes ActivityPub activities come with properties that are IRIs that when dereferenced are giving you 404 or 401. What does your middleware do then: Retry, Abort, Fail?
Also, I'm not sure I agree with your thesis that business logic needs to worry only about a single type of object, because the ActivityPub specification recognizes all of the types I mentioned in my first reply as being "valid" and if you consider their superset as being the "canonical type" you'll be wasting quite a lot of storage (memory and disk wise) when all you have access to is the plain URL.
That's one of my root annoyances with Go's type system: it lacks sum types. Rust has sum types in the form of enums, and thus unions. People don't appreciate how useful sum types are until they need them, and then they discover how much easier they make things.
Couldn’t you just have a RDF database do the heavy lifting and use go for the application logic?
I did some deep diving into RDF stuff not too long ago and all the FLOSS databases I found didn’t do json-ld but it also didn’t look all that hard to add. Just adapt the existing xml code methinks.
Erm, I don't want an rdf database in my application, thank you. :D
The main purpose for which I chose Go for the development of this project was simplicity of the build/deploy pipelines. My targets are small communities and single enthusiasts which shouldn't need a dedicated SRE to deploy a service in the Fediverse.
JSON-LD is an open standard for expressing RDF data as JSON. RDF is the most fundamental part of the W3C's Semantic Web and Linked Data projects, which began at the end of the 1990s to make the Web more machine-readable and continue to make steady progress.
If you're not familiar with RDF, I would suggest starting by reading the RDF 1.1 Primer[1], using the RDF 1.1 Concepts and Abstract Syntax as a reference[2] if something is confusing. I don't think you'll regret spending the time; RDF is a fascinating field!
> I don't think you'll regret spending the time; RDF is a fascinating field!
I love RDF. I'm using it for a project of my own, which has nothing to do with semantic web. I'm not using any of the standard ontologies, or OWL, or anything like that. I just wanted an extensible, schemaless data model, because I have no idea how folks will want to correlate data with each other. Being able to hang anything off of anything is important.
I'm also not processing trillions of triples for my work (if you ever look at triple stores, they like to taut how fast they can import vast troves of data).
Mind, I'm no doubt reinventing some stuff that I could probably be using a standard vocabulary for, but since global interoperability is not a goal, I'm not that concerned about it. I also have some concept of structure that I'm capturing, and encoding, and I perhaps could be using something like OWL for, but this is ad hoc doing things as I go and while there may well be gems within those spaces I can leverage, I'd rather make progress on my own path at the moment than fall into those large maws of times trying to suss them out to see if I could adapt them to my project.
SPARQL is an odd duck I'm still wrapping my head around (30 years of SQL warps ones point of view), but at least it exists, and I can use it.
I want to live in an alternative timeline where RDF was never adopted by wikidata but instead created something that solved its specific problems in a human friendly manner. People always point to wikidata as a succesful semantic web project but fail to imagine how much more awesome it could have been. First off, wikidata has little use for ontologies outside of its own domain because all the types are modeled as dynamic second order concepts. Meaning, people organise knowledge using web pages and those webpages are used to structure other knowledge.
A simple Sparql query for wikidate would look something like this
select *
where {
?item wdt:Qe23oieke wd:Ert923Ee3451 .
service wikibase:label { bd:serviceParam wikibase:language "en" }
}
This incomprehensible nonsense is the result of two choices:
1) It had to be language independent because you know the world is bigger than just the anglosphere.
2) It had to be "machine readable", because then the computer can work together in harmony.
Putting the cart before the horse, or in this case the machine before the man.
Wikidata was consciously designed without consideration for the semantic web/rdf. I remember being dismayed by this, but they added some facilities later. It is designed around a purpose built data model. https://www.mediawiki.org/wiki/Wikibase/DataModel
It's used to provide microdata to Google, e.g. to talk about the fake reviews you host on your site, which Google parses and rewards by showing a star-rating next to your result. It won't die as long as Google continues to use it.
Because of Google, json-ld is very common on recipe web sites on the web. (Recipes for cooking.) It's typically found in a script tag and usually contains the full recipe, even if the page itself is paywalled.
This can neatly solve the problem of extracting the recipe to save, and the complaints about having to dig through a page of text to read a recipe. I wrote a little browser plugin that pops up the recipe in a window and offers to save it to Obsidian (using a custom url and hammerspoon).
I don't know if the schema.org stuff is in wide use outside of that. It would perhaps be useful to enable a browser to pull things like appointments and contacts out of a web page, but I suspect the world will just go the way of data detection instead.
I was just thinking this looks similar to a DDI (Data Documentation Initiative) citation element, but the advantage of the DDI XML schema is that one can annotate any existing element in an XML document by importing the DDI namespace, which shouldn't change how the document is otherwise interpreted by applications, but this JSON schema as far as I can tell is a stand-alone document?
JSON-LD is definitely designed, at least partially, to do the "annotate a preexisting document" thing. You can add the "context" and have it specify the RDF semantics that should apply to each preexisting key.
This is a 'context map', which is what links the short name "spouse" for the JSON key to a particular class defined in a RDF (Resource Description Framework) ontology. Here, that's <https://schema.org/spouse>. These ontologies are what make it unambiguous to the parser that the content is a IRI/URL rather than a literal string.
Alternatively, one can write JSON-LD without a context map by using the full IRI of the class as the key:
Bit of a tangent, but it'll be interesting to see how standards designed to make the web more machine-readable are viewed/treated as more and more content on the web is ultimately consumed and provided to users higher up in the funnel via LLMs
Anyone know if JSON-LD is being used more in very recent years for AI? Saw a couple of comments elsewhere that many large companies are using it now. If so, how is is (and any knowledge graph / linked data) used in these situations?
Haven't come across this in the wild. My impression is that most of the semantic web people that were obsessing about ontologies twenty years ago have moved on.
As for AI, we now have deep learning approaches that don't require an investment in a lot of machine readable, disambiguated data but is as good or better than we are of making sense of unstructured data. That probably explains the low interest in semantic web type stuff at this point.
The ones that have moved on are being replaced by newcomers though - as evidenced by my multiplicity of my comments on this post, I'm one of them ;)
I would agree with your appraisal of AI - at the moment, there is a much greater return-on-investment for processing vast quantities of unstructured data than there is for meticulously curating knowledge graphs. However, I think that, in time, the balance will shift to vindicate the semantic web as people desire more trustworthiness from automated systems.
Why? What's changed to switch the balance from machine learned back in favor of painstakingly manually curated? Sounds a bit like wishful thinking to me. I don't think that cat will jump back into the bag and zip the bag up behind itself. Machines are only going to continue to outpace humans when it comes to pattern recognition, classifying things, or making sense of unstructured data. I don't see that turning around.
Maybe there will be a market for "artisanal ontologies". But I wouldn't get my hopes up for that one.
I was pondering the other day if someone could convince one of the proprietary LLMs to spit out something like json-ld of its internal knowledge base to train other AIs.
It's just an example, but it does highlight another problem. If your model has a single spouse, it's going to be wrong. It needs a list, with time periods. But then, why stop at spouses? Surely you want to include relationships that were not sanctified. But why stop at single partnership? Some people have multiple partners. And people have more than romantic/partner-type relationships: parents, step-parents, children, friends, bosses, colleagues.
People are going to extend the model, and not update the records, because that's a hassle. And there will be multiple versions of a record. What you'd really want is an entity that takes care of all of this, in one place. And not machine readable, because who is going to program a machine to find out the marriages of John Lennon? Something like an encyclopedia, perhaps.
It's actually a problem with the semantic web generally: it tends to assume that the world is compartmentalizable into neat categories. Instead, we live in a world where
(a) membership in categories is partial
(b) membership in categories is probabilistic
(c) membership in categories is contested (see Pluto for an innocuous example)
(d) definition, legitimacy, etc. of categories is contested
(e) category boundaries are vague, e.g. Sorites paradox:
Base step: A one day old human being is a child.
Induction step: If an n day old human being is a child, then that human being is also a child when it is n+1 days old.
Conclusion: Therefore, a 36,500 day old human being is a child.
(f) membership in categories changes with time
etc...
There are hot things, and there are cold things, and then there are things that are neither hot nor cold.
I used JSON-LD spec because it is preferred by Google, but it requires a lot of duplicate effort of the actual page content. Its almost as if it is designed for sites that use heavy JavaScript frameworks and not for HTML/CSS sites.
Usually the dynamic data would be generated by the server. eg. if you have an eCommerce site, it might update prices, stock remaining, applicable regions, etc. That would then be pulled into Google Shopping or whatever other service might consume it.
https://en.wikipedia.org/wiki/Semantic_Web#Example
Fundamentally, it suffers from the same problems that the semantic web suffers from as well, primarily the top of my concerns being link rot & scenarios involving at least one link in those examples being inaccessible/unrecoverable.
For the Semantic Web / JSON-LD to work, the infrastructure behind it needs to permanently keep all of the necessary sources that a file relies on (i.e. permaweb). Otherwise, the entire knowledge graph suffers from a left-pad style situation, wherein the removal of a critical piece of definition(s) will cause the unparsability of all documents reliant on it for definitions.