Hacker News new | past | comments | ask | show | jobs | submit login
Tagsistant: semantic filesystem for Linux (tagsistant.net)
119 points by CarolineW on June 12, 2017 | hide | past | favorite | 66 comments



Hierarchical file systems lack expressiveness and are awkward in places. In my day-to-day computing this has become more apparent and problematic with each passing year. For example, I dislike that you are forced to give unique names for each file, that you can classify each file in only one way, and that you can't tell how many copies you own of a piece of data.

The smallest step up from a hierarchical file system is to allow any file to have any number of tags, where each tag is just a simple string (no hidden IDs or anything). I believe most tag proposals implement this concept and not much further. I briefly looked at Tagsistant and numerous software and papers.

I sketched some of my own ideas about identifying immutable files by hash and creating arbitrary tags that reference such files. It turns out that this way of organizing files goes really deep, and I haven't explored all the implications yet. It yields a completely different landscape than the file system that we are used to today - the concepts of path, mutability, attributes, etc. are replaced with different mechanisms.

The article is long, but I would appreciate hearing if the concepts resonate with anybody else: https://www.nayuki.io/page/designing-better-file-organizatio...


I remember for a while Google Drive worked on a tags metaphor and it was just too much friction for users. The UI for the traditional file tree is just plain cleaner. Trees provide clear delineations of ownership and categorization. They suffer from the limits of hierarchies, but you can break them down and get a nice tree-view of them, for example.

Having spent time categorizing my photos, adding a tag layer on top of the existing filesystem is a great idea, but using it as a replacement for the traditional directory structure isn't.

Simple operations like "delete everything in here" becomes complicated under a tag structure now that we no longer have a strict concept of A is within B.


FWIW you can still work with Drive like this, though it's somewhat hidden functionality. "Shift+Z" after selecting an item will bring up the otherwise hidden "Add to" dialogue, allowing you to add an object to multiple parent folders/tags.


I think it's way too early to say is this kind of models good replacements of traditional file systems since we are so used to work with hierarchies everywhere and tooling is very young if any.

I think simple operations remain still quite simply but we need to alternate those a bit. For example "delete everything in here" may not be very clear thing but "delete these files globally" and "unlink these tags" are still simple concepts. What tags are offered to be unlinked for what files is just UX decision where is multiple quite ok answers. For example if we browse files as tag stack we may delete first one or ask how deeply we want clear stack tags.


I wasn't aware of the Google Drive thing; that's a shame it was found to be confusing to users. I'm glad Gmail still has tagging, because I do label my messages in multiple ways, and would find it agonizing to use a hierarchy.

Regarding photos, I fully intend to have a tag-only system to organize my own collection. I see file names and folders and counterproductive.

I agree with you that tagging becomes more complicated, and the semantics of "delete everything in here" is different. Gmail provides some insight into this - you have to distinguish between untagging a bunch of files versus actually deleting the messages (and any tags they carry). This isn't a dealbreaker in my opinion; it is a consequence of being more expressive.


As long as the tags themselves can be hierarchical, can't you recover all the benefits of a tree-like file structure?


This. I believe a hierarchy of tags is a great solution.


But then don't you have the same probelm of the limits of taxonomies and you start thinking "well is this tag a member of A or B? It should be a member of both" and then you need to tag the tags. Turtles all the way down.


That's the advantage of tags vs true directories. A tag could have mutiple parent tags.

Of course, that raises the question of what happens when you encounter a cycle in the tag graph.


You can have preferential ordering of tags, which, basically, gives you hierarchy.

Or, put another way, the hierarchical file name can be seen as an ordered n-tuple of tags.

Both approaches give you new ways to view, find and manipulate FS content. What about viewing all files which are sources (belong to "src" directory in any part of a path)? Something like that.


I too have been thinking about alternatives to hierarchial filesystems. Originally I thought that I wanted a tag based system, but currently I believe that tagging adds too much overhead and that it might not scale well either for the user since the more tags you have the harder they will be to remember and then there is the issue of relationships between tags.

On macOS they have a view that shows you all your files in the order that they were last modified. That has value I believe. Combine that with optional categorization for filtering, so tags but with the tags not being at the very center, and store files by their hash but keep filename for display. Also keeping track of origin of files - user authored vs made by others. Documents with hyperlinks are used for organizing files of current concern. Snapshots of all files are kept so you can delete, replace or update "indexes" over time but still being able to retrieve them later. Append-only except for the pointers that show what indexes are active.

Hope any of that made sense. It's a bit late and besides I didn't want to write "a whole book" in this comment.


I can't understand why msbob-like interfaces failed. The idea is simple and natural: you don't have hierarchical folders with names, you have visually recognizable places. Even very non-technical users can understand this idiom when you just put papers somewhere to get back to it later. People are very good at locations, very bad at names. One of the best remembering techniques is based on locations.

That's actually how I organize my physical desktop: there is a stake of blanks ("new text file..." menu item), operative drawings in the center, non-operative at the far right, and lot of drawers in my table, each with its own geometry, cabinet, shelves, working/entertainment rooms or areas. If I start to collect something new, like DVD discs or project documents, I simply "create" new shelf for archiving or empty old space by moving it to less useful location.

Given that it is electronic representation, you may even put document into additional location, like dragging it via alt key, so semi-transparent version of it is created. You can take documents to your hands, go to places, drop some here and some there, instead of that cut-paste and "clipboard" idiom. Why would one really move folders via cutting them to the clipboard? Did you ever see what it looks like?

Even if hierachical folder/file systems could be good, today these are shit. You can edit file, but you can't edit folder. Attach photo or text preview on file/folder via paperclip? Make it bigger, so that it stands out in other icons? Create in-folder heading like github's README.md? Colorize them quickly? Nope. Even when possible it is a real pain. All files and folders look the same (or random) and if it is not jpeg, you're on your own to search it by low-perceptible metadata like ctime or name/type.


The article mentions a few technologies like IPFS, so I'm just mentioning Camlistore for completeness because its in this space too.

https://camlistore.org/

I'm reading "The Science of Managing Our Digital Stuff (MIT Press)". They seem to prefer hierarchies (I'm not far through the book yet).

https://www.amazon.com/Science-Managing-Digital-Stuff-Press/...


> They seem to prefer hierarchies...

We might not be able to get away from hierarchies, if experience with "memory palaces" [1] gives an indication of how most of us remember (would be interesting if we could identify the memory palace equivalents for those who remember using auditory, visual, and tactile forms [2]). Most memory palaces' dendritic structure bears a striking resemblance to a hierarchical system.

I still find an overall hierarchical structure, combined with indexing and tagging, as the most flexible system with today's technology. I'm looking for ways to implement automated tagging using auto-summarization, voice commands, and automatic environment-contextual cue gathering, and expanding the indexing power with automated ontology extraction. Primitive example: I pick up an incoming call from a client, the system automatically transcribes the conversation, identifies the client, files the recording and transcription to a project folder, analyzes the content of the discussion, and auto-links relevant emails, chats and documents with bi-directional hyperlinks based upon concepts vocalized and conceptual relationship maps extracted based upon a crude initial morphological analysis of the conversation. The accuracy doesn't need to be astounding for this to have use to me; just a crude approximation is sufficient for me to start with.

[1] https://en.wikipedia.org/wiki/Method_of_loci

[2] https://en.wikipedia.org/wiki/Storage_(memory)#Short-term_me...


I second the suggestion to give a look at camlistore. It's designed to be a content storage for everything with automatic features extraction; the first use case implemented by the authors is to extract all the metadata from photos as soon as they are uploaded, so you can search for, say, panoramas in Lisbon in the last 3 months. Other document types are already implemented and you can have your own depending on your use case. Coming from a Googler, it's no surprise that the primary interface is "tag stuff, and use search"


I took your suggestion and explored these two new references. They were quite relevant, thanks.

Thoughts on Camlistore: I agree with their high-level https://camlistore.org/doc/principles and https://camlistore.org/doc/uses . Their presentation slides and videos gave a helpful explanation and demonstration of their functionality. The query string format and showing of live search results were very cool. I have doubts about the rich JSON metadata format, their model of mutable files, and whether I can represent and query the my kind of metadata in their system.

Thoughts on "Science of Managing Our Digital Stuff": I sat down at the public library and read part of the book. Your early warning about hierarchies proved correct. The authors seem to be very focused on conducting user studies and timing people's time and recall performance. All their text point toward the superiority of hierarchical organization due to the efficiency of human spatial navigation / folder traversal. The book has some interesting perspectives to offer (e.g. human behavior, group information management, motivations behind keeping data), but I don't expect it to contribute to any of my technical design decisions.


Hierarchical file systems lack expressiveness and are awkward in places.

Hierarchical file systems made the most sense back in the days of the spatial desktop metaphor [0] pre-OS X classic Mac OS Finder. The ability to organize your files in a spatial manner and have the system preserve the one-to-one relationship between a file and its (virtual) physical location within the system was what made it work.

As soon as the browser metaphor (or navigational) file manager [1] took over (owing much of its success to the web browser), this relationship was lost and the system became unwieldy.

[0] https://arstechnica.com/apple/2003/04/finder/3/

[1] https://en.wikipedia.org/wiki/File_manager#Navigational_file...


The problem with any other model is there are secondary superpowers required to make it work: forget what deleting a file means, what does editing it mean when it can appear in multiple places?

Tagging files IMO depends on having a robust deduplication system under the hood.


Does these questions require superpowers? I think no but a bit different thinking model for sure. For example thought that "file is somewhere" is already miss match for model since point of hash based file systems is usually make location irrelevant.

Traditional deletion should be split to real delete and unlink (see: https://news.ycombinator.com/item?id=14541776).

Editing mutable file you just mutate it everywhere since it's just single file after all. Editing immutable content is same as creating new file.

I think system on article doesn't suffer duplication problem (due tagging, for immutability maybe). There is no problem showing file on multiple path or "result of multiple query" without data deduplication unless we somehow try to brute force tag FS over traditional HFS. If I understood your concern correctly...


Just in case if you are not aware check RDF/RDFS/OWL. Use cases are quite different but model wise there are similarities (no immutability / hash based tough).

For example case how do you manage details of relationships of tags? Like how do you handle situation where you are accidentally created duplicate tag?

Spoiler: There is not-so-small complexity creep there.

I hope sane conscious trade-offs can be done.


Well, since you ask, here's Hans Reiser's old stuff:

https://reiser4.wiki.kernel.org/index.php/Future_Vision

https://reiser4.wiki.kernel.org/index.php/V4

(and http://lwn.net/2001/1108/a/reiser4-transaction.php3 )

. And here's some emails etc. I wrote in response:

https://web.archive.org/web/20040728044342/http://www.st-and...

https://www.mail-archive.com/reiserfs-list@namesys.com/msg09...

https://www.mail-archive.com/reiserfs-list@namesys.com/msg20...

https://www.mail-archive.com/reiserfs-list@namesys.com/msg20...

https://www.mail-archive.com/reiserfs-list@namesys.com/msg20...

https://www.mail-archive.com/reiserfs-list@namesys.com/msg20...

, plus some of the discussion threaded from those posts. (Sorry, my stuff needs rewriting and updating but I'm not in the position to do it at present. If there's anything you would like to ask about please do. https://news.ycombinator.com/item?id=9809041 and https://news.ycombinator.com/item?id=10548477 touch on things that are a bit further down the line, but related—in particular, to the handling of "internal metadata" and files with a compound internal structure.)


I forgot to include this email of mine https://marc.info/?l=linux-kernel&m=111624697710426 , probably the most important of mine.


Yes, the first thing that sprang to my mind was Reiser's design documents from 15+ years ago. Pity...


I imagine you are familiar with BFS (https://en.m.wikipedia.org/wiki/Be_File_System) already, but if you aren't I'm sure you will find it interesting.


Thanks for pointing out BeFS. I only learned about it in the past few months, whereas I was thinking about alternatives to hierarchies for a decade. These related links were quite helpful to my initial understanding of BeFS's unique features:

* https://systemswe.love/archive/minneapolis-2017/ivan-richwal...

* https://arstechnica.com/information-technology/2010/06/the-b...

I am a fan of their dynamic queries. Being able to search all over the file system for certain attributes, instead of merely browsing a pre-canned hierarchy, is a powerful feature. I'm not a fan of their extended attributes though; it seems brittle currently because we have reduced files down to the lowest common denominator of being a finite sequence of bytes, with very little metadata on the side (if you're lucky, you might get a file name and MIME type attached).


Awesome writeup! This goes right to the printer :)


Printing it out because it's too long to read on screen? (Maybe I have failed as a writer to be concise)


Great project!

I feel that the problem of archiving files is not well served by the POSIX file system, and deserves attention. Gaps are in data safety and backup capabilities (dropbox is a huge leap forward here) and document retrival. Usually external idices (which go out of sync) are used to query file names. Also there is no way to attach semantic metadata to files (appart from date stamps, and permissions).

This tool provides an interesting stab at the latter problem. I have always thought that semantics would be layered ontop of a POSIX file system. This project fips the logic and implements tagging within the fs.

I wonder how compatible this is with NFS/Dropbox/git. Can I use this to tag files on a Mac via Dropbox Sync? digging...

I have been using some home grown tools that allow me to put sematics into filenames for a while now. And it has served me quite well. Files look like this:

  2017-04-04 #S4907 #Choir List of names.pdf
  2017-04-05 #EXCITE #S5005 Notes on Data Repositories.pdf
  2017-04-06 #ARAG #S5031 $Amount=14.2EUR Invoice.pdf
And I have command line + CGI scripts that allow me to manage and query folders which contain properly formatted files. I just begun writing a second version in python:

https://github.com/HeinrichHartmann/pile

(waaay to early to use, yet. But the README elaborates a little). Has anyone aware of similar approaches?

E.g. using JSON documents as file names seems another obvious way to layer semantics ontop of POSIX/fs. Is anyone doing this?

EDIT: Formatting fixes.


FUSE has extended (and extensible) file attributes, which seem of interest.

https://en.m.wikipedia.org/wiki/Extended_file_attributes


I wanted to second this. I've got a similar project [1] for photos, videos and audio files. It encodes much of the metadata into the filename itself [2].

[1] https://github.com/jmathai/elodie

[2] https://medium.com/@jmathai/introducing-elodie-your-personal...


Thanks for sharing your system!

I'd like to give you some HN formatting tips, if I may:

Add two new lines for a new line. Otherwise the formatter will put your text on the same line (eg. with your list of files, it looks like one big filename now)

italics work by putting asterisks (*) around the word or phrase. Though maybe you meant to use underscores in this case :)

Best wishes!


Thx. Fixed now.


Nice! An equivalent for comparison: https://tmsu.org/


This should be higher up, I think TMSU is the best known product in this space, and a comparison between the two would be useful.

For one thing I think TMSU's achilles heel is handling renaming of files gracefully, I wonder how Tagsistant does it?



And if you're having trouble accessing the page (like I am), Google's cache can help: http://webcache.googleusercontent.com/search?q=cache:http://...

Usually I prefer to link archive.is, but it didn't manage to capture the page before it went down this time around.


Thanks, website (tagsistant.net) seems reachable but unresponsive


I used this years ago. I had a few scripts that would copy files to the appropriate tags, but it ended up being super slow. Eventually I wrote my own tools around the sqlite database it created (it's a pretty simple schema) but I've since lost most of those and ended up rewriting a bunch in a system I've been working on.

This was all back for the 0.2 release and I think the author changed a bunch of this stuff in later releases. I wish there were more options for tagging file systems in Linux. If I ever get my stuff out of the "thrown together" phase, I'll probably publish them.



I've been thinking increasingly of something along these lines myself, with Tagsistant being among the systems which have come up in my own research.

The problems, variously, are that fixed-name hierarchical-storage filesystems meet the needs of document-based storage, projects, workflows, sharing, and lifecycle exceedingly poorly.

The problem is coming up with a better option.

A filesystem-based approach has the advantage that it's low-level, not tied in to a single application or toolsuite, and may be extensible.

Among the questions I've turned up include identifying what specific problems this is trying to solve, distinguishing between public and private information, and what levels of standardisation might apply. There are also some very significant questions about privacy and data leakage.

My current thoughts are largely grouped around a documents-based system (provisionally, "/docfs"), and an online or Web-oriented system (provisionally, "/webfs"), both under an umbrella context system, KFC (for KFC's Fine Context). Mostly considering the domain space, workflows, and possible solution-shaped objects. Part of that (largely focused on Web access) discussed here:

https://www.reddit.com/r/dredmorbius/comments/6bgowu/what_if...


Humans have to tag the files, though. This is the same problem which kept the "semantic web" from going anywhere.


That's where I see a possible out.

There are extensive collections of works which are systematically categorised. We call themm "libraries". One in particular, the U.S. Library of Congress has a corpus of 24 million catalogued works, using an open categorisation scheme, which 1) can be used to apply classifications to extant works and 2) serve as a training set by which unclassified works might be classified.

Distinguishing between expert, automatic, and local classification of content might be useful.

There are other copora, including scientific and commercially-published article indices (of varying degrees of public accessiblity), and more.

It turns out that the process of classifying information has been going on for a while.

Of possibly related interest:

http://historyofinformation.com

(The bits applying to encyclpaedists and library catalog classifications are of particular interest here.)


Humans have to tag the files, though.

Only once, if done properly. Look at music tagging. A good system ought to have canonical tags for everything.

User-created files could also have a lot of auto-generated tags too. I'm thinking along the lines of email address/URL origin, Exif metadata, source code tags (ctags/etags), keyword extraction from prose text (via machine learning models)...

Beyond all that, though, would be your standard date and timestamps, your document name (which could be non-unique), and a project-based tagging scheme.

Take a look at the Library of Congress's MARC project [0][1]. It's the most ambitious tagging project I'm aware of.

[0] https://www.loc.gov/marc/

[1] https://en.wikipedia.org/wiki/MARC_standards


Look at music tagging.

That has a high ratio of fans to content. That's when reputation systems work. Outside of popular culture, it doesn't scale.


Could you clarify what you mean by "canonical tags for everything"? Do you mean an online database like freedb, MusicBrainz, and such?


Do you mean an online database like freedb, MusicBrainz, and such?

Yes, as well as the Library of Congress and doi.org.


Humans already have to pick a path and name for the file in a hierarchical filesystem. I don't see how this is any more difficult, and it might even be easier.


One of the problems might be that humans tend to be not very consistent over longer periods of time. So choosing a set of tags which will stay useful for, say, 10 years is hard. Because of this "tag rot" is likely to happen. After a long time, when you're looking for a certain file you might not remember the tagging scheme you applied at the moment you tagged the file you're looking for.


I often wanted a tag system, but this is not what I was picturing, I think. I'll rather like to have a database and an entry in the context menue to tag files without moving them. It should also be able to identify the file if is being moved or renamed.


It must be possible to get the same effect without copying the files by using links. If it is all on the same file system then hardlinks would work and be very space efficient.


Extended attributes can do this for most unixy operating systems. See fsetattr and fgetattr. It's a bit tricky in that the standard tools, like find, don't support them directly. But, the tags stay with the files, no separate DB needed. You do have to pay attention to copying, backups, etc, to make sure they are preserved, and no separate DB means all "queries" are full scans.


I've never been able to figure out what people mean by "tags" and what the benefits are, perhaps someone can enlighten me?

I feel like the only problem with standard hierarchical file systems is that sometimes you want files to show up in multiple places. I think this is only a subset of files, typically "media." A photograph or a song often has multiple categorizations. However, a lot of things, like my tax documents, source code, Word documents, and notes typically only have one place they need to go to. It seems like symlinks or hard links are an already-existing solution; other than lack of great options on managing them, how do tags improve this?

As far as I can tell, tagsistant does have hierarchical tags, so it isn't doing away with hierarchy, which is good. One problem with some tag systems (like Gmail's original) is that under "notes" I have "lectures," "stuff I want to remember," "things like list of books I want to read," "temporary." If I'm looking for photos, I really don't want to see a bunch of notes tags.


I think tags only prove beneficial when you have a very well defined purpose. More concrete, I think the number of tags has to be fairly limited and the logic behind the tagging scheme should stay consistent forever.

One example of tags being superior to links is that you can search files with logical queries. For instance, a family member asks you to send them a good portrait of you for whatever reason. You may want to make sure that it is somewhat recent. However, you're willing to take one that is two years old if it looks better. Then you could search for: #portrait & #me & #2017 & #2016 & #2015


I suspect the site is suffering the HN "Hug of Death".

As linked elsewhere[0], on github:

https://github.com/StrumentiResistenti/Tagsistant

[0] https://news.ycombinator.com/item?id=14537805


Great work OP! Had done something similar as my master thesis in 2005 basing on one of Hans Reiser's papers. It was more of a quick-hack having the code written in three days so not much have survived to this day, but you can see some of it in action on an adobe flash[1] video at https://adam.kruszewski.name/assets/static/mtfs-demo.htm

[1] don't judge me, mp4 in the browser wasn't main-stream back then ;)

edit: for those without flash like me -- even as a quick hack it had manual and automatic file tagging (based on file metadata) and you could query it using logical expressions. It also had pretty nasty memory leak I didn't care to find out :) Still, without a first-class, built-in support from file managers like BeOS had for its filesystem the idea is not fully realized I think.


It isn't explicit about this on its main page description. This project depends on FUSE to work.


My mind read this as tagistant - which is cool and awful name at the same time.


Now we just need a redesign of Unix tools and principles/practice to accommodate these :-S


That's actually a fairly reasonable conclusion.

You'd want to have tools which are aware of a semantic filesystem, or new tools which can make use of them. Probably a mix of both.

Wrappers around extant tools might be a reasonable migration path toward the former.


There was a concept of "db-fs", whereby rather than files in a hierarchical folder structure, you just had blobs/collections of data you could search via the usual search-queries. I suppose similar structure to the average no-sql/JSON db, but as a file-system.

The problem is, it isn't even compatible with the usual interface for a fs driver (read/write, permissions etc). There is no "search" concept in the fs conceptual model. The model assume, to some extent, a finite-size mount, duplication/multi-reference is only poorly supported/emulated via hard/soft links.

plus, existing tools would, as you mention, have no support for what would now be build into the fs - functionality such as provided by 'find' would be now built in, such that you would need a shell syntax/dsl to utilize.

My best guess would be a 'special' query command that would result in a virtual folder popping up on a special virtual fs mount, e.g:

    > mk_qdir *some-search-query*
    /proc/srch/023013/
but if ls'ing the folder kicked of a proc in the background that initiated the query, you would have to be careful no process (a mount counter for 'df', for example) crawled the partition.


Good point on mount counts.

Doesn't that problem already exist to an extent for remote FS mounts (NFS, etc), especially over automount?


Yep, if the external mount is large.

However, it's a little different for an fs system that can quickly (w/o network-speed limits) generate recursive fs structures.

for example, what if a created a vfs that created a 'foo' folder, with a 'foo' folder inside, and so on. The system would crawl an infinite descent of /foo/foo/foo/foo... and so on, which would eventually fill some cache or another.


Good thoughts, and I'm thinking of possible hazards and pitfalls of the approach.

There's already a concept of setting flags to avoid traversal of certain filesystems, and given the proliferation of virtual and networked filesystems, this seems useful: /proc, /sys, /dev, /udev, and a few others (it's getting to the point I don't recognise a full mounts listing readily anymore under Linux).

With the concepts I'm considering, in particular, of the tree as being essentially search traversal rather than a static filesystem (not even one with lots of symlinks all over the place), the potential to create some deep dives or recursive tangles is pretty high.

Another example: tools such as 'locate' or Spotlight shouldn't attempt to traverse this tree when generating indices. Instead, they should query it when requests are made.

Information and metadata leakage is another key consideration when moving data off local host and/or caching among hosts.


Yep, in fact I think a lot of vfs are like this - doesn't /proc/ query the kernel when returning parts of its tree (e.g. representing processes).

wrt the fs representing queries: one of the tag systems is like this, it represents tags as folders, and the search for term 'X' and 'Y' can be found under either folder ./X/Y/ , or ./Y/X/ ; obviously, every new tag combinatorically expands the virtual space.


Right. My vision is of a /docfs under which you might travers /au/stephenson/ti/snowcrash or /ti/snowcrash/au/stephenson, as an example. Either way works.

If a search terminates in multiple (or no) results, it's a directory, if in a single result, it's that file. Plus a few twists (virtual / dynamically generated file formats, summaries, synopses, metadata).


Of possible Interest is the new article:

https://news.ycombinator.com/item?id=14550060

and http://www.sqlite.org/src/doc/trunk/src/test_onefile.c for an idea of turning SQlite into and actual vfs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: