Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Designing better file organization around tags, not hierarchies (2017) (nayuki.io)
43 points by Tomte 6 months ago | hide | past | favorite | 50 comments


I'm going to admit that I didn't read the whole article.

I remember about 10-15 years ago that there were a lot of photo organizers that supported tagging. Tag the events as birthday or the people's names. I saw a lot of people who would then post "My photo organizer has been discontinued and I spent years organizing by tags. How do I migrate all the tags to another photo editor."

The hierarchical file system is kind of a lowest common denominator. I can copy a directory structure from ext4 to brtfs to zfs to ntfs to exfat to macos or whatever and have the same organization.


This is exactly why I cannot get on board with tagging. Without a universal standard, I am one deprecated app away from losing my time investment.


> Without a universal standard

The universal standard is Dublin Core (IETF RFC 2413 / ISO 15836) `dc:subject` “Unordered array of Text” stored in an XMP (ISO 16684) sidecar file:

- https://en.wikipedia.org/wiki/Dublin_Core#Dublin_Core_Metada...

- https://en.wikipedia.org/wiki/Extensible_Metadata_Platform

I like to put the DC stuff directly in my filesystem xattrs, but using a sidecar file will get you compatibility with any old filesystem and with any tooling that might destroy or disregard an xattr.


This is an amazing bit already, but it would be a 10/10 joke if you could work "year of the Linux desktop" into it somewhere.

If you were serious, then I would point out that even nerd file browsers like Nautilus do not support xattr metadata editing, searching, or even viewing and Linux metadata indexing systems e.g. Beagle are all long-dead projects. The whole idea has no traction.


“My software of choice has not prioritized supporting the universal standard” != “There is no universal standard”


Which software does support the universal standard?


GP is right that XMP-supporting FOSS software is heavily biased toward media-management software, especially for photos, and that there is a distinct lack of a good Free Software file manager that supports XMP generically: https://en.wikipedia.org/wiki/Extensible_Metadata_Platform#S...

I currently use a combination of my own filetype-detection-and-tagging software “CHECKING YOU OUT” (insert Dropbox meme here), Windows Explorer with “XMP IFilter Desktop Edition”, and voidtools' “Everything” 1.5:

https://www.ifiltershop.com/downloads/xmpfilter/readme.html#...

https://www.voidtools.com/forum/download/file.php?id=1376


> The hierarchical file system is kind of a lowest common denominator.

Which, in the form of folder and file names, can be your the first layer of tags. But familiarity with file systems is going out of the window as well, I've been made aware.


I have some file names that are really long. If I can use 255 characters then why not name the files with lots of words so that I can use "find" or some other file index utility.

I found the article from 2022 about younger people not understanding them.

Gen Z Kids Don't Understand How File Systems Work

https://news.ycombinator.com/item?id=30253526


I have multuple friends who are college engineering professors, and they've been dumbfounded by the lack of basic file system knowledge. I had a hard time believing it, but ive seen it first hand in highschool age relatives.


DAM is the word! (Digital asset management.)


(HFS here means “hierarchical file system”, i.e. one with nested folders.)

I tire of the either-or dichotomy. Yes, search is great. But a couple years ago, I adopted Johnny Decimal[0] for storing my archive files and I couldn’t be happier. Our 2023 tax filings are in 22.23. Physics articles are in 34.04. Her CME records are in 43.03. Go into one of those and you’ll find every file I have related to those things.

And they’re still searchable. You can still do tag-based queries, indexed text searches, and all that, but you don’t have to. When I want all my 2025 tax deduction receipts come tax filing time next year, I’ll send the contents of 22.25 to my accountant and be done with it, confident we found 100% of the related files.

I don’t do those for everything. Music files are in their own directories. Movies live with movies. I don’t try to shoehorn Obsidian into it. But all my long-term storage files go straight to their home in JD.

This has the huge benefit that I can instantly locate them when I’m accessing the archive from my phone, my wife can find things without learning special tooling or the vagaries of search, and there’s zero need for any additional apps or software beyond Finder.

I love using great search software. I used Devonthink daily for years. But I also adore not having to.

[0] https://johnnydecimal.com/


I can only assume this is at least a spiritual ode to Dewey Decimal? Looks neat.

I had hoped you meant that not being "either-or" would be that you should be able to get "realized hierarchies" on things. Where you could basically ask the computer to build a hierarchy on the fly.


Huh, that’s an interesting idea. “Hey AI, here are 10,000 files. Make a hierarchical categorization of them” sounds like fun.

In my case, I got a lot of value in coming up with that on my own. The result may not be rigorously perfect. In fact, I know it isn’t: some folders have lots of files, and some have few. It maps really well to my brain, though. That makes it easier for me to use, because I’m not memorizing a hierarchy someone else invented and pushed upon me. I’m using the system that sounded about right to me in the first place.


In a real sense, this is how random forest classifiers work? I think the main problem is that the original hierarchy can sometimes get lost. I would love a system that remembered the different organizations I had used and could let me move between them.


Kind of, yeah. And now I wish I had a visualization showing how that map evolved over time.


May I also suggest trying this for the individual files? For instance, for a receipt/invoice of a value of date be `2025-06-04 $42 Description of the File.EXT`.

That way, for a tax year `2024` containing expense receipts, the accountant or us can just look at the file name to confirm. If you still need to look, then opening the file works.


Absolutely! I have a document scanner that names inbound files with the (ISO, of course) date and title of the doc (as gleaned from OCR), but once a week I process its inbox to fix up name problems and move files to their forever home.

I use Hazel on my Mac to auto-rename bank statements etc like “$closing_date Visa statement.pdf” and move them to the right place automatically as soon as I save them to the ~/Downloads folder.

All my new files look exactly like you’re describing. I give my accountant a set of files like “2023-06-08 Goodwill donation receipt.pdf”

It’s slightly more effort to do this work up front, but pays off in spades when I need to actually retrieve something. Even without any fancy search software, just looking for files named like “2023receipt*” locates them.


Just before the Pandemic, I consulted for a company. I had to keep the receipts while on work travel. I gave them the zipped folder and when it expanded, they loved the organization and the naming convention enough to adopt it for their small team. :-)


I’m sure! Even if I abandoned Johnny Decimal and went all in on search vs sorting, I’m keeping the file naming convention. It’s clearly superior in practice.


I wouldn't bother; the time for this is past. It is not worth curating tags/ontologies. Let the computer do it for you, or don't: just search. Google was right.

The problem is acute in organizations, where the question of ontology ownership crops up. The last place I worked at had a dedicated engineer on the job. And marvel at the elaborate systems Wikipedia and StackOverflow have to manage their ontologies. It is best to avoid that stuff and improve search instead, which is likely why an ontology was sought in the first place.


One major difference is that tags/folders better enable discovery. Search can get you to “similar to thing I found”, but it can’t really do “related to thing I found” ala Wikipedia.

AI might be intelligent enough to fix that, but in the end it’d just be automating the curation of tags/ontologies (ideally in manner that’s user-visible and editable)


>Search can get you to “similar to thing I found”, but it can’t really do “related to thing I found” ala Wikipedia

LLM powered search can.

>but in the end it’d just be automating the curation of tags/ontologies

Automating those is the whole point though.


> Let the computer do it for you, or don't: just search.

Only for information that has no real value.

Search misses things. Hierarchy or tag ensures that all you have marked can be found.


In real use, hierarchy or tag ensures that you'll miss stuff, and what you didn't mistag or forgot to tag or didn't bother to, will still be difficult to find among the noise of a huge taxonomy or hierarchy...


Like what, domain-specific information? If so, that's a question of fine tuning the (machine learning) model. Thereafter, it can automatically tag as it indexes.


Let's say I look up the company benefits and search returns a document from 2015. Is that still relevant, or has it been replaced by a newer policy that search isn't showing? How would I know?


The computer can figure it out if the new policy is indexed. Alternatively, users can flag bad search results.


The problem with the "just search" approach is that it creates "dark corners", that is clusters of documents/images ignored by the search engine for whatever reason. I guess it could be addressed with some kind of "what I've been missing" query executed once in a while.


What about the "dark corners" of tags you forgot to add? That corner seems like it would be far larger.


> I wouldn't bother; the time for this is past. It is not worth curating tags/ontologies. Let the computer do it for you, or don't: just search.

The only mechanism that will ever be properly able to contextualize the relevance of my ingested information when it counts (as in while I am alive) is me; the only way to do this is by curation. A someone else who understood this once said: "The tool shapes the master; the master shapes the tool".

Manually assigning tags are an excellent way of doing this, as long as one doesn't overthink it to such an extent where the curation process meets the harsh realities of diminishing returns. Sadly, many people, constantly jumping from one fad to the next, already fail here.


> the time for this is past

A few nights ago I was looking through photos with my daughter, and she asked to see photos of her. I looked for them using the AI way and found some, but not a whole lot. That's because it missed the vast majority of them, which I then found when looking through files with the same date stamp as some of the AI discovered photos of her.

I understand we're on our way to automatically labeled content, but we are a long way off, and the time for manually organizing important information has not yet passed.


It’s a good observation. Counterpoint - the organization labor is drudge work, maybe with cheap AI labor you can have your KB kept in great shape, and likely this makes it easier for RAG to locate the correct document.

Maybe there is room for both, though I am in agreement that it seems dubious a human should be curating.


I mean that's basically what all these image search indexes do. They look at an image and add a bunch of tags for whatever they see in the photo that can be queried later. It works really well and armed with one of these models and word2vec you can build your own search quite easily.

tensorflow object detection is one such means. Hosted solutions would be Amazon Rekognition.


Labels in Gmail is what makes it better (for me) than other email systems.


This argument's been coming up every now and then since at least 2010, but it never goes anywhere. macOS has had tags in Finder for years, and few people actually use them; I like them and still fail to use them consistently.

I wonder if there's a counter-argument to be made that humans organize their knowledge of the world in ontological hierarchies, so a hierarchical file system is intuitive.


I think the main counter-argument is that people like to organize in static systems? Even if you use tags, you almost certainly want to restrict the specific tags that you allow so that you don't have a data science/cleanup task of normalizing things later.

Tags also fail because people then want to categorize their tags. Which tag is the author's name, versus the editor's? Publish year, versus year I read it? I suppose we could say that people want "slots" not "tags"?


And mayhem ensues if other people want to have a say in the ontology.


Many years ago I tried Tabbles (https://tabbles.net/en/), apparently it still exists, but it just didn't work out so well. The main concern is compatibility. I tag up my stuff meticulously and then the project is killed and I'm left with wasted effort.

I think the idea will go away, similar to how ontologies and Semantic Web, manual knowledge graphs etc gave way to processing unstructured data with LLMs. Instead of tags, we should have source/creation-context-based metadata and embeddings computed with language models. Then you can do natural fuzzy search.


Tags only work when it's a restricted domain. Like your music files, or ebooks, or your pictures library. So the best way is to construct multiple volumes, one for each type of repository. But it's not truly universal because of hardware devices you can't program as we don't have standards for tags.


Actually I don't care for any kind of organization for personal files. I just search.

I have my _really_ important files that are cumbersome to recreate like a <50kb jpg of my signature for those annoying forms, scans of my identity cards, etc, in ~/Documents/important

I have my to-read stuff in ~/Documents/books

All my other files are slammed together in ~/Downloads. I search through the folder using everything.exe (on windows) or fsearch(on linux).

That already covers the 90% case, and for the rest recoll maintains a search index that even stores file contents (including zip,pdf,xml-based stuff like office files). It works well. These days I hear everything.exe itself supports searching within files.

On the other hand, I'm really picky about properly hierarchically organizing code files in any SW project I do, and proper hierarchical organizing of the app folder of any app I make. As for tags, the only reason I see myself using them is for pictures. It can enable a deterministic version of Google photos search. But I'm happy with Google photos search for now.


From what I understand, the author's proposal for a tag-based file system reminds me of how Microsoft Azure Blob Storage works.

Azure blobs support multiple queryable tags per file, which aligns with the idea of organizing by metadata instead of folders.

It’s not as deeply integrated or user-centric as the article suggests, but it offers a practical usage, especially for large-scale, developer-driven storage.


The primary problems with tag-based organization are that 1) tags are difficult to organize hierarchically and 2) discoverability of existing tags tends to be poor.

However, if you have multiple overlapping hierarchical organization, common operations such as "moving" and "deleting" become overly complicated.

IMO, there's no best way, only different pros/cons and compromises.


Are there good examples tools for tagging files? I've long been interested,but haven't found anything good.


There's a standard used in content management systems that's been around for, I think, over twenty years - JCR, or Java Content Repository.

I know this is about an OS and not a CMS, but it seems storing your documents in a repository like this would be useful - not sure about the rest of the OS though.


It could still be a good idea. Wrote about almost the same thing in 2008 https://wouter.info/tag_based_file_management.pdf


My files are mostly org-mode/org-attach-ed stuff, linked in notes so they physically reside in a cache-like structure, accessed via search&narrow.

Classic taxonomies are a thing of the past.



the one thing about hierarchies that keeps them alive for me is that I am forced to assign one when I save a file, and they already exist. Sure, you could have a dumping ground like a "downloads" folder, but that's temporary or expendable.

It's probably easier to just through everything into a clustering algorithm and autotag than going back and tagging tens of thousands of files.


I still want hierarchies. Tags would be a great addition, though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: