Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
What Happened to Tagging? (2019) (jstor.org)
125 points by Tomte on May 21, 2022 | hide | past | favorite | 95 comments



Besides what the article goes into about auto-curation of social feeds reducing self-curation, the counterintuitive answer is that decentralized tagging requires strong centralization to work.

You need:

- agreement on what should be and what should not be tagged in a given domain

- standardized terminology (no multiple variants of tags)

- consistent grammar and formatting across all tags

- software support for tag editing that makes it easy to adhere to established tagging rules

- mechanisms to explain tagging rules to new users, at scale

- mechanisms to punish malicious/spam tagging (e.g. user history/reputation + bans)

Usually, all of these conditions together are only found in highly niche and specialized forums that care a lot about the quality of their content. While most large social platforms today do have some kind of tagging system (e.g. hash tags on Twitter/Instagram), the usefulness of these systems is generally limited due to the inherent difficulties of co-ordinating so many diverse users who have varying interests.


These are very nearly the exact opposite of the tagging ideas/motives on del.icio.us, an early popular system with tagging. There were lots of people who made similar arguments at the time as well! I thought they were wrong then and nothing has really appeared along the lines of what you're describing to convince me otherwise but it's probably worth taking more whacks at.


Ah, interesting! I never used del.icio.us myself, but from what I understand, it's fairly similar to Instagram (for example) in how tags work, optimizing for ease of use rather than ease of finding specific content. In my opinion, this is almost certainly the right decision for any platform where "absurdly detailed search" is not job #1, and I'm pretty sure I would have argued the same way as you did.

That said, having seen some of the centralized, intricate tagging systems out there, that let you filter down from Earth to one specific ant in the blink of an eye, that's what I think of when I think of "tagging" that's really effective. YMMV, but I would argue that if you can't type in 10 different tags and get 1 result that's exactly what you're looking for, tags aren't really delivering on their promise.


no. tags were a separate field, and you were shown the tags you used as prominently as the things themselves.

note that it was built as a memory aid so that you had a chance of finding something again later. your idea that it needs be exact, perfect, and precise or it won’t work is silly.


Ok, after reading this and also your top-level comment, seems we're talking about the two different types of tagging ("personal curation" vs "collaborative") which you list in that comment. That makes more sense.

Definitely agree that if you're mainly tagging for your future self, you can derive tons of value without constraints. Whereas if you have people tagging for others, that needs more support.

Although, if you're so inclined, you could also view these as a continuum --- with "personal" tagging being the apex of centralized consensus, and having to increasingly labor for consensus with every added user in the system.


We did suggest tags as the intersection of "tags used by others for this url" and "tags you have used in the past" to increase cohesiveness.


Hmm yeah, kinda, or at least I remember it slightly differently - that the sort of 'winning point' for it was that it was useful for individual users because it helps you pick out tags you already have or add a new tag you didn't have, plus it tells you something about the url. The purposes overlap, of course and it's been a while.

The thing above is much closer to what some of the librarians were really into.


oh definitely. lots of librarians campaigning for strict rules around tag use etc


image boards like danbooru and similar websites are an example of the things mentioned by the parent comment. and from my personal experience they are the best implementation of tags I've seen on the internet. they are not perfect and still have lot of room to improve on but they are way better than what's used and available elsewhere.

they have their own description, their get moderated, other people can add tags, they can report them, you can alias tags, see related tags, and get feedback on them.

disclaimer: most of these image board are NSFW.


Commonly called a folksonomy.

These are easy to establish, but can be quite difficult to maintain and rationalise.

https://en.wikipedia.org/wiki/Folksonomy


Commonly called a folksonomy.

Yeah, it's not, really. And thankfully.


How so not?

del.icio.us is specifically listed as an example.

(I'm aware that you're more than casually familiar with del.icio.us. I'm just confused by your response.)


It is but nobody involved in making deli.cio.us used it (del predates it, too). The term was popular for a bit at the time and has now mostly gone out of use. Unsurprisingly, being a bit of a clunker.


So ... del.icio.us's tagging wasn't used? Or the make-up-your-own tagging wasn't used?

Are you doing anything related these days?


It's two things, really, one is that 'folksonomy' was a term of its time, like, I dunno, 'blogosphere' or 'microformats' and is similarly obsolete. The other is that there were lots of people who thought or hoped that del.icio.us-style tags would lead to some sort of useful or interesting taxonomy or ontology, either emergent or by more prescriptive means (as in the initiating comment above) and that didn't really happen, perhaps because it wasn't (for the most part) what the tags were for. 'Foksonomy' is terminology that came from that line of thinking.


Fair enough, thanks.

I've been generally inclined toward adopting an existing taxonomy (or at least a usefully-sized portion of it). Unfortunately, many of the more common ones are copyright-encumbered (e.g., Dewey classification). Library of Congress classification and subject headings are available. If somewhat unweildy.

A few of my tilted windmills...


If you haven't come across it before, this is still a fun read, a kind of manifesto of the 'emergent ontology is going to be better than designed ontology' notion.

https://web.archive.org/web/20050601013309/http://shirky.com...

It made librarians pretty mad. At the end of the day, though, putting things like 'ontology' and 'keywords you type so you can find your bookmarks again later' in tension was a category error.


Thanks, excellent reference. I have seen that before, and Shirky hits it on the head here:

What's being optimized is number of books on the shelf.

Libraries organise their content. And in a print-and-paper world, that content has locational specificity.

You don't have to organise materials by topic, but in an open-stacks model that's almost always preferable, and it has utility with closed stacks as well.[1] Where the indexing system can provide a space-transgressing capability of cross references, even that was originally bound printed volumes or journals, though as the 19th century progressed, increasingly the more open to random-access, but still locationalised index card within cabinets.

I'd recently submitted the Hathi Trust archives of the Annual Report of the Librarian of Congress, 1866--2007 to HN: https://catalog.hathitrust.org/Record/000072049 (https://news.ycombinator.com/item?id=31421398)

Having read A.R. Spofford's reports, the physicality of the archive dominates --- during his tenure the collection was housed in the North wing of the US Capitol, adjacent to what is now the Old Senate Chambers (directly west, best I can tell, with three floors and presumably some basement space). The collection had burnt shortly previously (1851), and beginning in the 1870s, Spofford was urging Congress to dedicate a building to the archive, and complained incessently of the challenges in even enumerating the holdings due to crowding of books and other materials --- 800,000 volumes in a space meant for 200,000.

The new building opened the year of Spofford's retirement, in 1897. (Spofford himself lived to 1908, remaining as Chief Assistant Librarian, and presumably enjoying the main fruit of his labours in the Jefferson Building.

Following Spofford, attention seems to turn to cataloguing. I'm reading those reports now, which expand greatly from the earlier brief form (about 6 pages during Spofford's term). I'm interested to see how that discussion develops.

It is made clear that the organisation borrows heavily from Francis Bacon's trinary distinction of history (or memory), poetry (or creative works), and philosophy, or reason, acquired through the Thomas Jefferson collection's organisation. (Jefferson's personal library re-established the Library after an earlier fire of British origin in 1812.)

I've also looked at other ontologies --- Diderot, the Encyclopaedia Britannica, Wikipedia, and several library classifications (Dewey, LoC, Colon, ...).

As with principles of truth, classifications should be useful, serving a purpose. Organising and using a collection should presumably be key amongst those purposes.

Hierarchical classifciations, like metaphors, melt if pushed loudly enough.

In kicking around some ideas for an information management ... thing ... which I variously call KFC (Krell Functional/Fucking Context), docfs, and/or webfs (latter two should be self evident, see Plan9's 9p for strong precedent). One notion is that search affords identity, in the sense that a search which is sufficiently defined to result in a single work is an identity function for that work. That's mated with the notion that a search can return a different value of results: 0, 1, a few, and many. "Few" and "many" are relative to the ability to work with the results, they're not a fixed quantity, and will vary by characteristics of the system and its user. For a skilled researcher and good tools, I'd suggest "few" might range from an order of magnitude of 10 -- 1,000. Capable of being winnowed further, with effort. "Many" might be 100 and above (there's overlap, yes), to million, billions, or more.

For purposes of this discussion, values from 2 -- 9 equal 10 ;-)

A search result is nothing (empty set), one (identity), or > 1 (list). Where a list is presented, further subdivisions might be suggested: publication dates, authors, subjects, publishers, concepts, statistically significant words or phrases, titles... At some point, those subdivisions would likely provide an identity.

My thinking is that a filesystem-like expression of qualifiers might identify a given work, or set of works. Or the empty set. Something like:

  /docfs/au:fitzgerald/ti:gatsby
Or:

  /docfs/dt:600-799/kw:transsubstantiation
If you're interested in texts of a medaeval theological concept.

There might be more paths to Gatsby:

  /docfs/dt:1915-1930/kw:west egg/kw:daisy
Again: so long as you can come up with constructs which usefully winnow down the possible results set, you can find what you're looking for.

  /docfs/dt:1980--1989/su:machine learning 
The advantage of a controlled vocabulary is not that it is a strict hierarchy, which seems to be what many people get boggded down in, but that it is a useful and defined vocabulary.

And amongst the reasons why the US LoC classification and subject headings are useful is not that they are perfect and a single authority, but because over more than a century of use and adaptation they've acquired the institutional tools and processes to manage change and ambiguity reasonably well.

That, and the fact that they're freely available. Albeit in inconvenient forms.

(Another project I've been working on in fits and starts.)

________________________________

Notes:

1. Among alternatives I'm aware of is the SuDoc classification, used by the U.S. Supervisor of Documents, which is arranged by government department* and date. Which turns out to be a useful way for grouping that corpus physically.


> Usually, all of these conditions together are only found in highly niche and specialized forums that care a lot about the quality of their content.

Ooh do any of these still exist? If you know of any I'd love the links to look at how they're doing.

I was an inveterate tagger, debating taxonomies and ontologies late into the night (I have now forgotten the difference between the two!) and tried to run a curated forum. Eventually I gave up for most of the reasons you highlight - but mainly because I realised no one was as OCD about classification as I was.

In another life I would have run and catalogued a university library.


Stack Overflow exhibits (or exhibited) all the points that parent mentioned. If you look at [the discussion of tags on the Meta site][0], and especially what's called ["burnination"][1] you'll see these issues being hashed out over time.

To sustain a tagging system like that it takes dedicated and invested individuals, and the corollary of that is that such people tend to generate a lot of discussion.

[0]:https://meta.stackoverflow.com/questions/tagged/tags [1]:https://meta.stackoverflow.com/tags/burninate-request/info


The social cataloging site Rate Your Music has a very in-depth genre tagging system. For each album and track, users debate and vote on which primary genres and secondary genres apply. For example Radiohead's OK Computer has Alternative Rock and Art Rock primary genres and a highly controversial Space Rock Revival secondary.

Each genre has a lineage of parent genres so each release tagged with a genre must also be a part of each parent genre. For example: Electronic > Electronic Dance Music > House > Tribal House. Also: Rock > Metal > Thrash Metal > Technical Thrash Metal.

There's a queue for submitting proposals for new genres and modifying the definitions of existing ones. There's also a complex chart system for filtering releases by genres, types, and descriptors. I think I last heard there were ~1300 music genres on the site.


Some good examples have been posted prior to my reply here --- I'll reiterate Archive of our Own (fanfiction) and Danbooru (anime porn) as two fairly big sites with well-maintained tagging systems.

Both sites have abundant guides and documentation about their systems and it's very interesting to see how they manage the real-world complexity of their domains.

Here are some good entry points if you're interested:

Archive of Our Own:

https://archiveofourown.org/faq/tags?language_id=en

https://archiveofourown.org/wrangling_guidelines/2

Danbooru: (linked pages are text-only, but individual tag pages, as well as the rest of the site, can be highly NSFW)

https://danbooru.donmai.us/wiki_pages/howto%3Atag

https://danbooru.donmai.us/forum_topics?search%5Bcategory_id...


Building an effective tagging system can be much harder than people realize. I once worked on a tagging system for a collection of math problems. I thought I could code a simple tagging model, and let users tag their own math problems, and it would become much easier to find the problems you're most interested in.

Then I realized that tags like algebra 1, Algebra 1, Algebra I, Alg I, and all other variations should mean the same thing. So I started to develop a closed set of tags. That led to a fascinating rabbit hole about taxonomies that I don't even remember how to speak about clearly at this point.

That project is still a work in progress, and it's left me with immense respect for people who build well-structured systems that involve tagging.


Two impressive site-wide systems I've seen are the categories of Wikimedia Commons (multimedia) and tags of Archive of Our Own (fanfiction). The Commons guideline[0] elucidates its system and interesting ontological theory well. It's scope is extremely broad, aiming to simultaneously include any possibly useful categorization scheme,[1] and overall is a fairly freeform (ideally) directed acyclic graph. Variations are handled with redirects and disambiguation pages in a typical wiki manner, with the limitation that individual category uses must have the canonical name. Ao3, in contrast, has a schema of sorts, and synonyms are made equivalent during resolution (its tags FAQ[2] is also an interesting read).

I tried to write a more thorough comment but also struggled with being coherent. Thus, some ideas, only briefly:

- At an even higher level, the web itself and the overlapping userbases/communities ('intersectionality', without the discrimination--the original set-theory kind?) of individual sites can also be considered a way of organizing content

- Thus, analogously: Search engines replaced directories and webrings as algorithms did tags. The present SEO meta, though...

- Generalizing from Commons, all Wikimedia wikis (Wikipedia, etc.) have parallel category structures, only less developed due to the greater reliance on links. So do most wikis in general, though Wikimedia also unifies categorization and structured data with Wikidata. From there are knowledge graphs and databases in general, wrapping back around to Google trying to determine the Knowledge Graph item that each query refers to.

[0] https://commons.wikimedia.org/wiki/Commons:Categories

[1] all the typical keying on depicted people, things, times, and places, plus the ways that we categorize those. Niches from 'horizontal bicolor blue and white flags‎' to 'Luxembourgish pronunciation by gender‎', 'trams on route 709', 'ships with 6 funnels'. There's a tool (now called vCat) to visualize categories, some outputs here: https://commons.wikimedia.org/wiki/Category:Wikimedia_catgra...

[2] https://archiveofourown.org/faq/tags

Edit: specific examples


> tags of Archive of Our Own (fanfiction)

On a similar note, Danbooru-style image boards often have highly developed tagging systems, ranging from tags for specific characters or artists to tags for art styles, poses, or even specific features which happen to appear in the artwork (like "hat bow" or "blue eyes").


Just for fun, here are your examples applied to Commons (and a conjecture that tag systems naturally converge as they become more fine-grained):

https://commons.wikimedia.org/wiki/Cat:Wikipe-tan

(NSFW-ish[0]) https://commons.wikimedia.org/wiki/Cat:Drawings_by_User:Seed...

https://commons.wikimedia.org/wiki/Cat:Demoscene

https://commons.wikimedia.org/wiki/Cat:Paintings_of_couples,...

https://commons.wikimedia.org/wiki/Cat:Blue_eyes

https://commons.wikimedia.org/wiki/Cat:Bow_hats

There's also a tool to intersect or subtract categories hidden in the dropdown of the 'Good pictures' button at the top right.

[0] (NSFW-ish) https://en.wikipedia.org/wiki/Seedfeeder


> I tried to write a more thorough comment but also struggled with being coherent.

how fitting.

I guess it's always about neighbourhoods. In your street, in your pew, in your bookshelf, inside your brain, in your zettelkasten.


I just used synonyms and a tag hierarchy ( nested sets).

Works pretty well.


I ended up building out a hierarchy as well. But figuring out the structure of that hierarchy was not trivial at all. How does the name of a repeatable class (Algebra 1) fit with the name of a specific class (Algebra 1 Fall 2020 Section 2)? How does that relate to an area of math like algebra, geometry, number theory? How does that relate to things like context (ie problems about Minecraft, Lego, Physics, etc.)

I developed a closed system of tags, and then gave people the ability to define aliases.



Tagging doesn't work because there is an incentive to falsely tag to sell stuff. Tag sites with ads for cat products as "dog", "bird", "groceries", "boots", "cars" on the off chance that you'll get your ad for cat products in front of some random customer's eye balls.

It's exactly the same incentive as spam email. It takes zero effort, costs nothing, and if you get 1 hit in a million you still make a profit.

You can see this issue in play easily on soundcloud where people will tag their music with whatever tags they think will get their tracks played. You can also see it on all the porn sites where people re-upload porn with ads inserted or overlayed for their pay-for-porn site and then tag the porn with whatever they think will get click throughs.

you might claim with enough tags you'll be able to tell the accurate tags from the inaccurate tags but I've seen no evidence that that actually works. My guess is it's partly that the only people who have an incentive to tag are the content creators (or content re-uploaders) and they have no incentive to follow any rules.

Further, Agreeing on tags is nearly impossible. Consider "man" vs "woman" and all the political discussion around that. There's a conflict between those that want the tag for their identity, and those that want the tag to be useful for filtering. As much as I respect people's identities it's more useful for filtering if I can search for "brunette" and only see brunettes. And, if you find some other tag to use for the filtering then it's only a matter of time before someone demands they be identified with that new tag.


Seems like this would be easily countered by weighting each tag by something like log(n tags) or something.

Basically have just a few tags? They count a lot.

Have a bazillion? They count next to zero


Also let users report bad tags, vote up/down on tags, assign a vangaurd to protect specific tags, basically do anything about the problem.


What's to keep your competition from flagging your own valid tags?

Or you your competition's?


Users are not given enough power to penalize bad actors. What do you expect?

I suspect if tags were implemented properly, companies would make less money.


> You can see this issue in play easily on soundcloud where people will tag their music with whatever tags they think will get their tracks played.

When a Lo-Fi Chillwave stream also includes Grindcore Death Metal tracks, it’s an especially annoying taxonomy misapplication.

Most tag spam is less obvious to people, but still makes for dirty data.


See the list of issues Cory Doctorow identified back in 2000:

https://people.well.com/user/doctorow/metacrap.htm

(Posted by @andyback below.)


My time to shine, I guess.

When I invented tagging it was as personal curation process, it was designed for people to recall things back to themselves later. It was an organizational schema. It has mostly disappeared.

What we have now is mostly tagging so OTHER people can find it. Which leads to all sorts of bad incentives.

Nobody really built collaborative tagging but it needs a bunch more support than hashtag-this and hashtag-that to really work. For example, we showed people tags that they have used before so they were gently pushed to reuse tags for more organizational cohesiveness.


Hey, and thanks for bringing the concept of tagging to the world!

The idea that tags were meant for the individual makes a lot of sense. That’s how I used delicious. It was like my bookmarks folder, but links could be in multiple folders.

When it comes to collaborative tagging, are there any successful examples that you’ve seen? Or sites you feel are using tags in interesting or surprisingly useful ways?


Oh, I miss del.icio.us - UI / execution / social discovery, all those fantastic finds... :)

I love tagging a lot, but the problem with tagging is that only a handful of software/services use look-up for tag1 AND tag2 (AND tag3...) filtering. It's such a simple concept to filter all used tags based on already selected if I'm making a query using tags. I can not understand how people don't get that without this tagging is more-or-less useless.

Few months ago I discovered Bibsonomy [0,1], which is open-source, written in Java, but far beyond my abilities to deploy it.

I've been using Obsidian for almost two years for my notes exclusively and it's a life changer, but devs do not seem to be interested in implementing this simple filtering mechanism in Tags Pane when working with tags [2], which defies the purpose of using tags extensively (like Del.icio.us allowed).

[0] https://bibsonomy.org

[1] https://bitbucket.org/bibsonomy/bibsonomy/src/master/

[2] https://forum.obsidian.md/t/filter-tags-list-in-tag-pane-whe...


Thank you for making the concept of tags! Even though it's not perfect, I'm sure you've probably saved anywhere between thousands to perhaps millions of man hours globally (if not more)! That is an achievement very few people can claim.


>Nobody really built collaborative tagging

Why don't you build it now, or at least finance somebody who does? Is it too risky to build because the big social networks will copy it and thus destroy any successful exit?


I was on a call with Science to potentially take over delicious but the momentum was super low back then and IIRC 99% of it was porn. Should have taken that poison pill in hindsight.


I loved del.icio.us. It was a core part of my browsing experience. Any useful link I stumbled upon got tagged and saved. The popular links were very useful, too. Then it got sold to people who couldn't figure out how to make money off it without ruining it.

I still miss the functionality of being able to quickly find every interesting webpage I've ever seen (using tags). A way to supply that functionality in the modern world would be a visited pages search feature on Google or Chrome. Or a search feature for the content of pages I've bookmarked.


Pinboard (https://pinboard.in/) is still a thing, and the developer bought the deli.icio.us domain too I think.

In any case, it has at least replicated the del.icio.us functionality, and then added more, such as archiving page contents. The tags are still there too, and it prompts with other users' tags when you add a bookmark. Oh, and an API, which is very useful for programmatic use of the data once saved.


Pinboard doesn’t have the social features that del.icio.us had—you can’t see the list of others who bookmarked a link, for example.


In my opinion, social sharing usually requires free accounts. If the account costs money, then the social sharing features are probably less important. Pinboard says it is for introverts.

If proper free accounts with all functions are supported, then advertising is a likely revenue model.

To combat this problem, I've added sharing over email, SMS and Slack to my project, which is a personal search engine and document manager. There's no reason to not allow a very basic and free account function to read these shared links, without a lot of auth and registration getting in the way. The service already has the guest's email address, phone number or Slack account from the other user. This can be used to send a token for easy login. Limits would be added so these accounts can only store a few URLs themselves unless they upgrade.

This keeps privacy protected, hopefully.


Yes, you can see people who bookmarked a link (if they haven't made it private).


Wait really? How? I haven’t been able to figure out how to do this, but I’d love to know.


OK, so this seemed like something plausible to do, thought I'd take a look.

- Looked at "popular" bookmarks, e.g. https://pinboard.in/popular/ - At the far right end of each page title, there's a little number. For example, beside "Lotus 1-2-3 For Linux", there's currently the number 21, which links to https://pinboard.in/url:61d59935774d1affe22713a7423c78bd17ef.... - Following that link, you see who bookmarked that link, with name, tags, and some related other links. Of course, clicking on a user name will show you other things they've linked, at least using the default public setting.


I'm building a personal search engine/document management system that uses tags similar to how del.icio.us worked. URLs and screenshots can be saved via the browser, or by instructing the system to crawl it (which gets done with Firefox/webdriver). It's a like a split brained version of the Grub crawler. It also supports uploading PDFs and images.

Tags, objects, labels, synthesized commentary, etc. are provided by machine learning models and GPT3. Eventually the pipelines will be customizable, so running a plant identification model will be possible. Full text search and analytics is provided via a customized Solr deployment manager. I've built a unique UI for it based on my original cut of a simple timeseries interface at Loggly. Love using it, but have no idea if others will want to pay for it. I seriously hate ads, trackers and user privacy violations.

  merry-zebra|> !crawl https://news.ycombinator.com/item?id=31459103
  merry-zebra|> Please wait while I index https://news.ycombinator.com/item?id=31459103.
  merry-zebra|> Site has been indexed. An image of the site will be added in ~10 seconds.
  merry-zebra|> ...
  merry-zebra|> updated 2022-05-21T18:55:06Z
  merry-zebra|> ID UmXyyk3tZJdGZW4uv
  merry-zebra|> title What Happened to Tagging? (2019) | 
 Hacker News
  merry-zebra|> description The article discusses the potential reasons why "tagging" (i.e. adding labels to content for organizational purposes) has declined in popularity in recent years, despite its usefulness.
  merry-zebra|> URL https://news.ycombinator.com/item?id=31459103
  merry-zebra|> Tags #What, #Happened, #Tagging, #2019, #HackerNews, #News
  merry-zebra|> ...
  merry-zebra|> To search me for the document, click on one of the action links.
  system=> Do you have any comments about this webpage, @merry-zebra?
  merry-zebra|> I find tagging to be extremely useful for organizing content. I think the decline in popularity is likely due to the fact that it can be time consuming to tag everything, and people are often lazy. However, I think it is worth the effort to tag things, as it makes it much easier to find what you're looking for later on.


I've built something similar for my crawler at biztoc.com — It does OpenGraph extraction, finding the body, summarizing it, tag & entity extraction, sentiment analysis, Oembed, stock symbol detection, screenshot & favicon, etc.


Great job on this! It looks fantastic and I think you'll do well. I like how you moved to token use for logins. Passwords are dumb.

I thought about these types of features for Mitta.us (which is NOT done, but operational), but it was too much work. Glad I put it off, because you did a much better job.

I'm adding a !biztoc command to Mitta for search, but it would be cool to be able to add some post parameters like https://biztoc.com/post?title=foo&url=https://zombo.com to post as well.


Nice! That's already implemented: https://biztoc.com/s/bm

E.g: https://biztoc.com/post?bms=mitta&bmu=https%3A%2F%2Fnews.yco...

Let me know if you need more. Cheers


What's the best way to reach you? I am building something with a lot of the same ideas and would love to talk shop. Trying to network with others in the collaborative search/organization/knowledge space.

My email is also in my bio.


Nice work. How do you crawl dynamic websites that barely use links, or those which have scraping countermeasures like Amazon?


Thanks! I use GPT3 to synthesize a title and description from the URL and also use it to generate a description if the site simply lacks one. I use webdriver running Firefox to image the site. Some DOM information can be pulled that typically isn't blocked, but it isn't implemented yet.

My argument for these companies to allow a "scraper" like mine, is that I'm adding their full URL and tags for the user, on the user's behalf. I'm not scraping URLs or doing breadth/depth crawls. I ask for a single page the user gives me, then take an image only that user can see, unless they chose to share it with someone over email or Slack.

When a site implements block "crawlers" from certain IP blocks, I've written an extension for Chrome/Firefox which allows the user to image the screen and upload it. This adds the site to the index just like if they asked the site to crawl it. I gave up on scrolling the window, however. 0.5 seconds per screen grab limit in Chrome now.

It also supports image uploads, so if the user wants to just use their own screenshotting method, they can just upload the image. Extraction of text and synthesis of titles and descriptions can be handled by GPT3 (as well as URL synthesis from keywords, command translation and Solr query synthesis).

I'm working on training a model to tell me whether or not it's an image, web page screenshot or a desktop shot.


Is this a project you intend only for yourself? Or is it going to be a product?


It's a hosted service that will be available as well as an on premise deployment for companies.


Completely agree. Delicious was like the perfect bookmark manager. Then it went to complete shit and ever since then I’ve barely bookmarked anything.

Honestly though I don’t think bookmarks serve much of a purpose anymore. Like I’ll just search my history if I need something specific. Or maybe I‘ve just forgotten how useful they are.


Am I missing something because I use Firefox? I bookmark and tag every interesting site I come across. Is tagging not a thing in Chrome?


> A way to supply that functionality in the modern world would be a visited pages search feature on Google or Chrome

I've been wondering about a plugin that does that. Maybe built over this? https://lunrjs.com/

I am ABSOLUTELY CERTAIN this does not yet exist.

(Easier to type that last sentence than actually Google for it).


Why couldn't such a thing be a local browser extension or similar?


For sometime there were also "machine tags", basically a triple tag invented (I think) at Flickr[0]. It was an interesting concept, you could automate relationships between different contexts, for example between Flickr and Last.fm[1].

I used it for a while, then I always wondered why nothing similar has ever emerged, maybe because after the first wave of "social sharing" excitement of web 2.0, every walled garden has basically double locked their gates. And this is maybe what happened to tagging in general.

[0]: http://tagaholic.me/2009/03/26/what-are-machine-tags.html

[1]: https://code.flickr.net/2008/08/28/machine-tags-lastfm-and-r...


The concept of machine tags is the core premise of RDF. RDF is essentially the standardized way of describing relationships in a structured way (in XML). In fact, an early version of RSS was based on RDF (RSS 0.9 stood for "RDF site summary).

One of the downsides is that it's pretty hard for "average" folks to produce these feeds. There's a steep learning curve for modeling the relationships. Getting other sites to agree on a format, use it, and maintain it without breaking compatibility was hard.


I’m not sure I’m following, how is the machine tag format setting up automatic relationships and contexts? How is it helpful and how might you see it being effective today?

Edit: only saw the first link. Seems second link breaks it down but can’t review it yet. Got pulled away. Thanks!


Machine tags aren't tags at all. Not all metadata is tagging.


Right around the time the author was celebrating Tagsgiving, I was in Library School, and tagging was a hot topic around those parts. The consensus there was: "this is great and all, but there's a reason we have controlled vocabularies and classification systems. We'll see, we'll see."

I was all in on the possibilities for "folksonomies" and user tagging. However I have to admit that I have not seen many examples of where uncontrolled tagging was all that useful at scale.

To organize information, you need experts, with training, time, and a reason to get it right. Or, you can do it with an arbitrarily sophisticated, mostly theoretical ML system. But neither of these solutions benefit from having user tags.


Add enough tags and then you have a gawdawful mess and you need tags to organize your tags.


I think this is a great use case for some algorithm to help you combine tags ( by recognizing synonyms/plurals, text summarization, crowd-spurcing, something else?). Then it could keep you "on the rails" when tagging and periodically ask if you want to combine tags that seem similar.


Perhaps, but after even just a few relatively short attempts to start organizing some of my files with tags I don't think this would be sufficient. I found the meaning of tags frequently started to drift. What I cared about and why just wasn't that consistent. Never mind being consistent with hair-splitting judgement calls in categorization.

And the more you tag, the more difficult it is to fix. Either you retag everything to fit the new standard or you accept that trying to retrieve things by tag will return some weird set defined by the intersection of your changing definition over time and the time at which you applied the tag.

I don't doubt a more structured and principled approach would help, but I found it just ended up soaking up tons of time, and thought, without actually providing much back.


I’ve always thought that Gmail’s hierarchical tagging (‘folders’) is a great solution to the organization problem.


Same. I've making and remaking a bookmarking/notetaking site for my personal use over the years, and this is the solution I landed on. They look like and can be organized like folders, but you can quickly add items to multiple folders. I think it's working well for me so far.


That’s awesome. Just this past week I’ve started looking for a Chrome extension that does the same.


Doesn’t that always happen? Some people will push the tools so far that they lose all their initial usefulness.


I’m not sure tags died, TikTok certainly seems to be built around tags and it has over a billion monthly users. They are also key to Instagram discovery but feel a little less important there, though I don’t care much for that platform and could be wrong.


Curated tags, including canonizing one variation of a tag and making all the others with the same meaning synonyms: https://archiveofourown.org/wrangling_guidelines/16

Of course, where a single word has two or more meanings, synonyms don't make sense, so go with Wikipedia-style disambiguation.

Also be aware if your community has specialized jargon, uses multiple human languages, patois, creole, dialects, or pidgin.

Allow multi-word tags, but settle on a single casing/separation and enforce it: camelCase, snake_case, and kebab-case are some choices.

Prefer plurals "landscapes" over singulars "landscape".

See also https://web.archive.org/web/20050426210018/http://ideant.typ...


Tagging requires mental energy and some level of abstraction prowess - and might still be misleading. Social media is geared towards making the user expend as little mental energy as possible - and then organize the information they provide anyway for the advertisers using behavioral patterns or some variant of AI. This is probably considered by the industry to provide more reliable information - its like the difference between asking people to explain ethical behavior compared to recording what people actually do in reality.

So, we have a "tagging" model driven by advertising needs, that discourages our own need to tag (intellectually categorize) the content we consume. Instead of moving forwards, towards a more accurate tagging system that supports reflection and concept organization, it seems to me (in my pessimistic moments) that we are moving backwards into an online world where the only ones that know what we are doing are the machines.


Tagging never died.

“Tags” became known as “Labels”.

Labels are core functionality of Gmail, GitHub Issues, and more today.


I don't think tags died, they just evolved to more user friendly forms. For example, Reddit is basically tags. You post something to one subreddit, or cross-post it to another, and other users who are subscribed will all see it. Perhaps the UX was the main issue in those early days


Also, reddit, due to the way it encourages specific niche subreddits to proliferate, inadvertently also showcases, albeit from the tag side instead of the tagged-content side, one use-case where community tagging is both necessary and canonical tags are inconsistent or nonexistent, which is the vast arena of porn, which involves attempts to categorize massive amounts of content that are frequently lacking in tags or tagged minimalistically or even erroneously by studios, and also, can involve specific preferences that aren't necessarily even considered as something that needs to be tagged until someone starts a subreddit on some really obscure aspect of a clip and then, suddenly it turns out there's a community and demand for such a tag.

Some automated tagging solutions do work on some aspects of the tag deficiencies - performers using different stage names, for example. However, just as obscenity is defined on a "I'll know it when I see it" basis, individual and perhaps previously unnamed categorizations pop up frequently enough that there's no realistic way to anticipate every future community tag that may come about. There's also inconsistencies as to how currently-used tags are defined, and even in the generally more centralized and almost over-specific niches of the industry in Japan, with consistent and unique product codes for reference, you still don't get a single consistent studio tag system, even for their domestic mainstream market. And language certainly factors into all this, as well as culture. It's already evident that some tags translate and some simply don't, either because there is no word for it in the other language, or the categorization loses some aspect of cultural significance in the translation process that makes it end result valid but also nonsensical. Some degree of community curation to augment even a relatively consistent, centralized, and comprehensive canonical source.

There are a few projects on Github that are hashing out compatible systems for at least the English language (and it appears that projects in Korean and Chinese exist too, also on Github). This is definitely an arena that is organic, disorganized, and even if in the future can mostly be automated, will always have room for community curation, and is actually actively being worked out and evolving in real time. Tags, or community curation at large, will likely persist as content, the market, culture, CV/classification tech, and mores change as time goes on. Definitely not dying.


Tagging has enormous labor costs associated with it and people don't want to do it. Creation, accuracy confirmation, maintenance/updates, removal.

Your users do not want to do work, and they certainly do not want to do your work for you. They're tired, they want to relax, do not make them work.


Killed (at least for me) by browser history (and then the web of course) being so easily searchable


Although that might have been the case some time ago, nowadays browser history is definitely not easily searchable unless you only need results from the last 90 days. Chrome purges them after that, and Firefox has a similar limit.


I just have my history config set to:

Firefox will "remember history"

and my history goes back to 2019 when i started this new firefox profile.


All you need is links. Tags are just links to pages which don't exists. https://news.ycombinator.com/item?id=30915520


If that's the approach you want to take, tags are links to a set of pages you've specifically defined, whether they exist or not being largely irrelevant.

The term in library science is a controlled vocabulary.

Search allows a document (or its author(s)) to define itself. A controlled vocabulary allows another party --- yourself or a third party --- to define a set of terms which describe the contents.

There are various reasons why you might not want the document's author themselves to rule over such terms as "brilliant", "fascist", or the like. Some degree of distance may be required.

Authors are also famously unreliable narrators.


People who tell you that social tagging doesn't work would have tell you that Wikipedia can't work.

Social tagging can probably work but we haven't found a way yet


Also, what happened to the "semantic web"?

(what, you cynical person, are you insinuating a profit-addled industry would stoop to misleading tags? oh, ye of little faith ...)


Apple notes tag is pretty awful.


OTOH, Federico Viticci (MacStories) just switched to using reminders and tags with smart lists. I haven't attempted this yet, but I think the gist is reminders/notes requires tags in order to get smart folders.

https://club.macstories.net/posts/going-all-in-with-reminder...



That's really shallow. A nostalgia dive, not an analysis. It isn't that "tagging worked, but now it doesn't", it just never really worked for her usecases, and it still does for usecases it can be used for. Maybe I would be more sympathetic, because I also feel that "it was better back then" (who doesn't?), if it wasn't for people like her using phrases "exploring Web 2.0" and so obviously not understanding what really happened.

And what really happened is this. A bunch of somewhat similar people (a tiny group, really — compared to the population of the world, anyway) were playing a cute tabletop game of "Web 2.0" at the table set up out in the open fields, pretending they are eager for more people to join. But they weren't exploring this game, weren't really thinking about all the nuances and implication its rules had, and weren't actually preparing for more people to join, instead they implicitly agreed to use their own impromptu simplified version of the rules, which wasn't that hard to do for this tiny group of likely-minded people. And this naturally broke when more people and corporations and governments joined the game, and started playing it being limited only by the actual rules of the game, which are akin to the laws of physics in that they really have to be explored and have an immense number of implications.

So, what role does tagging have in all of this? What is its true nature? Sadly, its true nature is "a hack", an artificial construct that gives you a one-inch leeway to barely slide across your problem. The problem is that it's a bit too hard to search for content using natural language descriptions of the content, because there is a lot of content, and there are a lot of words in human languages, and they all intertwine in all sorts of ways, and people use the same words to mean something else than what you meant, and they might use some other synonymous words to describe what you meant, and so on. It is alright as long as there is a hundred of people and thousands of posts, and it's all text without pictures, but when it grows it becomes just unmanageable and the full-text search doesn't work anymore. And you have no ideas how to approach it other than full-text search. So you agree to use simplified version of language in addition to the original language to mark what your content really is about, so that its a bit less ambiguous and you can continue to use full-text search on all of the contents of the internet. That gives you just enough leeway to reduce the number of search results to something manageable and you are happy. And you can even automatically feed the search results to your RSS.

And 15 years later, when it all grows more still, and the simplified tagging-language gets more complex and more abused (or underused) so that full-text search doesn't work for everyone again, so people stop using your tagging-language because it doesn't help them any more than the original content, so it's useless and unnecessary and they switch to other approaches to find content. Which are also imperfect, by the way, but at least they are actually another approaches, not just some hacks on top of full-text search.

The real lesson here is: "tagging doesn't work out in the open fields; use it in small communities with curated content, like, for example, a blogging channel of your own, so that your readers can find your other posts on the same subject". And (surprise!) people do use it for that quite successfully.



Thanks, that's a vastly better article than the submitted one.

I'd read it before, it's well worth revisiting.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: