Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The article hardly supports its conclusion with these cherry-picked examples; however, the core reason these results don't meet the author's expectations is that Google's AI does not understand the content of webpages well enough to identify the publication date accurately (at least anywhere near as accurately as a human can). Google's publication date is based on whether it found changes to the HTML on its own crawl date (which is very noisy due to today's dynamically generated website) or based on schema.org/microdata, which as other commentators point it is game-able for purposes of SEO, or simply missing on most sites.

As a contrast, take a look at how Diffbot, an AI system that understands the content of the page by using computer vision and NLP techniques on it, interprets the page in question:

https://www.diffbot.com/testdrive/?url=https://www.reddit.co...

It can reliably extract the publication date on each post, without resorting to using site-specific rules. (You can try it on other discussion threads and article pages, that have a visible publication date).



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: