Any question asking about the letters of words is bound to underwhelm because GPT3 is trained on sub-word tokens, so it does not have random access to individual letters. The word "prime" is tokenized as a single token, instead of 5 characters:
GPT: There is no word "smorgle" in the English language.
Me: I never said it was English.
GPT: I apologize for assuming that "smorgle" was a word in the English language. Without any further context or information about the word "smorgle," I am unable to determine how many letters it has or provide any additional information about it.
Me: Count the letters
GPT: If you would like me to count the letters in the word "smorgle," it has 7 letters.
My mental model is that if you give it real words, it uses approximately one token per word, and it may or may not know how many letters are in the word - it will have learned how many letters there are only if that information was in its training. Just like any other fact it learns about words. It is not counting the letters.
If you give it a gibberish word, it will represent it with one letter per token and be actually able to more or less count tokens in order to figure out how many letters there are.
So this ends up looking like it can count letters in most words, real and fake. Perhaps it would do poorly with real but uncommon words.
> it does not have random access to individual letters
this presumes it works by understanding the components of the question and reasoning based on them. But it doesn't access down to that level, instead just guessing the most likely next word based on statistical tricks. so it doesn't need to "know" about letters to generate a reasonable response involving letters.
Hi, founder of Diffbot here, we are an AI research company spinout from Stanford that generate the world's largest knowledge graph from crawling the whole web. I didn't want to comment, but I see a lot of misunderstandings here about knowledge graphs, abstract representations of language, and the extent as to which this project uses ML.
First of all, having a machine-readable database of knowledge(i.e. Wikidata) is no doubt a great thing. It's maintained by a large community of human curators and always growing. However, generating actually useful natural language that rivals the value you get from reading a Wikipedia page from an abstract representation is problematic.
If you look at the walkthrough for how this would work (https://github.com/google/abstracttext/blob/master/eneyj/doc...), this project does not use machine and uses CFG-like production rules to generate natural sentences. Works great for generating toy sentences like "X is a Y".
However, human languages are not programming languages. Many natural languages, like German and Finnish, are so syntactically and morphologically complex that there is no compact ruleset that can describe them. (those that have taken grammar class can relate to the number of exceptions to the ruleset)
Additionally, not every sentence in a typical Wikipedia article can be easily represented in a machine-readable factual format. Plenty of text is opinion, subjective, or describes notions that don't have an proper entity. Of course there are ways that engineer around this, however they will exponential grow the complexity of your ontology, number of properties, and make for a terrible user experience for the annotators.
A much better and direct approach to the stated intention of making the knowledge accessible to more readers is to advance the state of machine translation, which would capture nuance and non-facts present in the original article. Additionally, exploring ML-based ways of NL generation from the dataset this will produce will have academic impact.
> Many natural languages, like German and Finnish, are so syntactically and morphologically complex that there is no compact ruleset that can describe them. (...)
> Additionally, not every sentence in a typical Wikipedia article can be easily represented in a machine-readable factual format.
It doesn't seem like the goal of this project is to describe those languages, or to represent ever sentence in a typical Wikipedia article? The goal doesn't seem to be to have all Wikipedia articles generated from Wikidata, but rather to have a couple of templates to the order of "if I have this data available about this type of Subject, generate this stub article about it". That would allow the smaller Wikipedia language editions to automatically generate many baseline articles that they might not currently have.
For example, the Dutch Wikipedia is one of the largest editions mainly because a large percentage of its articles were created by bots [1] that created a lot of articles on small towns ("x is a town in the municipality of y, founded in z. It is nearby m, n and o.") and obscure species of plants. This just seems like a more structured plan to apply that approach to many of the smaller Wikipedia's that may be missing a lot of basic articles and are thus not exposing many basic facts.
This is addressed in the white paper describing the project's architecture:
10.2 Machine translation
Another widely used approach —mostly for readers,
much less for contributors— is the use of automatic translation services like
Google Translate. A reader finds an article they are interested in and then asks
the service to translate itinto a language they understand. Google Translate
currently supports about a hundred languages — about a third of thelanguages
Wikipedia supports. Also the quality of these translations can vary widely — and
almost never achieves thequality a reader expects from an encyclopedia [33,
86].*
Unfortunately, the quality of the translations often correlates with the
availability of content in the given language [1],which leads to a Matthew
effect: languages that already have larger amounts of content also feature
better results intranslation. This is an inherent problem with the way Machine
Translation is currently trained, using large corpora. Whereas further
breakthroughs in Machine Translation are expected [43], these are hard to plan
for.
In short, relying on Machine Translation may delay the achievement of the
Wikipedia mission by a rather unpredictabletime frame.
One advantage Abstract Wikipedia would lead to is that Machine Translation
system can use the natural language generation system available in Wikilambda to
generate high-quality and high-fidelity parallel corpora for even morelanguages,
which can be used to train Machine Translation systems which then can resolve
the brittleness a symbolic system will undoubtedly encounter. So Abstract
Wikipedia will increase the speed Machine Translation will become better and
cover more languages in.
Additionally of course Google Translate is a proprietary service from Google, and Wikimedia projects can't integrate it in any way without abandoning their principles. It's left for the reader to enter pages into Google Translate themselves, and will only work as long as Google is providing the service.
What is the quality of open source translation these days?
>> Many natural languages, like German and Finnish, are so syntactically and
morphologically complex that there is no compact ruleset that can describe
them.
Is that realy true? If natural languages have rules, then there exists a
ruleset that can describe any natural language- the set of all rules in that
language. Of course, a "rule" is a compact representation of a set of strings,
so if natural languages don't have such rules it's difficult to see how any
automated system can represent a natural language "compactly". A system
without any kind of "rules" would have to store every grammatical string in a
language. That must be impossible in theory and in practice.
If I may offer a personal perspective, I think that the goal of the plan is to
produce better automated translations than is currently possible with machine
translation between language pairs for which there are very few parallel
texts. My personal perspective is that I'm Greek and I am sad to report that
basicaly translation from any language to Greek by e.g. Google Translate
(which I use occasionally) is laughably, cringe-inducingly bad. From what I
understand the reason for that is not only the morphology of the Greek
language which is kind of a linguistic isolate (as opposed to, say, Romance
languages), but also that, because there are not many parallel texts between
most languages (on Google Translate) and Greek, the translation goes through
English- which results in completely distorted syntax and meaning. Any project
that can improve on this sorry state of affairs (and not just for Greek- there
are languages with many fewer speakers and no paralle texts at all, not even
with English) is worth every second of its time.
To put it plainly, if you don't have enough data to train a machine learning
model, what, exactly, are your options? There is only one option: to do the
work by hand. Wikipedia, with its army of volunteers, has a much better shot
at getting results this way than any previous effort.
> To put it plainly, if you don't have enough data to train a machine learning model, what, exactly, are your options? There is only one option: to do the work by hand. Wikipedia, with its army of volunteers, has a much better shot at getting results this way than any previous effort.
The training data for machine translation models is also human-created. Given some fixed amount of human hours, would you rather them be spent annotating text that can train a translation system that can be used for many things, or a system that can just be used for this project? It all depends on the yield that you get per man-hour.
As the paper I quote below says, the system that would result from this project could be re-used in many other tasks, one of which is generating data for machine translation algorithms.
I think this makes sense. The project aims to create a program, basically ("a set of functions"). There are, intuitively, more uses for a proram than for a set of labelled data.
> Of course there are ways that engineer around this, however they will exponential grow the complexity of your ontology, number of properties, and make for a terrible user experience for the annotators.
So, the obvious solution is to create robo-annotators, and that's what your company is supposedly trying to do?
The article hardly supports its conclusion with these cherry-picked examples; however, the core reason these results don't meet the author's expectations is that Google's AI does not understand the content of webpages well enough to identify the publication date accurately (at least anywhere near as accurately as a human can). Google's publication date is based on whether it found changes to the HTML on its own crawl date (which is very noisy due to today's dynamically generated website) or based on schema.org/microdata, which as other commentators point it is game-able for purposes of SEO, or simply missing on most sites.
As a contrast, take a look at how Diffbot, an AI system that understands the content of the page by using computer vision and NLP techniques on it, interprets the page in question:
It can reliably extract the publication date on each post, without resorting to using site-specific rules. (You can try it on other discussion threads and article pages, that have a visible publication date).