Hacker Newsnew | past | comments | ask | show | jobs | submit | miket's commentslogin

Open source LLM that outperforms ChatGPT search mode, Gemini, and Perplexity


Here's a good way to identify how entity-dense your text is: https://demo.nl.diffbot.com/


Much easier to exit than standard vim!


Any question asking about the letters of words is bound to underwhelm because GPT3 is trained on sub-word tokens, so it does not have random access to individual letters. The word "prime" is tokenized as a single token, instead of 5 characters:

https://beta.openai.com/tokenizer


I’ve seen this said before, but it’s quite clear from the responses that it understands that prime has 5 letters.


Now try a gibberish word. A human could count the number of letters but ChatGPT can only approximate it.


Me: How many letters are in the word “smorgle” ?

GPT: There is no word "smorgle" in the English language.

Me: I never said it was English.

GPT: I apologize for assuming that "smorgle" was a word in the English language. Without any further context or information about the word "smorgle," I am unable to determine how many letters it has or provide any additional information about it.

Me: Count the letters

GPT: If you would like me to count the letters in the word "smorgle," it has 7 letters.


I’m so trained to double checking anything it says, I quickly confirmed that smorgle does indeed have 7 letters.


Me: I’ve made up a word called “mlaizapf”. Can you count the number of letters?

ChatGPT: Sure! "Mlaizapf" has 8 letters.

--

Me: How about this word: “spoofydoofaplixader”

ChatGPT: Spoofydoofaplixader" has 18 letters.

--

Me: How about the following words: "ghottiaopepinopepotorzey", "zazzamataz", and "maaq"

Gpt: Here are the number of letters in each word:

* "Ghottiaopepinopepotorzey" has 23 letters.

* "Zazzamataz" has 10 letters.

* "Maaq" has 4 letters.

Is there anything else I can help with?

------

It got all of them correct except for "ghottiaopepinopepotorzey" which has 24 letters, not 23.

Very interesting... It seems similar to its math abilities, where it struggles with bigger numbers or more complex problems.


I asked it a bunch of gibberish words and it got them all correct.


My mental model is that if you give it real words, it uses approximately one token per word, and it may or may not know how many letters are in the word - it will have learned how many letters there are only if that information was in its training. Just like any other fact it learns about words. It is not counting the letters.

If you give it a gibberish word, it will represent it with one letter per token and be actually able to more or less count tokens in order to figure out how many letters there are.

So this ends up looking like it can count letters in most words, real and fake. Perhaps it would do poorly with real but uncommon words.


>more or less count tokens

Which is what I meant by saying "approximate" because it can "count" the number of tokens.


> it does not have random access to individual letters

this presumes it works by understanding the components of the question and reasoning based on them. But it doesn't access down to that level, instead just guessing the most likely next word based on statistical tricks. so it doesn't need to "know" about letters to generate a reasonable response involving letters.


What do you think hidden layers do?


not familiar with that - what is it?





MediaWiki, the software that Wikipedia uses is open source and Wikibase, the software that WikiData uses is also open source


Hi, founder of Diffbot here, we are an AI research company spinout from Stanford that generate the world's largest knowledge graph from crawling the whole web. I didn't want to comment, but I see a lot of misunderstandings here about knowledge graphs, abstract representations of language, and the extent as to which this project uses ML.

First of all, having a machine-readable database of knowledge(i.e. Wikidata) is no doubt a great thing. It's maintained by a large community of human curators and always growing. However, generating actually useful natural language that rivals the value you get from reading a Wikipedia page from an abstract representation is problematic.

If you look at the walkthrough for how this would work (https://github.com/google/abstracttext/blob/master/eneyj/doc...), this project does not use machine and uses CFG-like production rules to generate natural sentences. Works great for generating toy sentences like "X is a Y".

However, human languages are not programming languages. Many natural languages, like German and Finnish, are so syntactically and morphologically complex that there is no compact ruleset that can describe them. (those that have taken grammar class can relate to the number of exceptions to the ruleset)

Additionally, not every sentence in a typical Wikipedia article can be easily represented in a machine-readable factual format. Plenty of text is opinion, subjective, or describes notions that don't have an proper entity. Of course there are ways that engineer around this, however they will exponential grow the complexity of your ontology, number of properties, and make for a terrible user experience for the annotators.

A much better and direct approach to the stated intention of making the knowledge accessible to more readers is to advance the state of machine translation, which would capture nuance and non-facts present in the original article. Additionally, exploring ML-based ways of NL generation from the dataset this will produce will have academic impact.


> Many natural languages, like German and Finnish, are so syntactically and morphologically complex that there is no compact ruleset that can describe them. (...)

> Additionally, not every sentence in a typical Wikipedia article can be easily represented in a machine-readable factual format.

It doesn't seem like the goal of this project is to describe those languages, or to represent ever sentence in a typical Wikipedia article? The goal doesn't seem to be to have all Wikipedia articles generated from Wikidata, but rather to have a couple of templates to the order of "if I have this data available about this type of Subject, generate this stub article about it". That would allow the smaller Wikipedia language editions to automatically generate many baseline articles that they might not currently have.

For example, the Dutch Wikipedia is one of the largest editions mainly because a large percentage of its articles were created by bots [1] that created a lot of articles on small towns ("x is a town in the municipality of y, founded in z. It is nearby m, n and o.") and obscure species of plants. This just seems like a more structured plan to apply that approach to many of the smaller Wikipedia's that may be missing a lot of basic articles and are thus not exposing many basic facts.

[1] https://en.wikipedia.org/wiki/Dutch_Wikipedia#Internet_bots


This is addressed in the white paper describing the project's architecture:

10.2 Machine translation

Another widely used approach —mostly for readers, much less for contributors— is the use of automatic translation services like Google Translate. A reader finds an article they are interested in and then asks the service to translate itinto a language they understand. Google Translate currently supports about a hundred languages — about a third of thelanguages Wikipedia supports. Also the quality of these translations can vary widely — and almost never achieves thequality a reader expects from an encyclopedia [33, 86].*

Unfortunately, the quality of the translations often correlates with the availability of content in the given language [1],which leads to a Matthew effect: languages that already have larger amounts of content also feature better results intranslation. This is an inherent problem with the way Machine Translation is currently trained, using large corpora. Whereas further breakthroughs in Machine Translation are expected [43], these are hard to plan for.

In short, relying on Machine Translation may delay the achievement of the Wikipedia mission by a rather unpredictabletime frame.

One advantage Abstract Wikipedia would lead to is that Machine Translation system can use the natural language generation system available in Wikilambda to generate high-quality and high-fidelity parallel corpora for even morelanguages, which can be used to train Machine Translation systems which then can resolve the brittleness a symbolic system will undoubtedly encounter. So Abstract Wikipedia will increase the speed Machine Translation will become better and cover more languages in.

https://arxiv.org/abs/2004.04733

(Theres's more discussion of machine learning in the paper but I'm quoting the section on machine translation in particular).


Additionally of course Google Translate is a proprietary service from Google, and Wikimedia projects can't integrate it in any way without abandoning their principles. It's left for the reader to enter pages into Google Translate themselves, and will only work as long as Google is providing the service.

What is the quality of open source translation these days?


State of the art is always open source in MT.


>> Many natural languages, like German and Finnish, are so syntactically and morphologically complex that there is no compact ruleset that can describe them.

Is that realy true? If natural languages have rules, then there exists a ruleset that can describe any natural language- the set of all rules in that language. Of course, a "rule" is a compact representation of a set of strings, so if natural languages don't have such rules it's difficult to see how any automated system can represent a natural language "compactly". A system without any kind of "rules" would have to store every grammatical string in a language. That must be impossible in theory and in practice.

If I may offer a personal perspective, I think that the goal of the plan is to produce better automated translations than is currently possible with machine translation between language pairs for which there are very few parallel texts. My personal perspective is that I'm Greek and I am sad to report that basicaly translation from any language to Greek by e.g. Google Translate (which I use occasionally) is laughably, cringe-inducingly bad. From what I understand the reason for that is not only the morphology of the Greek language which is kind of a linguistic isolate (as opposed to, say, Romance languages), but also that, because there are not many parallel texts between most languages (on Google Translate) and Greek, the translation goes through English- which results in completely distorted syntax and meaning. Any project that can improve on this sorry state of affairs (and not just for Greek- there are languages with many fewer speakers and no paralle texts at all, not even with English) is worth every second of its time.

To put it plainly, if you don't have enough data to train a machine learning model, what, exactly, are your options? There is only one option: to do the work by hand. Wikipedia, with its army of volunteers, has a much better shot at getting results this way than any previous effort.


> To put it plainly, if you don't have enough data to train a machine learning model, what, exactly, are your options? There is only one option: to do the work by hand. Wikipedia, with its army of volunteers, has a much better shot at getting results this way than any previous effort.

The training data for machine translation models is also human-created. Given some fixed amount of human hours, would you rather them be spent annotating text that can train a translation system that can be used for many things, or a system that can just be used for this project? It all depends on the yield that you get per man-hour.


As the paper I quote below says, the system that would result from this project could be re-used in many other tasks, one of which is generating data for machine translation algorithms.

I think this makes sense. The project aims to create a program, basically ("a set of functions"). There are, intuitively, more uses for a proram than for a set of labelled data.


> Of course there are ways that engineer around this, however they will exponential grow the complexity of your ontology, number of properties, and make for a terrible user experience for the annotators.

So, the obvious solution is to create robo-annotators, and that's what your company is supposedly trying to do?


The article hardly supports its conclusion with these cherry-picked examples; however, the core reason these results don't meet the author's expectations is that Google's AI does not understand the content of webpages well enough to identify the publication date accurately (at least anywhere near as accurately as a human can). Google's publication date is based on whether it found changes to the HTML on its own crawl date (which is very noisy due to today's dynamically generated website) or based on schema.org/microdata, which as other commentators point it is game-able for purposes of SEO, or simply missing on most sites.

As a contrast, take a look at how Diffbot, an AI system that understands the content of the page by using computer vision and NLP techniques on it, interprets the page in question:

https://www.diffbot.com/testdrive/?url=https://www.reddit.co...

It can reliably extract the publication date on each post, without resorting to using site-specific rules. (You can try it on other discussion threads and article pages, that have a visible publication date).


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: