Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Mistral Saba (mistral.ai)
145 points by stephen37 5 months ago | hide | past | favorite | 21 comments


I wonder why they grouped languages from the Middle East and South Asia together. Arabic and Hebrew are Semitic languages - no language from that family tree is native to the subcontinent. It would make sense if northern languages like Hindi, Urdu, Bengali, Nepali, etc were grouped with Persian, French, Russian, etc since those are all from the Indo-European family. South Indian languages like Telugu and Tamil are from a completely different family (Dravidian).

Why not either train the model exclusively on Semitic languages for further performance for those languages or on a wider set of languages for better multilingual performance overall? I don't understand the logic here.


There are a far greater number of speakers of Arabic in Germany (1.4M) [1] than in Afghanistan (420K) [2].

So properly speaking, they should be advertising the target region as Europe, Middle East and Africa. [3]

[1] https://en.wikipedia.org/wiki/Languages_of_Germany [2] https://en.wikipedia.org/wiki/Languages_of_Afghanistan [3] https://en.wikipedia.org/wiki/List_of_countries_and_territor...


There is a lot of Indian laborers in the Middle East, so it’s not that Tamil and Arabic are related, but a model used for that region should be fluent in both


Not sure if you've been to the middle east, but there's no way the labourers will have access to the internet besides their phones. And those phones can only be used to communicate with their loved ones back home, using Whatsapp.

They don't (or rather CAN) care about anything else in the world.

They have a lot more problems than "this model doesn't convert urdu to arabic well".


WhatsApp is a gateway to many things. Meta's Ai is on WhatsApp. Openai is on WhatsApp https://www.hindustantimes.com/technology/whatsapp-users-can... and I would really expect this model to have a WhatsApp gateway as well.

> They have a lot more problems than "this model doesn't convert urdu to arabic well".

I get what you mean, but I'm not sure what point you're trying to make. That they're a lost cause with too many problems and we shouldn't care about that use case? Why wouldn't we want to create models to provide more capabilities / information regardless?


they still need to communicate with local administration via "citizen app"(whatever it is called) to access any service, pay their fines, etc...(and be tracked) I'm guessing the stake holder in this project is the government of qatar


I'm Malayalam from Kerala state was not first if cultural exchange was the metric. ME natives often ask if somebody is from Kerala or (rest of) India. Malabar traded with Middle East for millenia (now cash crops, trades and skilled laborers, medical tourism) and Malayalam loans many words from Arabic and there is an Arabic script for Malayalam.


There's going to be diminishing returns in splitting the languages where you get less information related to the region / concept just because you're avoiding mixing languages. The language was not the only aspect: "cultural background, and in-depth regional knowledge". There's going to be lots of information shared in south/North languages just because of the geographically close (relatively anyway) distance.

I mean you wouldn't want to split a model into 3 separate ones, where one contains Austrian, another Slovakian, and another Hungarian, since there's going to be lots of cultural overlap.


I agree that it makes sense to group the Indic languages together due to cultural proximity but why would you group the Indic languages with Middle Eastern ones? Might as well group it with European or African or Sinitic languages at that point.


> I wonder why they grouped languages from the Middle East and South Asia together

Geography



That makes a lot more sense than the Saba I initially thought of[1]. The "specific geographies" seemed a bit overly specific.

[1]: https://en.wikipedia.org/wiki/Saba_(island)


> Mistral Saba is a 24B parameter model trained on meticulously curated datasets from across the Middle East and South Asia.


It's great to see proliferation of models in other languages!

Shoutout to Alif, a finetune of Llama 3 8b on Urdu datasets: https://huggingface.co/large-traversaal/Alif-1.0-8B-Instruct

It'd be great to see a comparison.


That's interesting. It would be interesting to compare how this will fare against Fanar (Arabic oriented models) [1]. I got access to their API last week but still didn't play with it. I think they did pretty good job with arabic dialects [2]. I don't know if they have any plans to release weights though. There are two models one trained from scratch and the other ia fine turned of Google's Gemma.

Saba vs fanar. I like the names too.

[1] https://fanar.qa/en

[2] https://arxiv.org/abs/2501.13944


Considering they don't talk about licensing, one can assume this is proprietary?

~2 years ago (Sep 27, 2023), Mistral AI said:

> we believe that an open approach to generative AI is necessary. Community-backed model development is the surest path to fight censorship and bias in a technology shaping our future. We strongly believe that by training our own models, releasing them openly, and fostering community contributions, we can build a credible alternative to the emerging AI oligopoly. Open-weight generative models will play a pivotal role in the upcoming AI revolution.

> Mistral AI’s mission is to spearhead the revolution of open models.

https://mistral.ai/en/news/about-mistral-ai

Did something change since then, or why did they have a change of hearts? Are they just doing a "OpenAI" and appear to believe in something in order to further their own cause, or does it have some particular reason behind it?


> One of the many custom-trained models to serve specific geographies, markets, and customers

> Mistral Saba is a result of working closely with strategic regional customers to address very specific challenges in addressing bespoke use cases.

It seems like a customer paid them to train this model, so presumably that customer gets to decide on licensing terms.

Isn’t this Mistral’s business model? Make general purpose models available as open-source and train more specific models for their customers?


It's available on their API service, so I'd assume it's not one of the private models others pay them to create, otherwise it would be fully private to the customer. Or, the customer wants it to be released/open, then the customer would release it, not Mistral. But I might misunderstand how they operate, happy to correct what I understand.

Edit: Actually, it is outlined in the bottom of the post:

> we have also begun to train models for strategic customers with the power of their deep and proprietary enterprise context. These models stay exclusive and private to the respective customers. If you would like to explore custom training with Mistral AI, explore our applied AI offerings, or please contact us.

So this is not one of those, as then it would be exclusive and private to the customer. This (Saba) is one of the models that I understood they would have released as at least "open-weights", if following their initial goals according to the early blog posts.


It says south asia but the blog post is about Arabic. Where are the numbers on Urdu?


GPT-4o mini keeps quietly demonstrating value per cost.


care to elaborate ?

Saba: input $0.2/M tokens / Output $0.6 /M tokens

GPT-4o Input: $0.15/M tokens Cached input:$0.075/M tokens Output:$0.6/1M tokens

sources: https://openai.com/api/pricing/ and https://mistral.ai/en/products/la-plateforme#pricing




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: