Hacker News new | past | comments | ask | show | jobs | submit login
Fluent 1.0: a localization system for natural-sounding translations (hacks.mozilla.org)
319 points by feross on April 17, 2019 | hide | past | favorite | 115 comments



One thing that I've always struggled w.r.t i18n is having to split messages up into often times less coherent chunks in order to add things like links or tooltips or styling elements in the middle of the text, which makes it more difficult to localize messages as a holistic piece independent of the source language.

For a slightly contrived example to demonstrate this, let's say you have a string like this:

"Please click here 7 times to confirm"

Where you want to make the "click here 7 times" look like a link by wrapping it in a <a> tag, or just styled differently using a styled <span>.

Using something like react-intl, which is what I've used in the past, you'd have to do something like this:

  <FormattedMessage
    id="confirm"
    defaultMessage={`Please { confirmLink } to confirm`}
    values={{
      confirmLink: 
        <a>
          <FormattedMessage 
            id="confirm-link"
            defaultMessage={`click here {clickCount, number} {clickCount, plural,
              one {time}
              other {times}
            }`}
            values={{ clickCount: 7 }}
          />
        </a>
    }}
  />
If then some language happens to require a completely different sentence structure that changes the ordering such that the "to confirm" part needs to be interleaved somewhere in the middle of the "click here 7 times" message to sound fluent, this would not be able to accommodate that.

I'm wondering how people generally deal with this, in React and elsewhere.


This is a great point and something that we've seen come up very often in building UIs. The good practice which we recommend to developers at Mozilla is to avoid splitting or nesting messages, because it makes it harder for translators to see the entire translation at once.

We've taken a layered approach to designing Fluent: what we're announcing today is the 1.0 of the syntax and file format specification. The implementations are still maturing towards 1.0 quality, but let me quickly describe what our current thinking is.

For JavaScript, we're working on low-level library which implements a parser of Fluent files, and offers an agnostic API for formatting translations. On top of it we hope to see an ecosystem of glue-code libraries, or bindings, each satisfying the needs of a different use-case or framework.

I've been working on one such binding library called fluent-react. It's still in its 0.x days, but it's already used in a number of Mozilla projects (e.g. in Firefox DevTools). In fluent-react translations can contain limited markup. During rendering, the markup is sanitized and then matched against props defined by the developer in the source code, in a way that overlays the translation onto the source's structure. Hence, this feature is called Overlays. See https://github.com/projectfluent/fluent.js/wiki/React-Overla....

Here's how you could re-implement your example using fluent-react. Note that the <a>'s href is only defined in the prop to the Localized component.

    <Localized
        id="confirm"
        $clickCount={7}
        a={<a href="..."></a>}
    >
        {"Please <a>click here {$clickCount ->
            [one] 1 time
           *[other] {$clickCount} times
        }</a> to confirm."}
    </Localized>
I'd love to get more feedback on ideas in fluent-react. Please feel free to reach out if you have more questions!


I've seen a few libraries that use a similar approach of parsing strings for pseudo-elements and then matching them with React elements to avoid splitting up messages, but I've always felt a lot of resistance towards adopting something like that because it means incurring the runtime cost of parsing a string for elements when you can easily have hundreds or thousands of messages being rendered at once. (Call it a premature optimization if you must, but I've been bitten enough times in the past for adopting libraries/approaches that scaled poorly performance wise and had to pay the cost in untimely, painful refactors.)

I feel there's a fundamental impedance mismatch here because we're defining messages as strings but the rest of our UI as React components. I described here a potentially different component-oriented approach as an attempt to get rid of this impedance mismatch: https://news.ycombinator.com/item?id=19681129

I'd love to hear some thoughts on that approach from folks with more real-world experience working with i18n than I do (which is not a whole lot sadly, given the nature of the kinds of projects I've worked on in the past).


I’ve been trying to solve this problem in the resource4j library for Java, which can cache rendered strings, but in the end it’s always a memory vs performance problem. I didn’t publish any artificial benchmarks, but in couple real projects (resource4j+thymeleaf) performance impact was usually negligible.


> `[one] 1 time *[other] {$clickCount} times`

How does this work for languages that have more complex pluralization rules?

E.g. in Russian it's "1 раз", "2 раза", "11 раз", "12 раз", "22 раза" and "55 раз" - the case depends on the number ending, with exceptions for 11, 12, 13 and 14.


That's a great question!

Fluent relies on Unicode Plural Rules [0] which allow us to handle all (as far as Unicode knows) pluralization rules for cardinal and ordinal (and range) categories :)

[0] http://cldr.unicode.org/index/cldr-spec/plural-rules


It's up to the localizer to define variants corresponding to the language's plural categories. For Russian, that's (one, few, many). Interestingly, this particular example could simplified to (few, *), because "раз" is good for both 1, 5, 11, 55, etc. See https://projectfluent.org/play/?id=7d22f87c04b23b86d9f9149d5... for an example of this in action.

Authoring tools can help here, too. Pontoon, Mozilla's translation management system, pre-populates plural variants based on the number of plural categories defined in Unicode's CLDR.



Would this also allow for translating e.g. the <a>'s `title` attribute, or e.g. an `aria-label`?


It's something that I definitely plan to add. There's even an open issue about it! https://github.com/projectfluent/fluent.js/issues/185


Great! It sounds like a very cool project to work on :)


What's stopping you from doing that now? Just add a new translation for those labels in the source document and bind it to the attribute tag.


You should take a look at js-lingui. Child components are automatically converted to symbols by way of a babel-macro (no run time parsing of complex translations).


Why fluent-react and not (in addition to) fluent-web-components? :(


One advantage I can imagine is that you can prerender the React components, outputting e.g. plain HTML. With tools like e.g. React Static, that means you can somewhat ergonomically generate different static websites for different languages, avoiding the runtime costs of looking up the correct strings.


Using the Svelte JS library (https://svelte.technology) you can have both: server-side rendered components [1] and compile to web components (custom elements) [2]

Another advantage is that the components compile to vanilla JavaScript, so we don't rely on a runtime library to run the application.

[1] https://svelte.technology/guide#server-side-rendering

[2] https://svelte.technology/guide#custom-elements


More and more, I'm starting to think maybe plain text isn't always the best abstraction to be using for defining messages for i18n. If messages were to be defined in terms of whatever primitive you're using to build your UI (i.e. React components if you're using React, and raw html template nodes if you're working with plain html), then all of this impedance mismatch might disappear.

In the React case, a component oriented approach to i18n could maybe look something like this:

  const DefaultMessage = ({ clickCount }) => 
    <span>
      Please <a>click here {clickCount} 
      {pluralize(clickCount, {one: "time", other: "times"})}
      </a> to confirm
    </span>

  // some theoretical language that requires putting "to confirm" between 
  // "click here" and "7 times", using English for clarity
  const MessageInSomeOtherLanguage = ({ clickCount }) => 
    <span>
      Please <a>click here to confirm {clickCount}
      {pluralize(clickCount, {one: "time", other: "times"})}
      </a> 
    </span>

  // some theoretical component that renders different components 
  // based on the language and passes through props
  <FormattedComponent
    id="confirm"
    defaultMessage={DefaultMessage}
    props={{ clickCount: 7 }}
  />
This feels a lot more elegant and flexible to me. Though it would make it more difficult for non-technical folks to contribute to translations, which might not be much of a concern if your company has the resources to support localization teams in-house. Am I overlooking any other obvious downsides to this approach? Does anyone know of any libraries that offers a similar API, or have experience using a similar approach?


The problem here is tooling and workflows. Often you'll be using a SaaS product to manage translations, like Lokalise or similar products. At the end of the day, these just give you a whole bunch of strings to put in your app.

I've found this really hard. At my last place we just ate the cost (and ugliness) of included HTML in this strings and dangerously inserting them into the page.


That's always a problem, even with templates. Angular does sanitization. Which is of course not ideal from a runtime performance point, but they probably weighted the cost-benefit and found it to be an okay trade-off.

Though if the i18n string files are present at build time, then this sanitization step could be done there.


Definitely a problem. I think what will really make Fluent take off is if someone can provide good translation tools and workflow. The syntax looks great, but for most projects, I suspect it's impractical to expect translators to write Fluent manually.


Our goal was to design a simple DSL which is easy to read and make small edits to. Copying and pasting is a powerful learning method :)

We're also working on creating richer and more streamlined authoring experience in Pontoon, Mozilla's translation management system. You can read about the current state of Fluent support in Pontoon in my colleague's post at https://blog.mozilla.org/l10n/2019/04/11/implementing-fluent....


Cool! That looks like a great start :) I don't really do any l10n work myself so this was really just a bystanders perspective.


It will be way easier to maintain, if the L10n resources won’t be mixed with the markup or code. Check this for example: https://github.com/resource4j/resource4j


This is an interesting approach, but it's not tooling independent. If you can rely on your translations only seeing use in React components, then it may be just what you need.


That's a great point. Although maybe one way to make the approach more generic is to treat them as functions that happen to return React Nodes as opposed to stateless functional components.

Then you could write an alternative set of functions that returns say Vue components or raw html templates.

It still doesn't make individual translations _always_ reusable across paradigms, but I'm not so sure if the impedance mismatch associated with working with raw strings in a modern UI frameworks is worth the translation portability of that de-facto approach. And at the end of the day the only transaction functions you'd have to duplicate are the ones return more than raw strings, so it's not the end of the world.


From an accessibility point of view, it's also recommended to avoid links that only span over such half-sentences.

Screen reader users will often navigate your page by cycling through the links that are on the page and then they'll get only the link-text read out, not the surrounding text.


This is how I handle those cases in vue:

clickHere: { one: 'Please' two: { singular: ' click here {0} time ', plural: ' click here {0} times ' }, three: 'to continue', link: 'google.com/en/' }

<span> {{ this.$i18n('clickHere.one') }} <a href="{{ this.$i18n('clickHere.link') }}" >{{ this.count > 1 ? this.$i18n('clickHere.two.plural', this.count) }} : this.$i18n('clickHere.two.singular', this.count) }}<\a> {{ this.$i18n('clickHere.three') }} <\span>

Don't forget to localize your links! English might not need it but many languages eventually will point to a different url.


I'm currently fighting (again) the same kind of battles with react-intl. I'd also be keen to hear others' ideas to make this easier / better.

Another example is when you need to inject an image as a text decoration that includes text, or only makes sense in a particular part of the sentence.

One workaround I've considered is to add the decoration directly to the font you're using, so you can literally translate the decoration as text, but that doesn't usually feel like a reasonable solution, and still might not solve the problem for all languages.


GNU gettext is a well working framework for i18n.

For the sentence ordering we include all variables in the translations, but split translations on styles. We then give the translator the text in order of html appearance in source code for context. Translators can then rearrange everything but the variable across the string and also leave stuff blank when necessary. It's not perfect, but works in most cases.

And in the end developers and designers must consider the "translateability" of the UI. It's always possible to create untranslateable UI.


Hi! We've been working with and evaluating Gettext when we started Fluent.

Our opinion is similar to Unicode's - Gettext is fundamentally flawed design for internationalization purposes.

Here's you can find more detailed explanation of our position - https://github.com/projectfluent/fluent/wiki/Fluent-vs-gette...

Please, don't take it as a criticism of using it. We just don't think it scales and we don't think it's possible to produce high quality sophisticated multilingual UI's with it, but if it works for you, don't touch it :)


You make a lot of good points in the linked article, but you lose some credibility right from the start

> Secondly, it makes it impossible to introduce multiple messages with the same source string which should be translated differently.

This is false. The gettext message format uses msgctxt to deal with this. It's a fundamental part of the format. The unique identifier is the combination of msgctxt and the singular string. I wonder how you could miss that? We actually use an automatically generated msgctxt for some part of our app to avoid accidentally translating the same source text incorrectly in different context.

Also I couldn't quite follow the point about interpolation of fluent vs gettext (probably because I don't know fluent). Message interpolation in gettext works and can be absolutely readable. E.g. "You have {count} items". The big drawback is that you can't move this variable across strings. Can you do that with fluent?


> but you lose some credibility right from the start

Thank you for the feedback! I updated the article to include the mention about `msgctxt`.

Personally, in my experience, many project environments end up with partial support for this feature (for example many react/angular extractors don't support it) which leads to limited use and requires the localizer to request adding a context by the developer.

I did not include that since it's just my personal experience and I assume more mature projects tend to recognize the feature and use it, hopefully, extensively :)

> Message interpolation in gettext works and can be absolutely readable. E.g. "You have {count} items".

As far as I understand this is not part of the system (gettext), but its bindings and in result is underspecified and differs between implementations. For example [0] uses `%{ count }` while [1] uses `{{ count }}`. If I'm mistaken here, please, point me to the spec :)

Since it is a higher level replacement, this approach likely suffers from multiple limitations. First of all, I highly doubt that there is any BiDi isolation between interpolated arguments and the string leading to a common bug when RTL text (say, arabic) contains an LTR variable (say, a latin based name of a person). Fluent resolves it by wrapping all interpolated placeables in BiDi isolation marks.

Secondly, I must assume that any internationalization, such as number formatting, date formatting, etc. is also not done from within of the resolver in gettext. That, in turn, means that it may be tricky to verify that a number is formatted using eastern arabic numerals when used in arabic translation, while formatted to western arabic when used in english translation. Fluent formats all placeables using Unicode backed intl formatters (for example in JS we use ECMA402), allowing for consistency and high quality translations where placeables get formatted together with the message.

For example, in your example, will the `You have { count } items` be translated to `لديك 5 عناصر` or `لديك ٥ عناصر`? And what will happen if instead of `count`, you'd have `name: "John"`? Will it be RTL or LTR?

[0] https://hexdocs.pm/gettext/Gettext.html#content [1] https://angular-gettext.rocketeer.be/dev-guide/api/angular-g...


Yes, I agree. It only solves a subset of the problems. Formatting and RTL/LTR is difficult to solve with gettext.


> Our opinion is similar to Unicode's - Gettext is fundamentally flawed design for internationalization purposes.

Did the Unicode consortium express critics about gettext? Could you provide some reference about this?


I don't know if there's any public statement about this. I base my position on experience at Unicode Conference and work on CLDR and ICU. I understand that it diminishes the value of my claim.

I can also point out to ICU MessageFormat - which has been designed much after Gettext and, I'd dare to say on purpose, bares no resemblance to it.


I agree that CLDR plural forms and ICU MessageFormat are somehow an implicit critic of gettext design :-)


Hahaha, thank you! I still feel ashamed of making a strong claim based on informal conversations, but I feel a bit vindicated by your agreement! :)


OMG this is so cool to a person who lives in the CJK world (to be specific, I’m Korean) where the order of noun/verb/adj is reversed and always gets to see programs that display text something like ‘Site is news reader HN’, ‘Button press confirm to’.

It’s a pity that the programming world is still super bad at i18n :-(


I18n is something like security, it takes a lot of effort to get right, people often don’t know that their assumptions are subtly wrong, and getting them right often requires using a more complicated solution. No wonder programmers just cop out and settle for ASCII, naïve string comparisons, first name & surname fields, etc.

And the problems go quite deep, all the way down to standard libraries and programming languages. The Swift debate about correct and performant Unicode string processing was very interesting – people are very hard to convince that string is not a collection of characters randomly accessible with integer indexes.


I'm an English native speaker that dabbles in other languages occasionally, and I know enough to recognize possible problem areas, but not enough to actually do anything about them.

The complexity of this Fluent library shows what a massive problem it is. I'm not surprised that we continue to be this bad at it.


It won't help much with Chinese localization, where the persistent problem is that developers assume every language has the words "yes" and "no".


Hi! This is also a problem in Welsh [0]. In the Fluent world we solve that on the level of design principles (do not reuse the string `yes` because it can be translated to a different `yes` in different buttons) and bindings (compound messages allow us to localize a whole widget with a main message and its yes/no options).

[0] http://psychopixi.com/languages/yes-and-no/


Doesn't Chinese have understandable localizations for affirmative/negative responses like "是" and "没有"? I don't see the problem.


That's roughly like putting [OK] and [Cancel] into a dialog box that asks "Do you want to save this document?". Perfectly understandable but awkward, somewhat confusing to non-tech people--in a word, unfluent. Answering such questions in Chinese (and I imagine, in Welsh) requires the verb that was used in the question.


How much money is a company losing because its Korean string is nonsense?


Think about how often the english documentation coming with cheap products from china is mocked - do you really want your professional output to be treated in the same regard?


No, but if it comes down to writing better translation code, or adding features and bug fixes, I know where my clients typically land.


If you're writing translation code then you're (probably) doing it wrong.

If your clients are happy to only ever have a product in one language which will never have to deal with anything outside the 7bit ascii range then ignoring the complexity required to do it is (probably) fine.

As soon as you hit some requirement which violates the above if you haven't considered how this might affect you you're likely in for some horrible problems when suddenly you need to handle these things.


The comparison with gettext is really interesting: https://github.com/projectfluent/fluent/wiki/Fluent-vs-gette...

Especially the advantages and drawbacks of using the source string as a message identifier, compared to a developer provided ID.

I'm wondering if fluent has something similar to xgettext, to extract the IDs from the source code?

Edit: Looks like there is some discussion about extraction here: https://github.com/projectfluent/fluent.js/wiki/React-Bindin...


This comparison mentions that in gettext, using the source string as a message identifier "makes it impossible to introduce multiple messages with the same source string which should be translated differently."

It's rather disingenuous to say this without mentioning that in gettext, the actual identifier is a combination of the source string and the string context, which is blank by default. But the context can and should be provided in cases where disambiguation is required.

https://www.gnu.org/software/gettext/manual/html_node/Contex...


Thank you! That's a good feedback. In our experience of working with the gettext based localizations we noticed that developers almost always omit the context and in result localizers are stripped of the ability to distinguish the message variant which makes them have to ask the developer to introduce it.

This, in turn, breaks the design principle #1 of Fluent - https://github.com/projectfluent/fluent/wiki/Design-Principl...

I'll update the Wiki to reflect that!


Don't take the comparison at its face value, it's clear to me that whoever wrote it isn't really familiar with gettext, or deliberately talking it down. Yes, it's sort of ancient, but the problems mentioned can be solved.

And using the source string as ID is a pretty clever trick. Of course, there are some downsides, but there are certainly also downsides with separate IDs.

Having said that, Fluent looks interesting.


Hi! Thank you for your feedback. I'm one of the authors of Fluent, and this wiki article.

I have, in fact, been using Gettext for quite a few years, but of course, as you pointed out, I am also biased.

If you have suggestions on how to improve the article to better represent the reality, please, file an issue and provide a PR! Our goal is to express our design differences, but we don't want to mislead anyone!


The downside of using separate IDs is that the developer has to "name" each string shown in the user interface, instead of just using the source string as an ID. And as you know, naming things is hard ;-)


Yes, naming is hard, but, to quote the previous commenter "this can be worked around" - you can `slug` any string if you want to. We prefer to think of the ID as the base of the social contract between the dev and the localizer. This enables a lot of fine tuned control over string invalidation.


Considering the translation ID as a "slug" is a good tip to ease choosing the ID. Thanks!


There's a tool written by a Udacity developer which can be used to autoextract copy from React source code: https://github.com/udacity/fluent-react-utils/tree/master/pa...


I was thinking of something along this line. Thanks!


Nice! This honestly sounds really great; coincidentally as a Czech person, the multiple plural forms have sort of been a bane of i18n for me, a surprising amount of solutions don't even take this into account at all (?!) or require dumb workarounds.

With Mozilla's experience and adherence to ideals of interoperability and openness, I can see Fluent as a solid "golden standard" solution for a great chunk of i18n needs :)


Ahoj! :) This is a perfect article that always comes to mind when talking pluralization:

https://metacpan.org/pod/distribution/Locale-Maketext/lib/Lo...

We Czechs have it easy!


Great article! I'm always impressed by the thought going into Perl libraries.

There was a presentation years ago, how Perl handled Unicode right and every other programming language didn't (with Python 3 pretty close, IIRC).

Does anyone remember the URL?


I think the presentation you're thinking of is Tom Christiansen's Unicode: The Good, the Bad, and the (mostly) Ugly: https://www.azabani.com/pages/gbu/


I think that's the one. Thank you!


If it's about Perl 6, perhaps: http://jnthn.net/papers/2015-spw-nfg.pdf ?


In the Czech example, how is it obvious that `few` stands for 2, 3, 4? Is that just how the concept of "few" (note the English term) is defined and understood by all Cezch speakers and thus this language specific meaning is encoded by Mozilla to map to the range 2-4?

My point is that while there might be a concept of "few" that does map uniquely to that range, I am not sure naming the keyword "few" is the right name for this.

Quite honestly it would be easier to understand if it were explicitly referring to the range. After all these strings are provided specific to a language anyway. As such why not encode the rules in them explicitly instead of relying on keywords?

Or are those merely user defined abstractions to accomplish reuse? I guess it would help, but I'm still not sure why this needs a whole new framework.


(Author of the blog post here.) Great question, thanks! Unicode defines six categories of plural forms: zero, one, two, few, many, and other. The names of these categories always appear in English. Unicode also maintains a collection of all mappings of numerical rules to these categories, for all languages supported by the CLDR. See http://www.unicode.org/cldr/charts/latest/supplemental/langu... for the mapping corresponding to the Czech grammar.


Care to comment about how it might handle one bold word or a link mid sentence?


When Fluent formats translations, it returns simple strings (in the sense of primitive computer types). They can include markup which is parsed by a higher-level abstraction responsible for actually showing the translations somewhere in the UI. Take a look at https://github.com/projectfluent/fluent.js/wiki/DOM-Overlays in the experimental fluent-dom package, and their React equivalent, https://github.com/projectfluent/fluent.js/wiki/React-Overla....


Ah, if this is building on top of existing rules / standards that makes total sense.


Czech speaker here. When stating a plural in the nominative form, counts of 2-4 items always use the nominative plural and counts of 5+ items use the genitive plural. This concept is understood by Czech speakers natively and doesn't have any description. "Few" is therefore just a descriptive name chosen by the Fluent developers, and probably covers other languages with a similar concept.


Few is languages specific, so the logic for it should live with the language. Another language might have a plural for 3-6 but you would then need another way to describe it. I've mentioned this in another comment, but I think a range would be better such as 2..4 or 3..6


The rules for cardinal numbers in Slavic languages are quite complex. "few" is not just 2, 3, 4, but any number less than 100 ending in 2, 3, 4. You have to use a function to handle the situation. "few" is basically the function name, which is as good as any.


It's not "less than 100 ending in 2, 3, 4". It's "2, 3, or 4 mod 10, but not 12, 13, 14 mod 100".

So for example "1002" is "few" in this sense (even though it's bigger than 100) but "14" is not, even though it's less than 100 and ends in 4.


For ya'll dabbing into the list of rules - feel free to use the Unicode Plural Rules list as a reference point - http://www.unicode.org/cldr/charts/33/supplemental/language_...

It's a great abstraction that makes a lot of Fluent easier :)


You're right. I forgot about the teens. I don't know about Czech, but in Polish the nominative plural isn't used for numbers higher than 100. "1024 bajtów" and not "1024 bajty". If I remember the rules correctly.


That's a good point. The rules I described are for Russian, and other Slavic languages may differ. In particular, looks like Polish is pretty similar to Russian here at least in terms of where the category boundaries are, but Czech is different.

As far as the cases used go, 1024 would take the genitive singular in Russian. 1025 would take the genitive plural. 1021 would take the nominative singular. Nominative plural is not used at all when counting things in Russian.


This confused me too. To take the example, I'd prefer the following:

  tabs-close-warning-multiple = {$count ->
      [2..4] Chystáte se zavřít {$count} panely. Opravdu chcete pokračovat?
     *[] Chystáte se zavřít {$count} panelů. Opravdu chcete pokračovat?
  }
Specify a range (2..4). The second option shouldn't need to match as it is the default value anyway (signified by the *)


The (2..4) range would work for Czech in this example (if I'm reading the CLDR right), but I'm afraid it wouldn't be sufficient for languages with more complex plural rules. Take the rule that returns "one" in Latvian, for example:

    n % 10 = 1 and
      n % 100 != 11 or
    v = 2 and
      f % 10 = 1 and
      f % 100 != 11 or
    v != 2 and
      f % 10 = 1
…where n is the absolute value of the number, f is the visible fractional digits with trailing zeros, and v is _the number_ of visible fraction digits with trailing zeros. Some rules can get even more complex than that; see [0] and [1].

It's safer and more robust to rely on the plural categories defined by the Unicode: (zero | one | two | few | many | other), and by the APIs provided by the platform (ICU, Intl.PluralRules, etc.).

[0] http://www.unicode.org/cldr/charts/latest/supplemental/langu...

[1] http://unicode.org/reports/tr35/tr35-numbers.html#Operands


> The second option shouldn't need to match as it is the default value anyway (signified by the *)

That's an interesting suggestion! Right now, the identifier between the brackets is required, but we could relax this in the future. In 1.0, we erred on the side of more conservative and explicit design, to improve the readability and discoverability of the syntax for translators.


Handing control of inflections etc. over to the translator rather than the developer is one of those great ideas that make so much sense when you first see them, that you start to wonder why we didn't do this before.

Great work by Mozilla; it's clear there's a lot of experience in the organisation feeding into the design of this system, and it's great that they're sharing it with the world.


Isn't it amazing that even with these "apparently solved and very basic" problems like i18n, there are still so many low-hanging fruits, and an open source project can do better than many companies.

I'm German and I disabled spell checking almost everywhere, because most implementations are extremely poor in German. Word lists are a poor solution to capture different word forms, and I find it surprising that even in 2019, only very few programs get that right (for instance Microsoft Word; it understands some but not all grammar rules). This is another thing where I think a modern (OSS) spell checker could make a difference.


I guess my biggest pet peeve with German spellcheckers and autocomplete solutions other than the nonexistent support for compound words is that most of them don't understand capitalisation rules.


Not just spellcheckers, virtual keyboards too. Writing German on a smartphone can be slightly infuriating


Re: genedered pluralized example in the article. How will the system deal with the fact, that in some languages (Czech, too):

    ($count -> Jana added {n} {apples|apple}) ($gender -> to {his|her} profile)
the $gender will affect what form the word "added" should take. You're suddenly dealing with possibilities($count)*possibilities($gender) variants of the sentence


I would assume that the translator could provide a translation along the lines of:

  ($count -> Jana ($gender -> {addedHis|addedHer}) {n} {apples|apple}) ($gender -> to {his|her} profile)
(obviously with addedHis and AddedHer substituted for the correct word.)


It's possible to build this message in Fluent with nested selectors, or with adjacent selectors. I built an example using Polish, since that's a language I know best. (To be 100% correct in Polish, I'd need to use a different possessive pronoun, but doing so would actually remove the double use of gender from your question.)

https://projectfluent.org/play/?id=2d7ab4b7ed1c4d9656475614f...

It's a complex piece of UI and consequently, the resulting Fluent message is also quite complex. But possible to build :)


Wouldn’t you just wrap that word in another selector?


Will the system still be sanely usable for translators?


Yeah this is where these things fall apart. I haven't used Mozilla's Fluent but I used a very similar closed source system at another company some years back. Some failure modes:

- Gender agreement is not trivial. In French, in "Mary bought it" the verb needs to match the gender and number of the object not the subject: "Marie l'a acheté" vs "Marie l'a achetée" vs "Marie les a achetés" depending on the gender/number of the "it" object. But in most other cases the verb needs to match the subject in gender and number, in Polish "Maria kupila" vs "Stas kupil" vs "Oni kupili".

- In many languages nouns need to agree in case, gender, and number with the phrase they're in, even in English we see this with pronouns: "this is he" vs "this is his".

- And not to mention number agreement between pronoun and either subject or object, depending on context: "this is the button" vs "these are the buttons" - but also "this hovers over the button" vs "these hover over the button" etc. Pronouns in general are a world of hurt, as are copulae (is/are/etc)

- So once we want more complex sentences, simple word tagging like $gender becomes insufficient, because now there's multi-party agreement to worry about, we have to worry about $gender_subject and $object_gender_number_case, etc.

This becomes completely untenable for all but the most technical translators. Maybe those are easy to find for a world famous project like Firefox. Unfortunately, not so for a run of the mill commercial project.


It's certainly complicated but I think there is no way around it if you want to have correct translations. The same kind of system is used for translating MediaWiki and it seems to work great there. Example message: https://translatewiki.net/w/i.php?title=MediaWiki:Logentry-b...


This is how to pass variables to localized React Component. The API is elegant. Bravo to the team !

https://github.com/projectfluent/fluent.js/blob/master/fluen...


Does it handle the "x of y" in Slavic languages correctly?

For example in Polish:

    "Page 3 of 4" is "Strona 3 z 4"
    "Page 3 of 100" is "Strona 3 ze 100"


As far as I can tell Fluent has no builtin support for tagging a phonemic information to messages or arguments (it only supports plural rules and number formatting largely derived from the CLDR). You can probably specify special cases with selectors (but it will quickly go absurd, for Polish I guess that applies to 6, 7, 16, 17, 60..79, 100..199, 600..699 and so on?) or have an external function.

Context: Polish preposition "z(e)" is spelled "z" if the following word starts with s and z and alikes (problematically enough, the exact rule is not systematic) and "ze" otherwise. Korean has a similar case with postpositions "은(는)" and "이(가)" where the former is for words ending with a consonant and the latter is for words ending with a vowel, and the ko-KR localization of Firefox seems to completely ignore and/or sidestep this; the last letter is assumed (e.g. "{-brand-name}는") or a static word is inserted (e.g. "{$user} 사용자는" instead of "{$user}은(는)").


Funnily enough "Strona 3 z 6" and "Strona 3 z 7" sounds correct but "Strona 3 z 100" doesn't.

So I think it's only words starting with "s", not "sz" nor "si". And only 0 starts with "z" so it's not a problem (you never have "X out of 0"). So the only special case is for 100-199, 100 000-199 999, etc.


A filler vowel is needed with certain consonant clusters. 100 is "stu" and "st" requires it, whereas "dwustu" (200), "trzystu" (300), "czterystu" (400) don't. I don't remember the list. "mn" is another combo that needs the vowel: "ze mnóstwa".


Yeah, it is clear that I don't speak Polish ;-) What is a common workaround there? "Strona X z(e) Y"?


> What is a common workaround there?

Ignoring the issue altogether :) Or, if you're pedantic - implementing the special cases in the source code. But that's unmaintenable if you have lots of languages.


This is an excellent question and a very good use-case. I'm a Polish speaker myself, so I can definitely relate. It's also a good excuse for me to talk a little bit more about the advanced features of Fluent.

The Fluent Syntax is a simple declarative DSL. By design, it doesn't allow translators to build complex conditionals or use arithmetic. There is, however, an escape hatch. The problem you described can be solved in Fluent with a little bit of one-time help from the developer of the source code, through a feature of Fluent called custom functions.

Translations in Fluent can use functions to format values or decide between variants. There exist built-in functions like NUMBER and DATETIME. They are rarely used because the Fluent runtime calls them on numeric and temporal values implicitly, but they can be helpful when localizers wish to use custom formatting options.

    weekday-today = Today is {DATETIME($today, weekday: "long")}.
See https://projectfluent.org/play/?id=a3540d4f02c104a634adbfc0e... for a live example of DATETIME.

There can also be custom functions, defined during the initialization of the runtime. In Firefox, we use one such function called PLATFORM: https://searchfox.org/mozilla-central/rev/d33d470140ce3f9426.... It can be used as follows:

    open-preferences = {PLATFORM() ->  
        [windows] Open Options
       *[other] Open Preferences
    }
The logic of custom functions is entirely up to developers and the localization needs of the UI. In https://github.com/projectfluent/fluent/issues/228#issuecomm..., for instance, I suggested using a custom function to handle negative and positive floor numbers.

A custom function can also cater to the use-case you described. A simple and possibly naive implementation in JavaScript could look like the following one:

    function NUMBER_HEAD(num) {
        while (num > 999) {
            num /= 1e3;
        }
        let first = num.toString()[0];
        return num < 10 ? first
            : num < 100 ? first + "x"
            : first + "xx";
    }
I wrote this with Polish in mind, but it could be useful to other languages in which numerals are named after the first thousand-triple, in a left to right order. Depending on the exact product requirements, the function could be called NUMBER_HEAD_POLISH, or perhaps NUMBER_HEAD_TRIPLE_FIRST_DIGIT :)

Once defined, the function can be used as follows:

    # The Polish copy can take advantage of the custom function.
    page-of = {NUMBER_HEAD($pageTotal) ->
        [1xx] Strona {$pageCurrent} ze {$pageTotal}
       *[other] Strona {$pageCurrent} z {$pageTotal}
    }
This method still requires some work from developers, but it only needs to happen once and in a single palce in code: where the Fluent runtime is initialized. Because they are code, custom functions can be reviewed and tested just as any other code in the code base, to help ensure that they do what they claim to :)

Importantly, the use of the custom function is completely opt-in.

    # The English copy doesn't need any special handling.
    page-of = Page {$pageCurrent} of {$pageTotal}
All localization callsites remain unchanged, and all existing translations remain functional.


Thanks for the detailed answer.

Right now I'm using qt translation system, it handles nicely various plural forms, just like Fluent.

But it requires special code in each message that has "X of Y" to handle the "ze 100" correctly in Polish. It might be done for Polish, because we have Polish developers, but many languages have similar quirks and it's not done for them. And it would result in combinatorial explosion of translation message versions if source code had to add special case for each quirk in each language.

This seems to be a much better solution.


Thanks!

> And it would result in combinatorial explosion of translation message versions if source code had to add special case for each quirk in each language.

This is the exact problem we designed Fluent to solve. If you get a chance to try it out, feel free to reach out to me if you questions. I'll be more than happy to help and to hear feedback.


This look really good! As someone in ruby-land I wish there was some easy way to get notified if/when a fluent-ruby gem becomes available.


Looking at the examples, it looks like something like messageformat (https://messageformat.github.io/messageformat/) would have been a good solution to them. We've been using this at my job and we are very happy with the flexibility it provides. The hard part comes when a third party has to do the translations because someone from tech needs to be involved.


From the article: "Many key ideas in Fluent have also been inspired by XLIFF and ICU’s MessageFormat."


Thank you for pointing out. I found a comparison wiki article. I think this is an improvement over messageformat. They have tackled many of the issues with mf I didn't even know I had :P


My take on how to solve the natural-sounding translation problem: https://www.lokalized.com/#a-more-complex-example

The magic is a tiny expression language which understands plural cardinalities, ordinals, etc. so a translator can encode all required logic in a JSON file - the application code can be "dumb".


Sounds more like a programming language for translations (than like a message format).


What's advantage over the FormatJS suite? https://formatjs.io/


Hi! FormatJS is very similar to some of our bindings, and is powered by MessageFormat on the lower level.

Here's our take on the differences between MessageFormat and Fluent - https://github.com/projectfluent/fluent/wiki/Fluent-and-ICU-...


it'd be great to see syntax highlighting for FTL files in the popular text editors, but I guess they would have to be unofficial unless a member of Mozilla/Fluent team wants to maintain it...


I see that the playground has syntax highlighting and uses `ace`, whose syntax definitions are defined in JavaScript [0] and look fairly usable. I guess it could even be converted into a `.sublime-syntax` file without too much trouble :)

[0]: https://github.com/projectfluent/play/blob/a4f49a4a7eeb93535...


You can get decent highlighting for basic Fluent messages by setting the editor to a mode for Properties files. For instance, whenever I type Fluent examples in GitHub, I use the following markdown:

    ```properties
    # A comment
    hello = Hello, world!
    ```
This is in fact by design. Properties files are quite nice for simple things. Fluent builds on top of them, and provides modern features like multiline text blocks (as long as it's indented, it's considered text continuation) and the micro-syntax for expressions: {$var}, {$var -> ...} etc.

So far, I haven't had much time to invest in building proper highlighting modes for popular editors. There are ACE and Vim modes mentioned in other comments here, and also a slightly outdated https://atom.io/packages/language-ftl20n written by a contributor. I'd love to see more such contributions, and I'll be more than happy to help by reviewing code!


I started https://github.com/projectfluent/fluent.vim but my vimscript is terrible and I will take all help I can get :)


I joke that prima face fluent is to gettext/other i18n frameworks that rust is to c/c++ , well designed and thoughtful.


Hope someday somebody invent .pot files, plurals and gettext.


No need to be sarcastic. The documentation explains the advantages over gettext.

Most notable being that using the source language string as identifier a) discourages changes (and improvements) to the source language strings and b) makes it hard to handle strings that appear the same in the source language but need different translations.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: