Reading List
Twitter needs manual language selection from hiddedevries.nl Blog RSS feed.
Twitter needs manual language selection
Lots of Twitterers speak languages that are not English. For people who read tweets that are not in English, it is important that these tweets are marked as such. I feel Twitter needs a feature for this.
It would be nice if, when writing a tweet, we could manually select which language the tweet is in, and that Twitter would use that information to set the appropriate lang
attribute on our content:
Sharing a controversial opinion on CSS frameworks in the Dutch language
Twitter is an authoring tool, for which the Authoring Tool Accessibility Guidelines recommend that “accessible content production is possible” (Guideline B.1.2).
The lang
attribute
Language attributes identify which language some web content is in. They are usually set on a page level, added to the HTML element:
Most developers don’t write these attributes often, the code often lives somewhere in a template that we don’t touch every day, or ever. But it’s an important attribute. Setting it correctly gets your page to pass one whole WCAG criterion (3.1.1 Language of page).
In some cases, we have to set language attributes on individual elements, too, like if some of our content is not in the page’s main language. On the website I built for the British-Taiwanese band Transition, we combine content in Mandarin with content in English on one page:
The Transition “Music” page
We picked en
as the main language and set it on the <html>
element. This meant we had to mark all Chinese content as zh
, in this case zh-TW
as it is specifically Mandarin as spoken in Taiwan. Of course, we could have written this the other way around, too. Usually we want to pick the language that’s most common on the page as the page’s language.
Setting a lang
attribute on parts of a page is its own WCAG criterion, too (3.1.2 Language of parts), by the way.
The user need
Setting the language is important for end users, like:
- people who use a screenreader to read out content on a page
- people who use a braille display
- people who end up seeing a default font (browsers can select these based on language)
- people who use software to translate content
- people who want to right click a word in our content to look it up in a dictionary
- people who use user stylesheets
The author need
There is also an author need, both for people who write content and for web developers.
Content editors
People who write content may get browser-provided spellcheckers. They will work better if they know what the content’s language is. I think Twitter.com has somehow turned browser spellcheck off, but there may be Twitter clients or indeed other authoring tools where this is relevant.
Web developers
Language attributes are important for web developers, too, as it allows them to use the :lang()
pseudo class in CSS more effectively.
Some CSS will behave differently based on languages. When you use hyphens: auto
, the browser needs to look up words in a dictionary to apply hyphenation correctly. It has to know the language for this.
With appropriate language attributes, you can also use CSS features like writing modes and typographic properties more effectively. See Hui Jing Chen’s deep dive into CSS for internationalisation for more details.
Automating and lang-maybe
Identifying languages can be automated. In fact, Twitter does this. When they recognise a tweet’s language, they add the relevant lang
attribute proactively. See for instance the European Commission chair’s multilingual tweets:
Twitter’s auto-added lang
attributes in action
Yay! I think this is very cool (thanks ThainBBdl for pointing this out). The advances in natural language processing are really impressive.
Having said that, any automated system makes mistakes. Vadim Makeev shared:
Yes, sometimes they take my Russian tweets and render them as Bulgarian. It’s not just the lang, they also use some Cyrillic font variation that makes them harder to read.
It is safe to assume such mistakes will skew towards minority languages and miss subtleties that matter a lot to individual people, especially in areas where language is political.
On the one hand, I think it makes sense to deploy automated language identification. As there are a lot of users, Twitter can safely assume not everyone would set a language for all of their tweets. People might not know or care (insert sad face here), a fallback helps with that. On the other hand, if this tech exists, might it make more sense if a browser would deploy it rather than an individual website? Why not have the browser guess the content’s language, for every website and not just Twitter?
If browsers would do this, Twitter’s lang
attributes may get in the way. They kind of give the impression that this information is author-provided. This makes me wonder, should there be a way for Twitter to say their declaration is a guess? lang-maybe
?
Manual selection
Automated language detection probably works best if it complements manual selection. It could help provide a default choice or suggestion for manual selection, and work as a fallback. So, I’m still going to make the case for a method for users to specify a language manually.
A per-tweet manual language picker would be great as it can:
- give willing authors more control to avoid issues
- avoid that language identification benefits are only had by users of the majority languages that AI models are best trained for
- let authors express their specific intent
Summing up
For non-English tweets to meet WCAG, they need to have their language declared with a lang
atttribute. Twitter currently guesses languages, which is a great step in the right direction, but is likely of little help to speakers of minority languages. A manual selector would be a great way to complement the automation.
Originally posted as Twitter needs manual language selection on Hidde's blog.