HTML and Specifying Language
Post from August 25, 2014 (↻ April 6, 2019), filed under Web Development.
This and many other posts are also available as a pretty, well-behaved e-book: On Web Development.
I question the importance and ways of marking up language in HTML documents, in particular changes in language.
More specifically, I question the officially recommended practice of using the
xml:lang) attribute to describe the document language (as in “
<html lang=en>”) as well as any subsequent changes (as in “
football <i lang=fr>à la</i> Germany”).
As for motivation, I deem this mostly an efficiency matter. At this point I doubt it to be an accessibility concern, because the problem—determining language—affects everyone. (I believe we can credit Joe Clark for the clear distinction that problems that affect everyone, or at least include non-disabled users, are not accessibility issues.)
I’ll now go over marking up document language and marking up changes in language separately, for they present different problems and ask for different solutions.
To specify the language of documents—thinking websites, not standalone pages—what first springs to mind must be: What is the most efficient way to do so?
This method is for that reason so efficient, because it uses the least code, is easiest to update, and carries a stronger weight (though I couldn’t find a reference to support that HTTP headers take the usual precedence here).
Still, the W3C I18N Activity advises against using HTTP headers, at least alone: “Use language attributes rather than HTTP to declare the default language for text processing.” (There seem to be no reasons given, then, as the language declarations document referenced is rather neutral about HTTP headers.)
Question 1: Given that HTTP headers are generally more efficient and maintainable to set document language, could it be that current advice against them—or for
@lang, respectively—is at least lacking balance?
Specifying Changes in Language
If specifying the language of every document is inefficient and thus costly, then highlighting changes in language is even costlier. Language changes occur in more places, and sometimes require extra markup. (For those who don’t know my definition of cost, I understand basically any negative consequence, for example more effort, as a cost.)
Changes in language also occur primarily in the “actual” document contents. That means that the requirement to mark them up affects more people (not just document and template developers, but all authors), which increases the burden and cost of the requirement. Changes in language appear often enough, then, to even affect users who don’t know HTML. As with users, for example, who write and edit their copy in content management systems that use some abstraction to translate contents into HTML, but also with people who use conversion software like Markdown.
Next, and here it gets more interesting, it is completely unclear what tools actually use the information of inner-document language changes. Granted, this may be a knowledge gap on my end—being corrected is one reason why I write all of this down—, but from what I’ve seen so far, what I specifically understand some services like Google not to be doing, and even from my fading memories testing assistive tools, there’s not a great value in marking up changes in language.
And if this was all incomplete and subject to betterment, the most important question only follows: What role can, and should, software play in this—software that detects changes in language? While software may never be perfect in detecting language, it may be or become better than what we have now. On my mind—and reflecting Google doctrine—, language detection should be automated. It should be a software responsibility. That a good number of authors cannot (the ones without training) and will never (the ones that use abstraction tools) mark up changes in language only seems to strengthen this thinking.
Question 2: Could it be that requiring authors to mark up changes in language is not only comparatively expensive and pointless, even, with current implementations—but could also, overall, be done better by tools in the first place?
❧ For emphasis I framed both key questions a bit provocatively †. The suspicion that current industry practices and current expert advice are both a little off is strong. The way I see it, a more appropriate approach to specifying language is to
- prefer indicating document language through
- require user agents and assistive technology to detect changes in language, where that is relevant.
Any existing provisions that mandate using
@lang when important to do so should remain in place. Similarly, if documents are likely to be served under conditions where no
Content-Language can be set,
@lang should still be preferred.
That is my view as good as I can lay it out. With this move towards
Content-Language I wish to make sure I didn’t overestimate their effectiveness; and neither do I want to appear negligent when it comes to the actual use, and useful use at that, of information that indicates changes in language. Hence, what data and evidence did I miss? And what else could, or should, we do?
Content-Language header defines the “natural language of the intended audience.” I simplify matters in this post by treating document language and audience language synonymously. In cases in which there is a difference that also matters, the original definitions apply, from what I tell with marginal consequences for the issues raised.
† …and I can only present this case because of all the work that has been done elsewhere, especially in W3C Working Groups. And so I like to particularly thank Richard Ishida from whom I’ve learned much when it comes to best practice L10N and I18N. (Both a former Google Tech Lead and W3C translator I’ve become very familiar with and grateful for Richard’s and the W3C I18N Activity’s work.)
Update (August 27, 2014)
For additional clarification see the discussion below, with WebAIM’s Jared Smith. I’ve also clarified my concerns and position on the list of the WCAG WG. Pardon a little bit of piecemeal, the topic is not trivial.
So far, group and individual feedback have been rather destructive. I believe we’re missing an opportunity to if not adjust, then to clarify existing guidelines. Bite reflexes, as I put it in one of the rare direct responses on Twitter, don’t help.
However, I’ve reviewed the case and will, effective immediately, stop marking up changes in language (and even rinse existing markup). I do this primarily on grounds that it’s not an accessibility problem (it’s one for all of us). Other issues listed, most notably expense of all the extra markup, have contributed to the decision. It’s not a secret that I’m a markup purist who has on other occasions taken controversial steps as they served quality and efficiency.
My recommendation to the working groups is to at least review the guidelines in place, like H57 and H58. To vendors I recommend to turn it up when it comes to meaningful support for language markup, for everyone; detect languages and changes in language, provide translations, impress us. To other developers, I think you benefit from being more critical, too.
Update (April 6, 2019)
I renewed the argument.
About the Author
Jens Oliver Meiert is a tech lead and author (sum.cumo, W3C, O’Reilly). He loves to try things, particularly in the realms of philosophy, art, and adventure. Here on meiert.com he shares and generalizes and exaggerates some of his thoughts and experiences.
If you have any thoughts or questions (or recommendations) about what he writes, leave a comment or a message.
I think it’s rather a stretch to suggest that adding a single attribute and value is “expensive” or “costly”. Certainly typing 10 characters is among the most effortless of accessibility requirements.
If authors are too burdened to add a single attribute, do you really think they’ll take the time to define HTTP headers at the server or back-end scripting level?
(for example) is NOT less code than
and it certainly is not “easiest to update” for those without knowledge of or access to server-side scripting.
As for “pointless”, this is incorrect. Screen readers do currently support document language and language switching (at least if defined in markup - not sure on HTTP headers - if they don’t, they should). Try listening to a page with an incorrect (or missing if the user agent primary language is different) language definition. Those 10 characters often mean the difference between accessible and utterly incomprehensible. To suggest that this affects sighted readers equally is just plain wrong.
Automated language detection simply won’t work - at least today. It would be incredibly processor intensive for internal page content. And would obviously not work for short phrases or sections. Even Google, arguably the leader in language detection, frequently mis-identifies the language of entire pages.
I can’t support your suggestion that we replace a simple solution that (generally) works with a significantly more complex one that is not yet supported.
IMO it seems rather ignorant to specify the language of a document as part of the transport protocol.
There are many other ways that html marked up documents are transported eg. via smtp, imap, ftp or through file based systems which do not allow or support setting the content-language header, thus loosing that information.
I agree that server-side language definition can be more efficient for single-language sites. In our evaluation work, we find that many (or perhaps most) sites that have multiple language content generally have the document language defined *incorrectly* for one or more content languages (and a mismatch between document lang and HTTP headers is even more prevalent). Moving this entirely to the server level where it is invisible to authors will aggravate this issue. Evaluation of the language would require a more burdensome HTTP header analysis.
I also agree that computer-detection of language of portions of a page could be of great value when the language of that portion is not explicitly identified, but I don’t think we should toss out the currently functional lang attribute to rely on automated language detection (which does not yet exist in any meaningful way for page portions). This would be like suggesting that we drop form labeling in markup because computers can sometimes guesses the label correctly based on proximity. It works OK, until the computer guess incorrectly - then it is entirely inaccessible (actually worse than inaccessible, because the computer reads the INCORRECT label/language).
“Any existing provisions that mandate using @lang when important to do so should remain in place.” What do you mean by “when important to do so”? Do you mean where automated translation is not sufficient. If that’s the case, then all language changes are currently “important”. Educating authors to always define language changes would certainly be easier than educating them on when explicitly defining language changes is “important” (whatever that means) or not.
We can never expect all changes in language to be marked up.
I think this is generally your argument for moving away from @lang to automated language detection. I think the same could be said of alt text, form labeling, or pretty much anything else in accessibility. While automated processes can fill the gaps where accessibility is not defined (such as form auto-labeling, or perhaps image analysis for images without alt), should we also throw out these techniques because there are not always used and because computers might someday be able to sometimes do it automatically?
How often is there a true, insurmountable barrier with lack of markup that indicates changes in language?
Never. Because adding an attribute to an element will never be an “insurmountable barrier”.
Take this post, what is the consequence of “status quo” not being marked up properly, which it isn’t?
It should not be marked up at all. “Status quo” is English (just like salsa, Los Angeles, feng shui, lager, etc.), even if its roots are not.
You’re right that the need to identify language changes happens less often than most people think (see examples above). But when it is necessary, it’s *absolutely vital* to get correct. Until we can fully trust automated processes to get this right, we shouldn’t be having discussions of removing the one foolproof method (@lang) for authors to do so.
Since I had already intervened there, I put some comments on the WAI mailing list.