HTML and Specifying Language

Published on Aug 25, 2014 (updated Feb 5, 2024), filed under development, html (feed). (Share this on Mastodon or Bluesky?)

This and many other posts are also available as a pretty, well-behaved ebook: On Web Development.

My concerns about requiring lang to be set on the html start tag had first been based on an insufficient differentiation between (and missing reconciliation of) text-processing language and language(s) of the intended audience. While not meant to be the same, in reality, they end up being used the same way. Under that premise, the argument made should be more understandable.

I question the importance and ways of marking up language in HTML documents, in particular changes in language.

More specifically, I question the officially recommended practice of using the lang (and xml:lang) attribute to describe the document language (as in “<html lang=en>”) as well as any subsequent changes (as in “football <i lang=fr>à la</i> Germany”).

As for motivation, I deem this mostly an efficiency matter. At this point I doubt it to be an accessibility concern, because the problem—determining language—affects everyone. (I believe we can credit Joe Clark for the clear distinction that problems that affect everyone, or at least include non-disabled users, are not accessibility issues.)

I’ll now go over marking up document language and marking up changes in language separately, for they present different problems and ask for different solutions.

Specifying Language

To specify the language of documents—thinking websites, not standalone pages—what first springs to mind must be: What is the most efficient way to do so?

The answer appears to be via HTTP header. That is, to set HTTP’s Content-Language header ^*. (In Apache, this is done through the Header or DefaultLanguage directives.)

This method is for that reason so efficient, because it uses the least code, is easiest to update, and carries a stronger weight (though I couldn’t find a reference to support that HTTP headers take the usual precedence here).

Still, the W3C I18N Activity advises against using HTTP headers, at least alone: “Use language attributes rather than HTTP to declare the default language for text processing.” (There seem to be no reasons given, then, as the language declarations document referenced is rather neutral about HTTP headers.)

Question 1: Given that HTTP headers are generally more efficient and maintainable to set document language, could it be that current advice against them—or for @lang, respectively—is at least lacking balance?

Specifying Changes in Language

If specifying the language of every document is inefficient and thus costly, then highlighting changes in language is even costlier. Language changes occur in more places, and sometimes require extra markup. (For those who don’t know my definition of cost, I understand basically any negative consequence, for example more effort, as a cost.)

Changes in language also occur primarily in the “actual” document contents. That means that the requirement to mark them up affects more people (not just document and template developers, but all authors), which increases the burden and cost of the requirement. Changes in language appear often enough, then, to even affect users who don’t know HTML. As with users, for example, who write and edit their copy in content management systems that use some abstraction to translate contents into HTML, but also with people who use conversion software like Markdown.

Next, and here it gets more interesting, it is completely unclear what tools actually use the information of inner-document language changes. Granted, this may be a knowledge gap on my end—being corrected is one reason why I write all of this down—, but from what I’ve seen so far, what I specifically understand some services like Google not to be doing, and even from my fading memories testing assistive tools, there’s not a great value in marking up changes in language.

And if this was all incomplete and subject to betterment, the most important question only follows: What role can, and should, software play in this—software that detects changes in language? While software may never be perfect in detecting language, it may be or become better than what we have now. On my mind—and reflecting Google doctrine—, language detection should be automated. It should be a software responsibility. That a good number of authors cannot (the ones without training) and will never (the ones that use abstraction tools) mark up changes in language only seems to strengthen this thinking.

Question 2: Could it be that requiring authors to mark up changes in language is not only comparatively expensive and pointless, even, with current implementations—but could also, overall, be done better by tools in the first place?

❧ For emphasis I framed both key questions a bit provocatively ^†. The suspicion that current industry practices and current expert advice are both a little off is strong. The way I see it, a more appropriate approach to specifying language is to

prefer indicating document language through Content-Language, and
require user agents and assistive technology to detect changes in language, where that is relevant.

Any existing provisions that mandate using @lang when important to do so should remain in place. Similarly, if documents are likely to be served under conditions where no Content-Language can be set, @lang should still be preferred.

That is my view as good as I can lay it out. With this move towards Content-Language I wish to make sure I didn’t overestimate their effectiveness; and neither do I want to appear negligent when it comes to the actual use, and useful use at that, of information that indicates changes in language. Hence, what data and evidence did I miss? And what else could, or should, we do?

This post supersedes what I wrote to w3c-wai-gl@w3.org and help@lists.whatwg.org. The intention was the same, but the message less clear.

^* The Content-Language header defines the “natural language of the intended audience.” I simplify matters in this post by treating document language and audience language synonymously. In cases in which there is a difference that also matters, the original definitions apply, from what I tell with marginal consequences for the issues raised.

^† …and I can only present this case because of all the work that has been done elsewhere, especially in W3C Working Groups. And so I like to particularly thank Richard Ishida from whom I’ve learned much when it comes to best practice L10N and I18N. (Both a former Google tech lead and a W3C translator, I’ve become familiar with and grateful for Richard’s and the W3C I18N Activity’s work.)

Update (August 27, 2014)

For additional clarification see the discussion below, with WebAIM’s Jared Smith. I’ve also clarified my concerns and position on the list of the WCAG WG. Pardon a little bit of piecemeal, the topic is not trivial.

So far, group and individual feedback have been rather destructive. I believe we’re missing an opportunity to if not adjust, then to clarify previous guidelines. Bite reflexes, as I put it in one of the rare direct responses on Twitter, don’t help.

However, I’ve reviewed the case and will, effective immediately, stop marking up changes in language (and even rinse existing markup). I do this primarily on grounds that it’s not an accessibility problem (it’s one for all of us). Other issues listed, most notably expense of all the extra markup, have contributed to the decision. It’s not a secret that I’m a markup purist who has on other occasions taken controversial steps as they served quality and efficiency.

My recommendation to the working groups is to at least review the guidelines in place, like H57 and H58. To vendors I recommend turning it up when it comes to meaningful support for language markup, for everyone; detect languages and changes in language, provide translations, impress us. To other developers, I think you benefit from being more critical, too.

Update (April 6, 2019)

I renewed the argument.

About Me

I’m Jens (long: Jens Oliver Meiert), and I’m a web developer, manager, and author. I’ve been working as a technical lead and engineering manager for companies you’ve never heard of and companies you use every day, I’m an occasional contributor to web standards (like HTML, CSS, WCAG), and I write and review books for O’Reilly and Frontend Dogma.

I love trying things, not only in web development and engineering management, but also in other areas like philosophy. Here on meiert.com I share some of my experiences and views. (I value you being critical, interpreting charitably, and giving feedback.)