HTML and Specifying Language

Published on August 25, 2014 (↻ February 5, 2024), filed under Development (RSS feed for all categories).

This and many other posts are also available as a pretty, well-behaved ebook: On Web Development.

My concerns about requiring lang to be set on the html start tag had first been based on an insufficient differentiation between (and missing reconciliation of) text-processing language and language(s) of the intended audience. While not meant to be the same, in reality, they end up being used the same way. Under that premise, the argument made should be more understandable.

I question the importance and ways of marking up language in HTML documents, in particular changes in language.

More specifically, I question the officially recommended practice of using the lang (and xml:lang) attribute to describe the document language (as in “<html lang=en>”) as well as any subsequent changes (as in “football <i lang=fr>à la</i> Germany”).

As for motivation, I deem this mostly an efficiency matter. At this point I doubt it to be an accessibility concern, because the problem—determining language—affects everyone. (I believe we can credit Joe Clark for the clear distinction that problems that affect everyone, or at least include non-disabled users, are not accessibility issues.)

I’ll now go over marking up document language and marking up changes in language separately, for they present different problems and ask for different solutions.

Specifying Language

To specify the language of documents—thinking websites, not standalone pages—what first springs to mind must be: What is the most efficient way to do so?

The answer appears to be via HTTP header. That is, to set HTTP’s Content-Language header ^*. (In Apache, this is done through the Header or DefaultLanguage directives.)

This method is for that reason so efficient, because it uses the least code, is easiest to update, and carries a stronger weight (though I couldn’t find a reference to support that HTTP headers take the usual precedence here).

Still, the W3C I18N Activity advises against using HTTP headers, at least alone: “Use language attributes rather than HTTP to declare the default language for text processing.” (There seem to be no reasons given, then, as the language declarations document referenced is rather neutral about HTTP headers.)

Question 1: Given that HTTP headers are generally more efficient and maintainable to set document language, could it be that current advice against them—or for @lang, respectively—is at least lacking balance?

Specifying Changes in Language

If specifying the language of every document is inefficient and thus costly, then highlighting changes in language is even costlier. Language changes occur in more places, and sometimes require extra markup. (For those who don’t know my definition of cost, I understand basically any negative consequence, for example more effort, as a cost.)

Changes in language also occur primarily in the “actual” document contents. That means that the requirement to mark them up affects more people (not just document and template developers, but all authors), which increases the burden and cost of the requirement. Changes in language appear often enough, then, to even affect users who don’t know HTML. As with users, for example, who write and edit their copy in content management systems that use some abstraction to translate contents into HTML, but also with people who use conversion software like Markdown.

Next, and here it gets more interesting, it is completely unclear what tools actually use the information of inner-document language changes. Granted, this may be a knowledge gap on my end—being corrected is one reason why I write all of this down—, but from what I’ve seen so far, what I specifically understand some services like Google not to be doing, and even from my fading memories testing assistive tools, there’s not a great value in marking up changes in language.

And if this was all incomplete and subject to betterment, the most important question only follows: What role can, and should, software play in this—software that detects changes in language? While software may never be perfect in detecting language, it may be or become better than what we have now. On my mind—and reflecting Google doctrine—, language detection should be automated. It should be a software responsibility. That a good number of authors cannot (the ones without training) and will never (the ones that use abstraction tools) mark up changes in language only seems to strengthen this thinking.

Question 2: Could it be that requiring authors to mark up changes in language is not only comparatively expensive and pointless, even, with current implementations—but could also, overall, be done better by tools in the first place?

❧ For emphasis I framed both key questions a bit provocatively ^†. The suspicion that current industry practices and current expert advice are both a little off is strong. The way I see it, a more appropriate approach to specifying language is to

prefer indicating document language through Content-Language, and
require user agents and assistive technology to detect changes in language, where that is relevant.

Any existing provisions that mandate using @lang when important to do so should remain in place. Similarly, if documents are likely to be served under conditions where no Content-Language can be set, @lang should still be preferred.

That is my view as good as I can lay it out. With this move towards Content-Language I wish to make sure I didn’t overestimate their effectiveness; and neither do I want to appear negligent when it comes to the actual use, and useful use at that, of information that indicates changes in language. Hence, what data and evidence did I miss? And what else could, or should, we do?

This post supersedes what I wrote to w3c-wai-gl@w3.org and help@lists.whatwg.org. The intention was the same, but the message less clear.

^* The Content-Language header defines the “natural language of the intended audience.” I simplify matters in this post by treating document language and audience language synonymously. In cases in which there is a difference that also matters, the original definitions apply, from what I tell with marginal consequences for the issues raised.

^† …and I can only present this case because of all the work that has been done elsewhere, especially in W3C Working Groups. And so I like to particularly thank Richard Ishida from whom I’ve learned much when it comes to best practice L10N and I18N. (Both a former Google tech lead and a W3C translator, I’ve become familiar with and grateful for Richard’s and the W3C I18N Activity’s work.)

Update (August 27, 2014)

For additional clarification see the discussion below, with WebAIM’s Jared Smith. I’ve also clarified my concerns and position on the list of the WCAG WG. Pardon a little bit of piecemeal, the topic is not trivial.

So far, group and individual feedback have been rather destructive. I believe we’re missing an opportunity to if not adjust, then to clarify previous guidelines. Bite reflexes, as I put it in one of the rare direct responses on Twitter, don’t help.

However, I’ve reviewed the case and will, effective immediately, stop marking up changes in language (and even rinse existing markup). I do this primarily on grounds that it’s not an accessibility problem (it’s one for all of us). Other issues listed, most notably expense of all the extra markup, have contributed to the decision. It’s not a secret that I’m a markup purist who has on other occasions taken controversial steps as they served quality and efficiency.

My recommendation to the working groups is to at least review the guidelines in place, like H57 and H58. To vendors I recommend turning it up when it comes to meaningful support for language markup, for everyone; detect languages and changes in language, provide translations, impress us. To other developers, I think you benefit from being more critical, too.

Update (April 6, 2019)

I renewed the argument.

About Me

Jens Oliver Meiert, on November 9, 2024.

I’m Jens (long: Jens Oliver Meiert), and I’m a web developer, manager, and author. I’ve worked as a technical lead and engineering manager for a few companies, I’m a contributor to several web standards, and I write and review books for O’Reilly and Frontend Dogma.

I love trying things, not only in web development and engineering management, but also in other areas like philosophy. Here on meiert.com I share some of my experiences and views. (I value you being critical, interpreting charitably, and giving feedback.)

Comments (Closed)

On August 25, 2014, 17:48 CEST, Jared Smith said:
I think it’s rather a stretch to suggest that adding a single attribute and value is “expensive” or “costly”. Certainly typing 10 characters is among the most effortless of accessibility requirements.

If authors are too burdened to add a single attribute, do you really think they’ll take the time to define HTTP headers at the server or back-end scripting level?

header(’Content-language: de’);
(for example) is NOT less code than
lang=”de”
and it certainly is not “easiest to update” for those without knowledge of or access to server-side scripting.

As for “pointless”, this is incorrect. Screen readers do currently support document language and language switching (at least if defined in markup - not sure on HTTP headers - if they don’t, they should). Try listening to a page with an incorrect (or missing if the user agent primary language is different) language definition. Those 10 characters often mean the difference between accessible and utterly incomprehensible. To suggest that this affects sighted readers equally is just plain wrong.

Automated language detection simply won’t work - at least today. It would be incredibly processor intensive for internal page content. And would obviously not work for short phrases or sections. Even Google, arguably the leader in language detection, frequently mis-identifies the language of entire pages.

I can’t support your suggestion that we replace a simple solution that (generally) works with a significantly more complex one that is not yet supported.
On August 25, 2014, 18:25 CEST, mark said:
IMO it seems rather ignorant to specify the language of a document as part of the transport protocol.

There are many other ways that html marked up documents are transported eg. via smtp, imap, ftp or through file based systems which do not allow or support setting the content-language header, thus loosing that information.
On August 25, 2014, 18:56 CEST, Jens Oliver Meiert said:
Jared, the typical website consists of more than one page, and Content-Language pays off early. Once you think websites you notice how beneficial HTTP headers are in comparison.

As for changes in language, these are old arguments and I’m exactly not buying them anymore. We can never expect all changes in language to be marked up. It’s, comparatively, a lot of dumb work, too. I contend that software can do better detecting changes in language than we’d ever be able to with all the guidelines in the world mandating to declare those changes. No, it won’t be perfect. But it will probably work better. And cheaper.
On August 25, 2014, 19:23 CEST, Jared Smith said:
I agree that server-side language definition can be more efficient for single-language sites. In our evaluation work, we find that many (or perhaps most) sites that have multiple language content generally have the document language defined *incorrectly* for one or more content languages (and a mismatch between document lang and HTTP headers is even more prevalent). Moving this entirely to the server level where it is invisible to authors will aggravate this issue. Evaluation of the language would require a more burdensome HTTP header analysis.

I also agree that computer-detection of language of portions of a page could be of great value when the language of that portion is not explicitly identified, but I don’t think we should toss out the currently functional lang attribute to rely on automated language detection (which does not yet exist in any meaningful way for page portions). This would be like suggesting that we drop form labeling in markup because computers can sometimes guesses the label correctly based on proximity. It works OK, until the computer guess incorrectly - then it is entirely inaccessible (actually worse than inaccessible, because the computer reads the INCORRECT label/language).

“Any existing provisions that mandate using @lang when important to do so should remain in place.” What do you mean by “when important to do so”? Do you mean where automated translation is not sufficient. If that’s the case, then all language changes are currently “important”. Educating authors to always define language changes would certainly be easier than educating them on when explicitly defining language changes is “important” (whatever that means) or not.
On August 25, 2014, 19:38 CEST, Jens Oliver Meiert said:
I think we’re really not that far apart. And to be clear, I’m still looking for more data.

My beef is mainly around the overall situation with changes in language. If I get you right then it wouldn’t be ideal if we had to teach when to mark those changes up and when these could be a tool job. I agree with that. But presented with a black and white, all or nothing decision between requiring to mark up all those changes manually, and to task tools to do so, I’d, again considering the overall situation, choose the latter.

Saying that, I also think that we haven’t tapped any potential yet. But that’s a problem with the status quo, not the proposal. If that makes sense. So I’d be excited to see guidelines (UAAG, perhaps) to require tools to go to reasonable lengths to tell changes apart. This brings me to:

“Any existing provisions that mandate using @lang when important to do so should remain in place.” What do you mean by “when important to do so”?

I still wonder, but maybe you can help, about the problem itself. How often is there a true, insurmountable barrier with lack of markup that indicates changes in language? (Note that I don’t question that there could be such issues, I know there can, but how often.) Take this post, what is the consequence of “status quo” not being marked up properly, which it isn’t? And so I say “when important,” because I believe that it doesn’t really matter as often as we (as experts) think it does. (And again, I’m not saying it never does, I’m just saying not as often as we think 😊)
On August 25, 2014, 21:53 CEST, Jared Smith said:
We can never expect all changes in language to be marked up.

I think this is generally your argument for moving away from @lang to automated language detection. I think the same could be said of alt text, form labeling, or pretty much anything else in accessibility. While automated processes can fill the gaps where accessibility is not defined (such as form auto-labeling, or perhaps image analysis for images without alt), should we also throw out these techniques because there are not always used and because computers might someday be able to sometimes do it automatically?

How often is there a true, insurmountable barrier with lack of markup that indicates changes in language?

Never. Because adding an attribute to an element will never be an “insurmountable barrier”.

Take this post, what is the consequence of “status quo” not being marked up properly, which it isn’t?

It should not be marked up at all. “Status quo” is English (just like salsa, Los Angeles, feng shui, lager, etc.), even if its roots are not.

You’re right that the need to identify language changes happens less often than most people think (see examples above). But when it is necessary, it’s *absolutely vital* to get correct. Until we can fully trust automated processes to get this right, we shouldn’t be having discussions of removing the one foolproof method (@lang) for authors to do so.
On August 26, 2014, 18:44 CEST, Jens Oliver Meiert said:
I argue that determining language affects everyone and that it is not, per se, an accessibility problem. We all have to correctly identify language and changes therein, and we all struggle with it at times. The idea of questioning current use of @lang, then, has nothing to do with removing actually accessibility-related markup. One cannot compare the need for and situation around @lang with, for example, @alt. (That this needs explaining doesn’t stand for healthy debate.)

I believe my words get twisted here. Responses, including prior list ones, appear reflexive, evasive, even offensive. The community seems set on one course and not open to ideas, not even questions—let alone concessions.

Anyone interested in a constructive discussion, I believe we should look at data. How useful is marking up changes in language (or, in how many instances does not doing so cause problems that users can’t cope with), how many authors actually mark changes up, what is the current error rate in software determining language, where can that rate realistically be brought, what can we do in our guidelines to support, &c. pp.
On August 29, 2014, 12:42 CEST, r12a said:
Since I had already intervened there, I put some comments on the WAI mailing list.