HTML and Specifying Language
Published on August 25, 2014 (⻠February 5, 2024), filed under Development (RSS feed for all categories).
This and many other posts are also available as a pretty, well-behaved ebook: On Web Development.
My concerns about requiring lang
to be set on the html
start tag had first been based on an insufficient differentiation between (and missing reconciliation of) text-processing language and language(s) of the intended audience. While not meant to be the same, in reality, they end up being used the same way. Under that premise, the argument made should be more understandable.
I question the importance and ways of marking up language in HTML documents, in particular changes in language.
More specifically, I question the officially recommended practice of using the lang
(and xml:lang
) attribute to describe the document language (as in â<html lang=en>
â) as well as any subsequent changes (as in âfootball <i lang=fr>Ă la</i> Germany
â).
As for motivation, I deem this mostly an efficiency matter. At this point I doubt it to be an accessibility concern, because the problemâdetermining languageâaffects everyone. (I believe we can credit Joe Clark for the clear distinction that problems that affect everyone, or at least include non-disabled users, are not accessibility issues.)
Iâll now go over marking up document language and marking up changes in language separately, for they present different problems and ask for different solutions.
Specifying Language
To specify the language of documentsâthinking websites, not standalone pagesâwhat first springs to mind must be: What is the most efficient way to do so?
The answer appears to be via HTTP header. That is, to set HTTPâs Content-Language
header *. (In Apache, this is done through the Header or DefaultLanguage directives.)
This method is for that reason so efficient, because it uses the least code, is easiest to update, and carries a stronger weight (though I couldnât find a reference to support that HTTP headers take the usual precedence here).
Still, the W3C I18N Activity advises against using HTTP headers, at least alone: âUse language attributes rather than HTTP to declare the default language for text processing.â (There seem to be no reasons given, then, as the language declarations document referenced is rather neutral about HTTP headers.)
Question 1: Given that HTTP headers are generally more efficient and maintainable to set document language, could it be that current advice against themâor for @lang
, respectivelyâis at least lacking balance?
Specifying Changes in Language
If specifying the language of every document is inefficient and thus costly, then highlighting changes in language is even costlier. Language changes occur in more places, and sometimes require extra markup. (For those who donât know my definition of cost, I understand basically any negative consequence, for example more effort, as a cost.)
Changes in language also occur primarily in the âactualâ document contents. That means that the requirement to mark them up affects more people (not just document and template developers, but all authors), which increases the burden and cost of the requirement. Changes in language appear often enough, then, to even affect users who donât know HTML. As with users, for example, who write and edit their copy in content management systems that use some abstraction to translate contents into HTML, but also with people who use conversion software like Markdown.
Next, and here it gets more interesting, it is completely unclear what tools actually use the information of inner-document language changes. Granted, this may be a knowledge gap on my endâbeing corrected is one reason why I write all of this downâ, but from what Iâve seen so far, what I specifically understand some services like Google not to be doing, and even from my fading memories testing assistive tools, thereâs not a great value in marking up changes in language.
And if this was all incomplete and subject to betterment, the most important question only follows: What role can, and should, software play in thisâsoftware that detects changes in language? While software may never be perfect in detecting language, it may be or become better than what we have now. On my mindâand reflecting Google doctrineâ, language detection should be automated. It should be a software responsibility. That a good number of authors cannot (the ones without training) and will never (the ones that use abstraction tools) mark up changes in language only seems to strengthen this thinking.
Question 2: Could it be that requiring authors to mark up changes in language is not only comparatively expensive and pointless, even, with current implementationsâbut could also, overall, be done better by tools in the first place?
⧠For emphasis I framed both key questions a bit provocatively â . The suspicion that current industry practices and current expert advice are both a little off is strong. The way I see it, a more appropriate approach to specifying language is to
- prefer indicating document language through
Content-Language
, and - require user agents and assistive technology to detect changes in language, where that is relevant.
Any existing provisions that mandate using @lang
when important to do so should remain in place. Similarly, if documents are likely to be served under conditions where no Content-Language
can be set, @lang
should still be preferred.
That is my view as good as I can lay it out. With this move towards Content-Language
I wish to make sure I didnât overestimate their effectiveness; and neither do I want to appear negligent when it comes to the actual use, and useful use at that, of information that indicates changes in language. Hence, what data and evidence did I miss? And what else could, or should, we do?
This post supersedes what I wrote to w3c-wai-gl@w3.org and help@lists.whatwg.org. The intention was the same, but the message less clear.
* The Content-Language
header defines the ânatural language of the intended audience.â I simplify matters in this post by treating document language and audience language synonymously. In cases in which there is a difference that also matters, the original definitions apply, from what I tell with marginal consequences for the issues raised.
â âŠand I can only present this case because of all the work that has been done elsewhere, especially in W3C Working Groups. And so I like to particularly thank Richard Ishida from whom Iâve learned much when it comes to best practice L10N and I18N. (Both a former Google tech lead and a W3C translator, Iâve become familiar with and grateful for Richardâs and the W3C I18N Activityâs work.)
Update (August 27, 2014)
For additional clarification see the discussion below, with WebAIMâs Jared Smith. Iâve also clarified my concerns and position on the list of the WCAG WG. Pardon a little bit of piecemeal, the topic is not trivial.
So far, group and individual feedback have been rather destructive. I believe weâre missing an opportunity to if not adjust, then to clarify previous guidelines. Bite reflexes, as I put it in one of the rare direct responses on Twitter, donât help.
However, Iâve reviewed the case and will, effective immediately, stop marking up changes in language (and even rinse existing markup). I do this primarily on grounds that itâs not an accessibility problem (itâs one for all of us). Other issues listed, most notably expense of all the extra markup, have contributed to the decision. Itâs not a secret that Iâm a markup purist who has on other occasions taken controversial steps as they served quality and efficiency.
My recommendation to the working groups is to at least review the guidelines in place, like H57 and H58. To vendors I recommend turning it up when it comes to meaningful support for language markup, for everyone; detect languages and changes in language, provide translations, impress us. To other developers, I think you benefit from being more critical, too.
Update (April 6, 2019)
I renewed the argument.
About Me
Iâm Jens (long: Jens Oliver Meiert), and Iâm a frontend engineering leader and tech author/publisher. Iâve worked as a technical lead for companies like Google and as an engineering manager for companies like Miro, Iâm a contributor to several web standards, and I write and review books for OâReilly and Frontend Dogma.
I love trying things, not only in web development (and engineering management), but also in other areas like philosophy. Here on meiert.com I share some of my experiences and views. (Please be critical, interpret charitably, and give feedback.)
Comments (Closed)
-
On August 25, 2014, 17:48 CEST, Jared Smith said:
I think itâs rather a stretch to suggest that adding a single attribute and value is âexpensiveâ or âcostlyâ. Certainly typing 10 characters is among the most effortless of accessibility requirements.
If authors are too burdened to add a single attribute, do you really think theyâll take the time to define HTTP headers at the server or back-end scripting level?
header(âContent-language: deâ);
(for example) is NOT less code than
lang=âdeâ
and it certainly is not âeasiest to updateâ for those without knowledge of or access to server-side scripting.As for âpointlessâ, this is incorrect. Screen readers do currently support document language and language switching (at least if defined in markup - not sure on HTTP headers - if they donât, they should). Try listening to a page with an incorrect (or missing if the user agent primary language is different) language definition. Those 10 characters often mean the difference between accessible and utterly incomprehensible. To suggest that this affects sighted readers equally is just plain wrong.
Automated language detection simply wonât work - at least today. It would be incredibly processor intensive for internal page content. And would obviously not work for short phrases or sections. Even Google, arguably the leader in language detection, frequently mis-identifies the language of entire pages.
I canât support your suggestion that we replace a simple solution that (generally) works with a significantly more complex one that is not yet supported.
-
On August 25, 2014, 18:25 CEST, mark said:
IMO it seems rather ignorant to specify the language of a document as part of the transport protocol.
There are many other ways that html marked up documents are transported eg. via smtp, imap, ftp or through file based systems which do not allow or support setting the content-language header, thus loosing that information.
-
On August 25, 2014, 18:56 CEST, Jens Oliver Meiert said:
Jared, the typical website consists of more than one page, and
Content-Language
pays off early. Once you think websites you notice how beneficial HTTP headers are in comparison.As for changes in language, these are old arguments and Iâm exactly not buying them anymore. We can never expect all changes in language to be marked up. Itâs, comparatively, a lot of dumb work, too. I contend that software can do better detecting changes in language than weâd ever be able to with all the guidelines in the world mandating to declare those changes. No, it wonât be perfect. But it will probably work better. And cheaper.
-
On August 25, 2014, 19:23 CEST, Jared Smith said:
I agree that server-side language definition can be more efficient for single-language sites. In our evaluation work, we find that many (or perhaps most) sites that have multiple language content generally have the document language defined *incorrectly* for one or more content languages (and a mismatch between document lang and HTTP headers is even more prevalent). Moving this entirely to the server level where it is invisible to authors will aggravate this issue. Evaluation of the language would require a more burdensome HTTP header analysis.
I also agree that computer-detection of language of portions of a page could be of great value when the language of that portion is not explicitly identified, but I donât think we should toss out the currently functional lang attribute to rely on automated language detection (which does not yet exist in any meaningful way for page portions). This would be like suggesting that we drop form labeling in markup because computers can sometimes guesses the label correctly based on proximity. It works OK, until the computer guess incorrectly - then it is entirely inaccessible (actually worse than inaccessible, because the computer reads the INCORRECT label/language).
âAny existing provisions that mandate using @lang when important to do so should remain in place.â What do you mean by âwhen important to do soâ? Do you mean where automated translation is not sufficient. If thatâs the case, then all language changes are currently âimportantâ. Educating authors to always define language changes would certainly be easier than educating them on when explicitly defining language changes is âimportantâ (whatever that means) or not.
-
On August 25, 2014, 19:38 CEST, Jens Oliver Meiert said:
I think weâre really not that far apart. And to be clear, Iâm still looking for more data.
My beef is mainly around the overall situation with changes in language. If I get you right then it wouldnât be ideal if we had to teach when to mark those changes up and when these could be a tool job. I agree with that. But presented with a black and white, all or nothing decision between requiring to mark up all those changes manually, and to task tools to do so, Iâd, again considering the overall situation, choose the latter.
Saying that, I also think that we havenât tapped any potential yet. But thatâs a problem with the status quo, not the proposal. If that makes sense. So Iâd be excited to see guidelines (UAAG, perhaps) to require tools to go to reasonable lengths to tell changes apart. This brings me to:
âAny existing provisions that mandate using @lang when important to do so should remain in place.â What do you mean by âwhen important to do soâ?
I still wonder, but maybe you can help, about the problem itself. How often is there a true, insurmountable barrier with lack of markup that indicates changes in language? (Note that I donât question that there could be such issues, I know there can, but how often.) Take this post, what is the consequence of âstatus quoâ not being marked up properly, which it isnât? And so I say âwhen important,â because I believe that it doesnât really matter as often as we (as experts) think it does. (And again, Iâm not saying it never does, Iâm just saying not as often as we think đ)
-
On August 25, 2014, 21:53 CEST, Jared Smith said:
We can never expect all changes in language to be marked up.
I think this is generally your argument for moving away from @lang to automated language detection. I think the same could be said of alt text, form labeling, or pretty much anything else in accessibility. While automated processes can fill the gaps where accessibility is not defined (such as form auto-labeling, or perhaps image analysis for images without alt), should we also throw out these techniques because there are not always used and because computers might someday be able to sometimes do it automatically?
How often is there a true, insurmountable barrier with lack of markup that indicates changes in language?
Never. Because adding an attribute to an element will never be an âinsurmountable barrierâ.
Take this post, what is the consequence of âstatus quoâ not being marked up properly, which it isnât?
It should not be marked up at all. âStatus quoâ is English (just like salsa, Los Angeles, feng shui, lager, etc.), even if its roots are not.
Youâre right that the need to identify language changes happens less often than most people think (see examples above). But when it is necessary, itâs *absolutely vital* to get correct. Until we can fully trust automated processes to get this right, we shouldnât be having discussions of removing the one foolproof method (@lang) for authors to do so.
-
On August 26, 2014, 18:44 CEST, Jens Oliver Meiert said:
I argue that determining language affects everyone and that it is not, per se, an accessibility problem. We all have to correctly identify language and changes therein, and we all struggle with it at times. The idea of questioning current use of
@lang
, then, has nothing to do with removing actually accessibility-related markup. One cannot compare the need for and situation around@lang
with, for example,@alt
. (That this needs explaining doesnât stand for healthy debate.)I believe my words get twisted here. Responses, including prior list ones, appear reflexive, evasive, even offensive. The community seems set on one course and not open to ideas, not even questionsâlet alone concessions.
Anyone interested in a constructive discussion, I believe we should look at data. How useful is marking up changes in language (or, in how many instances does not doing so cause problems that users canât cope with), how many authors actually mark changes up, what is the current error rate in software determining language, where can that rate realistically be brought, what can we do in our guidelines to support, &c. pp.
-
On August 29, 2014, 12:42 CEST, r12a said:
Since I had already intervened there, I put some comments on the WAI mailing list.
Read More
Maybe of interest to you, too:
- Next: How to Travel the World and Stay Safe
- Previous: Sources
- More under Development
- More from 2014
- Most popular posts
Looking for a way to comment? Comments have been disabled, unfortunately.
Get a good look at web development? Try WebGlossary.infoâand The Web Development Glossary 3K. With explanations and definitions for thousands of terms of web development, web design, and related fields, building on Wikipedia as well as MDN Web Docs. Available at Apple Books, Kobo, Google Play Books, and Leanpub.