2012-01-06

Tomcat v6.0.35 and UTF-8 Parameters

Update 1 (2012-01-07): I don't have access to the problematic system right now and am unable to confirm; but, when I tried simplifying this down to just one JSP, it worked fine. It also worked fine with one mayaa file. This leads me to think that there must be some system-specific issue.

The recent release of Apache Tomcat, v6.0.35, seems to break the handling of parameters encoded in UTF-8. For example, if I pass "%E6%97%A5%E6%9C%AC" (which is the string of URL-escaped UTF-8 bytes for "日本"), it gets incorrectly interpreted. Both URIEncoding="UTF-8" and useBodyEncodingForURI="true" are set for the necessary Connectors in server.xml, and it works as expected prior to v6.0.35.

Expected:
$ cat nippon && cat $_ | hexdump -C
日本
00000000  e6 97 a5 e6 9c ac 0a                              |.......|
00000007

Actual:
$ cat tomcat-bug && cat $_ | hexdump -C
æ¥æ¬
00000000  c3 a6 c2 97 c2 a5 c3 a6  c2 9c c2 ac 0a           |.............|
0000000d

I cloned the GitHub mirror of tomcat60 and did a quick git-bisect. The offending commit is 1ef4156 (r1200601 in SVN), which corresponds to the last two items of the Catalina changelog for unreleased version 6.0.34.

So, in other words, Tomcat properly interprets parameters prior to (and fails starting from) 1ef4156.

It is hard to tell exactly what the problem is, though, because 1ef4156 is such a large commit. My best guess, without digging into the code, is that ISO-8859-1 is being used instead of UTF-8 in the decoding process—i.e., it seems that the charset is not being correctly passed to the parameter processor.

The same "mistaken" decoding can be done with iconv, as follows:
$ cat nippon | iconv -f ISO-8859-1 -t UTF-8 | hexdump -C
00000000  c3 a6 c2 97 c2 a5 c3 a6  c2 9c c2 ac 0a           |.............|
0000000d

Maybe I'll have a look later and try to fix the problem.