[MAGNOLIA-3008] Changes in the handling of url decoding in 4.2 break non-ascii urls that were working properly in previous versions Created: 14/Jan/10 Updated: 23/Jan/13 Resolved: 15/Mar/10 |
|
| Status: | Closed |
| Project: | Magnolia |
| Component/s: | None |
| Affects Version/s: | 4.2, 4.2.1, 4.2.2, 4.1.4, 4.2.3 |
| Fix Version/s: | 4.3 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Fabrizio Giustina | Assignee: | Fabrizio Giustina |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Template: |
|
| Acceptance criteria: |
Empty
|
| Task DoD: |
[ ]*
Doc/release notes changes? Comment present?
[ ]*
Downstream builds green?
[ ]*
Solution information and context easily available?
[ ]*
Tests
[ ]*
FixVersion filled and not yet released
[ ] 
Architecture Decision Record (ADR)
|
| Bug DoR: |
[ ]*
Steps to reproduce, expected, and actual results filled
[ ]*
Affected version filled
|
| Date of First Response: |
| Description |
|
After upgrading from 4.1 to 4.2 (4.2.3) we saw several problems related to url encoding, and looks like the changes in Magnolia 4.2 introduced some regressions. The related jiras are: The problems: after 4.2 we were happily using virtual uris that were working properly on non-ascii url, for example: url: http://www.myserver.com/path/{test} VirtualURIMapping code: public MappingResult mapURI(String uri) { if (StringUtils.contains(uri, "{test}")) { // do something everything was used to work properly (and as expected) in magnolia < 4.2. The above expression was correctly evaluated to : StringUtils.contains("http://www.myserver.com/path/${test}", "{test}")
After upgrading to 4.2 the url given to the virtualURI is And of course this break everything: StringUtils.contains("http://www.myserver.com/path/%7Btest%7D", "{test}")
This also happens with accented characters (àèìòù) which are very common in italian. Also the path received by the cms filter is now undecoded: in the past the path "/già" was ending in the request of the content "/già" in the website repo (apart from the fact that magnolia strips accented characters when creating pages from the admin interface this is perfectly legal in jcr, and content loaded from xml with such paths was handled property by magnolia)... now the HierarchyManager looks for /gi%something which is definitively not good. Note that this also happen after tweaking the tomcat connector to use UTF8 (see http://wiki.magnolia-cms.com/display/DEV/URI+encoding+in+Tomcat ), which is something that was needed in the past to get parameters in get correctly decoded. Unfortunately this looks like a blocker for the upgrade from 4.1 to 4.2, I couldn't find any configuration that could restore the old behaviour and any non-ascii url (for example we used a lot of SEO friendly virtualURIs which mapped italian words...). |
| Comments |
| Comment by Jan Haderka [ 15/Jan/10 ] |
Indeed, and if you read javadoc for the method that was removed from aggregation state that was used to decode the uri you would see that it also was meant for the parameters en/decoding not for URIs. If something like that was ever needed, we should have used org.apache.catalina.util.RequestUtil.URLDecode or one described at http://www.w3.org/International/O-URL-code.html instead. IMHO, there are 2 issues which we are mixing together:
we can deal with the second part of it in the as for the char encoding of the URI (not the query string!!!) , the RFC-2396 states allowed chars in section 1.6 and defines rules for remaining character escaping in section 2. Subsection 2.4.2 also warns about un-escaping the URI multiple times, which is what was happening in Anyway coming back to the original issue - if % encoded character is contained in the set of characters it is decoded by app server (tomcat in this case) as expected. (it really is done by the server not by the browser - verified by looking at the protocol level: GET /demo-fe%61tures.html HTTP/1.1 was issues). So the only time when the character is left in % notation is when it is outside of the ascii range, In that case we are anyway hitting the restrictions of the node names later when passed to HM. If/when we reintroduce the decoding in AggregationState I would prefer to make this configurable and also to keep the value of decodedURI in separate property rather then instrumenting the value of originalURI directly. And we definitively need to handle it since javadoc for HttpServletRequest.getRequestURI() says "The web container does not decode this String.", but as shown above with Tomcat it is true only to the certain extent (non ascii chars) and other app servers have similar quirks. some more reading: http://www.whirlycott.com/phil/2005/05/11/building-j2ee-web-applications-with-utf-8-support/ |
| Comment by Fabrizio Giustina [ 15/Mar/10 ] |
|
fixed with |