[MAGNOLIA-3008] Changes in the handling of url decoding in 4.2 break non-ascii urls that were working properly in previous versions Created: 14/Jan/10  Updated: 23/Jan/13  Resolved: 15/Mar/10

Status: Closed
Project: Magnolia
Component/s: None
Affects Version/s: 4.2, 4.2.1, 4.2.2, 4.1.4, 4.2.3
Fix Version/s: 4.3

Type: Bug Priority: Critical
Reporter: Fabrizio Giustina Assignee: Fabrizio Giustina
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Template:
Acceptance criteria:
Empty
Task DoD:
[ ]* Doc/release notes changes? Comment present?
[ ]* Downstream builds green?
[ ]* Solution information and context easily available?
[ ]* Tests
[ ]* FixVersion filled and not yet released
[ ]  Architecture Decision Record (ADR)
Bug DoR:
[ ]* Steps to reproduce, expected, and actual results filled
[ ]* Affected version filled
Date of First Response:

 Description   

After upgrading from 4.1 to 4.2 (4.2.3) we saw several problems related to url encoding, and looks like the changes in Magnolia 4.2 introduced some regressions.

The related jiras are:
MAGNOLIA-2524 - AggregationState.decodeURI is wrong - review
MAGNOLIA-2899 - Make ServletDispatchingFilter i18n aware

The problems: after 4.2 we were happily using virtual uris that were working properly on non-ascii url, for example:

url: http://www.myserver.com/path/{test}


VirtualURIMapping code:

public MappingResult mapURI(String uri)
    {
          if (StringUtils.contains(uri, "{test}"))
            {
                // do something

everything was used to work properly (and as expected) in magnolia < 4.2. The above expression was correctly evaluated to :

          StringUtils.contains("http://www.myserver.com/path/${test}", "{test}")

After upgrading to 4.2 the url given to the virtualURI is
http://www.myserver.com/path/%7Btest%7D

And of course this break everything:

          StringUtils.contains("http://www.myserver.com/path/%7Btest%7D", "{test}")

This also happens with accented characters (àèìòù) which are very common in italian. Also the path received by the cms filter is now undecoded: in the past the path "/già" was ending in the request of the content "/già" in the website repo (apart from the fact that magnolia strips accented characters when creating pages from the admin interface this is perfectly legal in jcr, and content loaded from xml with such paths was handled property by magnolia)... now the HierarchyManager looks for /gi%something which is definitively not good.

Note that this also happen after tweaking the tomcat connector to use UTF8 (see http://wiki.magnolia-cms.com/display/DEV/URI+encoding+in+Tomcat ), which is something that was needed in the past to get parameters in get correctly decoded.

Unfortunately this looks like a blocker for the upgrade from 4.1 to 4.2, I couldn't find any configuration that could restore the old behaviour and any non-ascii url (for example we used a lot of SEO friendly virtualURIs which mapped italian words...).



 Comments   
Comment by Jan Haderka [ 15/Jan/10 ]

...which is something that was needed in the past to get parameters in get correctly decoded.

Indeed, and if you read javadoc for the method that was removed from aggregation state that was used to decode the uri you would see that it also was meant for the parameters en/decoding not for URIs. If something like that was ever needed, we should have used org.apache.catalina.util.RequestUtil.URLDecode or one described at http://www.w3.org/International/O-URL-code.html instead.

IMHO, there are 2 issues which we are mixing together:

  • encoding of national chars in the URI (i.e. % notation to real chars) - that should always be a web server responsibility
  • usage of the national chars in node names - limited by the JCR spec restrictions for node names

we can deal with the second part of it in the MAGNOLIA-3009

as for the char encoding of the URI (not the query string!!!) , the RFC-2396 states allowed chars in section 1.6 and defines rules for remaining character escaping in section 2. Subsection 2.4.2 also warns about un-escaping the URI multiple times, which is what was happening in MAGNOLIA-2524 and was causing issues linked from there.
The fact that it is working properly now is visible when requesting something from say DMS where the name of the document can be anything and is not limited by the JCR restrictions for the node names, e.g. requesting http://localhost:8080/dms/untitled/august-100--200-/august-100%E2%82%AC-200%E2%82%AC.pdf will properly retrieve document named august-100€-200€.pdf (but OK, this is because DMSDownloadServlet does try to decode the name itself as well, which it should not).

Anyway coming back to the original issue - if % encoded character is contained in the set of characters it is decoded by app server (tomcat in this case) as expected. (it really is done by the server not by the browser - verified by looking at the protocol level: GET /demo-fe%61tures.html HTTP/1.1 was issues).

So the only time when the character is left in % notation is when it is outside of the ascii range, In that case we are anyway hitting the restrictions of the node names later when passed to HM.
In case of the VirtualURIMapping you can make it aware and decode the URIs as necessary. You still can't create a page with name già so you need to map it to something else either way and you can either decode the URI or encode the test string before attempting to do the match.

If/when we reintroduce the decoding in AggregationState I would prefer to make this configurable and also to keep the value of decodedURI in separate property rather then instrumenting the value of originalURI directly. And we definitively need to handle it since javadoc for HttpServletRequest.getRequestURI() says "The web container does not decode this String.", but as shown above with Tomcat it is true only to the certain extent (non ascii chars) and other app servers have similar quirks.

some more reading: http://www.whirlycott.com/phil/2005/05/11/building-j2ee-web-applications-with-utf-8-support/

Comment by Fabrizio Giustina [ 15/Mar/10 ]

fixed with MAGNOLIA-3009

Generated at Mon Feb 12 03:42:17 CET 2024 using Jira 9.4.2#940002-sha1:46d1a51de284217efdcb32434eab47a99af2938b.