[MAGNOLIA-3333] cache: improve logging on blocking timeouts and unlock in the cache filter on exceptions Created: 22/Oct/10  Updated: 17/Aug/12  Resolved: 19/Jan/11

Status: Closed
Project: Magnolia
Component/s: cache
Affects Version/s: None
Fix Version/s: 4.4.2, 4.5

Type: Bug Priority: Major
Reporter: Magnolia International Assignee: Philipp Bärfuss
Resolution: Fixed Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File MAGNOLIA-3333-exception.txt     Text File magnolia-module-cache.patch    
Issue Links:
relation
is related to MAGNOLIA-3167 cache: single blocking request can bl... Closed
Template:
Acceptance criteria:
Empty
Task DoD:
[ ]* Doc/release notes changes? Comment present?
[ ]* Downstream builds green?
[ ]* Solution information and context easily available?
[ ]* Tests
[ ]* FixVersion filled and not yet released
[ ]  Architecture Decision Record (ADR)
Bug DoR:
[ ]* Steps to reproduce, expected, and actual results filled
[ ]* Affected version filled
Date of First Response:

 Description   

After a longer discussion we have done the following:

  • log when a requests takes longer to produce the content than the lock timeout for the waiting threads are
  • log when a request was blocked longer than the timeout period (because the producer has not finished yet)
  • introduced a blocking cache interface to expose the timeout configuration an to wrap the unlock method
  • it is now the cache filter which will unblock on exceptions (before it was the store executor)


 Comments   
Comment by orlando warren [ 02/Dec/10 ]

We're having the same problem in production (and had to rollback because of it). Here's a stack trace:

Caused by: net.sf.ehcache.constructs.blocking.LockTimeoutException: Lock timeout. Waited more than 10000ms t
o acquire lock for key /efax/content/login.efbfbd7e13efbfbdefbfbddaa02cefbfbd243472efbfbd3defbfbd on blockin
g cache default
at net.sf.ehcache.constructs.blocking.BlockingCache.get(BlockingCache.java:532)
at info.magnolia.module.cache.ehcache.EhCacheWrapper.get(EhCacheWrapper.java:63)
at info.magnolia.module.cache.cachepolicy.Default.shouldCache(Default.java:96)
at info.magnolia.module.cache.filter.CacheFilter.doFilter(CacheFilter.java:111)
at info.magnolia.cms.filters.OncePerRequestAbstractMgnlFilter.doFilter(OncePerRequestAbstractMgnlFil
ter.java:62)
at info.magnolia.cms.filters.MgnlFilterChain.doFilter(MgnlFilterChain.java:82)
at info.magnolia.cms.i18n.I18nContentSupportFilter.doFilter(I18nContentSupportFilter.java:75)
at info.magnolia.cms.filters.AbstractMgnlFilter.doFilter(AbstractMgnlFilter.java:70)
at info.magnolia.cms.filters.MgnlFilterChain.doFilter(MgnlFilterChain.java:82)
at info.magnolia.cms.filters.VirtualUriFilter.doFilter(VirtualUriFilter.java:133)
at info.magnolia.cms.filters.AbstractMgnlFilter.doFilter(AbstractMgnlFilter.java:70)
at info.magnolia.cms.filters.MgnlFilterChain.doFilter(MgnlFilterChain.java:82)
at info.magnolia.cms.filters.MgnlFilterChain.doFilter(MgnlFilterChain.java:76)
at info.magnolia.cms.filters.CompositeFilter.doFilter(CompositeFilter.java:64)
at info.magnolia.cms.filters.AbstractMgnlFilter.doFilter(AbstractMgnlFilter.java:70)
at info.magnolia.cms.filters.MgnlFilterChain.doFilter(MgnlFilterChain.java:82)
at info.magnolia.cms.filters.MgnlFilterChain.doFilter(MgnlFilterChain.java:76)
at j2.module.j2core.cookie.CookieParamCacheKeyManagementFilter.doFilter(CookieParamCacheKeyManagemen
tFilter.java:183)
at info.magnolia.cms.filters.AbstractMgnlFilter.doFilter(AbstractMgnlFilter.java:70)
at info.magnolia.cms.filters.MgnlFilterChain.doFilter(MgnlFilterChain.java:82)
at j2.module.j2core.akamai.AkamaiCacheKeyManagementFilter.doFilter(AkamaiCacheKeyManagementFilter.ja
va:115)
at info.magnolia.cms.filters.AbstractMgnlFilter.doFilter(AbstractMgnlFilter.java:70)
at info.magnolia.cms.filters.MgnlFilterChain.doFilter(MgnlFilterChain.java:82)
at info.magnolia.cms.filters.MgnlFilterChain.doFilter(MgnlFilterChain.java:84)
at j2.module.efax.filters.offercode.EfaxOfferCodeFilter.doFilter(EfaxOfferCodeFilter.java:356)
... 54 more

Comment by orlando warren [ 02/Dec/10 ]

btw, we're using Magnolia 4.3.8

Comment by Philipp Bärfuss [ 02/Dec/10 ]

What happens is the following:

  • a request cannot be served by the cache (because it is not yet cached) and so it starts to process the request. The lock for the cache key is acquired
  • followup requests to the same key are blocked (as one request is already caching the same data)
  • to avoid having all threads waiting we have introduced a timeout for the waiting requests
  • if the first request never returns (or if it is to slow) a exception is thrown to give the thread free for other requests

The main question is why is the first request not returning. You can best find out by making a thread dump (feel free to attach it here):

-> http://expertodev.wordpress.com/2009/05/30/how-to-take-java-thread-dump/

Could it be that the request is just too slow (below 10 seconds?)

Comment by orlando warren [ 03/Dec/10 ]

We're having a hard time reproducing the issue on our QA environment. Hopefully when we do I can send it along.
I doubt that the request was taking more than 10s, because then we would have seen a degradation lasting 10's of seconds followed by normal performance again. The site really did die.
We're thinking of uncommenting out in your CacheFilter the finally clause in the doFilter method:

public void doFilter(HttpServletRequest request, HttpServletResponse response, FilterChain chain) throws IOException, ServletException {
final AggregationState aggregationState = MgnlContext.getAggregationState();
final CachePolicyResult cachePolicyResult = cacheConfig.getCachePolicy().shouldCache(cache, aggregationState, cacheConfig.getFlushPolicy());

log.debug("Cache policy result: {}", cachePolicyResult);

final CachePolicyResult.CachePolicyBehaviour behaviour = cachePolicyResult.getBehaviour();
monitor.logBehavior(behaviour.getName());
monitor.logAccess(cachePolicyResult.getCacheKey());
final CachePolicyExecutor executor = cacheConfig.getExecutor(behaviour);
if (executor == null)

{ throw new IllegalStateException("Unexpected cache policy result: " + cachePolicyResult); }

executor.processCacheRequest(request, response, chain, cache, cachePolicyResult);

// TODO if the cache blocks we will have to add this again.
/*
finally{
Object key = cachePolicyResult.getCacheKey();
if (!cachePolicyResult.getBehaviour().equals(CachePolicyResult.bypass) && (
((EhCacheWrapper)cache).getWrappedEhcache().getQuiet(key) == null)) {
log.warn("Cache nearly blocked for key: {}, removed entry", key);
cache.put(key, null);
cache.remove(key);
}
}
*/
}

The comment seems to indicate that was what the finally was for. What do you think? Why did you comment out the code?

Thanks.

Comment by Craig Blake [ 06/Dec/10 ]

From looking at the code more, it appears that the finally clause (which use to be in the CacheFilter to handle the case where EhCache throws an exception and needs to have the lock explicitly released) was moved into the Store where the main cache get happens.

However, there are other methods which call get() on the underlying blocking cache and which need to have non-timeout exceptions handled to prevent a deadlock, including the method Default.shouldCache() which is called on line 111 of CacheFilter. From what I can see when Default.shouldCache() calls get() on line 96, if a non-timeout exception is thrown, it is not handled and will result in deadlock on the cache key.

Are we missing something?

Comment by Philipp Bärfuss [ 24/Dec/10 ]

attaching an untested [patch|^magnolia-module-ache.patch] which:

  • introduces a BlockingCache interface
    • int getBlockingTimeout()
    • unlock(key)
  • the unlocking is done in the CacheFilter
  • Store
    • the bigger part of the diff in Store is due to indentation
    • we are not catching exceptions as they are handled by the filter
    • we are not taking any measures to un-block the cache

Attention: the patch is based on Magnolia 4.3

Comment by Jan Haderka [ 03/Jan/11 ]

Can you add junit or integration test that would reexecute scenario described by Philipp and ensure that it is really solved by the patch?
Thx.

Comment by Philipp Bärfuss [ 18/Jan/11 ]

Changes are implemented but not back-ported to 4.3. This mostly improves the logs of such occasions.

Comment by Jan Haderka [ 19/Jan/11 ]

Fix versions? Patch was also applied on 4.3 branch and not reverted from there (According to the log comments).

Comment by Philipp Bärfuss [ 19/Jan/11 ]

I updated the title and description.

The original description was:

I've noticed a lot of exceptions like the attached on our documentation website. I have currently no idea why this happens. Restarting the instance "solves" the issue.

There is a secondary issue to this: the exception appears both in the logs and on screen. I believe this type of exception should never been rendered in the client of a public instance (it shows some information that might be considered sensitive). (perhaps a good generic solution to this problem would be a filter intercepting all exceptions and rendering a default uninformative error page on public instances)

Comment by Philipp Bärfuss [ 19/Jan/11 ]

The revert on 4.3 was done in revition 41282. Sorry for the lousy commit message.

Generated at Mon Feb 12 03:45:29 CET 2024 using Jira 9.4.2#940002-sha1:46d1a51de284217efdcb32434eab47a99af2938b.