[MAGNOLIA-3167] cache: single blocking request can block all other requests to cached content Created: 26/Mar/10  Updated: 23/Jan/13  Resolved: 07/May/10

Status: Closed
Project: Magnolia
Component/s: cache
Affects Version/s: 3.6.8, 4.1.4, 4.2.3, 4.3.1
Fix Version/s: 4.3.2

Type: Bug Priority: Major
Reporter: Philipp Bärfuss Assignee: Jan Haderka
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File cacheBlock.png     PNG File cacheBlockOnflush.png    
Issue Links:
dependency
is depended upon by MGNLADVCACHE-15 Remove synchronized blocks Closed
relation
is related to MAGNOLIA-3333 cache: improve logging on blocking ti... Closed
is related to MGNLCACHE-36 Failure to generate response first ti... Closed
supersession
Template:
Patch included:
Yes
Acceptance criteria:
Empty
Task DoD:
[ ]* Doc/release notes changes? Comment present?
[ ]* Downstream builds green?
[ ]* Solution information and context easily available?
[ ]* Tests
[ ]* FixVersion filled and not yet released
[ ]  Architecture Decision Record (ADR)
Bug DoR:
[ ]* Steps to reproduce, expected, and actual results filled
[ ]* Affected version filled
Date of First Response:

 Description   

the cache mechanism can block all requests:

  • cache.get() will block if an other request is caching the same key (this is a feature of the BlockingCache)
    • mutex per key kept until cache.put() is called (either with an entry or null value)
  • this code is again in a synchronized block which synchronizes on the cache object itself
    • this blocks all other request trying to enter the synchronization block

The critical scenario which can prevent magnolia from responding any request (all threads blocked) is the following
1) first request to a resource which is slow or never returns (request to a service, poor database connection, ..)
2) second request to the same resource: --> thread is blocked at EhCacheWrapper.get(key):56, but also keeps the lock on the cache
3) all other caching requests are blocked (no matter which url) due to the synchronize block at Default.shouldCache(Cache, AggregationState, FlushPolicy):89

Solution:

  • don't synchronize on the cache (why are we doing this???)
  • allow configuration of BlockingCache.timeoutMillis
    • throw an exception if a request waits for to long
  • uncomment finally block at doFilter(HttpServletRequest, HttpServletResponse, FilterChain), this is a safety net and should log a FATAL ERROR message
    • only relevant if the result is a store request


 Comments   
Comment by Philipp Bärfuss [ 26/Mar/10 ]

Patch against 3.6.8 (draft) can be found at SUPPORT-538

Comment by Magnolia International [ 26/Mar/10 ]

I think the main reason we synchronize on the cache instance is to avoid relying on specifics of EhCache's implementation.

Given the above, we could consider not doing it, but rather change our Cache interface to enforce locking at key level like EhCache's BlockingCache does (by means of simply naming methods appropriately, providing extra facilities to do said locking ourselves (should be doable with java 5's java.util.concurrent.locks?), or define such contracts in the javadoc - or of course all of this)

Comment by Jan Haderka [ 26/Mar/10 ]

Actually the reason we synchronize is the related to the fact that we use blocking cache by default. If I'm not mistaken the scenario that requires synchronization is this:

  1. Requst A calls hasElement() which in turn will call get(key). Since key is not yet in the cache get() method will return null and place a mutex on the key. At which point hasElement() return false and cache policy will return "store". and request will proceed to generate the cache entry
  2. while the cache entry is being generated, the request B comes for the same resource, and the hasElement() will block in the subsequent call to cache.get(key) (since we use the blocking cache)
  3. still while the cache entry requested by the A is generated, the request C comes and gets stuck in the exactly same place as request B
  4. now finally, the cache entry have been generated and placed in the cache at which point the request A returns.
  5. since the cache entry have been put in the cache, it will remove the mutex and request B will proceed, find the entry in the cache, and will go on with the "useCache" behavior.
  6. unfortunatelly, the reuqest C will be still blocked and will stay that way forever since removing the mutex by A released only the first blocked get() call, but all of them. So what applies to C applies to any number of requests that come in that very same time window.

The behavior above was a reason why we introduced the synchronization on cache in 3.6. I might have explained it in even more details in the user list in the past.

Now, considering this issue, the timeout is probably bast solution to avoid being locked forever if the thread handling request A gets stuck forever.

As for serving multiple entries from the cache, we should lock on the key rather then on the whole cache so only the requests to the cache for given key ever gets serialized access as Greg already suggested above. ... Alternatively we can drop whole policy/executor based implementation and implement cache in a way that would not allow calls to cache to escape ever.

Comment by Philipp Bärfuss [ 26/Mar/10 ]

unfortunatelly, the reuqest C will be still blocked and will stay that way forever since removing the mutex

If I understand that right the following will happen if the synchronization block is removed
1) A gets mutext
2) B tries to acquire the mutix
3) C tries to acquire the mutix
4) A releases (calls put)
5) B succeeds but does never release the mutex
--> C is blocked for ever

Well then we have to find a better solution, just blocking all requests for that reason is not an option. But looking at the following code of net.sf.ehcache.constructs.blocking.BlockingCache.get(Object) (version 1.5)

            if (element != null) {
                //ok let the other threads in
                lock.release();
                return element;
            } else {

It seams as if they handle the case in which the cache was populated while that thread was blocked. Either something is very fishy but that extra synchronization block is wrong/obsolete.

Comment by Jan Haderka [ 26/Mar/10 ]

Yes, that is correct.
AFAIK this scenario was not re-tested after upgrading to ehcache 1.5 so it might be indeed obsolete.

OTOH looking at current trunk of ehcache, I'm not so sure you had a right code, there the BlockingCache.get(key) method looks like:

    /**
     * Looks up an entry.  Blocks if the entry is null until a call to {@link #put} is done
     * to put an Element in.
     * <p/>
     * If a put is not done, the lock is never released.
     * <p/>
     * If this method throws an exception, it is the responsibility of the caller to catch that exception and call
     * <code>put(new Element(key, null));</code> to release the lock acquired. See {@link net.sf.ehcache.constructs.blocking.SelfPopulatingCache}
     * for an example.
     * <p/>
     * Note. If a LockTimeoutException is thrown while doing a <code>get</code> it means the lock was never acquired,
     * therefore it is a threading error to call {@link #put}
     *
     * @throws LockTimeoutException if timeout millis is non zero and this method has been unable to
     *                              acquire a lock in that time
     * @throws RuntimeException     if thrown the lock will not released. Catch and call<code>put(new Element(key, null));</code>
     *                              to release the lock acquired.
     */
    public Element get(final Object key) throws RuntimeException, LockTimeoutException {

        Sync lock = getLockForKey(key);
        Element element;
        acquiredLockForKey(key, lock, LockType.READ);
        element = cache.get(key);
        lock.unlock(LockType.READ);
        if (element == null) {
        acquiredLockForKey(key, lock, LockType.WRITE);
        element = cache.get(key);
        if (element != null) {
            lock.unlock(LockType.WRITE);
        }
        }
        return element;
    }

and browsing through the code I'm not convinced that this issue is solved. There is something fishy in that and related code, i just can't pinpoint it out yet. I'll try to write simple test for this on Monday and re-read all the code again to be sure.

It might be (speculation!) that the write lock prevents second call to

        element = cache.get(key);

to be executed, but since this internal cache is not synchronized, there is no guarantee by JVM that given thread will see already value put in by other thread. Those locks serialize only the code execution but not memory visibility (oh that sounds soo smart ... must be just effect of re-reading Concurency in Practice )

Comment by Magnolia International [ 26/Mar/10 ]

Let's spend some time "asap" setting up a few reproducible test scenarios (jmeter) so we can retest this, especially when upgrading ehcache ? (we're a few versions behind)

Comment by Philipp Bärfuss [ 29/Mar/10 ]

source: I looked at the code for ehcache 1.5 (what we bundle with 4.x) and ehcache 1.4.1 (bundled with 3.6.8)

re-reading: now that this code uses a mutex object and a cache object I think this is not the case

test: the basic scenarios can be tested with unit tests

test a) on request blocked for key "a" -> test that a request for key "b" enters the rendering
test b) three request to the same key (two blocked) -> test that all three request return

Comment by Jan Haderka [ 30/Mar/10 ]

Added test to expose the scenario outlined above.
As of ehCache 1.5 the issue occurs only when MaxElementsInMemory is set to 0 (there are also posts on the net warning against such setting)
As of ehCache 2.0, even with MaxElementsInMemory set to 0 everything works fine. There are however some API changes (SourceConfiguration) that need to be accommodated before full upgrade to ehCache 2.0.

Comment by Jan Haderka [ 30/Mar/10 ]

It would also seem that the meaning of maxElementsInMemory==0 has changed in ehCache 2.0: http://ehcache.org/documentation/faq.html

Can I use Ehcache as a disk cache only?
As of Ehcache 2.0 this is not possible. You can set the maxElementsInMemory to 1, but setting the max size to 0 now gives an infinite capacity.

Comment by Jan Haderka [ 31/Mar/10 ]

Two most common scenarios (i could think of) under which cache can block. The first one (multiple requests) is solved by synchronizing on whole cache, tho the price for that is probably too high.
The second scenario is AFAIK currently not taken care of since flush policy doesn't synchronize on cache so the lock can be introduced by flushAll or flushByUuid calls. And more dangerously since Default cache policy does synchronize on the cache one such blocking request blocks whole cache forever.

Comment by Jan Haderka [ 07/May/10 ]

Implementing timeout as shown in patch attached to SUPPORT-538 certainly ensures there is no single request blocking forever. However this doesn't justify removal of the synchronized block.
The main purpose of removing synchronized block is to ensure the mutex is always placed in the hasElement() call and not in the subsequent get() call. Cache is accessible by other threads and items can be removed by them directly (remove(Key)/clear()) or indirectly by eviction of expired items or due to reaching max amount of allowed elements.
Due to the facts outlined above, we need either synchronized block to make sure that only one thread is able to execute such two calls in a sequence or we need to avoid using such construction altogether. (And yes, there is still the possibility that item will be evicted after the hasElement() and before the get() call, in which case get() will cause mutex to be placed and error page returned to the user.)
Anyway, the synchronization would need to be in all places we manipulate the cache (cache.clear(), cache.remove(Key) and also cache.put(Key, Val) and if we ever expire entries by time to live, we would need to synchronize there too).
Below, are outlined few scenarios which help to clarify the sort of problems that occur due to lack of synchronization:

Scenario 1:

  1. Thread A: hasElement(KeyA) --> ehCache.get(KeyA) != null ==> false, [place mutex]
  2. Thread A: store(KeyA) --> <takes time>
  3. Thread B: hasElement(KeyA) --> ehCache.get(KeyA) ==> LockTimeoutEx
  4. Thread B: <return error to user>
  5. Thread A: store(KeyA) --> <finishes now> --> ehCache.put(KeyA, ValA) ==> [remove mutex]
  6. Thread A: <return page to user>

Scenario 2:

  1. Thread A: hasElement(KeyA) --> ehCache.get(KeyA) != null ==> true
  2. Thread B: evict/expire(KeyA)
  3. Thread A: get(KeyA) --> ehCache.get(KeyA) ==> null, [place mutex]
  4. Thread A: <return error to the user>
  5. Thread C: hasElement(KeyA) --> ehCache.get(KeyA) ==> LockTimeoutEx
  6. Thread C: <return error to the user>

The synchronization issue could be dealt with in multiple ways:

  • add synchronized to all places as outlined above. cons: hard to maintain, test and ensure that it is always the case. Synchronization of whole cache means only one request can be processed at a time, creating performance bottleneck
  • synchronize everywhere, but use system of mutexes, one per key similar to ehCache own system. pros: removing the bottleneck. cons: extra code to write and maintain, significant increase in complexity
  • abandon usage of problematic construction. There is no need to do the double check.
Generated at Mon Feb 12 03:43:50 CET 2024 using Jira 9.4.2#940002-sha1:46d1a51de284217efdcb32434eab47a99af2938b.