[MAGNOLIA-3167] cache: single blocking request can block all other requests to cached content Created: 26/Mar/10 Updated: 23/Jan/13 Resolved: 07/May/10 |
|
| Status: | Closed |
| Project: | Magnolia |
| Component/s: | cache |
| Affects Version/s: | 3.6.8, 4.1.4, 4.2.3, 4.3.1 |
| Fix Version/s: | 4.3.2 |
| Type: | Bug | Priority: | Major |
| Reporter: | Philipp Bärfuss | Assignee: | Jan Haderka |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Template: |
|
||||||||||||||||||||||||
| Patch included: |
Yes
|
||||||||||||||||||||||||
| Acceptance criteria: |
Empty
|
||||||||||||||||||||||||
| Task DoD: |
[ ]*
Doc/release notes changes? Comment present?
[ ]*
Downstream builds green?
[ ]*
Solution information and context easily available?
[ ]*
Tests
[ ]*
FixVersion filled and not yet released
[ ] 
Architecture Decision Record (ADR)
|
||||||||||||||||||||||||
| Bug DoR: |
[ ]*
Steps to reproduce, expected, and actual results filled
[ ]*
Affected version filled
|
||||||||||||||||||||||||
| Date of First Response: | |||||||||||||||||||||||||
| Description |
|
the cache mechanism can block all requests:
The critical scenario which can prevent magnolia from responding any request (all threads blocked) is the following Solution:
|
| Comments |
| Comment by Philipp Bärfuss [ 26/Mar/10 ] |
|
Patch against 3.6.8 (draft) can be found at SUPPORT-538 |
| Comment by Magnolia International [ 26/Mar/10 ] |
|
I think the main reason we synchronize on the cache instance is to avoid relying on specifics of EhCache's implementation. Given the above, we could consider not doing it, but rather change our Cache interface to enforce locking at key level like EhCache's BlockingCache does (by means of simply naming methods appropriately, providing extra facilities to do said locking ourselves (should be doable with java 5's java.util.concurrent.locks?), or define such contracts in the javadoc - or of course all of this) |
| Comment by Jan Haderka [ 26/Mar/10 ] |
|
Actually the reason we synchronize is the related to the fact that we use blocking cache by default. If I'm not mistaken the scenario that requires synchronization is this:
The behavior above was a reason why we introduced the synchronization on cache in 3.6. I might have explained it in even more details in the user list in the past. Now, considering this issue, the timeout is probably bast solution to avoid being locked forever if the thread handling request A gets stuck forever. As for serving multiple entries from the cache, we should lock on the key rather then on the whole cache so only the requests to the cache for given key ever gets serialized access as Greg already suggested above. ... Alternatively we can drop whole policy/executor based implementation and implement cache in a way that would not allow calls to cache to escape ever. |
| Comment by Philipp Bärfuss [ 26/Mar/10 ] |
If I understand that right the following will happen if the synchronization block is removed Well then we have to find a better solution, just blocking all requests for that reason is not an option. But looking at the following code of net.sf.ehcache.constructs.blocking.BlockingCache.get(Object) (version 1.5) if (element != null) { //ok let the other threads in lock.release(); return element; } else { It seams as if they handle the case in which the cache was populated while that thread was blocked. Either something is very fishy but that extra synchronization block is wrong/obsolete. |
| Comment by Jan Haderka [ 26/Mar/10 ] |
|
Yes, that is correct. OTOH looking at current trunk of ehcache, I'm not so sure you had a right code, there the BlockingCache.get(key) method looks like: /**
* Looks up an entry. Blocks if the entry is null until a call to {@link #put} is done
* to put an Element in.
* <p/>
* If a put is not done, the lock is never released.
* <p/>
* If this method throws an exception, it is the responsibility of the caller to catch that exception and call
* <code>put(new Element(key, null));</code> to release the lock acquired. See {@link net.sf.ehcache.constructs.blocking.SelfPopulatingCache}
* for an example.
* <p/>
* Note. If a LockTimeoutException is thrown while doing a <code>get</code> it means the lock was never acquired,
* therefore it is a threading error to call {@link #put}
*
* @throws LockTimeoutException if timeout millis is non zero and this method has been unable to
* acquire a lock in that time
* @throws RuntimeException if thrown the lock will not released. Catch and call<code>put(new Element(key, null));</code>
* to release the lock acquired.
*/
public Element get(final Object key) throws RuntimeException, LockTimeoutException {
Sync lock = getLockForKey(key);
Element element;
acquiredLockForKey(key, lock, LockType.READ);
element = cache.get(key);
lock.unlock(LockType.READ);
if (element == null) {
acquiredLockForKey(key, lock, LockType.WRITE);
element = cache.get(key);
if (element != null) {
lock.unlock(LockType.WRITE);
}
}
return element;
}
and browsing through the code I'm not convinced that this issue is solved. There is something fishy in that and related code, i just can't pinpoint it out yet. I'll try to write simple test for this on Monday and re-read all the code again to be sure. It might be (speculation!) that the write lock prevents second call to element = cache.get(key); to be executed, but since this internal cache is not synchronized, there is no guarantee by JVM that given thread will see already value put in by other thread. Those locks serialize only the code execution but not memory visibility (oh that sounds soo smart ... must be just effect of re-reading Concurency in Practice |
| Comment by Magnolia International [ 26/Mar/10 ] |
|
Let's spend some time "asap" setting up a few reproducible test scenarios (jmeter) so we can retest this, especially when upgrading ehcache ? (we're a few versions behind) |
| Comment by Philipp Bärfuss [ 29/Mar/10 ] |
|
source: I looked at the code for ehcache 1.5 (what we bundle with 4.x) and ehcache 1.4.1 (bundled with 3.6.8) re-reading: now that this code uses a mutex object and a cache object I think this is not the case test: the basic scenarios can be tested with unit tests test a) on request blocked for key "a" -> test that a request for key "b" enters the rendering |
| Comment by Jan Haderka [ 30/Mar/10 ] |
|
Added test to expose the scenario outlined above. |
| Comment by Jan Haderka [ 30/Mar/10 ] |
|
It would also seem that the meaning of maxElementsInMemory==0 has changed in ehCache 2.0: http://ehcache.org/documentation/faq.html Can I use Ehcache as a disk cache only? |
| Comment by Jan Haderka [ 31/Mar/10 ] |
|
Two most common scenarios (i could think of) under which cache can block. The first one (multiple requests) is solved by synchronizing on whole cache, tho the price for that is probably too high. |
| Comment by Jan Haderka [ 07/May/10 ] |
|
Implementing timeout as shown in patch attached to SUPPORT-538 certainly ensures there is no single request blocking forever. However this doesn't justify removal of the synchronized block. Scenario 1:
Scenario 2:
The synchronization issue could be dealt with in multiple ways:
|