[QAARQ-17] publish large files using rabbitmq result in out of memory exception Created: 07/Mar/17  Updated: 17/Dec/18  Resolved: 04/Jul/18

Status: Resolved
Project: Queued Asynchronous Activation over RabbitMQ
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Neutral
Reporter: Jann Forrer Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

SUSE Linux Enterprise Server 11 (x86_64)
Memory: 16 GB


Issue Links:
relation
Template:
Acceptance criteria:
Empty
Task DoD:
[ ]* Doc/release notes changes? Comment present?
[ ]* Downstream builds green?
[ ]* Solution information and context easily available?
[ ]* Tests
[ ]* FixVersion filled and not yet released
[ ]  Architecture Decision Record (ADR)
Bug DoR:
[ ]* Steps to reproduce, expected, and actual results filled
[ ]* Affected version filled
Date of First Response:

 Description   

We are just doing our final test before using rabbitmq in our productive environment but we found a problem if we try to publish a large asset (well not really large ~> 200MB).
Every time we publish such a file we get an OutOfMemory exception on the live server (see StackTrace below). But that is not the only problem: if we continue publishing webpages or smaller assets they are all queued up in the respective rabbitmq queue and we did not found a way to send them to the live server. Even a restart of the Magnolia live server bound to that queue does not solve the problem.
The only way to fix it is to delete the queue and with it all queued messages. That's annoying because on the authoring server the files are marked as published but they are not and if only on live server has the OufOfMemory problem the live might be out of sync.

Note, that increasing Memory (we actually have 50GB on our live server) allow to publish larger files but that will no solve the problem but only postpone it.

Any idea? It would even help a lot if it is possible to delete such a message so that the live serve can consume the remaining queued message.

--- StackTrace:
017-03-07 11:42:50,355 WARN  itmq.activation.jobs.AbstractActivationConsumerJob: Could not connect to fan1 , no connection from
 uzh-client...
Exception in thread "Thread-20" java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOfRange(Unknown Source)
        at java.lang.String.<init>(Unknown Source)
        at java.lang.StringBuilder.toString(Unknown Source)
        at org.codehaus.jackson.util.TextBuffer.contentsAsString(TextBuffer.java:362)
        at org.codehaus.jackson.impl.Utf8StreamParser.getText(Utf8StreamParser.java:278)
        at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.deserialize(UntypedObjectDeserializer.java:59)
        at org.codehaus.jackson.map.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:217)
        at org.codehaus.jackson.map.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:194)
        at org.codehaus.jackson.map.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:30)
        at org.codehaus.jackson.map.ObjectMapper._readMapAndClose(ObjectMapper.java:2732)
        at org.codehaus.jackson.map.ObjectMapper.readValue(ObjectMapper.java:1935)
        at info.magnolia.rabbitmq.activation.io.NodeServiceImpl.toNodesFromStackAck(NodeServiceImpl.java:117)
        at info.magnolia.rabbitmq.activation.jobs.AbstractActivationConsumerJob.activate(AbstractActivationConsumerJob.java:116)
        at info.magnolia.rabbitmq.activation.jobs.ActivationConsumerJob.processMessage(ActivationConsumerJob.java:74)
        at info.magnolia.rabbitmq.activation.jobs.AbstractActivationConsumerJob.run(AbstractActivationConsumerJob.java:76)
        at java.lang.Thread.run(Unknown Source)
2017-03-07 14:07:02,311 WARN  org.apache.jackrabbit.core.SessionImpl            : Unclosed session detected. The session was ope
ned here: 
java.lang.Exception: Stack Trace
        at org.apache.jackrabbit.core.SessionImpl.<init>(SessionImpl.java:222)
        at org.apache.jackrabbit.core.SessionImpl.<init>(SessionImpl.java:239)
        at org.apache.jackrabbit.core.XASessionImpl.<init>(XASessionImpl.java:101)
        at org.apache.jackrabbit.core.RepositoryImpl.createSessionInstance(RepositoryImpl.java:1613)
        at org.apache.jackrabbit.core.RepositoryImpl.createSession(RepositoryImpl.java:956)
        at org.apache.jackrabbit.core.RepositoryImpl.login(RepositoryImpl.java:1501)
        at org.apache.jackrabbit.core.jndi.BindableRepository.login(BindableRepository.java:162)
        at info.magnolia.jackrabbit.ProviderImpl.getSystemSession(ProviderImpl.java:527)
        at info.magnolia.repository.DefaultRepositoryManager.getSystemSession(DefaultRepositoryManager.java:277)
        at info.magnolia.context.SystemRepositoryStrategy.internalGetSession(SystemRepositoryStrategy.java:54)
        at info.magnolia.context.AbstractRepositoryStrategy.getSession(AbstractRepositoryStrategy.java:74)
        at info.magnolia.context.AbstractContext.getJCRSession(AbstractContext.java:132)
        at info.magnolia.context.AbstractContext.getHierarchyManager(AbstractContext.java:205)
        at info.magnolia.context.MgnlContext.getHierarchyManager(MgnlContext.java:128)
        at info.magnolia.cms.core.version.MgnlVersioningNodeWrapper$1.exec(MgnlVersioningNodeWrapper.java:120)
        at info.magnolia.cms.core.version.MgnlVersioningNodeWrapper$1.exec(MgnlVersioningNodeWrapper.java:115)
        at info.magnolia.context.MgnlContext.doInSystemContext(MgnlContext.java:385)
        at info.magnolia.context.MgnlContext.doInSystemContext(MgnlContext.java:371)
        at info.magnolia.cms.core.version.MgnlVersioningNodeWrapper.remove(MgnlVersioningNodeWrapper.java:115)
        at info.magnolia.jcr.wrapper.DelegateNodeWrapper.remove(DelegateNodeWrapper.java:536)
        at info.magnolia.jcr.wrapper.MgnlPropertySettingNodeWrapper.remove(MgnlPropertySettingNodeWrapper.java:238)
        at info.magnolia.jcr.wrapper.DelegateNodeWrapper.remove(DelegateNodeWrapper.java:536)
        at info.magnolia.audit.MgnlAuditLoggingContentDecoratorNodeWrapper.remove(MgnlAuditLoggingContentDecoratorNodeWrapper.ja
va:94)
        at info.magnolia.rabbitmq.activation.jobs.AbstractActivationConsumerJob.deactivate(AbstractActivationConsumerJob.java:13
8)
        at info.magnolia.rabbitmq.activation.jobs.ActivationConsumerJob.processMessage(ActivationConsumerJob.java:72)
        at info.magnolia.rabbitmq.activation.jobs.AbstractActivationConsumerJob.run(AbstractActivationConsumerJob.java:76)
        at java.lang.Thread.run(Unknown Source)
20
 


 Comments   
Comment by Karel de Witte [ 08/Mar/17 ]

Hi Jan,

Thanks for reporting.
Regarding the memory issue i will investigate, regarding the removal of the blocking message you need to disconnect the exchange by shutting down the client in rabbitmq administration, then dequeue the faulty message using the panel. I can show it to you over remote control if you want.

Best regards,
Karel.

Comment by Jann Forrer [ 08/Mar/17 ]

Hi Karel

Thank you, I could delete the messages according to the documentation (https://documentation.magnolia-cms.com/display/DOCS/RabbitMQ+modules#RabbitMQmodules-Troubleshootingandspecialusescases)

However more questions arises:

1. What about the sequence number if we assume that only on live server has a problem and we only need to delete the message on
one queue. Does that result in different sequence numbers on the different live servers? If yes is it possible to resync the sequence
number again?

2. We have the same problem i.e. queued message which can not be delivered after killing (during publishing) and restart a live server.
According to https://www.rabbitmq.com/reliability.html it should be possible to configure rabbitmq in a way that no message is lost in such
a case. Can that be achieved by configuring an ACK Client on the consumer?

3. Is it possible to make messages persistent as described in the rabbitmq docu: by supplying a delivery_mode property with a value 2 ?

4. Do you have an idea how much resources (Memory, CPU, ....) a productive rabbitmq server needs? I will deliver numbers on the number of tasks (publish, unpublish) if necessary.

Best regards
Jann

Comment by Jann Forrer [ 09/Mar/17 ]

.... concerning point 2: I found out that after a crash of a live server, I have to restart the consumer, so that the messages are consumed again.
Because of interest: Why is the consumer not working after starting the magnolia server? I could see that the consumer is registered but it does not start to consume messages. Only after a restart of the consumer the messages are processed again (note that I do not need to delete any message but only restarting the consumer).

Comment by Jann Forrer [ 09/Mar/17 ]

.... concerning point 4: I got some number from Nicole. She analyzed several days. The largest number we got is about 7'000 tasks (activation/deactivation). So if we assume 10'000 tasks a day we should be on the save side.

Comment by Jann Forrer [ 16/Mar/17 ]

HI Karel

Any news concerning point 1. 3. and 4. ?

Best regards
Jann

Comment by Karel de Witte [ 27/Mar/17 ]

Hi Jann,

Regarding 1) The sequence number on the public instance is to be sure all instances received the same amount of messages, if you delete a message manually, then of course you need to adapt the sequence number manually on the public instance, i might however add this to the REST control API so you can easily do it remotely.
Regarding 2) I will check, but normally the consumer should start automatically on restart, note you also have the possibility to restart the consumer using a REST call.

Beside the consumer already ack's the message when it is consumed correctly, this is why in case of trouble the consumer shuts down and unacks the message, the reason why it shuts down is to avoid problems when consuming a node that for instance depends on another node. Imagine the parent is not correctly consumed, if the child node is then correctly consumed you will get an inconsistent state on your public instance.

Regarding 3) Messages are persisted if the queue is durable which I think should be the case already.
Regarding 4) It will depend on the average size of your node's and may vary if you publish large assets or not and how many queues you would like to maintain. Could you give me the size of the assets workspace in terms of size on disk and the website workspace alos in terms of disk size ?

I will work on enhancing 1 and also refactoring the consumer.

Best regards,

Generated at Mon Feb 12 10:38:51 CET 2024 using Jira 9.4.2#940002-sha1:46d1a51de284217efdcb32434eab47a99af2938b.