[MLEARN-9] Training neural network sometimes fails due to "unknown identifier" Created: 11/Feb/19  Updated: 27/Mar/19  Resolved: 27/Mar/19

Status: Closed
Project: Machine Learning
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Neutral
Reporter: Cedric Reichenbach Assignee: Federico Grilli
Resolution: Cannot Reproduce Votes: 1
Labels: None
Remaining Estimate: 0d
Time Spent: 3.5d
Original Estimate: Not Specified

Issue Links:
relation
is related to MLEARN-6 Neural network storage debouncer is g... Closed
is related to MLEARN-15 Store labels lowercase Closed
Template:
Acceptance criteria:
Empty
Task DoD:
[ ]* Doc/release notes changes? Comment present?
[ ]* Downstream builds green?
[ ]* Solution information and context easily available?
[ ]* Tests
[ ]* FixVersion filled and not yet released
[ ]  Architecture Decision Record (ADR)
Bug DoR:
[ ]* Steps to reproduce, expected, and actual results filled
[ ]* Affected version filled
Date of First Response:
Epic Link: Periscope improvements
Sprint: Foundation 7
Story Points: 5

 Description   

As observed by fgrilli while running performance load tests, training of ranking neural networks sometimes failes with the following exception:

019-02-06 17:38:33,726 ERROR gnolia.periscope.rank.ml.NeuralNetworkResultRanker: Failed to train ranking neural network
java.lang.IllegalArgumentException: Unknown result with identifier: tours
	at info.magnolia.periscope.rank.ml.NeuralNetworkResultRanker.outputToArray(NeuralNetworkResultRanker.java:183) ~[magnolia-periscope-result-ranker-1.1-SNAPSHOT.jar:?]
	at info.magnolia.periscope.rank.ml.NeuralNetworkResultRanker.trainRanking(NeuralNetworkResultRanker.java:125) ~[magnolia-periscope-result-ranker-1.1-SNAPSHOT.jar:?]
	at info.magnolia.periscope.Periscope.resultPicked(Periscope.java:168) ~[magnolia-periscope-core-1.1-SNAPSHOT.jar:?]

It looks like there's an inconsistency between assuring that a given label is in labels list and actually using that label for training.



 Comments   
Comment by Federico Grilli [ 27/Mar/19 ]

Premise: I encountered this issue only once while running load tests in the scope of MLEARN-6 with ~100 concurrent users. 

This time I wasn't able to reproduce the issue both manually and via magnolia-load-tests (due to DEV-1141). 
I tried to do some debugging and static code analysis and here's my understanding of the process:

  • Upon login, Periscope (a singleton) starts a search via SearchRunner. 
  • SearchRunner gets a list of SearchResultSupplier(s) for several JCR workspaces (pages, assets, tours, etc.) and regulates concurrent access to such workspaces ("Jackrabbit does not support multiple threads concurrently reading from or writing to the same session. Each session should only ever be accessed from one thread.").
  • The labels (search results) are returned asynchronously and further processed by being added to a NeuralNetworkResultRanker (one instance for each logged user) which stores them into an IndexedBuffer. 
  • Upon clicking on a search result (in our case, click on top row of the search result grid, likely the Tours app), the machine learning process is triggered and the neural network is trained on the chosen item (label)(see https://deeplearning4j.org/api/latest/org/deeplearning4j/nn/multilayer/MultiLayerNetwork.html#fit-org.nd4j.linalg.api.ndarray.INDArray-int:A-).
  • We expect the chosen item (label) to be in the list of labels stored with the initial search but, in our case, this is not true, hence the exception.   
  • My explanation is that the labels are stored case sensitive (so they have "Tours" but not "tours") and for some odd reason what got to the server was the lowercase version (the value passed from client to server seems to be the value of the title column in the grid). If that is the case, it appears like a random issue, unlikely related to periscope (maybe more a glitch in Vaadin when sending data to the server?).
    Should we store all labels lowercase and make sure the label we're searching for is lowercase as well?
Generated at Mon Feb 12 02:29:01 CET 2024 using Jira 9.4.2#940002-sha1:46d1a51de284217efdcb32434eab47a99af2938b.