[MGNLRANK-4] [Shaping Ranking] What weighting does Lucene provide Created: 13/Oct/22  Updated: 05/Jul/23  Resolved: 05/Jul/23

Status: Closed
Project: Ranker
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Neutral
Reporter: Laura Delnevo Assignee: Unassigned
Resolution: Obsolete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
relation
is related to MGNLPER-178 DOC: What is the default now Closed
Template:
Acceptance criteria:
Empty
Task DoR:
Empty
Date of First Response:
Epic Link: User-based ranking
Team: AuthorX

 Description   
  • Discover What is weighting from the Lucene index ?
    • Provide a list, for internal documentation to further discuss

Notes:

Jackrabbit uses the default lucene algorithm to calculate the score for a jcr:contains clause. any other query element will usually return a  score of 1000.
a quick test showed the following for the query:

//*[jcr:contains(.,'apache')] order by @jcr:score descending
jcr:score | text property
 ---------------------------------------------------------------------- 
1000 | "Apache Jackrabbit"
 848 | "some test jackrabbit apache, apache is great"
 350 | "this is a text that is much larger than the first one and only contains the word apache once."

Another article that is inline with the jcr documentation is:
https://stackoverflow.com/questions/30885219/understanding-apache-lucenes-scoring-algorithm

Scoring calculation is something really complex. Here, you have to begin with the primal equation:

score(q,d) = coord(q,d) · queryNorm(q) · ∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t,d) )

To translate this into non-geek: the score depends, between others, on the frequency of the search term and on the boost factor assigned


Generated at Mon Feb 12 10:39:20 CET 2024 using Jira 9.4.2#940002-sha1:46d1a51de284217efdcb32434eab47a99af2938b.