[MGNLEESOLR-72] Crawler ignores robots meta-tag from the page Created: 17/Jul/15  Updated: 03/Dec/20  Resolved: 26/Aug/15

Status: Closed
Project: Solr Search Provider
Component/s: None
Affects Version/s: None
Fix Version/s: 3.0

Type: Improvement Priority: Neutral
Reporter: Mariusz Chruscielewski Assignee: Milan Divilek
Resolution: Fixed Votes: 0
Labels: support
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Windows 7


Attachments: Java Source File MgnlCrawler.java    
Issue Links:
Cloners
Template:
Patch included:
Yes
Acceptance criteria:
Empty
Task DoD:
[ ]* Doc/release notes changes? Comment present?
[ ]* Downstream builds green?
[ ]* Solution information and context easily available?
[ ]* Tests
[ ]* FixVersion filled and not yet released
[ ]  Architecture Decision Record (ADR)
Date of First Response:
Visible to:
Edgar Vonk
Sprint: Sprint 7 (Kromeriz)
Story Points: 2

 Description   

In current implementation of magnolia-solr-search-provider there is no check for value of "robots" meta tag of page. That causes indexing of all found pages, even if "noindex" value is set to robots meta tag. Problem is on crawler4j side, because it does not respect this flag. Issue is reported on their issue tracker (https://code.google.com/p/crawler4j/issues/detail?id=59) since 2011 and still exists. Possible option is to modify MgnlCrawler's visit(Page p) method to check flag value from parsed content and don't index it in solr if "noindex" flag exists. Possible solution implemented is attached (lines 108-112).



 Comments   
Comment by Jaroslav Simak [ 27/Aug/15 ]

Fix copyright year when integrating

Generated at Mon Feb 12 10:59:49 CET 2024 using Jira 9.4.2#940002-sha1:46d1a51de284217efdcb32434eab47a99af2938b.