Uploaded image for project: 'Solr Search Provider'
  1. Solr Search Provider
  2. MGNLEESOLR-72

Crawler ignores robots meta-tag from the page

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Neutral
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.0
    • Labels:
    • Environment:
      Windows 7
    • Patch included:
      Yes
    • Sprint:
      Sprint 7 (Kromeriz)
    • Story Points:
      2

      Description

      In current implementation of magnolia-solr-search-provider there is no check for value of "robots" meta tag of page. That causes indexing of all found pages, even if "noindex" value is set to robots meta tag. Problem is on crawler4j side, because it does not respect this flag. Issue is reported on their issue tracker (https://code.google.com/p/crawler4j/issues/detail?id=59) since 2011 and still exists. Possible option is to modify MgnlCrawler's visit(Page p) method to check flag value from parsed content and don't index it in solr if "noindex" flag exists. Possible solution implemented is attached (lines 108-112).

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                mdivilek Milan Divilek
                Reporter:
                mchruscielewski Mariusz Chruscielewski
                Visible to:
                Edgar Vonk
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:
                  Date of First Response: