Uploaded image for project: 'Solr Search Provider'
  1. Solr Search Provider
  2. MGNLEESOLR-61

Ability to implement own crawler implementation

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.1.1
    • Fix Version/s: 3.0
    • Sprint:
      Sprint 7 (Kromeriz)
    • Story Points:
      2

      Description

      As a developer I want the possibility to implement my own webcrawler in magnolia. With an own crawler we want to implement some logic to make it possible to exclude some pages from being indexed by Solr.

      Magnolia implemented their own crawler (MgnlCrawler.java). This crawler is executed in the following command 'CrawlerIndexerCommand'. This command can be changed in the Magnolia configuration.

      What we tried so far:
      Implementend our own command (almost same code as 'CrawlerIndexerCommand' except our own crawler is called by the controller) and added factories and indexer and crawler maps to our Module class. This is copying of code and not the way to do this in Java.

              @Override
      	public void start(ModuleLifecycleContext moduleLifecycleContext) {
      		dataIndexerFactory.init();
      		crawlerIndexerFactory.init();
      	}
      
      	@Override
      	public void stop(ModuleLifecycleContext moduleLifecycleContext) {
      		dataIndexerFactory.cleanup();
      		crawlerIndexerFactory.cleanup();
      	}
      

      Possible solutions:
      1 Making the crawler implemention configurable in Magnolia.
      2 Extending 'MgnlCrawler' we would like to reuse methods like treatFieldMappings(), getIndexService. Now these methods are private and we only want to add some additions to the shouldVisit() and visit() methods.
      3 Extending 'CrawlerIndexerCommand' however the contentIndexerModule is private.
      4 An app in Magnolia to manage exclusion and other Solr configuration.

      Point four is a nice to have feature in the future.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                mdivilek Milan Divilek
                Reporter:
                mvdmark Michaël van der Mark
              • Votes:
                1 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:
                  Date of First Response:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0d
                  0d
                  Logged:
                  Time Spent - 10m
                  10m