Uploaded image for project: 'Solr Search Provider'
  1. Solr Search Provider
  2. MGNLEESOLR-61

Ability to implement own crawler implementation

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Fixed
    • Major
    • 3.0
    • 2.1.1
    • Sprint 7 (Kromeriz)
    • 2

    Description

      As a developer I want the possibility to implement my own webcrawler in magnolia. With an own crawler we want to implement some logic to make it possible to exclude some pages from being indexed by Solr.

      Magnolia implemented their own crawler (MgnlCrawler.java). This crawler is executed in the following command 'CrawlerIndexerCommand'. This command can be changed in the Magnolia configuration.

      What we tried so far:
      Implementend our own command (almost same code as 'CrawlerIndexerCommand' except our own crawler is called by the controller) and added factories and indexer and crawler maps to our Module class. This is copying of code and not the way to do this in Java.

              @Override
      	public void start(ModuleLifecycleContext moduleLifecycleContext) {
      		dataIndexerFactory.init();
      		crawlerIndexerFactory.init();
      	}
      
      	@Override
      	public void stop(ModuleLifecycleContext moduleLifecycleContext) {
      		dataIndexerFactory.cleanup();
      		crawlerIndexerFactory.cleanup();
      	}
      

      Possible solutions:
      1 Making the crawler implemention configurable in Magnolia.
      2 Extending 'MgnlCrawler' we would like to reuse methods like treatFieldMappings(), getIndexService. Now these methods are private and we only want to add some additions to the shouldVisit() and visit() methods.
      3 Extending 'CrawlerIndexerCommand' however the contentIndexerModule is private.
      4 An app in Magnolia to manage exclusion and other Solr configuration.

      Point four is a nice to have feature in the future.

      Checklists

        Acceptance criteria

        Attachments

          Issue Links

            Activity

              People

                mdivilek Milan Divilek
                mvdmark Michaƫl van der Mark
                Votes:
                1 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                  Created:
                  Updated:
                  Resolved:

                  Checklists

                    Task DoD

                    Time Tracking

                      Estimated:
                      Original Estimate - Not Specified
                      Not Specified
                      Remaining:
                      Remaining Estimate - 0d
                      0d
                      Logged:
                      Time Spent - 10m
                      10m