[MGNLEESOLR-61] Ability to implement own crawler implementation Created: 08/May/15  Updated: 03/Dec/20  Resolved: 25/Aug/15

Status: Closed
Project: Solr Search Provider
Component/s: None
Affects Version/s: 2.1.1
Fix Version/s: 3.0

Type: Improvement Priority: Major
Reporter: Michaƫl van der Mark Assignee: Milan Divilek
Resolution: Fixed Votes: 1
Labels: maintenance, quickwin
Remaining Estimate: 0d
Time Spent: 10m
Original Estimate: Not Specified

Template:
Acceptance criteria:
Empty
Task DoD:
[ ]* Doc/release notes changes? Comment present?
[ ]* Downstream builds green?
[ ]* Solution information and context easily available?
[ ]* Tests
[ ]* FixVersion filled and not yet released
[ ]  Architecture Decision Record (ADR)
Date of First Response:
Sprint: Sprint 7 (Kromeriz)
Story Points: 2

 Description   

As a developer I want the possibility to implement my own webcrawler in magnolia. With an own crawler we want to implement some logic to make it possible to exclude some pages from being indexed by Solr.

Magnolia implemented their own crawler (MgnlCrawler.java). This crawler is executed in the following command 'CrawlerIndexerCommand'. This command can be changed in the Magnolia configuration.

What we tried so far:
Implementend our own command (almost same code as 'CrawlerIndexerCommand' except our own crawler is called by the controller) and added factories and indexer and crawler maps to our Module class. This is copying of code and not the way to do this in Java.

        @Override
	public void start(ModuleLifecycleContext moduleLifecycleContext) {
		dataIndexerFactory.init();
		crawlerIndexerFactory.init();
	}

	@Override
	public void stop(ModuleLifecycleContext moduleLifecycleContext) {
		dataIndexerFactory.cleanup();
		crawlerIndexerFactory.cleanup();
	}

Possible solutions:
1 Making the crawler implemention configurable in Magnolia.
2 Extending 'MgnlCrawler' we would like to reuse methods like treatFieldMappings(), getIndexService. Now these methods are private and we only want to add some additions to the shouldVisit() and visit() methods.
3 Extending 'CrawlerIndexerCommand' however the contentIndexerModule is private.
4 An app in Magnolia to manage exclusion and other Solr configuration.

Point four is a nice to have feature in the future.



 Comments   
Comment by Edgar Vonk [ 14/Jul/15 ]

Any news on this maybe? It is quite cumbersome because we really want to write our own crawler class. But the MgnlCrawler class is very much hardcoded in the module. E,.g. in CrawlerIndexerCommand:

controller.start(MgnlCrawler.class, config.getNbrCrawlers());

Ideally we would like to be able to configure the crawler class in the module meta-inf configuration. Something like:

  <components>
    <id>main</id>
    <component>
      <type>info.magnolia.module.indexer.crawler.MgnlCrawler</type>
      <implementation>org.OurCustomCrawler</implementation>
    </component>
</components>
Comment by Milan Divilek [ 03/Aug/15 ]

Hello Edgar, ticket is planned for version 3.0. I'll have a look on it asap.

Generated at Mon Feb 12 10:59:42 CET 2024 using Jira 9.4.2#940002-sha1:46d1a51de284217efdcb32434eab47a99af2938b.