[MGNLEESOLR-164] Support PolitenessDelay to reduce stressing resources. Created: 28/Apr/22  Updated: 23/Oct/23  Resolved: 21/Mar/23

Status: Closed
Project: Solr Search Provider
Component/s: None
Affects Version/s: 5.5.4
Fix Version/s: 6.1.3

Type: Improvement Priority: Neutral
Reporter: Minh Nguyen Assignee: Milan Divilek
Resolution: Fixed Votes: 1
Labels: None
Σ Remaining Estimate: Not Specified Remaining Estimate: Not Specified
Σ Time Spent: Not Specified Time Spent: Not Specified
Σ Original Estimate: Not Specified Original Estimate: Not Specified

Attachments: PNG File image-2022-04-28-11-03-53-142.png    
Issue Links:
Relates
relates to MGNLEESOLR-185 DOC: New properties in Solr version 6... Closed
Sub-Tasks:
Key
Summary
Type
Status
Assignee
MGNLEESOLR-177 Review Sub-task Completed Michal Novak  
MGNLEESOLR-178 preintQA Sub-task Completed Michal Novak  
MGNLEESOLR-179 QA Sub-task Completed Javier Benito  
MGNLEESOLR-183 Implementation Sub-task Completed Milan Divilek  
Template:
Acceptance criteria:
Empty
Task DoD:
[X]* Doc/release notes changes? Comment present?
[X]* Downstream builds green?
[X]* Solution information and context easily available?
[X]* Tests
[X]* FixVersion filled and not yet released
[ ]  Architecture Decision Record (ADR)
Documentation update required:
Yes
Epic Link: DevX Bucket
Sprint: DevX 33
Story Points: 1
Team: DeveloperX
Work Started:

 Description   

Recently, Swissre were experienced some sort of high resources usage. We discover many factors to cause high memory resource and our solr crawler was one of the factor.

We would love to somehow delay the crawler a bit to reduce stressing the public servers. so Could we add politenessDelay to the crawler config ?

From the library they support:https://github.com/yasserg/crawler4j

Politeness
crawler4j is designed very efficiently and has the ability to crawl domains very fast (e.g., it has been able to crawl 200 Wikipedia pages per second). However, since this is against crawling policies and puts huge load on servers (and they might block you!), since version 1.3, by default crawler4j waits at least 200 milliseconds between requests. However, this parameter can be tuned:

crawlConfig.setPolitenessDelay(politenessDelay);

Development Note:
We can help adding politenessDelay here.
info.magnolia.module.indexer.crawler.commands.CrawlerIndexerCommand,  createCrawlController() method

 protected CrawlController createCrawlController(String crawlerName, CrawlerConfig config) throws Exception {
        CrawlConfig crawlConfig = new CrawlConfig();
        crawlConfig.setCrawlStorageFolder(Files.createTempDirectory(FileSystems.getDefault().getPath(Path.getTempDirectory().getAbsolutePath()), crawlerName).toString());
        crawlConfig.setMaxDepthOfCrawling(config.getDepth());
        crawlConfig.setMaxOutgoingLinksToFollow(config.getMaxOutgoingLinksToFollow());

        AuthInfo authInfo = createAuthInfo(crawlerName, config);
        if (authInfo != null) {
            crawlConfig.addAuthInfo(authInfo);
        }

        PageFetcher pageFetcher = new PageFetcher(crawlConfig);
        RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
        RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);

        CrawlController controller = new CrawlController(crawlConfig, pageFetcher, robotstxtServer);
        for (Site site : config.getSites()) {
            log.debug("Crawling site: " + site.getUrl());
            controller.addSeed(site.getUrl());
        }
        return controller;
    }

Current config of swissre:

Sitemap pages:
https://www.swissre.com/sitemap~pages~.html
https://www.swissre.com/sitemap~profiles~.html

Discussion can be found here:
https://magnolia-cms.slack.com/archives/C0214MATDNU/p1650958746312519?thread_ts=1650947030.132319&cid=C0214MATDNU
Logging of those crawler:
https://magnolia-cms.slack.com/archives/C0214MATDNU/p1650959103580969?thread_ts=1650947030.132319&cid=C0214MATDNU

Thank you so much.


Generated at Mon Feb 12 11:00:44 CET 2024 using Jira 9.4.2#940002-sha1:46d1a51de284217efdcb32434eab47a99af2938b.