Loading...

XML

Word

Printable

Type: Improvement
Resolution: Fixed
Priority: Neutral
Fix Version/s: 6.1.3
Affects Version/s: 5.5.4
Labels:
None

Template:
Acceptance criteria:

Empty

show more show less
Task DoD:

show more show less
Documentation update required:

Yes
Epic Link:
DEV-2020
Sprint:
DevX 33
Story Points:
1

Recently, Swissre were experienced some sort of high resources usage. We discover many factors to cause high memory resource and our solr crawler was one of the factor.

We would love to somehow delay the crawler a bit to reduce stressing the public servers. so Could we add politenessDelay to the crawler config ?

From the library they support:https://github.com/yasserg/crawler4j

Politeness
crawler4j is designed very efficiently and has the ability to crawl domains very fast (e.g., it has been able to crawl 200 Wikipedia pages per second). However, since this is against crawling policies and puts huge load on servers (and they might block you!), since version 1.3, by default crawler4j waits at least 200 milliseconds between requests. However, this parameter can be tuned:

crawlConfig.setPolitenessDelay(politenessDelay);

Development Note:
We can help adding politenessDelay here.
info.magnolia.module.indexer.crawler.commands.CrawlerIndexerCommand, createCrawlController() method

 protected CrawlController createCrawlController(String crawlerName, CrawlerConfig config) throws Exception {
        CrawlConfig crawlConfig = new CrawlConfig();
        crawlConfig.setCrawlStorageFolder(Files.createTempDirectory(FileSystems.getDefault().getPath(Path.getTempDirectory().getAbsolutePath()), crawlerName).toString());
        crawlConfig.setMaxDepthOfCrawling(config.getDepth());
        crawlConfig.setMaxOutgoingLinksToFollow(config.getMaxOutgoingLinksToFollow());

        AuthInfo authInfo = createAuthInfo(crawlerName, config);
        if (authInfo != null) {
            crawlConfig.addAuthInfo(authInfo);
        }

        PageFetcher pageFetcher = new PageFetcher(crawlConfig);
        RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
        RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);

        CrawlController controller = new CrawlController(crawlConfig, pageFetcher, robotstxtServer);
        for (Site site : config.getSites()) {
            log.debug("Crawling site: " + site.getUrl());
            controller.addSeed(site.getUrl());
        }
        return controller;
    }

Current config of swissre:

Sitemap pages:
https://www.swissre.com/sitemap~pages~.html
https://www.swissre.com/sitemap~profiles~.html

Discussion can be found here:
https://magnolia-cms.slack.com/archives/C0214MATDNU/p1650958746312519?thread_ts=1650947030.132319&cid=C0214MATDNU
Logging of those crawler:
https://magnolia-cms.slack.com/archives/C0214MATDNU/p1650959103580969?thread_ts=1650947030.132319&cid=C0214MATDNU

Thank you so much.

Acceptance criteria

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

image-2022-04-28-11-03-53-142.png
126 kB
28/Apr/22 6:03 AM

relates to

MGNLEESOLR-185 DOC: New properties in Solr version 6.1.3

Closed

1.	Review	Completed	Michal Novak
2.	preintQA	Completed	Michal Novak
3.	QA	Completed	Javier Benito
4.	Implementation	Completed	Milan Divilek

Assignee:: Milan Divilek

Reporter:: Minh Nguyen

Team:: DeveloperX

Votes:: 1 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 28/Apr/22 6:06 AM

Updated:: 23/Oct/23 11:53 AM

Resolved:: 21/Mar/23 10:03 AM

Work Started:: 14/Mar/23 8:45 AM

Task DoD

Details

Description

Checklists

Attachments

Attachments

Issue Links

Sub-Tasks

Activity

People

Dates

Checklists