Uploaded image for project: 'Solr Search Provider'
  1. Solr Search Provider
  2. MGNLEESOLR-164

Support PolitenessDelay to reduce stressing resources.

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Fixed
    • Neutral
    • 6.1.3
    • 5.5.4
    • None
    • Yes
    • DevX 33
    • 1

    Description

      Recently, Swissre were experienced some sort of high resources usage. We discover many factors to cause high memory resource and our solr crawler was one of the factor.

      We would love to somehow delay the crawler a bit to reduce stressing the public servers. so Could we add politenessDelay to the crawler config ?

      From the library they support:https://github.com/yasserg/crawler4j

      Politeness
      crawler4j is designed very efficiently and has the ability to crawl domains very fast (e.g., it has been able to crawl 200 Wikipedia pages per second). However, since this is against crawling policies and puts huge load on servers (and they might block you!), since version 1.3, by default crawler4j waits at least 200 milliseconds between requests. However, this parameter can be tuned:
      
      crawlConfig.setPolitenessDelay(politenessDelay);
      

      Development Note:
      We can help adding politenessDelay here.
      info.magnolia.module.indexer.crawler.commands.CrawlerIndexerCommand,  createCrawlController() method

       protected CrawlController createCrawlController(String crawlerName, CrawlerConfig config) throws Exception {
              CrawlConfig crawlConfig = new CrawlConfig();
              crawlConfig.setCrawlStorageFolder(Files.createTempDirectory(FileSystems.getDefault().getPath(Path.getTempDirectory().getAbsolutePath()), crawlerName).toString());
              crawlConfig.setMaxDepthOfCrawling(config.getDepth());
              crawlConfig.setMaxOutgoingLinksToFollow(config.getMaxOutgoingLinksToFollow());
      
              AuthInfo authInfo = createAuthInfo(crawlerName, config);
              if (authInfo != null) {
                  crawlConfig.addAuthInfo(authInfo);
              }
      
              PageFetcher pageFetcher = new PageFetcher(crawlConfig);
              RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
              RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
      
              CrawlController controller = new CrawlController(crawlConfig, pageFetcher, robotstxtServer);
              for (Site site : config.getSites()) {
                  log.debug("Crawling site: " + site.getUrl());
                  controller.addSeed(site.getUrl());
              }
              return controller;
          }
      

      Current config of swissre:

      Sitemap pages:
      https://www.swissre.com/sitemap~pages~.html
      https://www.swissre.com/sitemap~profiles~.html

      Discussion can be found here:
      https://magnolia-cms.slack.com/archives/C0214MATDNU/p1650958746312519?thread_ts=1650947030.132319&cid=C0214MATDNU
      Logging of those crawler:
      https://magnolia-cms.slack.com/archives/C0214MATDNU/p1650959103580969?thread_ts=1650947030.132319&cid=C0214MATDNU

      Thank you so much.

      Checklists

        Acceptance criteria

        Attachments

          Issue Links

            There are no Sub-Tasks for this issue.

            Activity

              People

                mdivilek Milan Divilek
                minh.nguyen Minh Nguyen
                DeveloperX
                Votes:
                1 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                  Created:
                  Updated:
                  Resolved:
                  Work Started:

                  Checklists

                    Task DoD