[DEAD-6] strip fragments from URIs before checking them Created: 21/Mar/14  Updated: 26/Mar/14  Resolved: 26/Mar/14

Status: Resolved
Project: Deadlink
Component/s: None
Affects Version/s: None
Fix Version/s: 1.1.6

Type: Bug Priority: Neutral
Reporter: Richard Unger Assignee: Marvin Kerkhoff
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Template:
Patch included:
Yes
Acceptance criteria:
Empty
Task DoD:
[ ]* Doc/release notes changes? Comment present?
[ ]* Downstream builds green?
[ ]* Solution information and context easily available?
[ ]* Tests
[ ]* FixVersion filled and not yet released
[ ]  Architecture Decision Record (ADR)
Bug DoR:
[ ]* Steps to reproduce, expected, and actual results filled
[ ]* Affected version filled
Date of First Response:

 Description   

Currently, the link-checker checks URIs like:

#nav
#maincontent
www.lfrz.at/#
www.lfrz.at/#anchor

Suggested solution:

1) strip the fragment from the URL before checking it.
2) if the resulting URL is empty, no need to check it.



 Comments   
Comment by Richard Unger [ 21/Mar/14 ]

Patch:

diff --git a/src/main/java/de/marvinkerkhoff/checker/LinkChecker.java b/src/main/java/de/marvinkerkhoff/checker/LinkChecker.java
index d40aedb..52c46ec 100644
--- a/src/main/java/de/marvinkerkhoff/checker/LinkChecker.java
+++ b/src/main/java/de/marvinkerkhoff/checker/LinkChecker.java
@@ -31,6 +31,7 @@
 import javax.jcr.version.VersionException;
 
 import org.apache.commons.codec.binary.Base64;
+import org.apache.commons.lang.StringUtils;
 import org.apache.log4j.Logger;
 import org.jsoup.Connection;
 import org.jsoup.HttpStatusException;
@@ -68,10 +69,15 @@
     private long scanTime, startTime;
     private Long totalLinks = new Long(0);
     private Long goodLinks = new Long(0);
+	protected String[] IGNOREARRAY;
 
     public LinkChecker(final Node node) {
         loadProperties();
-
+        if (!StringUtils.isEmpty(IGNORELINKS))
+        	IGNOREARRAY = IGNORELINKS.split(",");
+        else
+        	IGNOREARRAY = new String[0];
+        	
         REPORTNODE = node;  
         
         try {
@@ -112,6 +118,10 @@
 
     public LinkChecker(final Node node, Boolean continueScan) {
         loadProperties();
+        if (!StringUtils.isEmpty(IGNORELINKS))
+        	IGNOREARRAY = IGNORELINKS.split(",");
+        else
+        	IGNOREARRAY = new String[0];
         REPORTNODE = node;
 
         if (continueScan) {
@@ -192,14 +202,22 @@
     private HashMap<String, PageLink> appendElements(final HashMap<String, PageLink> pageLinkList, String docTitle, final Elements elem, final String attrKey, final PageLink checkPage) {
         Outer:
         for (final Element pageElement : elem) {
-            String linkTarget = pageElement.attr(attrKey);
-            if (linkTarget == null || linkTarget.trim().length() < 1) {
-                linkTarget = pageElement.attr("href");
+            String linkTargetWithFragment = pageElement.attr(attrKey);
+            if (linkTargetWithFragment == null || linkTargetWithFragment.trim().length() < 1) {
+            	linkTargetWithFragment = pageElement.attr("href");
             }
             
-            String[] ignoreArray = IGNORELINKS.split(",");
+            // strip fragment portion from link
+            String linkTarget = StringUtils.substringBefore(linkTargetWithFragment, "#");
+            String linkFragment = StringUtils.substringAfter(linkTargetWithFragment, "#");
+            if (StringUtils.isEmpty(linkTarget)){
+            	LOG.debug("Skipping fragment link: "+linkTargetWithFragment);
+            	continue Outer; // continue with next link
+            }
+            if (!StringUtils.isEmpty(linkFragment))
+            	LOG.debug("Stripped fragment '#"+linkFragment+"' from link: "+linkTargetWithFragment);
             
-            for (String ignoreString : ignoreArray) {
+            for (String ignoreString : IGNOREARRAY) {
                 if (linkTarget.startsWith(ignoreString)) {
                     LOG.info("Skipping "+ignoreString+" link: " + pageElement);
                     continue Outer;

This patch checks for URI fragments and strips them from the URIs being checked. If the resulting URI is empty, it isn't checked.

I also moved the initialization of the skipArray outside of the appendElements() method into the class constructor (optimization).

Comment by Marvin Kerkhoff [ 26/Mar/14 ]

Thx for the Patch but found a much shorter way to do it. This fix will also find link which links to the page itself.

diff --git a/src/main/java/de/marvinkerkhoff/checker/LinkChecker.java b/src/main/java/de/marvinkerkhoff/checker/LinkChecker.java
index d40aedb..68b03e6 100644
--- a/src/main/java/de/marvinkerkhoff/checker/LinkChecker.java
+++ b/src/main/java/de/marvinkerkhoff/checker/LinkChecker.java
@@ -31,6 +31,7 @@ import javax.jcr.query.InvalidQueryException;
 import javax.jcr.version.VersionException;

 import org.apache.commons.codec.binary.Base64;
+import org.apache.commons.lang.StringUtils;
 import org.apache.log4j.Logger;
 import org.jsoup.Connection;
 import org.jsoup.HttpStatusException;
@@ -200,7 +201,8 @@ public class LinkChecker {
             String[] ignoreArray = IGNORELINKS.split(",");

             for (String ignoreString : ignoreArray) {
-                if (linkTarget.startsWith(ignoreString)) {
+                boolean isOnlyHashTagOrSameURL = pageElement.baseUri().equals(StringUtils.substringBefore(linkTarget, "#"));
+                if (linkTarget.startsWith(ignoreString) || isOnlyHashTagOrSameURL) {
                     LOG.info("Skipping "+ignoreString+" link: " + pageElement);
                     continue Outer;
                 }
Generated at Mon Feb 12 00:42:54 CET 2024 using Jira 9.4.2#940002-sha1:46d1a51de284217efdcb32434eab47a99af2938b.