[DEAD-6] strip fragments from URIs before checking them Created: 21/Mar/14 Updated: 26/Mar/14 Resolved: 26/Mar/14 |
|
| Status: | Resolved |
| Project: | Deadlink |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 1.1.6 |
| Type: | Bug | Priority: | Neutral |
| Reporter: | Richard Unger | Assignee: | Marvin Kerkhoff |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Template: |
|
| Patch included: |
Yes
|
| Acceptance criteria: |
Empty
|
| Task DoD: |
[ ]*
Doc/release notes changes? Comment present?
[ ]*
Downstream builds green?
[ ]*
Solution information and context easily available?
[ ]*
Tests
[ ]*
FixVersion filled and not yet released
[ ] 
Architecture Decision Record (ADR)
|
| Bug DoR: |
[ ]*
Steps to reproduce, expected, and actual results filled
[ ]*
Affected version filled
|
| Date of First Response: |
| Description |
|
Currently, the link-checker checks URIs like: #nav Suggested solution: 1) strip the fragment from the URL before checking it. |
| Comments |
| Comment by Richard Unger [ 21/Mar/14 ] |
|
Patch: diff --git a/src/main/java/de/marvinkerkhoff/checker/LinkChecker.java b/src/main/java/de/marvinkerkhoff/checker/LinkChecker.java
index d40aedb..52c46ec 100644
--- a/src/main/java/de/marvinkerkhoff/checker/LinkChecker.java
+++ b/src/main/java/de/marvinkerkhoff/checker/LinkChecker.java
@@ -31,6 +31,7 @@
import javax.jcr.version.VersionException;
import org.apache.commons.codec.binary.Base64;
+import org.apache.commons.lang.StringUtils;
import org.apache.log4j.Logger;
import org.jsoup.Connection;
import org.jsoup.HttpStatusException;
@@ -68,10 +69,15 @@
private long scanTime, startTime;
private Long totalLinks = new Long(0);
private Long goodLinks = new Long(0);
+ protected String[] IGNOREARRAY;
public LinkChecker(final Node node) {
loadProperties();
-
+ if (!StringUtils.isEmpty(IGNORELINKS))
+ IGNOREARRAY = IGNORELINKS.split(",");
+ else
+ IGNOREARRAY = new String[0];
+
REPORTNODE = node;
try {
@@ -112,6 +118,10 @@
public LinkChecker(final Node node, Boolean continueScan) {
loadProperties();
+ if (!StringUtils.isEmpty(IGNORELINKS))
+ IGNOREARRAY = IGNORELINKS.split(",");
+ else
+ IGNOREARRAY = new String[0];
REPORTNODE = node;
if (continueScan) {
@@ -192,14 +202,22 @@
private HashMap<String, PageLink> appendElements(final HashMap<String, PageLink> pageLinkList, String docTitle, final Elements elem, final String attrKey, final PageLink checkPage) {
Outer:
for (final Element pageElement : elem) {
- String linkTarget = pageElement.attr(attrKey);
- if (linkTarget == null || linkTarget.trim().length() < 1) {
- linkTarget = pageElement.attr("href");
+ String linkTargetWithFragment = pageElement.attr(attrKey);
+ if (linkTargetWithFragment == null || linkTargetWithFragment.trim().length() < 1) {
+ linkTargetWithFragment = pageElement.attr("href");
}
- String[] ignoreArray = IGNORELINKS.split(",");
+ // strip fragment portion from link
+ String linkTarget = StringUtils.substringBefore(linkTargetWithFragment, "#");
+ String linkFragment = StringUtils.substringAfter(linkTargetWithFragment, "#");
+ if (StringUtils.isEmpty(linkTarget)){
+ LOG.debug("Skipping fragment link: "+linkTargetWithFragment);
+ continue Outer; // continue with next link
+ }
+ if (!StringUtils.isEmpty(linkFragment))
+ LOG.debug("Stripped fragment '#"+linkFragment+"' from link: "+linkTargetWithFragment);
- for (String ignoreString : ignoreArray) {
+ for (String ignoreString : IGNOREARRAY) {
if (linkTarget.startsWith(ignoreString)) {
LOG.info("Skipping "+ignoreString+" link: " + pageElement);
continue Outer;
This patch checks for URI fragments and strips them from the URIs being checked. If the resulting URI is empty, it isn't checked. I also moved the initialization of the skipArray outside of the appendElements() method into the class constructor (optimization). |
| Comment by Marvin Kerkhoff [ 26/Mar/14 ] |
|
Thx for the Patch but found a much shorter way to do it. This fix will also find link which links to the page itself. diff --git a/src/main/java/de/marvinkerkhoff/checker/LinkChecker.java b/src/main/java/de/marvinkerkhoff/checker/LinkChecker.java index d40aedb..68b03e6 100644 --- a/src/main/java/de/marvinkerkhoff/checker/LinkChecker.java +++ b/src/main/java/de/marvinkerkhoff/checker/LinkChecker.java @@ -31,6 +31,7 @@ import javax.jcr.query.InvalidQueryException; import javax.jcr.version.VersionException; import org.apache.commons.codec.binary.Base64; +import org.apache.commons.lang.StringUtils; import org.apache.log4j.Logger; import org.jsoup.Connection; import org.jsoup.HttpStatusException; @@ -200,7 +201,8 @@ public class LinkChecker { String[] ignoreArray = IGNORELINKS.split(","); for (String ignoreString : ignoreArray) { - if (linkTarget.startsWith(ignoreString)) { + boolean isOnlyHashTagOrSameURL = pageElement.baseUri().equals(StringUtils.substringBefore(linkTarget, "#")); + if (linkTarget.startsWith(ignoreString) || isOnlyHashTagOrSameURL) { LOG.info("Skipping "+ignoreString+" link: " + pageElement); continue Outer; } |