Parsing Links

Jun 12, 2020

When testing AutoComic on many different websites, I noticed something that I had never noticed before even when writing my own website: telling where a link goes is hard. An html link can be in one of many formats; before getting a page found via a link, the script must turn it into a full URL.

In addition to the formats listed at the webpage above, I discovered that href attributes can begin with a ?, which links to the same page with different query parameters.

Some websites also use .. for relative URL's which much be processed.

To take all of these into account, I wrote the following function:

def _getFullURL(self, path): if path == "": return path elif path[0] == '/': return self.baseURL + path elif path[0] == '?': return self.noQueryURL + path elif path[:2] == '..': return re.sub(r"/[^/]*/\.\./", "/", self.noQueryURL + '/' + path) elif path[:4] != "http": return self.basePath + '/' + path return path

where baseURL, basePath, and noQueryURL are updated based on the current page.

Most of the cases here were found by running the script on many different websites. I am sure that there are link formats that this cannot handle, but only more testing will reveal them.