Parsing Links
Jun 12, 2020
When testing AutoComic on many different websites, I noticed something that I had never noticed before even when writing my own website: telling where a link goes is hard. An html link can be in one of many formats; before getting a page found via a link, the script must turn it into a full URL.
In addition to the formats listed at the webpage above, I discovered that href attributes can begin with a ?
, which links to the same page with different query parameters.
Some websites also use ..
for relative URL's which much be processed.
To take all of these into account, I wrote the following function:
def _getFullURL(self, path): if path == "": return path elif path[0] == '/': return self.baseURL + path elif path[0] == '?': return self.noQueryURL + path elif path[:2] == '..': return re.sub(r"/[^/]*/\.\./", "/", self.noQueryURL + '/' + path) elif path[:4] != "http": return self.basePath + '/' + path return path
where baseURL
, basePath
, and noQueryURL
are updated based on the current page.
Most of the cases here were found by running the script on many different websites. I am sure that there are link formats that this cannot handle, but only more testing will reveal them.