123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154 |
- <?xml version="1.0" encoding="utf-8"?>
- <!--
-
- h t t :: / / t /
- h t t :: // // t //
- h ttttt ttttt ppppp sssss // // y y sssss ttttt //
- hhhh t t p p s // // y y s t //
- h hh t t ppppp sssss // // yyyyy sssss t //
- h h t t p s :: / / y .. s t .. /
- h h t t p sssss :: / / yyyyy .. sssss t .. /
-
- <https://y.st./>
- Copyright © 2016 Alex Yst <mailto:copyright@y.st>
- This program is free software: you can redistribute it and/or modify
- it under the terms of the GNU General Public License as published by
- the Free Software Foundation, either version 3 of the License, or
- (at your option) any later version.
- This program is distributed in the hope that it will be useful,
- but WITHOUT ANY WARRANTY; without even the implied warranty of
- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- GNU General Public License for more details.
- You should have received a copy of the GNU General Public License
- along with this program. If not, see <https://www.gnu.org./licenses/>.
- -->
- <!DOCTYPE html>
- <html xmlns="http://www.w3.org/1999/xhtml">
- <head>
- <base href="https://y.st./en/weblog/2016/01-January/08.xhtml" />
- <title>An error in the handling of relative URIs <https://y.st./en/weblog/2016/01-January/08.xhtml></title>
- <link rel="icon" type="image/png" href="/link/CC_BY-SA_4.0/y.st./icon.png" />
- <link rel="stylesheet" type="text/css" href="/link/basic.css" />
- <link rel="stylesheet" type="text/css" href="/link/site-specific.css" />
- <script type="text/javascript" src="/script/javascript.js" />
- <meta name="viewport" content="width=device-width" />
- </head>
- <body>
- <nav>
- <p>
- <a href="/en/">Home</a> |
- <a href="/en/a/about.xhtml">About</a> |
- <a href="/en/a/contact.xhtml">Contact</a> |
- <a href="/a/canary.txt">Canary</a> |
- <a href="/en/URI_research/"><abbr title="Uniform Resource Identifier">URI</abbr> research</a> |
- <a href="/en/opinion/">Opinions</a> |
- <a href="/en/coursework/">Coursework</a> |
- <a href="/en/law/">Law</a> |
- <a href="/en/a/links.xhtml">Links</a> |
- <a href="/en/weblog/2016/01-January/08.xhtml.asc">{this page}.asc</a>
- </p>
- <hr/>
- <p>
- Weblog index:
- <a href="/en/weblog/"><abbr title="American Standard Code for Information Interchange">ASCII</abbr> calendars</a> |
- <a href="/en/weblog/index_ol_ascending.xhtml">Ascending list</a> |
- <a href="/en/weblog/index_ol_descending.xhtml">Descending list</a>
- </p>
- <hr/>
- <p>
- Jump to entry:
- <a href="/en/weblog/2015/03-March/07.xhtml"><<First</a>
- <a rel="prev" href="/en/weblog/2016/01-January/07.xhtml"><Previous</a>
- <a rel="next" href="/en/weblog/2016/01-January/09.xhtml">Next></a>
- <a href="/en/weblog/latest.xhtml">Latest>></a>
- </p>
- <hr/>
- </nav>
- <header>
- <h1>An error in the handling of relative <abbr title="Uniform Resource Identifier">URI</abbr>s</h1>
- <p>Day 00307: Friday, 2016 January 08</p>
- </header>
- <p>
- I awoke this morning to find that the spider had choked on a bad <abbr title="Uniform Resource Identifier">URI</abbr> that it had pulled from its own database.
- It assumes that all <abbr title="Uniform Resource Identifier">URI</abbr>s in its database are valid, but somehow, it had managed to put an invalid <abbr title="Uniform Resource Identifier">URI</abbr> there.
- Before attempting to diagnose the problem, I made sure to check my email to see if the school had written me back though.
- They had not.
- With that out of the way, I ran a query against the database to find the offensive page so that I could run tests on my <code>merge_uris()</code> function, which had to have returned invalid results for this to happen.
- The link was to <code>https://5jp7xtmox6jyoqd5.onion</code>, so I just searched for the page that linked to it.
- Only one result came up: <a href="http://52wdeibt3ivmcapq.onion/darknet.html"><code>http://52wdeibt3ivmcapq.onion/darknet.html</code></a>.
- This page contains many technically-invalid <abbr title="Uniform Resource Identifier">URI</abbr>s that my spider should have successfully sanitized.
- Or rather, my <code>merge_uris()</code> function should have successfully sanitized them.
- Running another query against the database, I found that this page alone had been allowed to add eight <abbr title="Uniform Resource Identifier">URI</abbr>s with no path component to my database; <abbr title="Uniform Resource Identifier">URI</abbr>s that are technically invalid and which, by feature, not bug, would choke <code>merge_uris()</code> when used as the "absolute <abbr title="Uniform Resource Identifier">URI</abbr>" parameter of that function.
- But before reaching the database, they should have been given as the "relative <abbr title="Uniform Resource Identifier">URI</abbr>" parameter, causing the function to return the <abbr title="Uniform Resource Identifier">URI</abbr> with a slash at the end, making the <abbr title="Uniform Resource Identifier">URI</abbr>s valid.
- To make sure that it was in fact a bug in the function and not the spider, I tested the function on every invalid <abbr title="Uniform Resource Identifier">URI</abbr> that the page had added to my database.
- The function returned the expected incorrect result every time.
- </p>
- <p>
- I added a call to <code>\var_dump()</code> right before the return statement, but it was not getting executed.
- Searching for another spot where the function returns, I immediately found the problem.
- If the relative <abbr title="Uniform Resource Identifier">URI</abbr> has a different scheme that the absolute <abbr title="Uniform Resource Identifier">URI</abbr> that it is merged with, the relative <abbr title="Uniform Resource Identifier">URI</abbr> is assumed to be absolute.
- In theory, this should be accurate.
- If the scheme is different, relative <abbr title="Uniform Resource Identifier">URI</abbr>s should not be used at all, so the hyperlink should point to an absolute <abbr title="Uniform Resource Identifier">URI</abbr>.
- In practice though, some webmasters do not respect this, or even do not know it.
- It does not help that many Web browsers stupidly hide the path of a <abbr title="Uniform Resource Identifier">URI</abbr> when the path is only a slash.
- All this does is breed ignorance that the <abbr title="Uniform Resource Identifier">URI</abbr> even has the trailing slash, resulting in more people writing bad hyperlinks, such as those on the page that messed up my database.
- That said, my function should account for bad input as well, at least if the bad input is said to be a relative <abbr title="Uniform Resource Identifier">URI</abbr>.
- This function's whole purpose is to take incomplete <abbr title="Uniform Resource Identifier">URI</abbr>s and make them whole in an automated way.
- I think I managed to fix the function, then I used a short script to repair the damage to the database so I would not have to throw out the whole database.
- </p>
- <p>
- In the process of scanning the database for errors, I found several erroneous <abbr title="Internet Relay Chat">IRC</abbr> <abbr title="Uniform Resource Identifier">URI</abbr>s, which got me thinking.
- At some point, I should scan the database and see how many onion-based <abbr title="Internet Relay Chat">IRC</abbr> networks I can find.
- </p>
- <p>
- Upon the next run of the spider, I found that it was requesting <abbr title="File Transfer Protocol">FTP</abbr> and Gopher pages.
- I thought that I had implemented a protocol whilelist and only allowed <abbr title="Hypertext Transfer Protocol Secure">HTTPS</abbr>- and <abbr title="Hypertext Transfer Protocol">HTTP</abbr>-based <abbr title="Uniform Resource Identifier">URI</abbr>s, but it seems that I neglected to make the spider actually check the whitelist.
- That means that the spider will be requesting pages that it does not know how to handle yet.
- However, because of the new logic flow that allows use of the MySQL database, if I fix the protocol whitelist feature, the spider will look endlessly.
- </p>
- <p>
- Having needed to update my base library, I put aside work on the spider, despite its current issues, to work on a wrapper class, as I had agreed to include at least new wrapper class in every update.
- I have also decided to restructure the library version numbers a bit.
- Currently, the version numbers are increasing quickly, making it look like I am making more progress on it than I am.
- I will also be holding the version numbers back a bit until progress catches up with the version number.
- After building the <a href="https://secure.php.net/manual/en/ref.fdf.php">FDF</a> wrapper class, I started work on a <a href="https://secure.php.net/manual/en/ref.ftp.php"><abbr title="File Transfer Protocol">FTP</abbr></a> wrapper class.
- I found though that some <abbr title="File Transfer Protocol">FTP</abbr> functions rely on file resources, so I would need to complete a <a href="https://secure.php.net/manual/en/function.fopen.php">file</a> wrapper class first.
- However, for this wrapper class, I would need a wrapper class for stream resources.
- Looking into stream resources, I found that there is a documented prototype for implementing <a href="https://secure.php.net/manual/en/class.streamwrapper.php">stream objects</a>.
- My understanding of this class prototype though is that it does not meet my goals.
- Instead of wrapping up a stream resource with its related functions, it instead replaces stream resources altogether.
- I might try creating a class that both implements this prototype and wraps up the functions I want wrapped up.
- This class prototype though, and probably stream resources in general, requires stream context support, so I moved on and built the a class wrapping stream context resources.
- </p>
- <p>
- For my own reference, current work that needs to be done on the spider includes:
- </p>
- <ul>
- <li>find a way to avoid trying to crawl uncrawlable <abbr title="Uniform Resource Identifier">URI</abbr>s without creating an endless loop</li>
- <li>replace mis-implemented protocol whitelist with a <code>switch()</code> statement to allow handling different protocols in different ways</li>
- <li>restructure program flow to cause <code><a/></code>s to be scanned for and saved before the <code><title/></code> is recorded to make spider interruptions to no longer be detrimental</li>
- <li>Fix handling of <code><a/></code>s that have child nodes</li>
- </ul>
- <p>
- My <a href="/a/canary.txt">canary</a> still sings the tune of freedom and transparency.
- </p>
- <hr/>
- <p>
- Copyright © 2016 Alex Yst;
- You may modify and/or redistribute this document under the terms of the <a rel="license" href="/license/gpl-3.0-standalone.xhtml"><abbr title="GNU's Not Unix">GNU</abbr> <abbr title="General Public License version Three or later">GPLv3+</abbr></a>.
- If for some reason you would prefer to modify and/or distribute this document under other free copyleft terms, please ask me via email.
- My address is in the source comments near the top of this document.
- This license also applies to embedded content such as images.
- For more information on that, see <a href="/en/a/licensing.xhtml">licensing</a>.
- </p>
- <p>
- <abbr title="World Wide Web Consortium">W3C</abbr> standards are important.
- This document conforms to the <a href="https://validator.w3.org./nu/?doc=https%3A%2F%2Fy.st.%2Fen%2Fweblog%2F2016%2F01-January%2F08.xhtml"><abbr title="Extensible Hypertext Markup Language">XHTML</abbr> 5.1</a> specification and uses style sheets that conform to the <a href="http://jigsaw.w3.org./css-validator/validator?uri=https%3A%2F%2Fy.st.%2Fen%2Fweblog%2F2016%2F01-January%2F08.xhtml"><abbr title="Cascading Style Sheets">CSS</abbr>3</a> specification.
- </p>
- </body>
- </html>
|