csh
/
y.st.
Fork de y.st./y.st.


			
				
					
						
						
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137
							<?xml version="1.0" encoding="utf-8"?>
<!--
                                                                                     
 h       t     t                ::       /     /                     t             / 
 h       t     t                ::      //    //                     t            // 
 h     ttttt ttttt ppppp sssss         //    //  y   y       sssss ttttt         //  
 hhhh    t     t   p   p s            //    //   y   y       s       t          //   
 h  hh   t     t   ppppp sssss       //    //    yyyyy       sssss   t         //    
 h   h   t     t   p         s  ::   /     /         y  ..       s   t    ..   /     
 h   h   t     t   p     sssss  ::   /     /     yyyyy  ..   sssss   t    ..   /     
                                                                                     
	<https://y.st./>
	Copyright © 2016 Alex Yst <mailto:copyright@y.st>

	This program is free software: you can redistribute it and/or modify
	it under the terms of the GNU General Public License as published by
	the Free Software Foundation, either version 3 of the License, or
	(at your option) any later version.

	This program is distributed in the hope that it will be useful,
	but WITHOUT ANY WARRANTY; without even the implied warranty of
	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
	GNU General Public License for more details.

	You should have received a copy of the GNU General Public License
	along with this program. If not, see <https://www.gnu.org./licenses/>.
-->
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
	<head>
		<base href="https://y.st./en/weblog/2016/02-February/06.xhtml" />
		<title>The spider does not handle broken markup well &lt;https://y.st./en/weblog/2016/02-February/06.xhtml&gt;</title>
		<link rel="icon" type="image/png" href="/link/CC_BY-SA_4.0/y.st./icon.png" />
		<link rel="stylesheet" type="text/css" href="/link/basic.css" />
		<link rel="stylesheet" type="text/css" href="/link/site-specific.css" />
		<script type="text/javascript" src="/script/javascript.js" />
		<meta name="viewport" content="width=device-width" />
	</head>
	<body>
		<nav>
			<p>
				<a href="/en/">Home</a> |
				<a href="/en/a/about.xhtml">About</a> |
				<a href="/en/a/contact.xhtml">Contact</a> |
				<a href="/a/canary.txt">Canary</a> |
				<a href="/en/URI_research/"><abbr title="Uniform Resource Identifier">URI</abbr> research</a> |
				<a href="/en/opinion/">Opinions</a> |
				<a href="/en/coursework/">Coursework</a> |
				<a href="/en/law/">Law</a> |
				<a href="/en/a/links.xhtml">Links</a> |
				<a href="/en/weblog/2016/02-February/06.xhtml.asc">{this page}.asc</a>
			</p>
			<hr/>
			<p>
				Weblog index:
				<a href="/en/weblog/"><abbr title="American Standard Code for Information Interchange">ASCII</abbr> calendars</a> |
				<a href="/en/weblog/index_ol_ascending.xhtml">Ascending list</a> |
				<a href="/en/weblog/index_ol_descending.xhtml">Descending list</a>
			</p>
			<hr/>
			<p>
				Jump to entry:
				<a href="/en/weblog/2015/03-March/07.xhtml">&lt;&lt;First</a>
				<a rel="prev" href="/en/weblog/2016/02-February/05.xhtml">&lt;Previous</a>
				<a rel="next" href="/en/weblog/2016/02-February/07.xhtml">Next&gt;</a>
				<a href="/en/weblog/latest.xhtml">Latest&gt;&gt;</a>
			</p>
			<hr/>
		</nav>
		<header>
			<h1>The spider does not handle broken markup well</h1>
			<p>Day 00336: Saturday, 2016 February 06</p>
		</header>
<p>
	I feel like I am stuck waiting for the time being.
	Until my appointment on Wednesday, in which I will learn what courses I would need to take to pursue finishing my associate degree at the local community college, I will not know whether I will be going back to school or attempting to find a full-time job for the time being.
	I cannot really make any plans until then.
</p>
<p>
	While the spider continues to run without error, I do not have any fatal bugs to work on, so it is time to work on feature support.
	The main two features on the waiting list are support for retrieving the full text of an <code>&lt;a/&gt;</code> tag even when it has child nodes and support for flat file databases.
	I think that it would be a lot of fun to try abstracting away the direct database calls, which would make supporting flat files and multiple database software options as easy as writing a new class, one that implements a particular interface, for each storage method desired.
	However, the <code>&lt;a/&gt;</code> tag issue is of more urgency, even if of less importance, so I set to work on that instead.
</p>
<p>
	Setting up that functionality turned out to be quite trivial.
	I set up a mock function that, if built, should provide exactly what I need to easily extract link text regardless of nesting.
	However, before I could build it, I ran some tests on some broken <abbr title="Extensible Markup Language">XML</abbr> to see how <abbr title="PHP: Hypertext Preprocessor">PHP</abbr>&apos;s <abbr title="Extensible Markup Language">XML</abbr> parser deals with broken <abbr title="Extensible Markup Language">XML</abbr>.
	As it turns out, it does not.
	As soon as an error is encountered, parsing stops.
	This is the proper behavior, but I had thought that I had tested the <abbr title="Extensible Markup Language">XML</abbr> parser on <abbr title="Hypertext Markup Language">HTML</abbr> (which is basically malformed <abbr title="Extensible Markup Language">XML</abbr> before.
	I had looked for some sort of error though, so when reasonable-looking data was returned, I must have assumed that it had not given up on the document.
	Partial data like this is probably worse than no data, as with partial data, I think that I have everything on the page parsed and do not realize that there is a bug in need of fixing.
</p>
<p>
	To be fair though, use of <abbr title="Hypertext Markup Language">HTML</abbr> and other broken markup is the real bug.
	But though the bug is not on my end, I need to be able to deal with it on my end.
</p>
<p>
	I added output to estimate the time remaining in the processing of a website, as it can be a bit bothersome waiting for the spider to be finished with a large website so that I can halt and restart it to see an update take effect.
	Next, I worked on improving my <abbr title="Uniform Resource Identifier">URI</abbr> handlers.
	In reality, a <abbr title="Uniform Resource Identifier">URI</abbr> is a specialized kind of data, so it would be better implemented as a class than as several functions like it was yesterday when I finished it.
	Additionally, as a class, it will only need to check the <abbr title="Uniform Resource Identifier">URI</abbr> components for validity during parsing, instead of both during parsing and during reconstruction.
	I think that I have the class working correctly, but I did not have time to test it tonight, so I will wait until tomorrow to upload it and integrate it with the spider.
</p>
<p>
	While combining the <abbr title="Uniform Resource Identifier">URI</abbr>-related functions into a class, I came across a corner case that I had not accounted for.
	<abbr title="Uniform Resource Identifier">URI</abbr>s such as <code>example:/.//</code> seem to be valid, though unnormalized.
	Using the standard normalization method though, the path becomes <code>///</code>.
	With no host component present, the path cannot begin with two slashes.
	However, normalizing the path is not allowed to alter the host component, so we cannot normalize the <abbr title="Uniform Resource Identifier">URI</abbr> to <code>example:///</code> and say that the new path is <code>/</code>.
	I have no idea how to properly normalize this <abbr title="Uniform Resource Identifier">URI</abbr>.
	For the time being, I set up a loop that is entered when this type of situation is encountered.
	As long as the first two characters of the path are both slashes, one will be removed and the loop will continue.
	I am not sure that this is the correct way to handle this situation, but it is a better idea than any of the alternatives I came up with, such as encoding one of the slashes as <code>%25</code> or prefixing the path with some character do that the path did not start with a slash at all.
</p>
<p>
	I spent most of the day running errands in Springfield with my mother.
	It seems that my presence was not really needed though and I was not able to help.
</p>
		<hr/>
		<p>
			Copyright © 2016 Alex Yst;
			You may modify and/or redistribute this document under the terms of the <a rel="license" href="/license/gpl-3.0-standalone.xhtml"><abbr title="GNU&apos;s Not Unix">GNU</abbr> <abbr title="General Public License version Three or later">GPLv3+</abbr></a>.
			If for some reason you would prefer to modify and/or distribute this document under other free copyleft terms, please ask me via email.
			My address is in the source comments near the top of this document.
			This license also applies to embedded content such as images.
			For more information on that, see <a href="/en/a/licensing.xhtml">licensing</a>.
		</p>
		<p>
			<abbr title="World Wide Web Consortium">W3C</abbr> standards are important.
			This document conforms to the <a href="https://validator.w3.org./nu/?doc=https%3A%2F%2Fy.st.%2Fen%2Fweblog%2F2016%2F02-February%2F06.xhtml"><abbr title="Extensible Hypertext Markup Language">XHTML</abbr> 5.1</a> specification and uses style sheets that conform to the <a href="http://jigsaw.w3.org./css-validator/validator?uri=https%3A%2F%2Fy.st.%2Fen%2Fweblog%2F2016%2F02-February%2F06.xhtml"><abbr title="Cascading Style Sheets">CSS</abbr>3</a> specification.
		</p>
	</body>
</html>