csh
/
y.st.
forked de y.st./y.st.


			
				
					
						
						
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125
							<?xml version="1.0" encoding="utf-8"?>
<!--
                                                                                     
 h       t     t                ::       /     /                     t             / 
 h       t     t                ::      //    //                     t            // 
 h     ttttt ttttt ppppp sssss         //    //  y   y       sssss ttttt         //  
 hhhh    t     t   p   p s            //    //   y   y       s       t          //   
 h  hh   t     t   ppppp sssss       //    //    yyyyy       sssss   t         //    
 h   h   t     t   p         s  ::   /     /         y  ..       s   t    ..   /     
 h   h   t     t   p     sssss  ::   /     /     yyyyy  ..   sssss   t    ..   /     
                                                                                     
	<https://y.st./>
	Copyright © 2015 Alex Yst <mailto:copyright@y.st>

	This program is free software: you can redistribute it and/or modify
	it under the terms of the GNU General Public License as published by
	the Free Software Foundation, either version 3 of the License, or
	(at your option) any later version.

	This program is distributed in the hope that it will be useful,
	but WITHOUT ANY WARRANTY; without even the implied warranty of
	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
	GNU General Public License for more details.

	You should have received a copy of the GNU General Public License
	along with this program. If not, see <https://www.gnu.org./licenses/>.
-->
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
	<head>
		<base href="https://y.st./en/weblog/2015/12-December/28.xhtml" />
		<title>The spider&apos;s first run &lt;https://y.st./en/weblog/2015/12-December/28.xhtml&gt;</title>
		<link rel="icon" type="image/png" href="/link/CC_BY-SA_4.0/y.st./icon.png" />
		<link rel="stylesheet" type="text/css" href="/link/basic.css" />
		<link rel="stylesheet" type="text/css" href="/link/site-specific.css" />
		<script type="text/javascript" src="/script/javascript.js" />
		<meta name="viewport" content="width=device-width" />
	</head>
	<body>
		<nav>
			<p>
				<a href="/en/">Home</a> |
				<a href="/en/a/about.xhtml">About</a> |
				<a href="/en/a/contact.xhtml">Contact</a> |
				<a href="/a/canary.txt">Canary</a> |
				<a href="/en/URI_research/"><abbr title="Uniform Resource Identifier">URI</abbr> research</a> |
				<a href="/en/opinion/">Opinions</a> |
				<a href="/en/coursework/">Coursework</a> |
				<a href="/en/law/">Law</a> |
				<a href="/en/a/links.xhtml">Links</a> |
				<a href="/en/weblog/2015/12-December/28.xhtml.asc">{this page}.asc</a>
			</p>
			<hr/>
			<p>
				Weblog index:
				<a href="/en/weblog/"><abbr title="American Standard Code for Information Interchange">ASCII</abbr> calendars</a> |
				<a href="/en/weblog/index_ol_ascending.xhtml">Ascending list</a> |
				<a href="/en/weblog/index_ol_descending.xhtml">Descending list</a>
			</p>
			<hr/>
			<p>
				Jump to entry:
				<a href="/en/weblog/2015/03-March/07.xhtml">&lt;&lt;First</a>
				<a rel="prev" href="/en/weblog/2015/12-December/27.xhtml">&lt;Previous</a>
				<a rel="next" href="/en/weblog/2015/12-December/29.xhtml">Next&gt;</a>
				<a href="/en/weblog/latest.xhtml">Latest&gt;&gt;</a>
			</p>
			<hr/>
		</nav>
		<header>
			<h1>The spider&apos;s first run</h1>
			<p>Day 00296: Monday, 2015 December 28</p>
		</header>
<p>
	For my new search engine to be at all effective, it needs to know how to handle relative <abbr title="Uniform Resource Identifier">URI</abbr>s.
	It obviously cannot request pages with them directly, so I built a function that takes a base <abbr title="Uniform Resource Identifier">URI</abbr> and a relative <abbr title="Uniform Resource Identifier">URI</abbr> and merges them to form a new absolute <abbr title="Uniform Resource Identifier">URI</abbr>.
	I found that <abbr title="PHP: Hypertext Preprocessor">PHP</abbr>&apos;s <a href="https://php.net/manual/en/function.parse-url.php"><code>parse_url()</code> function</a> was very helpful for breaking <abbr title="Uniform Resource Identifier">URI</abbr>s into their components so that they could be merged.
	However, all accounting for <code>.</code> and <code>..</code> directories, as well as all processing of the <abbr title="Uniform Resource Identifier">URI</abbr> components to form the new absolute <abbr title="Uniform Resource Identifier">URI</abbr>, had to be coded by hand.
	It took several hours to get it right, but I think that my new function does what I need it to now.
</p>
<p>
	My next task was to find a way to find hyperlinks in a downloaded page so that the <abbr title="Uniform Resource Identifier">URI</abbr>s can be collected, normalized, and added to the database.
	I thought that this task would be more difficult than the task of normalizing relative <abbr title="Uniform Resource Identifier">URI</abbr>s, but I was pleasantly surprised.
	<abbr title="PHP: Hypertext Preprocessor">PHP</abbr>&apos;s <a href="https://php.net/manual/en/function.xml-parse-into-struct.php"><code>xml_parse_into_struct()</code> function</a> performs most of the leg work.
	This function&apos;s output will also make it easy to take into account the page&apos;s preferred base base <abbr title="Uniform Resource Identifier">URI</abbr>, a feature that I wanted to add way later, but will now be able to build very early on.
	I was worried that this would only work on <abbr title="Extensible Hypertext Markup Language">XHTML</abbr> pages, as <abbr title="Hypertext Markup Language">HTML</abbr> is not <abbr title="Extensible Markup Language">XML</abbr>-compliant, but it also seems to work on the few <abbr title="Hypertext Markup Language">HTML</abbr> pages that I tested it on.
	Like the curl_*() functions, the xml_*() functions require passing around a resource handle, so I wrapped them up in a class as well.
</p>
<p>
	I quickly found a strange issue with the new class though.
	Each object instantiated from it can only be used to parse a single <abbr title="Extensible Markup Language">XML</abbr> document.
	I do not think that this is an error in the class, but rather, an issue with the underlying <abbr title="PHP: Hypertext Preprocessor">PHP</abbr> functions, and in fact, I witness the same behavior when I remove my wrapper class.
</p>
<p>
	I performed my first trial run of the search engine&apos;s spider today, and at first, it seemed to be doing very well.
	However, it found a large file that someone had linked to and it got stuck there.
	I think that it ate all the memory of my machine, too, as my machine ended up locking up entirely after a couple hours of being stuck on this one file.
	I considered screening the Content-Type headers of files before downloading them, but the particular file that clogged up the spider claims to be of type <code>text/plain; charset=UTF-8</code>.
	It was not a text file though, it was an XZ compressed file.
	Headers cannot always be trusted, as servers can be misconfigured.
	However, I think that the real threat is not misconfigured servers, but maliciously-configured servers.
	I should not rely on headers for anything as important as keeping the spider unclogged.
	There does not seem to be a direct way to limit file download size, but someone on <abbr title="Internet Relay Chat">IRC</abbr> gave me <a href="https://stackoverflow.com/questions/17641073/how-to-set-a-maximum-size-limit-to-php-curl-downloads">a hint</a> as to how to set a download limit in a less direct way.
	It seems a little confusing to me, potentially because it has been a long day, so I will try working with this again tomorrow.
</p>
<p>
	My <a href="/a/canary.txt">canary</a> still sings the tune of freedom and transparency.
</p>
		<hr/>
		<p>
			Copyright © 2015 Alex Yst;
			You may modify and/or redistribute this document under the terms of the <a rel="license" href="/license/gpl-3.0-standalone.xhtml"><abbr title="GNU&apos;s Not Unix">GNU</abbr> <abbr title="General Public License version Three or later">GPLv3+</abbr></a>.
			If for some reason you would prefer to modify and/or distribute this document under other free copyleft terms, please ask me via email.
			My address is in the source comments near the top of this document.
			This license also applies to embedded content such as images.
			For more information on that, see <a href="/en/a/licensing.xhtml">licensing</a>.
		</p>
		<p>
			<abbr title="World Wide Web Consortium">W3C</abbr> standards are important.
			This document conforms to the <a href="https://validator.w3.org./nu/?doc=https%3A%2F%2Fy.st.%2Fen%2Fweblog%2F2015%2F12-December%2F28.xhtml"><abbr title="Extensible Hypertext Markup Language">XHTML</abbr> 5.1</a> specification and uses style sheets that conform to the <a href="http://jigsaw.w3.org./css-validator/validator?uri=https%3A%2F%2Fy.st.%2Fen%2Fweblog%2F2015%2F12-December%2F28.xhtml"><abbr title="Cascading Style Sheets">CSS</abbr>3</a> specification.
		</p>
	</body>
</html>