csh
/
y.st.
geforked van y.st./y.st.


			
				
					
						
						
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136
							<?xml version="1.0" encoding="utf-8"?>
<!--
                                                                                     
 h       t     t                ::       /     /                     t             / 
 h       t     t                ::      //    //                     t            // 
 h     ttttt ttttt ppppp sssss         //    //  y   y       sssss ttttt         //  
 hhhh    t     t   p   p s            //    //   y   y       s       t          //   
 h  hh   t     t   ppppp sssss       //    //    yyyyy       sssss   t         //    
 h   h   t     t   p         s  ::   /     /         y  ..       s   t    ..   /     
 h   h   t     t   p     sssss  ::   /     /     yyyyy  ..   sssss   t    ..   /     
                                                                                     
	<https://y.st./>
	Copyright © 2016 Alex Yst <mailto:copyright@y.st>

	This program is free software: you can redistribute it and/or modify
	it under the terms of the GNU General Public License as published by
	the Free Software Foundation, either version 3 of the License, or
	(at your option) any later version.

	This program is distributed in the hope that it will be useful,
	but WITHOUT ANY WARRANTY; without even the implied warranty of
	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
	GNU General Public License for more details.

	You should have received a copy of the GNU General Public License
	along with this program. If not, see <https://www.gnu.org./licenses/>.
-->
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
	<head>
		<base href="https://y.st./en/weblog/2016/01-January/06.xhtml" />
		<title>The spider is almost usable again &lt;https://y.st./en/weblog/2016/01-January/06.xhtml&gt;</title>
		<link rel="icon" type="image/png" href="/link/CC_BY-SA_4.0/y.st./icon.png" />
		<link rel="stylesheet" type="text/css" href="/link/basic.css" />
		<link rel="stylesheet" type="text/css" href="/link/site-specific.css" />
		<script type="text/javascript" src="/script/javascript.js" />
		<meta name="viewport" content="width=device-width" />
	</head>
	<body>
		<nav>
			<p>
				<a href="/en/">Home</a> |
				<a href="/en/a/about.xhtml">About</a> |
				<a href="/en/a/contact.xhtml">Contact</a> |
				<a href="/a/canary.txt">Canary</a> |
				<a href="/en/URI_research/"><abbr title="Uniform Resource Identifier">URI</abbr> research</a> |
				<a href="/en/opinion/">Opinions</a> |
				<a href="/en/coursework/">Coursework</a> |
				<a href="/en/law/">Law</a> |
				<a href="/en/a/links.xhtml">Links</a> |
				<a href="/en/weblog/2016/01-January/06.xhtml.asc">{this page}.asc</a>
			</p>
			<hr/>
			<p>
				Weblog index:
				<a href="/en/weblog/"><abbr title="American Standard Code for Information Interchange">ASCII</abbr> calendars</a> |
				<a href="/en/weblog/index_ol_ascending.xhtml">Ascending list</a> |
				<a href="/en/weblog/index_ol_descending.xhtml">Descending list</a>
			</p>
			<hr/>
			<p>
				Jump to entry:
				<a href="/en/weblog/2015/03-March/07.xhtml">&lt;&lt;First</a>
				<a rel="prev" href="/en/weblog/2016/01-January/05.xhtml">&lt;Previous</a>
				<a rel="next" href="/en/weblog/2016/01-January/07.xhtml">Next&gt;</a>
				<a href="/en/weblog/latest.xhtml">Latest&gt;&gt;</a>
			</p>
			<hr/>
		</nav>
		<header>
			<h1>The spider is almost usable again</h1>
			<p>Day 00305: Wednesday, 2016 January 06</p>
		</header>
<p>
	The search engine spider was quickly filling up my <abbr title="random-access memory">RAM</abbr> and my machine was getting a bit hot, or so I thought.
	I felt that the best option was to terminate the spider for now.
	However, after terminating the spider, I found that my <abbr title="random-access memory">RAM</abbr> was still mostly in use.
	I terminated the spider and lost days worth of information for nothing! My <abbr title="random-access memory">RAM</abbr> was probably just in use because my system was being efficient.
	The thing that I had not taken into account is that a reasonable system will try to use most of the <abbr title="random-access memory">RAM</abbr>.
	<abbr title="random-access memory">RAM</abbr> that is not being used is <abbr title="random-access memory">RAM</abbr> that is not serving a function.
	By filling the <abbr title="random-access memory">RAM</abbr> with information from current processes, the system is able to load those processes quicker.
	If more <abbr title="random-access memory">RAM</abbr> is needed later, a reasonable system will clear the less-needed junk out of <abbr title="random-access memory">RAM</abbr> to make room for things that are needed.
	I have no reason to believe that Debian is not such a reasonable system, just trying to make the most of the hardware that is available.
	I will run the spider again once I have implemented some new features and am not running as much of it in <abbr title="random-access memory">RAM</abbr>.
	For now, it is in a broken state, so I cannot restart it yet.
</p>
<p>
	I have some thinking about how to store <code>/robots.txt</code> information and have come to the conclusion that I should not store it between crawls at all.
	At the point the spider is recrawling a page, it is considering the possibility that changes have been made.
	Changes too could have been made to the <code>/robots.txt</code> file, so it should be re-retrieved.
	However, if multiple pages are retrieved from the same website during a single crawl, the <code>/robots.txt</code> file should only be downloaded once.
	The indexed data from the last crawl should be alphabetized, but newly-found pages will be out of order.
	I do not know how well storing all the parsed <code>/robots.txt</code> files in <abbr title="random-access memory">RAM</abbr> would be though, so I think that I will only store one <code>/robots.txt</code> file at a time.
	This will be very inefficient during the first crawl, but later crawls should not involve too many extra <code>/robots.txt</code> file downloads.
	With this new perspective in mind, I wrote a class that handles standard <code>User-agent:</code> and <code>Disallow:</code> directives in addition to <code>Allow:</code> and <code>Sitemap:</code> directives.
	At some point, I would like to add sitemap support to the spider, though I have not added that today.
	My plans were to implement the blacklisting feature far later in the development process, but I decided to build it today, along with the function needed to implement it.
	Basically, this new function normalizes onion addresses before hashing them.
	This allows the spider to interpret a blacklist on a particular onion address as applying to all subdomains of that onion as well.
	If something is bad enough that it needs to be blacklisted, we can assume that the whole onion should be removed from the search results.
	For convenience, the function takes a whole <abbr title="Uniform Resource Identifier">URI</abbr> as input, though it only uses the domain of the <abbr title="Uniform Resource Identifier">URI</abbr> to determine the hash result.
	I found though that the progress report feature I added yesterday is no longer usable.
	It relies on the in-<abbr title="random-access memory">RAM</abbr> arrays that I just removed from the code.
	I do not know if I will be able to build a similar feature when using a MySQL database.
	I got most of the new logic set up, but I did run into an issue near the end and had to give up for the night.
	I need a way to SELECT <abbr title="Uniform Resource Identifier">URI</abbr>s from the `anchors` table, but only SELECT <abbr title="Uniform Resource Identifier">URI</abbr>s that are not present in the `titles` table.
</p>
<p>
	I tried to contact my old school again about getting a copy of my transcript.
	They still have not responded to me.
</p>
<p>
	We found something a bit annoying about the air here in Coos Bay.
	Because it is wetter than the air in Springfield, our perishable food is becoming inedible faster.
	We will have to keep this in mind for the future.
</p>
<p>
	My <a href="/a/canary.txt">canary</a> still sings the tune of freedom and transparency.
</p>
		<hr/>
		<p>
			Copyright © 2016 Alex Yst;
			You may modify and/or redistribute this document under the terms of the <a rel="license" href="/license/gpl-3.0-standalone.xhtml"><abbr title="GNU&apos;s Not Unix">GNU</abbr> <abbr title="General Public License version Three or later">GPLv3+</abbr></a>.
			If for some reason you would prefer to modify and/or distribute this document under other free copyleft terms, please ask me via email.
			My address is in the source comments near the top of this document.
			This license also applies to embedded content such as images.
			For more information on that, see <a href="/en/a/licensing.xhtml">licensing</a>.
		</p>
		<p>
			<abbr title="World Wide Web Consortium">W3C</abbr> standards are important.
			This document conforms to the <a href="https://validator.w3.org./nu/?doc=https%3A%2F%2Fy.st.%2Fen%2Fweblog%2F2016%2F01-January%2F06.xhtml"><abbr title="Extensible Hypertext Markup Language">XHTML</abbr> 5.1</a> specification and uses style sheets that conform to the <a href="http://jigsaw.w3.org./css-validator/validator?uri=https%3A%2F%2Fy.st.%2Fen%2Fweblog%2F2016%2F01-January%2F06.xhtml"><abbr title="Cascading Style Sheets">CSS</abbr>3</a> specification.
		</p>
	</body>
</html>