123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136 |
- <?xml version="1.0" encoding="utf-8"?>
- <!--
-
- h t t :: / / t /
- h t t :: // // t //
- h ttttt ttttt ppppp sssss // // y y sssss ttttt //
- hhhh t t p p s // // y y s t //
- h hh t t ppppp sssss // // yyyyy sssss t //
- h h t t p s :: / / y .. s t .. /
- h h t t p sssss :: / / yyyyy .. sssss t .. /
-
- <https://y.st./>
- Copyright © 2016 Alex Yst <mailto:copyright@y.st>
- This program is free software: you can redistribute it and/or modify
- it under the terms of the GNU General Public License as published by
- the Free Software Foundation, either version 3 of the License, or
- (at your option) any later version.
- This program is distributed in the hope that it will be useful,
- but WITHOUT ANY WARRANTY; without even the implied warranty of
- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- GNU General Public License for more details.
- You should have received a copy of the GNU General Public License
- along with this program. If not, see <https://www.gnu.org./licenses/>.
- -->
- <!DOCTYPE html>
- <html xmlns="http://www.w3.org/1999/xhtml">
- <head>
- <base href="https://y.st./en/weblog/2016/01-January/06.xhtml" />
- <title>The spider is almost usable again <https://y.st./en/weblog/2016/01-January/06.xhtml></title>
- <link rel="icon" type="image/png" href="/link/CC_BY-SA_4.0/y.st./icon.png" />
- <link rel="stylesheet" type="text/css" href="/link/basic.css" />
- <link rel="stylesheet" type="text/css" href="/link/site-specific.css" />
- <script type="text/javascript" src="/script/javascript.js" />
- <meta name="viewport" content="width=device-width" />
- </head>
- <body>
- <nav>
- <p>
- <a href="/en/">Home</a> |
- <a href="/en/a/about.xhtml">About</a> |
- <a href="/en/a/contact.xhtml">Contact</a> |
- <a href="/a/canary.txt">Canary</a> |
- <a href="/en/URI_research/"><abbr title="Uniform Resource Identifier">URI</abbr> research</a> |
- <a href="/en/opinion/">Opinions</a> |
- <a href="/en/coursework/">Coursework</a> |
- <a href="/en/law/">Law</a> |
- <a href="/en/a/links.xhtml">Links</a> |
- <a href="/en/weblog/2016/01-January/06.xhtml.asc">{this page}.asc</a>
- </p>
- <hr/>
- <p>
- Weblog index:
- <a href="/en/weblog/"><abbr title="American Standard Code for Information Interchange">ASCII</abbr> calendars</a> |
- <a href="/en/weblog/index_ol_ascending.xhtml">Ascending list</a> |
- <a href="/en/weblog/index_ol_descending.xhtml">Descending list</a>
- </p>
- <hr/>
- <p>
- Jump to entry:
- <a href="/en/weblog/2015/03-March/07.xhtml"><<First</a>
- <a rel="prev" href="/en/weblog/2016/01-January/05.xhtml"><Previous</a>
- <a rel="next" href="/en/weblog/2016/01-January/07.xhtml">Next></a>
- <a href="/en/weblog/latest.xhtml">Latest>></a>
- </p>
- <hr/>
- </nav>
- <header>
- <h1>The spider is almost usable again</h1>
- <p>Day 00305: Wednesday, 2016 January 06</p>
- </header>
- <p>
- The search engine spider was quickly filling up my <abbr title="random-access memory">RAM</abbr> and my machine was getting a bit hot, or so I thought.
- I felt that the best option was to terminate the spider for now.
- However, after terminating the spider, I found that my <abbr title="random-access memory">RAM</abbr> was still mostly in use.
- I terminated the spider and lost days worth of information for nothing! My <abbr title="random-access memory">RAM</abbr> was probably just in use because my system was being efficient.
- The thing that I had not taken into account is that a reasonable system will try to use most of the <abbr title="random-access memory">RAM</abbr>.
- <abbr title="random-access memory">RAM</abbr> that is not being used is <abbr title="random-access memory">RAM</abbr> that is not serving a function.
- By filling the <abbr title="random-access memory">RAM</abbr> with information from current processes, the system is able to load those processes quicker.
- If more <abbr title="random-access memory">RAM</abbr> is needed later, a reasonable system will clear the less-needed junk out of <abbr title="random-access memory">RAM</abbr> to make room for things that are needed.
- I have no reason to believe that Debian is not such a reasonable system, just trying to make the most of the hardware that is available.
- I will run the spider again once I have implemented some new features and am not running as much of it in <abbr title="random-access memory">RAM</abbr>.
- For now, it is in a broken state, so I cannot restart it yet.
- </p>
- <p>
- I have some thinking about how to store <code>/robots.txt</code> information and have come to the conclusion that I should not store it between crawls at all.
- At the point the spider is recrawling a page, it is considering the possibility that changes have been made.
- Changes too could have been made to the <code>/robots.txt</code> file, so it should be re-retrieved.
- However, if multiple pages are retrieved from the same website during a single crawl, the <code>/robots.txt</code> file should only be downloaded once.
- The indexed data from the last crawl should be alphabetized, but newly-found pages will be out of order.
- I do not know how well storing all the parsed <code>/robots.txt</code> files in <abbr title="random-access memory">RAM</abbr> would be though, so I think that I will only store one <code>/robots.txt</code> file at a time.
- This will be very inefficient during the first crawl, but later crawls should not involve too many extra <code>/robots.txt</code> file downloads.
- With this new perspective in mind, I wrote a class that handles standard <code>User-agent:</code> and <code>Disallow:</code> directives in addition to <code>Allow:</code> and <code>Sitemap:</code> directives.
- At some point, I would like to add sitemap support to the spider, though I have not added that today.
- My plans were to implement the blacklisting feature far later in the development process, but I decided to build it today, along with the function needed to implement it.
- Basically, this new function normalizes onion addresses before hashing them.
- This allows the spider to interpret a blacklist on a particular onion address as applying to all subdomains of that onion as well.
- If something is bad enough that it needs to be blacklisted, we can assume that the whole onion should be removed from the search results.
- For convenience, the function takes a whole <abbr title="Uniform Resource Identifier">URI</abbr> as input, though it only uses the domain of the <abbr title="Uniform Resource Identifier">URI</abbr> to determine the hash result.
- I found though that the progress report feature I added yesterday is no longer usable.
- It relies on the in-<abbr title="random-access memory">RAM</abbr> arrays that I just removed from the code.
- I do not know if I will be able to build a similar feature when using a MySQL database.
- I got most of the new logic set up, but I did run into an issue near the end and had to give up for the night.
- I need a way to SELECT <abbr title="Uniform Resource Identifier">URI</abbr>s from the `anchors` table, but only SELECT <abbr title="Uniform Resource Identifier">URI</abbr>s that are not present in the `titles` table.
- </p>
- <p>
- I tried to contact my old school again about getting a copy of my transcript.
- They still have not responded to me.
- </p>
- <p>
- We found something a bit annoying about the air here in Coos Bay.
- Because it is wetter than the air in Springfield, our perishable food is becoming inedible faster.
- We will have to keep this in mind for the future.
- </p>
- <p>
- My <a href="/a/canary.txt">canary</a> still sings the tune of freedom and transparency.
- </p>
- <hr/>
- <p>
- Copyright © 2016 Alex Yst;
- You may modify and/or redistribute this document under the terms of the <a rel="license" href="/license/gpl-3.0-standalone.xhtml"><abbr title="GNU's Not Unix">GNU</abbr> <abbr title="General Public License version Three or later">GPLv3+</abbr></a>.
- If for some reason you would prefer to modify and/or distribute this document under other free copyleft terms, please ask me via email.
- My address is in the source comments near the top of this document.
- This license also applies to embedded content such as images.
- For more information on that, see <a href="/en/a/licensing.xhtml">licensing</a>.
- </p>
- <p>
- <abbr title="World Wide Web Consortium">W3C</abbr> standards are important.
- This document conforms to the <a href="https://validator.w3.org./nu/?doc=https%3A%2F%2Fy.st.%2Fen%2Fweblog%2F2016%2F01-January%2F06.xhtml"><abbr title="Extensible Hypertext Markup Language">XHTML</abbr> 5.1</a> specification and uses style sheets that conform to the <a href="http://jigsaw.w3.org./css-validator/validator?uri=https%3A%2F%2Fy.st.%2Fen%2Fweblog%2F2016%2F01-January%2F06.xhtml"><abbr title="Cascading Style Sheets">CSS</abbr>3</a> specification.
- </p>
- </body>
- </html>
|