123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125 |
- <?xml version="1.0" encoding="utf-8"?>
- <!--
-
- h t t :: / / t /
- h t t :: // // t //
- h ttttt ttttt ppppp sssss // // y y sssss ttttt //
- hhhh t t p p s // // y y s t //
- h hh t t ppppp sssss // // yyyyy sssss t //
- h h t t p s :: / / y .. s t .. /
- h h t t p sssss :: / / yyyyy .. sssss t .. /
-
- <https://y.st./>
- Copyright © 2015 Alex Yst <mailto:copyright@y.st>
- This program is free software: you can redistribute it and/or modify
- it under the terms of the GNU General Public License as published by
- the Free Software Foundation, either version 3 of the License, or
- (at your option) any later version.
- This program is distributed in the hope that it will be useful,
- but WITHOUT ANY WARRANTY; without even the implied warranty of
- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- GNU General Public License for more details.
- You should have received a copy of the GNU General Public License
- along with this program. If not, see <https://www.gnu.org./licenses/>.
- -->
- <!DOCTYPE html>
- <html xmlns="http://www.w3.org/1999/xhtml">
- <head>
- <base href="https://y.st./en/weblog/2015/12-December/28.xhtml" />
- <title>The spider's first run <https://y.st./en/weblog/2015/12-December/28.xhtml></title>
- <link rel="icon" type="image/png" href="/link/CC_BY-SA_4.0/y.st./icon.png" />
- <link rel="stylesheet" type="text/css" href="/link/basic.css" />
- <link rel="stylesheet" type="text/css" href="/link/site-specific.css" />
- <script type="text/javascript" src="/script/javascript.js" />
- <meta name="viewport" content="width=device-width" />
- </head>
- <body>
- <nav>
- <p>
- <a href="/en/">Home</a> |
- <a href="/en/a/about.xhtml">About</a> |
- <a href="/en/a/contact.xhtml">Contact</a> |
- <a href="/a/canary.txt">Canary</a> |
- <a href="/en/URI_research/"><abbr title="Uniform Resource Identifier">URI</abbr> research</a> |
- <a href="/en/opinion/">Opinions</a> |
- <a href="/en/coursework/">Coursework</a> |
- <a href="/en/law/">Law</a> |
- <a href="/en/a/links.xhtml">Links</a> |
- <a href="/en/weblog/2015/12-December/28.xhtml.asc">{this page}.asc</a>
- </p>
- <hr/>
- <p>
- Weblog index:
- <a href="/en/weblog/"><abbr title="American Standard Code for Information Interchange">ASCII</abbr> calendars</a> |
- <a href="/en/weblog/index_ol_ascending.xhtml">Ascending list</a> |
- <a href="/en/weblog/index_ol_descending.xhtml">Descending list</a>
- </p>
- <hr/>
- <p>
- Jump to entry:
- <a href="/en/weblog/2015/03-March/07.xhtml"><<First</a>
- <a rel="prev" href="/en/weblog/2015/12-December/27.xhtml"><Previous</a>
- <a rel="next" href="/en/weblog/2015/12-December/29.xhtml">Next></a>
- <a href="/en/weblog/latest.xhtml">Latest>></a>
- </p>
- <hr/>
- </nav>
- <header>
- <h1>The spider's first run</h1>
- <p>Day 00296: Monday, 2015 December 28</p>
- </header>
- <p>
- For my new search engine to be at all effective, it needs to know how to handle relative <abbr title="Uniform Resource Identifier">URI</abbr>s.
- It obviously cannot request pages with them directly, so I built a function that takes a base <abbr title="Uniform Resource Identifier">URI</abbr> and a relative <abbr title="Uniform Resource Identifier">URI</abbr> and merges them to form a new absolute <abbr title="Uniform Resource Identifier">URI</abbr>.
- I found that <abbr title="PHP: Hypertext Preprocessor">PHP</abbr>'s <a href="https://php.net/manual/en/function.parse-url.php"><code>parse_url()</code> function</a> was very helpful for breaking <abbr title="Uniform Resource Identifier">URI</abbr>s into their components so that they could be merged.
- However, all accounting for <code>.</code> and <code>..</code> directories, as well as all processing of the <abbr title="Uniform Resource Identifier">URI</abbr> components to form the new absolute <abbr title="Uniform Resource Identifier">URI</abbr>, had to be coded by hand.
- It took several hours to get it right, but I think that my new function does what I need it to now.
- </p>
- <p>
- My next task was to find a way to find hyperlinks in a downloaded page so that the <abbr title="Uniform Resource Identifier">URI</abbr>s can be collected, normalized, and added to the database.
- I thought that this task would be more difficult than the task of normalizing relative <abbr title="Uniform Resource Identifier">URI</abbr>s, but I was pleasantly surprised.
- <abbr title="PHP: Hypertext Preprocessor">PHP</abbr>'s <a href="https://php.net/manual/en/function.xml-parse-into-struct.php"><code>xml_parse_into_struct()</code> function</a> performs most of the leg work.
- This function's output will also make it easy to take into account the page's preferred base base <abbr title="Uniform Resource Identifier">URI</abbr>, a feature that I wanted to add way later, but will now be able to build very early on.
- I was worried that this would only work on <abbr title="Extensible Hypertext Markup Language">XHTML</abbr> pages, as <abbr title="Hypertext Markup Language">HTML</abbr> is not <abbr title="Extensible Markup Language">XML</abbr>-compliant, but it also seems to work on the few <abbr title="Hypertext Markup Language">HTML</abbr> pages that I tested it on.
- Like the curl_*() functions, the xml_*() functions require passing around a resource handle, so I wrapped them up in a class as well.
- </p>
- <p>
- I quickly found a strange issue with the new class though.
- Each object instantiated from it can only be used to parse a single <abbr title="Extensible Markup Language">XML</abbr> document.
- I do not think that this is an error in the class, but rather, an issue with the underlying <abbr title="PHP: Hypertext Preprocessor">PHP</abbr> functions, and in fact, I witness the same behavior when I remove my wrapper class.
- </p>
- <p>
- I performed my first trial run of the search engine's spider today, and at first, it seemed to be doing very well.
- However, it found a large file that someone had linked to and it got stuck there.
- I think that it ate all the memory of my machine, too, as my machine ended up locking up entirely after a couple hours of being stuck on this one file.
- I considered screening the Content-Type headers of files before downloading them, but the particular file that clogged up the spider claims to be of type <code>text/plain; charset=UTF-8</code>.
- It was not a text file though, it was an XZ compressed file.
- Headers cannot always be trusted, as servers can be misconfigured.
- However, I think that the real threat is not misconfigured servers, but maliciously-configured servers.
- I should not rely on headers for anything as important as keeping the spider unclogged.
- There does not seem to be a direct way to limit file download size, but someone on <abbr title="Internet Relay Chat">IRC</abbr> gave me <a href="https://stackoverflow.com/questions/17641073/how-to-set-a-maximum-size-limit-to-php-curl-downloads">a hint</a> as to how to set a download limit in a less direct way.
- It seems a little confusing to me, potentially because it has been a long day, so I will try working with this again tomorrow.
- </p>
- <p>
- My <a href="/a/canary.txt">canary</a> still sings the tune of freedom and transparency.
- </p>
- <hr/>
- <p>
- Copyright © 2015 Alex Yst;
- You may modify and/or redistribute this document under the terms of the <a rel="license" href="/license/gpl-3.0-standalone.xhtml"><abbr title="GNU's Not Unix">GNU</abbr> <abbr title="General Public License version Three or later">GPLv3+</abbr></a>.
- If for some reason you would prefer to modify and/or distribute this document under other free copyleft terms, please ask me via email.
- My address is in the source comments near the top of this document.
- This license also applies to embedded content such as images.
- For more information on that, see <a href="/en/a/licensing.xhtml">licensing</a>.
- </p>
- <p>
- <abbr title="World Wide Web Consortium">W3C</abbr> standards are important.
- This document conforms to the <a href="https://validator.w3.org./nu/?doc=https%3A%2F%2Fy.st.%2Fen%2Fweblog%2F2015%2F12-December%2F28.xhtml"><abbr title="Extensible Hypertext Markup Language">XHTML</abbr> 5.1</a> specification and uses style sheets that conform to the <a href="http://jigsaw.w3.org./css-validator/validator?uri=https%3A%2F%2Fy.st.%2Fen%2Fweblog%2F2015%2F12-December%2F28.xhtml"><abbr title="Cascading Style Sheets">CSS</abbr>3</a> specification.
- </p>
- </body>
- </html>
|