123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153 |
- <?xml version="1.0" encoding="utf-8"?>
- <!--
-
- h t t :: / / t /
- h t t :: // // t //
- h ttttt ttttt ppppp sssss // // y y sssss ttttt //
- hhhh t t p p s // // y y s t //
- h hh t t ppppp sssss // // yyyyy sssss t //
- h h t t p s :: / / y .. s t .. /
- h h t t p sssss :: / / yyyyy .. sssss t .. /
-
- <https://y.st./>
- Copyright © 2016 Alex Yst <mailto:copyright@y.st>
- This program is free software: you can redistribute it and/or modify
- it under the terms of the GNU General Public License as published by
- the Free Software Foundation, either version 3 of the License, or
- (at your option) any later version.
- This program is distributed in the hope that it will be useful,
- but WITHOUT ANY WARRANTY; without even the implied warranty of
- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- GNU General Public License for more details.
- You should have received a copy of the GNU General Public License
- along with this program. If not, see <https://www.gnu.org./licenses/>.
- -->
- <!DOCTYPE html>
- <html xmlns="http://www.w3.org/1999/xhtml">
- <head>
- <base href="https://y.st./en/weblog/2016/01-January/12.xhtml" />
- <title>Moving forward again <https://y.st./en/weblog/2016/01-January/12.xhtml></title>
- <link rel="icon" type="image/png" href="/link/CC_BY-SA_4.0/y.st./icon.png" />
- <link rel="stylesheet" type="text/css" href="/link/basic.css" />
- <link rel="stylesheet" type="text/css" href="/link/site-specific.css" />
- <script type="text/javascript" src="/script/javascript.js" />
- <meta name="viewport" content="width=device-width" />
- </head>
- <body>
- <nav>
- <p>
- <a href="/en/">Home</a> |
- <a href="/en/a/about.xhtml">About</a> |
- <a href="/en/a/contact.xhtml">Contact</a> |
- <a href="/a/canary.txt">Canary</a> |
- <a href="/en/URI_research/"><abbr title="Uniform Resource Identifier">URI</abbr> research</a> |
- <a href="/en/opinion/">Opinions</a> |
- <a href="/en/coursework/">Coursework</a> |
- <a href="/en/law/">Law</a> |
- <a href="/en/a/links.xhtml">Links</a> |
- <a href="/en/weblog/2016/01-January/12.xhtml.asc">{this page}.asc</a>
- </p>
- <hr/>
- <p>
- Weblog index:
- <a href="/en/weblog/"><abbr title="American Standard Code for Information Interchange">ASCII</abbr> calendars</a> |
- <a href="/en/weblog/index_ol_ascending.xhtml">Ascending list</a> |
- <a href="/en/weblog/index_ol_descending.xhtml">Descending list</a>
- </p>
- <hr/>
- <p>
- Jump to entry:
- <a href="/en/weblog/2015/03-March/07.xhtml"><<First</a>
- <a rel="prev" href="/en/weblog/2016/01-January/11.xhtml"><Previous</a>
- <a rel="next" href="/en/weblog/2016/01-January/13.xhtml">Next></a>
- <a href="/en/weblog/latest.xhtml">Latest>></a>
- </p>
- <hr/>
- </nav>
- <header>
- <h1>Moving forward again</h1>
- <p>Day 00311: Tuesday, 2016 January 12</p>
- </header>
- <p>
- Yesterday, I got almost nothing done, programing-wise.
- I think the problem is that I sat to work on boring wrapper classes and only boring wrapper classes.
- I wanted to make some progress on that so that I could be done with it sooner, but I need to throw in some actual usable stuff in between.
- I should focus more on my spider and only put the minimal effort that I promised into the wrapper classes.
- The wrapper classes will not be needed for quite some time, and if any of them become relevant before they are written, I can simply jump ahead and write the needed wrapper class.
- This should keep me more motivated to actually keep my code moving forward.
- </p>
- <p>
- As the first order of business today, I fixed the issue in which the spider will not recrawl the page that it was interrupted while crawling, though that will not help until the next time that it is interrupted or reaches the end of the current crawl.
- I forgot to yesterday though that the spider keep getting interrupted every time that my laptop disconnects from the server, even when I use the <code>disown</code> command or <code>&</code> operator, so I was able to deploy the fix fairly quickly.
- The spider takes longer to start up as it gathers more information though, and I always worry that I have introduced bugs such as those that might prevent any output at all.
- At one point, I caused the spider to loop endlessly, so I added a few debug information output lines so that I can see where the script is looping.
- It is a relief to be able to see the difference between a hang in the script versus an unending loop with no output.
- The spider hung for hours this time when I started it though, and I ended up adding even more debug lines.
- As it turns out, my database has gotten too big for the spider to effectively query, and thus, the spider cannot know where to crawl next.
- I thought that I would be fine if I simply did not hang onto information needed for search engine queries, but it seems that even holding onto data used by the spider itself is too much.
- I have had to downsize what the spider stores, making it a bit less effective.
- Now, it will only store information about website index pages.
- It will have to depend on websites having internal links that make everything accessible from the main index, though it can take several hops if needed.
- If a found <abbr title="Uniform Resource Identifier">URI</abbr> relates to a page on the same site, it will be crawled looking for more links to external sites, but if the found <abbr title="Uniform Resource Identifier">URI</abbr> is found that points to a different website's page, the <abbr title="Uniform Resource Identifier">URI</abbr> will be truncated to the root directory and stored in the database.
- </p>
- <p>
- I felt completely lost as to whether I need to rework the way that the crawling and database work first or implement support for other protocols first.
- In the end, I decided to learn more about the protocols so that I can parse the responses that I get from the servers.
- After I have built the tools that I need in order to understand server responses, I will set up the <code>switch()</code> statement for dealing with <abbr title="Uniform Resource Identifier">URI</abbr>s of different schemes differently, rework the <abbr title="Hypertext Transfer Protocol Secure">HTTPS</abbr>/<abbr title="Hypertext Transfer Protocol">HTTP</abbr>-handling code, and add the support of other schemes that <abbr title="Client for URLs/Client URL Request Library/Curl URL Request Library">cURL</abbr> can handle.
- I started out by learning <a href="https://en.wikipedia.org/wiki/Gopher_%28protocol%29">Gopher</a> directory-listing syntax.
- With some help from someone that prefers not to be mentioned, as well as some queries to a known Gopher server, I was able to decypher the very basic syntax.
- As it turns out, Gopher directory listings are listings of resources on the same server, a different server, or a mix.
- Gopher only allows linking to three types of servers though, assuming I understand correctly: Gopher servers, Telnet servers, and tn3270 servers, with tn3270 being an ancient form of Telnet that has its own <abbr title="Uniform Resource Identifier">URI</abbr> scheme.
- There is a <a href="https://en.wikipedia.org/wiki/Gopher_%28protocol%29#URL_links">hacky workaround</a> that allows linking to arbitrary <abbr title="Uniform Resource Identifier">URI</abbr>s, but it seems very messy to me.
- Basically, it is implemented as a link to an <abbr title="Hypertext Markup Language">HTML</abbr> page that is on the same server as the directory listing, but the text of the link has a specific syntax.
- The client is expected to recognize that because the specific syntax is being used, the link is not actually to be taken at face value.
- The "server" and "port" information of the link is to be ignored, and information from the link text is to be used instead.
- If the client fails to understand this, it will request the <abbr title="Hypertext Markup Language">HTML</abbr> file that the link actually points to, though as a fallback, the <abbr title="Hypertext Markup Language">HTML</abbr> file sent to the client when it does so will usually be an <abbr title="Hypertext Markup Language">HTML</abbr> redirection page that redirects the client to the correct <abbr title="Uniform Resource Identifier">URI</abbr>.
- I do not like this special-case-type syntax, and will probably have the spider ignore it by simply not programming it to recognize the special case.
- I assume that the <abbr title="Hypertext Markup Language">HTML</abbr> redirect pages used use <code><meta/></code> refresh tags, so Gopher's use of those may prompt me to work on support for <code><meta/></code> refresh tags at some point soon.
- This support would apply to <abbr title="Extensible Hypertext Markup Language">XHTML</abbr>/<abbr title="Hypertext Markup Language">HTML</abbr> pages retrieved over <abbr title="Hypertext Transfer Protocol Secure">HTTPS</abbr>/<abbr title="Hypertext Transfer Protocol">HTTP</abbr> as well.
- The other oddity in the Gopher directory listings is the informational message links.
- This links have a dummy host name and dummy port, due to the fact that they are not meant to to be links at all.
- However, as Gopher directory listings can contain only links, the only way to provide non-link information is by using such dummy information.
- Next, I looked into <abbr title="File Transfer Protocol">FTP</abbr> support.
- <abbr title="File Transfer Protocol">FTP</abbr> seems to return a directory listing of the files actually contained in a directory.
- There is no title information present though, and as far as I know, <abbr title="File Transfer Protocol">FTP</abbr> is not typically used to set up website-like pages the way that Gopher is, so I will not find <abbr title="Extensible Hypertext Markup Language">XHTML</abbr> files with links to other sites.
- There is really no reason for me to crawl these pages.
- I might put such services that do not need to be crawled in a separate database than the services that should be crawled.
- I kind of wonder if I want the spider to crawl anything <strong>*besides*</strong> <abbr title="Hypertext Transfer Protocol Secure">HTTPS</abbr>, <abbr title="Hypertext Transfer Protocol Secure">HTTPS</abbr>, and Gopher pages.
- Other types of resources might be good to query for a name, such as <abbr title="Internet Relay Chat over SSL">IRCS</abbr> servers, but I doubt that they will need to be recursively crawled.
- </p>
- <p>
- It appears that <a href="https://opalrwf4mzmlfmag.onion/">wowaname</a>'s hosting service has gone down for the time being.
- She left her computer at her roommate's house over winter break due to not having a stable Internet connection at home and <a href="http://answerstedhctbek.onion/">her roommate's older brother unplugged the machine</a>.
- From the sound of it, wowaname's hosting is usually provided over her school's Internet connection.
- They do not mind if students use this connection for whatever they like, such as providing hosting services, but they do not allow students to keep anything plugged in over the break.
- </p>
- <p>
- Cyrus has less than two weeks before his time is up and it is too late for him to complete his Boy Scout project.
- He was going to have me run some paperwork around town while he is in school tomorrow to save himself some time, but opted against it, saying that he had more that needed to be done before the paperwork could be dealt with.
- I wish that there was something I could do to help take the stress off of him a bit, but every time i ask, he says that there is nothing that I can do to help him yet.
- </p>
- <p>
- My <a href="/a/canary.txt">canary</a> still sings the tune of freedom and transparency.
- </p>
- <hr/>
- <p>
- Copyright © 2016 Alex Yst;
- You may modify and/or redistribute this document under the terms of the <a rel="license" href="/license/gpl-3.0-standalone.xhtml"><abbr title="GNU's Not Unix">GNU</abbr> <abbr title="General Public License version Three or later">GPLv3+</abbr></a>.
- If for some reason you would prefer to modify and/or distribute this document under other free copyleft terms, please ask me via email.
- My address is in the source comments near the top of this document.
- This license also applies to embedded content such as images.
- For more information on that, see <a href="/en/a/licensing.xhtml">licensing</a>.
- </p>
- <p>
- <abbr title="World Wide Web Consortium">W3C</abbr> standards are important.
- This document conforms to the <a href="https://validator.w3.org./nu/?doc=https%3A%2F%2Fy.st.%2Fen%2Fweblog%2F2016%2F01-January%2F12.xhtml"><abbr title="Extensible Hypertext Markup Language">XHTML</abbr> 5.1</a> specification and uses style sheets that conform to the <a href="http://jigsaw.w3.org./css-validator/validator?uri=https%3A%2F%2Fy.st.%2Fen%2Fweblog%2F2016%2F01-January%2F12.xhtml"><abbr title="Cascading Style Sheets">CSS</abbr>3</a> specification.
- </p>
- </body>
- </html>
|