123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137 |
- <?xml version="1.0" encoding="utf-8"?>
- <!--
-
- h t t :: / / t /
- h t t :: // // t //
- h ttttt ttttt ppppp sssss // // y y sssss ttttt //
- hhhh t t p p s // // y y s t //
- h hh t t ppppp sssss // // yyyyy sssss t //
- h h t t p s :: / / y .. s t .. /
- h h t t p sssss :: / / yyyyy .. sssss t .. /
-
- <https://y.st./>
- Copyright © 2016 Alex Yst <mailto:copyright@y.st>
- This program is free software: you can redistribute it and/or modify
- it under the terms of the GNU General Public License as published by
- the Free Software Foundation, either version 3 of the License, or
- (at your option) any later version.
- This program is distributed in the hope that it will be useful,
- but WITHOUT ANY WARRANTY; without even the implied warranty of
- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- GNU General Public License for more details.
- You should have received a copy of the GNU General Public License
- along with this program. If not, see <https://www.gnu.org./licenses/>.
- -->
- <!DOCTYPE html>
- <html xmlns="http://www.w3.org/1999/xhtml">
- <head>
- <base href="https://y.st./en/weblog/2016/02-February/06.xhtml" />
- <title>The spider does not handle broken markup well <https://y.st./en/weblog/2016/02-February/06.xhtml></title>
- <link rel="icon" type="image/png" href="/link/CC_BY-SA_4.0/y.st./icon.png" />
- <link rel="stylesheet" type="text/css" href="/link/basic.css" />
- <link rel="stylesheet" type="text/css" href="/link/site-specific.css" />
- <script type="text/javascript" src="/script/javascript.js" />
- <meta name="viewport" content="width=device-width" />
- </head>
- <body>
- <nav>
- <p>
- <a href="/en/">Home</a> |
- <a href="/en/a/about.xhtml">About</a> |
- <a href="/en/a/contact.xhtml">Contact</a> |
- <a href="/a/canary.txt">Canary</a> |
- <a href="/en/URI_research/"><abbr title="Uniform Resource Identifier">URI</abbr> research</a> |
- <a href="/en/opinion/">Opinions</a> |
- <a href="/en/coursework/">Coursework</a> |
- <a href="/en/law/">Law</a> |
- <a href="/en/a/links.xhtml">Links</a> |
- <a href="/en/weblog/2016/02-February/06.xhtml.asc">{this page}.asc</a>
- </p>
- <hr/>
- <p>
- Weblog index:
- <a href="/en/weblog/"><abbr title="American Standard Code for Information Interchange">ASCII</abbr> calendars</a> |
- <a href="/en/weblog/index_ol_ascending.xhtml">Ascending list</a> |
- <a href="/en/weblog/index_ol_descending.xhtml">Descending list</a>
- </p>
- <hr/>
- <p>
- Jump to entry:
- <a href="/en/weblog/2015/03-March/07.xhtml"><<First</a>
- <a rel="prev" href="/en/weblog/2016/02-February/05.xhtml"><Previous</a>
- <a rel="next" href="/en/weblog/2016/02-February/07.xhtml">Next></a>
- <a href="/en/weblog/latest.xhtml">Latest>></a>
- </p>
- <hr/>
- </nav>
- <header>
- <h1>The spider does not handle broken markup well</h1>
- <p>Day 00336: Saturday, 2016 February 06</p>
- </header>
- <p>
- I feel like I am stuck waiting for the time being.
- Until my appointment on Wednesday, in which I will learn what courses I would need to take to pursue finishing my associate degree at the local community college, I will not know whether I will be going back to school or attempting to find a full-time job for the time being.
- I cannot really make any plans until then.
- </p>
- <p>
- While the spider continues to run without error, I do not have any fatal bugs to work on, so it is time to work on feature support.
- The main two features on the waiting list are support for retrieving the full text of an <code><a/></code> tag even when it has child nodes and support for flat file databases.
- I think that it would be a lot of fun to try abstracting away the direct database calls, which would make supporting flat files and multiple database software options as easy as writing a new class, one that implements a particular interface, for each storage method desired.
- However, the <code><a/></code> tag issue is of more urgency, even if of less importance, so I set to work on that instead.
- </p>
- <p>
- Setting up that functionality turned out to be quite trivial.
- I set up a mock function that, if built, should provide exactly what I need to easily extract link text regardless of nesting.
- However, before I could build it, I ran some tests on some broken <abbr title="Extensible Markup Language">XML</abbr> to see how <abbr title="PHP: Hypertext Preprocessor">PHP</abbr>'s <abbr title="Extensible Markup Language">XML</abbr> parser deals with broken <abbr title="Extensible Markup Language">XML</abbr>.
- As it turns out, it does not.
- As soon as an error is encountered, parsing stops.
- This is the proper behavior, but I had thought that I had tested the <abbr title="Extensible Markup Language">XML</abbr> parser on <abbr title="Hypertext Markup Language">HTML</abbr> (which is basically malformed <abbr title="Extensible Markup Language">XML</abbr> before.
- I had looked for some sort of error though, so when reasonable-looking data was returned, I must have assumed that it had not given up on the document.
- Partial data like this is probably worse than no data, as with partial data, I think that I have everything on the page parsed and do not realize that there is a bug in need of fixing.
- </p>
- <p>
- To be fair though, use of <abbr title="Hypertext Markup Language">HTML</abbr> and other broken markup is the real bug.
- But though the bug is not on my end, I need to be able to deal with it on my end.
- </p>
- <p>
- I added output to estimate the time remaining in the processing of a website, as it can be a bit bothersome waiting for the spider to be finished with a large website so that I can halt and restart it to see an update take effect.
- Next, I worked on improving my <abbr title="Uniform Resource Identifier">URI</abbr> handlers.
- In reality, a <abbr title="Uniform Resource Identifier">URI</abbr> is a specialized kind of data, so it would be better implemented as a class than as several functions like it was yesterday when I finished it.
- Additionally, as a class, it will only need to check the <abbr title="Uniform Resource Identifier">URI</abbr> components for validity during parsing, instead of both during parsing and during reconstruction.
- I think that I have the class working correctly, but I did not have time to test it tonight, so I will wait until tomorrow to upload it and integrate it with the spider.
- </p>
- <p>
- While combining the <abbr title="Uniform Resource Identifier">URI</abbr>-related functions into a class, I came across a corner case that I had not accounted for.
- <abbr title="Uniform Resource Identifier">URI</abbr>s such as <code>example:/.//</code> seem to be valid, though unnormalized.
- Using the standard normalization method though, the path becomes <code>///</code>.
- With no host component present, the path cannot begin with two slashes.
- However, normalizing the path is not allowed to alter the host component, so we cannot normalize the <abbr title="Uniform Resource Identifier">URI</abbr> to <code>example:///</code> and say that the new path is <code>/</code>.
- I have no idea how to properly normalize this <abbr title="Uniform Resource Identifier">URI</abbr>.
- For the time being, I set up a loop that is entered when this type of situation is encountered.
- As long as the first two characters of the path are both slashes, one will be removed and the loop will continue.
- I am not sure that this is the correct way to handle this situation, but it is a better idea than any of the alternatives I came up with, such as encoding one of the slashes as <code>%25</code> or prefixing the path with some character do that the path did not start with a slash at all.
- </p>
- <p>
- I spent most of the day running errands in Springfield with my mother.
- It seems that my presence was not really needed though and I was not able to help.
- </p>
- <hr/>
- <p>
- Copyright © 2016 Alex Yst;
- You may modify and/or redistribute this document under the terms of the <a rel="license" href="/license/gpl-3.0-standalone.xhtml"><abbr title="GNU's Not Unix">GNU</abbr> <abbr title="General Public License version Three or later">GPLv3+</abbr></a>.
- If for some reason you would prefer to modify and/or distribute this document under other free copyleft terms, please ask me via email.
- My address is in the source comments near the top of this document.
- This license also applies to embedded content such as images.
- For more information on that, see <a href="/en/a/licensing.xhtml">licensing</a>.
- </p>
- <p>
- <abbr title="World Wide Web Consortium">W3C</abbr> standards are important.
- This document conforms to the <a href="https://validator.w3.org./nu/?doc=https%3A%2F%2Fy.st.%2Fen%2Fweblog%2F2016%2F02-February%2F06.xhtml"><abbr title="Extensible Hypertext Markup Language">XHTML</abbr> 5.1</a> specification and uses style sheets that conform to the <a href="http://jigsaw.w3.org./css-validator/validator?uri=https%3A%2F%2Fy.st.%2Fen%2Fweblog%2F2016%2F02-February%2F06.xhtml"><abbr title="Cascading Style Sheets">CSS</abbr>3</a> specification.
- </p>
- </body>
- </html>
|