123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159 |
- <?xml version="1.0" encoding="utf-8"?>
- <!--
-
- h t t :: / / t /
- h t t :: // // t //
- h ttttt ttttt ppppp sssss // // y y sssss ttttt //
- hhhh t t p p s // // y y s t //
- h hh t t ppppp sssss // // yyyyy sssss t //
- h h t t p s :: / / y .. s t .. /
- h h t t p sssss :: / / yyyyy .. sssss t .. /
-
- <https://y.st./>
- Copyright © 2016 Alex Yst <mailto:copyright@y.st>
- This program is free software: you can redistribute it and/or modify
- it under the terms of the GNU General Public License as published by
- the Free Software Foundation, either version 3 of the License, or
- (at your option) any later version.
- This program is distributed in the hope that it will be useful,
- but WITHOUT ANY WARRANTY; without even the implied warranty of
- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- GNU General Public License for more details.
- You should have received a copy of the GNU General Public License
- along with this program. If not, see <https://www.gnu.org./licenses/>.
- -->
- <!DOCTYPE html>
- <html xmlns="http://www.w3.org/1999/xhtml">
- <head>
- <base href="https://y.st./en/weblog/2016/01-January/23.xhtml" />
- <title>\CURLOPT_NOPROGRESS, \curl_getinfo(), and \parse_url() <https://y.st./en/weblog/2016/01-January/23.xhtml></title>
- <link rel="icon" type="image/png" href="/link/CC_BY-SA_4.0/y.st./icon.png" />
- <link rel="stylesheet" type="text/css" href="/link/basic.css" />
- <link rel="stylesheet" type="text/css" href="/link/site-specific.css" />
- <script type="text/javascript" src="/script/javascript.js" />
- <meta name="viewport" content="width=device-width" />
- </head>
- <body>
- <nav>
- <p>
- <a href="/en/">Home</a> |
- <a href="/en/a/about.xhtml">About</a> |
- <a href="/en/a/contact.xhtml">Contact</a> |
- <a href="/a/canary.txt">Canary</a> |
- <a href="/en/URI_research/"><abbr title="Uniform Resource Identifier">URI</abbr> research</a> |
- <a href="/en/opinion/">Opinions</a> |
- <a href="/en/coursework/">Coursework</a> |
- <a href="/en/law/">Law</a> |
- <a href="/en/a/links.xhtml">Links</a> |
- <a href="/en/weblog/2016/01-January/23.xhtml.asc">{this page}.asc</a>
- </p>
- <hr/>
- <p>
- Weblog index:
- <a href="/en/weblog/"><abbr title="American Standard Code for Information Interchange">ASCII</abbr> calendars</a> |
- <a href="/en/weblog/index_ol_ascending.xhtml">Ascending list</a> |
- <a href="/en/weblog/index_ol_descending.xhtml">Descending list</a>
- </p>
- <hr/>
- <p>
- Jump to entry:
- <a href="/en/weblog/2015/03-March/07.xhtml"><<First</a>
- <a rel="prev" href="/en/weblog/2016/01-January/22.xhtml"><Previous</a>
- <a rel="next" href="/en/weblog/2016/01-January/24.xhtml">Next></a>
- <a href="/en/weblog/latest.xhtml">Latest>></a>
- </p>
- <hr/>
- </nav>
- <header>
- <h1><code>\CURLOPT_NOPROGRESS</code>, <code>\curl_getinfo()</code>, and <code>\parse_url()</code></h1>
- <p>Day 00322: Saturday, 2016 January 23</p>
- </header>
- <p>
- I did a bit of searching, and I found that there is a <a href="https://bugs.php.net/bug.php?id=65236">bug in how <code>\xml_parse_into_struct()</code> handles deeply-nested tags</a>.
- According to the Red Hat bug report, this bug is capable of <a href="https://bugzilla.redhat.com/show_bug.cgi?id=983689">crashing a script</a>.
- Due to the fact that the file that crashes the spider not being an actual <abbr title="Extensible Markup Language">XML</abbr> file, I have no idea how it is being interpreted.
- The spider only sends it to that function because it has no way to know if the file is an <abbr title="Extensible Markup Language">XML</abbr>/<abbr title="Hypertext Markup Language">HTML</abbr> file, so it sends all files through that function.
- After several more hours of debugging though, I finally found a problem in my spider itself that is making this problem worse: the progress function that I set is not being called! I removed a line that set the <code>\CURLOPT_NOPROGRESS</code> option to false, as I thought that it was leftover code that was not needed.
- The <abbr title="PHP: Hypertext Preprocessor">PHP</abbr> manual says that <a href="https://secure.php.net/manual/en/function.curl-setopt.php"><code>\CURLOPT_NOPROGRESS</code> should only be used for debugging</a>, but this is a lie! If <code>\CURLOPT_NOPROGRESS</code> is not set to false, the progress function will never be called.
- The <a href="https://stackoverflow.com/questions/17641073/how-to-set-a-maximum-size-limit-to-php-curl-downloads">original instructions for setting up a file size limit</a> were very helpful, as was the fact that my limiting function was set to generate output, though not output was being generated.
- Once I had the problem fixed though, too much output was being generated, ,so I had to disable that feature.
- I set up a new setting to re-enable it in the configuration file though.
- It seems that my file size limit might have been too high though, so I turned it down to eight megabytes.
- I am unsure if this file size is low enough to prevent future problems with non-<abbr title="Extensible Markup Language">XML</abbr> files or if the problem is only prevented for now because the file that triggered the error this particular time is larger than this limit.
- </p>
- <p>
- I looked further into the <a href="https://secure.php.net/manual/en/function.curl-getinfo.php"><code>\curl_getinfo()</code> function</a>, but it seems that it is not suitable for use in the spider.
- If a single <abbr title="Client for URLs/Client URL Request Library/Curl URL Request Library">cURL</abbr> handle is used for multiple file transfers, the output of this function gets a bit muddled.
- Specifically, if this file is called multiple times, any value that the function returned on a previous call will be called again, unless that value is overridden by a new value.
- That means that if the function is not yet able to retrieve information on the current download when called from within my <code>curl_limit</code> class, it will reuse the last-known value, which could be from a different download.
- I theorize that if I modified the <code>curl_limit</code> class to abort selected downloads based on the Content-Type header, files that the spider attempts to retrieve after the aborted download will also be potentially aborted, as <code>curl_limit</code> objects seem to get called several times before the server actually returns data, so the headers are not yet available.
- </p>
- <p>
- On the topic of optimization of include.d code, I realized today that my <code>CURLOPT_TOR</code> constant has an issue.
- It relies on the \CURL* constants, which unless the <abbr title="Client for URLs/Client URL Request Library/Curl URL Request Library">cURL</abbr> extension is installed and loaded, do not exist.
- This constant is only useful when the <abbr title="Client for URLs/Client URL Request Library/Curl URL Request Library">cURL</abbr> extension is loaded, but it is defined any time any constant in the name space is needed.
- </p>
- <p>
- We went out to some dunes to do some cooking on the trail, as Cyrus had to do that today to meet one of his Boy Scout requirements.
- Unfortunately, he forgot to bring both the metal screen to put over the fire and the pot to cook in, so we had to go back home to get them, using up some of what little time we have.
- Upon returning to the dunes, we gathered firewood and sap to build a fire and cook a meal.
- We then rushed back into town so that Cyrus could meet up with one of his counselors and get some of his requirements signed off on.
- </p>
- <p>
- While he was meeting with the counselor, Alyssa gave me a calendar and newspaper from the volunteer organization that she works with to look over.
- Today was hectic, so I was unable to look them over today, but hopefully they will give me a better idea of what she does.
- </p>
- <p>
- Upon returning home, Cyrus went over a few quick things with us in order to pass off more requirements.
- We discussed emergency situations and what to do in them, put together an emergency backpack, then discussed the family finances.
- After that, we were off to the local public library to finish our work there.
- We finished pruning the analog electronic media (<abbr title="Video Home System">VHS</abbr> cassettes and audio cassettes), then finished moving them all over to a single section to condense them.
- We cleaned off the dust on the freed shelves, reconfigured the shelves to have different spacing, and spread the digital electronic media out to allow room for collection growth.
- </p>
- <p>
- Once home from the library, Cyrus worked more on his Boy Scout requirements on his own.
- For me, it was time to once again debug the spider.
- It was again choking of a supposedly-empty <abbr title="Uniform Resource Identifier">URI</abbr> due to that bug that I intentionally did not fix <a href="/en/weblog/2016/01-January/19.xhtml">the other day</a>.
- This time, however, my code was just fine; I had not gotten a blank <abbr title="Uniform Resource Identifier">URI</abbr> due to an error in my code.
- I tracked down the hyperlink that caused the issue, and it was pointed at the <abbr title="Uniform Resource Identifier">URI</abbr> "#fn:1".
- I immediately saw the problem: <code>\parse_url()</code> was seeing the colon and getting confused.
- To test my theory, I called <code>\parse_url()</code> directly and passed in the troublesome <abbr title="Uniform Resource Identifier">URI</abbr> fragment, and sure enough, the function returned falls.
- The next step was clear.
- I needed to find out whether the <abbr title="Uniform Resource Identifier">URI</abbr> fragment is valid.
- If the formatting is valid, I would know that <code>\parse_url()</code> is broken and I need to rewrite <code>merge_uris()</code> to use some other method or breaking <abbr title="Uniform Resource Identifier">URI</abbr>s into their components.
- On the other hand, if that syntax is invalid, this was not a problem in <code>\parse_url()</code>.
- Instead of rewriting <code>merge_uris()</code>, I needed to simply fix the bug in the spider.
- </p>
- <p>
- As it turns out, that <abbr title="Uniform Resource Identifier">URI</abbr> fragment is <a href="https://tools.ietf.org/html/rfc3986#section-3.5">completely valid</a>.
- I that same document, an example regular expression that parses a <abbr title="Uniform Resource Identifier">URI</abbr> is present in the document, though the <abbr title="Request for Comments">RFC</abbr>s are not under a free license.
- Luckily, someone that prefers to remain unmentioned helped me find a document saying that <a href="http://trustee.ietf.org/license-info/IETF-TLP-5.htm"><abbr title="Internet Engineering Task Force">IETF</abbr> document code components are covered by the Modified <abbr title="Berkeley Software Distribution">BSD</abbr> license</a>.
- I should be able to construct a working function tomorrow if I have time.
- </p>
- <p>
- In the end, Cyrus was unable to meet his project deadline of midnight tonight.
- However, it seems that he has been granted a short extension, allowing him to finish up by nine tomorrow.
- </p>
- <p>
- My <a href="/a/canary.txt">canary</a> still sings the tune of freedom and transparency.
- </p>
- <hr/>
- <p>
- Copyright © 2016 Alex Yst;
- You may modify and/or redistribute this document under the terms of the <a rel="license" href="/license/gpl-3.0-standalone.xhtml"><abbr title="GNU's Not Unix">GNU</abbr> <abbr title="General Public License version Three or later">GPLv3+</abbr></a>.
- If for some reason you would prefer to modify and/or distribute this document under other free copyleft terms, please ask me via email.
- My address is in the source comments near the top of this document.
- This license also applies to embedded content such as images.
- For more information on that, see <a href="/en/a/licensing.xhtml">licensing</a>.
- </p>
- <p>
- <abbr title="World Wide Web Consortium">W3C</abbr> standards are important.
- This document conforms to the <a href="https://validator.w3.org./nu/?doc=https%3A%2F%2Fy.st.%2Fen%2Fweblog%2F2016%2F01-January%2F23.xhtml"><abbr title="Extensible Hypertext Markup Language">XHTML</abbr> 5.1</a> specification and uses style sheets that conform to the <a href="http://jigsaw.w3.org./css-validator/validator?uri=https%3A%2F%2Fy.st.%2Fen%2Fweblog%2F2016%2F01-January%2F23.xhtml"><abbr title="Cascading Style Sheets">CSS</abbr>3</a> specification.
- </p>
- </body>
- </html>
|