plan.org 2.9 KB

Plan

So for now I'm writing a program that assumes the file contains valid html code. I'm also assuming that there are no comments. With that assumption...actually couldn't I use bison or something to generate code that would parse this for me?

Probably.

Anyway, how would I go about parsing an html file? That's a really good question. I'd have to keep a record of what elements I am currently parsing... For example


cat simple.html


    
        
    
    
        

Hello, world!

When getc() gives me the "B" in Bootstrap, then at that point my data structures should look like the following:

element elements [] =

0 -> element name = "!DOCTYPE html" attribute.name = html attribute.contents = "" older_sibling = 0; younger_sibling = 0; child = 0; done_parsing = true;

1 -> element name = "head" contents = ? don't know yet. Not done parsing older_sibling = 0; younger_sibling = 0; done_parsing = false; child = elementptr -> element element.name = title element.contents = ? don't know yet. Not done parsing. element.older_sibling = 0 element.younger_sibling = ? don't know yet not done parsing element.done_parsing = false;


  while 1:
      switch (c):
      case "<"
      parse_top_element ()
      case ">"
      parse_bottom_element ()

  def parse_top_element ():
      while ((c = getc()) != ">"):
          string += c

parsing issues I can't/shouldn't return an array from a function...

https://stackoverflow.com/questions/11656532/returning-an-array-using-c

returning a string from a function

I'll have to dynamically increase the size of the array inside the function.

https://stackoverflow.com/questions/25798977/returning-string-from-c-function

dynamically allocate 2D array

an example html data structure

I should probably be allocating these string via malloc https://www.geeksforgeeks.org/dynamically-allocate-2d-array-c/

<p>
<html>
</p>
<p>
  <body>
</p>
<p>
      <div>
</p>
<p>
        <p> Hello <span> World! </span> <em> What's happening?</em> </p>
</p>
<p>
      <div>
</p>
<p>
      <div>
</p>
<p>
         <div>
</p>
<p>
           <div>
</p>
<p>
             <p> cra cra How are you? </p>
</p>
<p>
             <br/>
</p>
<p>
           </div>
</p>
<p>
         </div>
</p>
<p>
      </div>
</p>
<p>
      <div>
</p>
<p>
        <p> What's going on here!? </p>
</p>
<p>
        <br/>
</p>
<p>
        <p> Hello </p>
</p>
<p>
      </div>
</p>
<p>
  </body>
</p>
<p>
</html>
</p>

If we reach a closing html element... element->done_parsing = true; element = element->parent_element;