clean html strings: "<a>foo</a>" → "foo"

cage cc0a3e6587 - prevented passing a space down to an allowed tag. 2 months ago
src cc0a3e6587 - prevented passing a space down to an allowed tag. 2 months ago
tests cc0a3e6587 - prevented passing a space down to an allowed tag. 2 months ago
COPYING.LESSER 1051b31e07 - initial revision. 3 months ago
Changelog aca53480e3 - added Changelog. 3 months ago
LICENSE 1051b31e07 - initial revision. 3 months ago
README.org cc2e420fd5 - updated documentation. 3 months ago
README.txt cc2e420fd5 - updated documentation. 3 months ago
trivial-sanitize-tests.asd 1051b31e07 - initial revision. 3 months ago
trivial-sanitize.asd c6f94cc95c - fixed ASDF file. 3 months ago

README.org

Introduction

This library is a tiny library to clean HTML strings

Status

The library is under development

Prerequisites:

  • alexandria
  • cl-ppcre
  • cl-html5-parser
    To run the tests also
  • clunit2

needs to be installed (via quicklisp).

Installation:

The best way to get trivial-sanitize working is using the excellent+ +quicklisp

#+BEGIN_SRC common-lisp+ +(ql:quickload "trivial-sanitize")+ +#+END_SRC

This library is not yet in quicklisp, to install just copy the sources in your local-projects directory.

Usage

The public API have just a single function sanitize.

#+BEGIN_SRC common-lisp (sanitize:sanitize "foo" sanitize:*strips-all-tags-sanitize*) #+END_SRC

the arguments of sanitize are:

#+BEGIN_SRC common-lisp (sanitize:sanitize str rules &key (case-insensitive-tag-match t) (strip-comments t) (strips-all-tags-on-malformed-html nil)) #+END_SRC

where:

str
is the strings that needs to be cleaned;
rules
are the set of clean filters to be applied to str, the rules can be built using the macro define-rules example below:

#+BEGIN_SRC common-lisp (define-rules *basic-sanitize* :tags ("a" "abbr" "b" "blockquote" "br" "cite" "code" "dd" "dfn" "dl" "dt" "em" "i" "kbd" "li" "mark" "ol" "p" "pre" "q" "s" "samp" "small" "strike" "strong" "sub" "sup" "time" "u" "ul" "var") :attributes ((:all . ("title")) ("a" . ("href")) ("blockquote" . ("cite")) ("dfn" . ("title")) ("q" . ("cite")) ("time" . ("datetime" "pubdate"))) :add-attributes (("a" . (("rel" . "nofollow")))) :protocols (("a" . (("href" . (:ftp :http :https :mailto :relative)))) ("blockquote" . (("cite" . (:http :https :relative)))) ("q" . (("cite" . (:http :https :relative)))))) #+END_SRC

  • the arguments for :tags is a list of allowed tags;
  • the arguments for :attributes is a list where each element is a list with the first element a tag name or a special keyword :all and the rest of the list represents the allowed attributes for that tag;
  • the argument for :add-attributes is a list of lists where the first element of the latter is the name of a tag and the rest is also a list of cons that specify the name of the attribute (as the car) and the value of the attribute (as the cdr), to be added to the attributes of the tag; so fro example a string like:
  • #+BEGIN_SRC text email #+END_SRC

will became:

    #+BEGIN_SRC text email #+END_SRC
  • the arguments for :protocols is a list where each element is a list so formed
  • #+BEGIN_SRC text first element: tag-name rest: attributes-protocols #+END_SRC
tag-name
the name of the tag where this rule applies;
attribute-protocols
a list where each element is also a list
#+BEGIN_SRC text first element: attribute-name rest: allowed-protocols #+END_SRC

for example the list:

#+BEGIN_SRC common-lisp '("a" . (("href" . (:ftp :http :https :mailto :relative)))) #+END_SRC

meas that, for tag a the attribute href can contains values that specify protocols of type: "ftp", "http", "https", "mailto" or "relative" only.

Three sets of rules are already available:

  • *basic-sanitize*
  • *strips-all-tags-sanitize*
  • *restricted-sanitize*
  • *relaxed-sanitize*

Please see the file src/sanitize.lisp for their definition.

  • case-insensitive-tag-match ::
  • if non nil, when matching tags (or other elements) ignore different case
    strip-comments
    if non nil remove comments (i.e.) text wrapped in <!-- --> on the same line
    strips-all-tags-on-malformed-html
    if non nil when the parsing of str contains error run again this function using a set of rules that tries to strips all the tags from str

Moreover there are two important special variable that can help to make a fine tuning of the generated HTML:

whitespace-elements
a list of tag's name that are replaced with white spaces when the tags is stripped, example:

given the string

#+BEGIN_SRC text "

foo
" #+END_SRC

if div is present in the list bound to whitespace-elements stripping the tag will results in

#+BEGIN_SRC text " foo " #+END_SRC

if not presents the results will be:

#+BEGIN_SRC text "foo" #+END_SRC

self-closing-elements
a list of tags that will be kept as self closing tags, example:

#+BEGIN_SRC text "a


b" #+END_SRC

will be kept as is if present in the list bound to self-closing-elements

#+BEGIN_SRC text "a


b" #+END_SRC

but will became:

#+BEGIN_SRC text "a


b" #+END_SRC

if not present.

BUGS

  • when a malformed HTML is provided to sanitize a spurious "</root-tag>" could appears at the end of the filtered string

Please send bug reports or patches to the issue tracker.

License

This library is released under Lisp Lesser General Public license (see COPYING.LESSER file)

NO WARRANTY

This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

Acknowledgment

My deep thanks to the authors of cl-sanitize and hunchentoot, Thank you!