doc | ||
docs | ||
src/html_to_md | ||
test/html_to_md | ||
.gitignore | ||
CHANGELOG.md | ||
LICENSE | ||
project.clj | ||
README.md |
html-to-md
A Clojure library designed to convert (Enlivened) HTML to markdown; but, more generally, a framework for [HT|SG|X]ML transformation.
Introduction
The itch I'm trying to scratch at present is to transform Blogger.com's dreadful tag-soup markup into markdown; but my architecture for doing this is to build a completely general [HT|SG|X]ML transformation framework and then specialise it.
WARNING: this is presently alpha-quality code, although it does have fair unit test coverage.
Usage
To use this library in your project, add the following leiningen dependency:
[org.clojars.simon_brooke/html-to-md "0.3.0"]
To use it in your namespace, require:
[html-to-md.core :refer [html-to-md]]
For default usage, that's all you need. To play more sophisticated tricks, consider:
[html-to-md.transformer :refer [transform process]]
[html-to-md.html-to-md :refer [markdown-dispatcher]]
The intended usage is as follows:
(require '[html-to-md.core :refer [html-to-md]])
(html-to-md url output-file)
This will read (X)HTML from url
and write Markdown to output-file
. If
output-file
is not supplied, it will return the markdown as a string:
(require '[html-to-md.core :refer [html-to-md]])
(def md (html-to-md url))
If you are specifically scraping blogger.com pages, you may try the following recipe:
(require '[html-to-md.core :refer [blogger-to-md]])
(blogger-to-md url output-file)
It works for my blogger pages. However, I'm not sure to what extent the skinning of blogger pages is pure CSS (in which case my recipe should work for yours) and to what extent it's HTML templating (in which case it probably won't). Results not guaranteed, if it doesn't work you get to keep all the pieces.
Extending the transformer
In principle, the transformer can transform any [HT|SG|X]ML markup into any other, or into any textual form. To extend it to do something other than markdown, supply a dispatcher. A dispatcher is essentially a function of one argument, a [HT|SG|X]ML tag represented as a Clojure keyword, which returns a processor, which should be a function of two arguments, an element assumed to have that tag, and a dispatcher. The processor should return the value that you want elements of that tag transformed into.
Obviously it is convenient to write dispatchers as maps, but it isn't required that you do so: anything which, given a keyword, will return a processor, will work.
License
Copyright © 2019 Simon Brooke simon@journeyman.cc
Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.