Merge branch 'release/0.2.0'

This commit is contained in:
Simon Brooke 2019-04-30 20:15:39 +01:00
commit 97351feafe
9 changed files with 269 additions and 51 deletions

2
.gitignore vendored
View file

@ -11,3 +11,5 @@ pom.xml.asc
.hgignore
.hg/
*~
test\.md

View file

@ -1,6 +1,7 @@
# html-to-md
A Clojure library designed to convert (Enlivened) HTML to markdown; but, more
A Clojure library designed to convert
([Enlive](https://github.com/cgrand/enlive)ned) HTML to markdown; but, more
generally, a framework for [HT|SG|X]ML transformation.
## Introduction
@ -17,35 +18,50 @@ unit test coverage.
To use this library in your project, add the following leiningen dependency:
[org.clojars.simon_brooke/html-to-md "0.1.0"]
[org.clojars.simon_brooke/html-to-md "0.2.0"]
To use it in your namespace, require:
[html-to-md/transformer :refer [transform process]]
[html-to-md/html-to-md :refer [markdown-dispatcher]]
[html-to-md.core :refer [html-to-md]]
For default usage, that's all you need. To play more sophisticated tricks,
consider:
[html-to-md.transformer :refer [transform process]]
[html-to-md.html-to-md :refer [markdown-dispatcher]]
The intended usage is as follows:
```clojure
(require '[html-to-md.transformer :refer [transform]])
(require '[html-to-md.html-to-md :refer [markdown-dispatcher]])
(require '[html-to-md.core :refer [html-to-md]])
(transform URL markdown-dispatcher)
(html-to-md url output-file)
```
Where URL is any URL that references an HTML, SGML, XHTML or XML document.
However, my fancy multi-method doesn't work yet and may well be the wrong
approach, so for now use
This will read (X)HTML from `url` and write Markdown to `output-file`. If
`output-file` is not supplied, it will return the markdown as a string:
```clojure
(require '[html-to-md.core :refer [html-to-md]])
(require '[html-to-md.transformer :refer [process]])
(require '[html-to-md.html-to-md :refer [markdown-dispatcher]])
(require '[net.cgrand.enlive-html :as html])
(process (html/html-resource URL) markdown-dispatcher)
(def md (html-to-md url))
```
If you are specifically scraping [blogger.com](https://www.blogger.com/")
pages, you may *try* the following recipe:
```clojure
(require '[html-to-md.core :refer [blogger-to-md]])
(blogger-to-md url output-file)
```
It works for my blogger pages. However, I'm not sure to what extent the
skinning of blogger pages is pure CSS (in which case my recipe should work
for yours) and to what extent it's HTML templating (in which case it
probably won't). Results not guaranteed, if it doesn't work you get to
keep all the pieces.
## Extending the transformer
In principle, the transformer can transform any [HT|SG|X]ML markup into any
@ -66,3 +82,4 @@ Copyright © 2019 Simon Brooke <simon@journeyman.cc>
Distributed under the Eclipse Public License either version 1.0 or (at
your option) any later version.

View file

@ -1,3 +1,127 @@
# Introduction to html-to-md
TODO: write [great documentation](http://jacobian.org/writing/what-to-write/)
The itch I'm trying to scratch at present is to transform
[Blogger.com](http://www.blogger.com)'s dreadful tag-soup markup into markdown;
but my architecture for doing this is to build a completely general [HT|SG|X]ML
transformation framework and then specialise it.
**WARNING:** this is presently alpha-quality code, although it does have fair
unit test coverage.
## Usage
To use this library in your project, add the following leiningen dependency:
[org.clojars.simon_brooke/html-to-md "0.2.0"]
To use it in your namespace, require:
[html-to-md.core :refer [html-to-md]]
For default usage, that's all you need. To play more sophisticated tricks,
consider:
[html-to-md.transformer :refer [transform process]]
[html-to-md.html-to-md :refer [markdown-dispatcher]]
The intended usage is as follows:
```clojure
(require '[html-to-md.core :refer [html-to-md]])
(html-to-md url output-file)
```
This will read (X)HTML from `url` and write Markdown to `output-file`. If
`output-file` is not supplied, it will return the markdown as a string:
```clojure
(require '[html-to-md.core :refer [html-to-md]])
(def md (html-to-md url))
```
If you are specifically scraping [blogger.com](https://www.blogger.com/")
pages, you may *try* the following recipe:
```clojure
(require '[html-to-md.core :refer [blogger-to-md]])
(blogger-to-md url output-file)
```
It works for my blogger pages. However, I'm not sure to what extent the
skinning of blogger pages is pure CSS (in which case my recipe should work
for yours) and to what extent it's HTML templating (in which case it
probably won't). Results not guaranteed, if it doesn't work you get to
keep all the pieces.
## Extending the transformer
In principle, the transformer can transform any [HT|SG|X]ML markup into any
other, or into any textual form. To extend it to do something other than
markdown, supply a **dispatcher**. A dispatcher is essentially a function of one
argument, a [HT|SG|X]ML tag represented as a Clojure keyword, which returns
a **processor,** which should be a function of two arguments, an element assumed
to have that tag, and a dispatcher. The processor should return the value that
you want elements of that tag transformed into.
Thus the `html-to-md.html-to-md` namespace comprises a number of *processor*
functions, such as this one:
```clojure
(defn markdown-a
"Process the anchor element `e` into markdown, using dispatcher `d`."
[e d]
(str
"["
(s/trim (apply str (process (:content e) d)))
"]("
(-> e :attrs :href)
")"))
```
and a *dispatcher* map:
```clojure
(def markdown-dispatcher
"A despatcher for transforming (X)HTML into Markdown."
{:a markdown-a
:b markdown-strong
:br markdown-br
:code markdown-code
:body markdown-default
:div markdown-div
:em markdown-em
:h1 markdown-h1
:h2 markdown-h2
:h3 markdown-h3
:h4 markdown-h4
:h5 markdown-h5
:h6 markdown-h6
:html markdown-html
:i markdown-em
:img markdown-img
:ol markdown-ol
:p markdown-div
:pre markdown-pre
:samp markdown-code
:script markdown-omit
:span markdown-default
:strong markdown-strong
:style markdown-omit
:ul markdown-ul
})
```
Obviously it is convenient to write dispatchers as maps, but it isn't required
that you do so: anything which, given a keyword, will return a processor, will
work.
## License
Copyright © 2019 Simon Brooke <simon@journeyman.cc>
Distributed under the Eclipse Public License either version 1.0 or (at
your option) any later version.

View file

@ -1,10 +1,11 @@
(defproject html-to-md "0.1.0"
(defproject html-to-md "0.2.0"
:description "Convert (Enlivened) HTML to markdown; but, more generally, a framework for [HT|SG|X]ML transformation."
:url "https://github.com/simon-brooke/html-to-md"
:license {:name "Eclipse Public License"
:url "http://www.eclipse.org/legal/epl-v10.html"}
:dependencies [[org.clojure/clojure "1.8.0"]
[enlive "1.1.6"]]
:plugins [[lein-codox "0.10.3"]]
:plugins [[lein-codox "0.10.3"]
[lein-release "1.0.5"]]
:lein-release {:deploy-via :clojars}
:signing {:gpg-key "Simon Brooke (Stultus in monte) <simon@journeyman.cc>"})

View file

@ -0,0 +1,41 @@
(ns html-to-md.blogger-to-md
(:require [clojure.string :as s]
[html-to-md.html-to-md :refer [markdown-dispatcher markdown-header]]
[html-to-md.transformer :refer [process]]
[net.cgrand.enlive-html :as html]))
(defn blogger-scraper
"Processor which scrapes the actual post content out of a blogger page.
*NOTE:* This was written to scrape *my* blogger pages, yours may be
different!"
[e d]
(let [title (first (html/select e [:h3.post-title]))
content (html/select e [:div.post-body])]
(if (and title content)
(apply
str
(cons
(markdown-header title d 1)
(process content d))))))
(defn image-table-processor
"Blogger's horrible tag soup wraps images in tables. Is this table such
a table? If so extract the image from it and process it to markdown;
otherwise, fall back on what `markdown-dispatcher` would do with the
table (which is currently nothing, but that will change)."
[e d]
(let [caption (process (first (html/select e [:td.tr-caption])) d)
alt (if caption (s/trim (apply str caption)))
image (first (html/select e [:img]))
src (if image (-> image :attrs :src))]
(if image
(str "![image: " alt "](" src ")")
(process e markdown-dispatcher))))
(def blogger-dispatcher
"Adaptation of `markdown-dispatcher`, q.v., with the `:table`, `:h3` and
`:html` dispatches overridden."
(assoc markdown-dispatcher
:html blogger-scraper
:table image-table-processor))

View file

@ -1,6 +1,21 @@
(ns html-to-md.core)
(ns html-to-md.core
(:require [html-to-md.transformer :refer [transform process]]
[html-to-md.html-to-md :refer [markdown-dispatcher]]
[html-to-md.blogger-to-md :refer [blogger-dispatcher]]))
(defn foo
"I don't do a whole lot."
[x]
(println x "Hello, World!"))
(defn html-to-md
"Transform the HTML document referenced by `url` into Markdown, and write
it to `output`, if supplied."
([url]
(apply str (transform url markdown-dispatcher)))
([url output]
(spit output (html-to-md url))))
(defn blogger-to-md
"Transform the Blogger post referenced by `url` into Markdown, and write
it to `output`, if supplied. *NOTE:* This was written to scrape *my*
blogger pages, yours may be different!"
([url]
(apply str (transform url blogger-dispatcher)))
([url output]
(spit output (blogger-to-md url))))

View file

@ -7,15 +7,18 @@
(defn markdown-a
"Process the anchor element `e` into markdown, using dispatcher `d`."
[e d]
(apply
str
(flatten
(list
"["
(map #(process % d) (:content e))
"]("
(-> e :attrs :href)
")"))))
(str
"["
(s/trim (apply str (process (:content e) d)))
"]("
(-> e :attrs :href)
")"))
(defn markdown-br
"Process the line-break element `e`, so beloved of tag-soupers, into
markdown"
[e d]
"\n")
(defn markdown-code
"Process the code or samp `e` into markdown, using dispatcher `d`."
@ -51,15 +54,12 @@
"Process the header element `e` into markdown, with level `level`,
using dispatcher `d`."
[e d level]
(apply
str
(flatten
(list
"\n"
(take level (repeat "#"))
" "
(map #(process % d) (:content e))
"\n"))))
(str
"\n"
(apply str (take level (repeat "#")))
" "
(s/trim (apply str (process (:content e) d)))
"\n"))
(defn markdown-h1
"Process the header element `e` into markdown, with level 1, using
@ -105,7 +105,7 @@
(defn markdown-img
"Process this image element `e` into markdown, using dispatcher `d`."
[e d]
(str "![" (-> e :attrs :alt) "](" (-> e :attrs :src) ")"))
(str "![image: " (-> e :attrs :alt) "](" (-> e :attrs :src) ")"))
(defn markdown-ol
"Process this ordered list element `e` into markdown, using dispatcher
@ -120,10 +120,15 @@
str
(flatten
(list "\n" (inc %2) ". " (process %1 d))))
(:content e)
(html/select e [:li])
(range))))
"\n\n"))
(defn markdown-omit
"Don't process the element `e` into markdown, but return `nil`."
[e d]
nil)
(defn markdown-pre
"Process the preformatted emphasis element `e` into markdown, using
dispatcher `d`."
@ -155,13 +160,15 @@
str
(flatten
(list "\n* " (process % d))))
(:content e))))
(html/select e [:li]))))
"\n\n"))
(def markdown-dispatcher
"A despatcher for transforming (X)HTML into Markdown."
{:a markdown-a
:b markdown-strong
:br markdown-br
:code markdown-code
:body markdown-default
:div markdown-div
@ -179,8 +186,10 @@
:p markdown-div
:pre markdown-pre
:samp markdown-code
:script markdown-omit
:span markdown-default
:strong markdown-strong
:style markdown-omit
:ul markdown-ul
})

View file

@ -26,26 +26,35 @@
(if processor
(apply processor (list element dispatcher))
(map #(process % dispatcher) (:content element))))
(string? element) element))
(string? element) element
(or (seq? element) (vector? element))
(remove nil? (map #(process % dispatcher) element))))
(defn- transformer-dispatch
[a _]
(class a))
(defmulti transform
"Transform the `obj` which is my first argument using the `dispatcher`
which is my second argument."
[class class] :default :default)
#'transformer-dispatch
:default :default)
(defmethod transform :default [obj dispatcher]
(process obj dispatcher))
(defmethod transform [java.net.URI Object] [uri dispatcher]
(process (html/html-resource uri) dispatcher))
(defmethod transform java.net.URI [uri dispatcher]
(remove nil? (process (html/html-resource uri) dispatcher)))
(defmethod transform [java.net.URL Object] [url dispatcher]
(defmethod transform java.net.URL [url dispatcher]
(transform (.toURI url) dispatcher))
(defmethod transform [String Object] [s dispatcher]
(defmethod transform String [s dispatcher]
(let [url (try (java.net.URL. s) (catch Exception any))]
(if url (transform url dispatcher)
;; otherwise, if s is not a URL, consider it as an HTML fragment,
;; parse and process it
(process (tagsoup/parser (java.io.StringReader s)) dispatcher)
)))

View file

@ -73,7 +73,7 @@
(deftest img-test
(testing "Image tag."
(let [expected "![Hello dere!](http://foo.bar/image.png)"
(let [expected "![image: Hello dere!](http://foo.bar/image.png)"
actual (process
{:tag :img
:attrs {:src "http://foo.bar/image.png"