Merge branch 'release/0.2.0'
This commit is contained in:
commit
97351feafe
2
.gitignore
vendored
2
.gitignore
vendored
|
@ -11,3 +11,5 @@ pom.xml.asc
|
||||||
.hgignore
|
.hgignore
|
||||||
.hg/
|
.hg/
|
||||||
*~
|
*~
|
||||||
|
|
||||||
|
test\.md
|
||||||
|
|
47
README.md
47
README.md
|
@ -1,6 +1,7 @@
|
||||||
# html-to-md
|
# html-to-md
|
||||||
|
|
||||||
A Clojure library designed to convert (Enlivened) HTML to markdown; but, more
|
A Clojure library designed to convert
|
||||||
|
([Enlive](https://github.com/cgrand/enlive)ned) HTML to markdown; but, more
|
||||||
generally, a framework for [HT|SG|X]ML transformation.
|
generally, a framework for [HT|SG|X]ML transformation.
|
||||||
|
|
||||||
## Introduction
|
## Introduction
|
||||||
|
@ -17,35 +18,50 @@ unit test coverage.
|
||||||
|
|
||||||
To use this library in your project, add the following leiningen dependency:
|
To use this library in your project, add the following leiningen dependency:
|
||||||
|
|
||||||
[org.clojars.simon_brooke/html-to-md "0.1.0"]
|
[org.clojars.simon_brooke/html-to-md "0.2.0"]
|
||||||
|
|
||||||
To use it in your namespace, require:
|
To use it in your namespace, require:
|
||||||
|
|
||||||
[html-to-md/transformer :refer [transform process]]
|
[html-to-md.core :refer [html-to-md]]
|
||||||
[html-to-md/html-to-md :refer [markdown-dispatcher]]
|
|
||||||
|
For default usage, that's all you need. To play more sophisticated tricks,
|
||||||
|
consider:
|
||||||
|
|
||||||
|
[html-to-md.transformer :refer [transform process]]
|
||||||
|
[html-to-md.html-to-md :refer [markdown-dispatcher]]
|
||||||
|
|
||||||
The intended usage is as follows:
|
The intended usage is as follows:
|
||||||
|
|
||||||
```clojure
|
```clojure
|
||||||
(require '[html-to-md.transformer :refer [transform]])
|
(require '[html-to-md.core :refer [html-to-md]])
|
||||||
(require '[html-to-md.html-to-md :refer [markdown-dispatcher]])
|
|
||||||
|
|
||||||
(transform URL markdown-dispatcher)
|
(html-to-md url output-file)
|
||||||
```
|
```
|
||||||
|
|
||||||
Where URL is any URL that references an HTML, SGML, XHTML or XML document.
|
This will read (X)HTML from `url` and write Markdown to `output-file`. If
|
||||||
However, my fancy multi-method doesn't work yet and may well be the wrong
|
`output-file` is not supplied, it will return the markdown as a string:
|
||||||
approach, so for now use
|
|
||||||
|
|
||||||
```clojure
|
```clojure
|
||||||
|
(require '[html-to-md.core :refer [html-to-md]])
|
||||||
|
|
||||||
(require '[html-to-md.transformer :refer [process]])
|
(def md (html-to-md url))
|
||||||
(require '[html-to-md.html-to-md :refer [markdown-dispatcher]])
|
|
||||||
(require '[net.cgrand.enlive-html :as html])
|
|
||||||
|
|
||||||
(process (html/html-resource URL) markdown-dispatcher)
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
If you are specifically scraping [blogger.com](https://www.blogger.com/")
|
||||||
|
pages, you may *try* the following recipe:
|
||||||
|
|
||||||
|
```clojure
|
||||||
|
(require '[html-to-md.core :refer [blogger-to-md]])
|
||||||
|
|
||||||
|
(blogger-to-md url output-file)
|
||||||
|
```
|
||||||
|
|
||||||
|
It works for my blogger pages. However, I'm not sure to what extent the
|
||||||
|
skinning of blogger pages is pure CSS (in which case my recipe should work
|
||||||
|
for yours) and to what extent it's HTML templating (in which case it
|
||||||
|
probably won't). Results not guaranteed, if it doesn't work you get to
|
||||||
|
keep all the pieces.
|
||||||
|
|
||||||
## Extending the transformer
|
## Extending the transformer
|
||||||
|
|
||||||
In principle, the transformer can transform any [HT|SG|X]ML markup into any
|
In principle, the transformer can transform any [HT|SG|X]ML markup into any
|
||||||
|
@ -66,3 +82,4 @@ Copyright © 2019 Simon Brooke <simon@journeyman.cc>
|
||||||
|
|
||||||
Distributed under the Eclipse Public License either version 1.0 or (at
|
Distributed under the Eclipse Public License either version 1.0 or (at
|
||||||
your option) any later version.
|
your option) any later version.
|
||||||
|
|
||||||
|
|
126
doc/intro.md
126
doc/intro.md
|
@ -1,3 +1,127 @@
|
||||||
# Introduction to html-to-md
|
# Introduction to html-to-md
|
||||||
|
|
||||||
TODO: write [great documentation](http://jacobian.org/writing/what-to-write/)
|
The itch I'm trying to scratch at present is to transform
|
||||||
|
[Blogger.com](http://www.blogger.com)'s dreadful tag-soup markup into markdown;
|
||||||
|
but my architecture for doing this is to build a completely general [HT|SG|X]ML
|
||||||
|
transformation framework and then specialise it.
|
||||||
|
|
||||||
|
**WARNING:** this is presently alpha-quality code, although it does have fair
|
||||||
|
unit test coverage.
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
To use this library in your project, add the following leiningen dependency:
|
||||||
|
|
||||||
|
[org.clojars.simon_brooke/html-to-md "0.2.0"]
|
||||||
|
|
||||||
|
To use it in your namespace, require:
|
||||||
|
|
||||||
|
[html-to-md.core :refer [html-to-md]]
|
||||||
|
|
||||||
|
For default usage, that's all you need. To play more sophisticated tricks,
|
||||||
|
consider:
|
||||||
|
|
||||||
|
[html-to-md.transformer :refer [transform process]]
|
||||||
|
[html-to-md.html-to-md :refer [markdown-dispatcher]]
|
||||||
|
|
||||||
|
The intended usage is as follows:
|
||||||
|
|
||||||
|
```clojure
|
||||||
|
(require '[html-to-md.core :refer [html-to-md]])
|
||||||
|
|
||||||
|
(html-to-md url output-file)
|
||||||
|
```
|
||||||
|
|
||||||
|
This will read (X)HTML from `url` and write Markdown to `output-file`. If
|
||||||
|
`output-file` is not supplied, it will return the markdown as a string:
|
||||||
|
|
||||||
|
```clojure
|
||||||
|
(require '[html-to-md.core :refer [html-to-md]])
|
||||||
|
|
||||||
|
(def md (html-to-md url))
|
||||||
|
```
|
||||||
|
|
||||||
|
If you are specifically scraping [blogger.com](https://www.blogger.com/")
|
||||||
|
pages, you may *try* the following recipe:
|
||||||
|
|
||||||
|
```clojure
|
||||||
|
(require '[html-to-md.core :refer [blogger-to-md]])
|
||||||
|
|
||||||
|
(blogger-to-md url output-file)
|
||||||
|
```
|
||||||
|
|
||||||
|
It works for my blogger pages. However, I'm not sure to what extent the
|
||||||
|
skinning of blogger pages is pure CSS (in which case my recipe should work
|
||||||
|
for yours) and to what extent it's HTML templating (in which case it
|
||||||
|
probably won't). Results not guaranteed, if it doesn't work you get to
|
||||||
|
keep all the pieces.
|
||||||
|
|
||||||
|
## Extending the transformer
|
||||||
|
|
||||||
|
In principle, the transformer can transform any [HT|SG|X]ML markup into any
|
||||||
|
other, or into any textual form. To extend it to do something other than
|
||||||
|
markdown, supply a **dispatcher**. A dispatcher is essentially a function of one
|
||||||
|
argument, a [HT|SG|X]ML tag represented as a Clojure keyword, which returns
|
||||||
|
a **processor,** which should be a function of two arguments, an element assumed
|
||||||
|
to have that tag, and a dispatcher. The processor should return the value that
|
||||||
|
you want elements of that tag transformed into.
|
||||||
|
|
||||||
|
Thus the `html-to-md.html-to-md` namespace comprises a number of *processor*
|
||||||
|
functions, such as this one:
|
||||||
|
|
||||||
|
```clojure
|
||||||
|
(defn markdown-a
|
||||||
|
"Process the anchor element `e` into markdown, using dispatcher `d`."
|
||||||
|
[e d]
|
||||||
|
(str
|
||||||
|
"["
|
||||||
|
(s/trim (apply str (process (:content e) d)))
|
||||||
|
"]("
|
||||||
|
(-> e :attrs :href)
|
||||||
|
")"))
|
||||||
|
```
|
||||||
|
|
||||||
|
and a *dispatcher* map:
|
||||||
|
|
||||||
|
```clojure
|
||||||
|
(def markdown-dispatcher
|
||||||
|
"A despatcher for transforming (X)HTML into Markdown."
|
||||||
|
{:a markdown-a
|
||||||
|
:b markdown-strong
|
||||||
|
:br markdown-br
|
||||||
|
:code markdown-code
|
||||||
|
:body markdown-default
|
||||||
|
:div markdown-div
|
||||||
|
:em markdown-em
|
||||||
|
:h1 markdown-h1
|
||||||
|
:h2 markdown-h2
|
||||||
|
:h3 markdown-h3
|
||||||
|
:h4 markdown-h4
|
||||||
|
:h5 markdown-h5
|
||||||
|
:h6 markdown-h6
|
||||||
|
:html markdown-html
|
||||||
|
:i markdown-em
|
||||||
|
:img markdown-img
|
||||||
|
:ol markdown-ol
|
||||||
|
:p markdown-div
|
||||||
|
:pre markdown-pre
|
||||||
|
:samp markdown-code
|
||||||
|
:script markdown-omit
|
||||||
|
:span markdown-default
|
||||||
|
:strong markdown-strong
|
||||||
|
:style markdown-omit
|
||||||
|
:ul markdown-ul
|
||||||
|
})
|
||||||
|
```
|
||||||
|
|
||||||
|
Obviously it is convenient to write dispatchers as maps, but it isn't required
|
||||||
|
that you do so: anything which, given a keyword, will return a processor, will
|
||||||
|
work.
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
Copyright © 2019 Simon Brooke <simon@journeyman.cc>
|
||||||
|
|
||||||
|
Distributed under the Eclipse Public License either version 1.0 or (at
|
||||||
|
your option) any later version.
|
||||||
|
|
||||||
|
|
|
@ -1,10 +1,11 @@
|
||||||
(defproject html-to-md "0.1.0"
|
(defproject html-to-md "0.2.0"
|
||||||
:description "Convert (Enlivened) HTML to markdown; but, more generally, a framework for [HT|SG|X]ML transformation."
|
:description "Convert (Enlivened) HTML to markdown; but, more generally, a framework for [HT|SG|X]ML transformation."
|
||||||
:url "https://github.com/simon-brooke/html-to-md"
|
:url "https://github.com/simon-brooke/html-to-md"
|
||||||
:license {:name "Eclipse Public License"
|
:license {:name "Eclipse Public License"
|
||||||
:url "http://www.eclipse.org/legal/epl-v10.html"}
|
:url "http://www.eclipse.org/legal/epl-v10.html"}
|
||||||
:dependencies [[org.clojure/clojure "1.8.0"]
|
:dependencies [[org.clojure/clojure "1.8.0"]
|
||||||
[enlive "1.1.6"]]
|
[enlive "1.1.6"]]
|
||||||
:plugins [[lein-codox "0.10.3"]]
|
:plugins [[lein-codox "0.10.3"]
|
||||||
|
[lein-release "1.0.5"]]
|
||||||
:lein-release {:deploy-via :clojars}
|
:lein-release {:deploy-via :clojars}
|
||||||
:signing {:gpg-key "Simon Brooke (Stultus in monte) <simon@journeyman.cc>"})
|
:signing {:gpg-key "Simon Brooke (Stultus in monte) <simon@journeyman.cc>"})
|
||||||
|
|
41
src/html_to_md/blogger_to_md.clj
Normal file
41
src/html_to_md/blogger_to_md.clj
Normal file
|
@ -0,0 +1,41 @@
|
||||||
|
(ns html-to-md.blogger-to-md
|
||||||
|
(:require [clojure.string :as s]
|
||||||
|
[html-to-md.html-to-md :refer [markdown-dispatcher markdown-header]]
|
||||||
|
[html-to-md.transformer :refer [process]]
|
||||||
|
[net.cgrand.enlive-html :as html]))
|
||||||
|
|
||||||
|
(defn blogger-scraper
|
||||||
|
"Processor which scrapes the actual post content out of a blogger page.
|
||||||
|
*NOTE:* This was written to scrape *my* blogger pages, yours may be
|
||||||
|
different!"
|
||||||
|
[e d]
|
||||||
|
(let [title (first (html/select e [:h3.post-title]))
|
||||||
|
content (html/select e [:div.post-body])]
|
||||||
|
(if (and title content)
|
||||||
|
(apply
|
||||||
|
str
|
||||||
|
(cons
|
||||||
|
(markdown-header title d 1)
|
||||||
|
(process content d))))))
|
||||||
|
|
||||||
|
(defn image-table-processor
|
||||||
|
"Blogger's horrible tag soup wraps images in tables. Is this table such
|
||||||
|
a table? If so extract the image from it and process it to markdown;
|
||||||
|
otherwise, fall back on what `markdown-dispatcher` would do with the
|
||||||
|
table (which is currently nothing, but that will change)."
|
||||||
|
[e d]
|
||||||
|
(let [caption (process (first (html/select e [:td.tr-caption])) d)
|
||||||
|
alt (if caption (s/trim (apply str caption)))
|
||||||
|
image (first (html/select e [:img]))
|
||||||
|
src (if image (-> image :attrs :src))]
|
||||||
|
(if image
|
||||||
|
(str "")
|
||||||
|
(process e markdown-dispatcher))))
|
||||||
|
|
||||||
|
|
||||||
|
(def blogger-dispatcher
|
||||||
|
"Adaptation of `markdown-dispatcher`, q.v., with the `:table`, `:h3` and
|
||||||
|
`:html` dispatches overridden."
|
||||||
|
(assoc markdown-dispatcher
|
||||||
|
:html blogger-scraper
|
||||||
|
:table image-table-processor))
|
|
@ -1,6 +1,21 @@
|
||||||
(ns html-to-md.core)
|
(ns html-to-md.core
|
||||||
|
(:require [html-to-md.transformer :refer [transform process]]
|
||||||
|
[html-to-md.html-to-md :refer [markdown-dispatcher]]
|
||||||
|
[html-to-md.blogger-to-md :refer [blogger-dispatcher]]))
|
||||||
|
|
||||||
(defn foo
|
(defn html-to-md
|
||||||
"I don't do a whole lot."
|
"Transform the HTML document referenced by `url` into Markdown, and write
|
||||||
[x]
|
it to `output`, if supplied."
|
||||||
(println x "Hello, World!"))
|
([url]
|
||||||
|
(apply str (transform url markdown-dispatcher)))
|
||||||
|
([url output]
|
||||||
|
(spit output (html-to-md url))))
|
||||||
|
|
||||||
|
(defn blogger-to-md
|
||||||
|
"Transform the Blogger post referenced by `url` into Markdown, and write
|
||||||
|
it to `output`, if supplied. *NOTE:* This was written to scrape *my*
|
||||||
|
blogger pages, yours may be different!"
|
||||||
|
([url]
|
||||||
|
(apply str (transform url blogger-dispatcher)))
|
||||||
|
([url output]
|
||||||
|
(spit output (blogger-to-md url))))
|
||||||
|
|
|
@ -7,15 +7,18 @@
|
||||||
(defn markdown-a
|
(defn markdown-a
|
||||||
"Process the anchor element `e` into markdown, using dispatcher `d`."
|
"Process the anchor element `e` into markdown, using dispatcher `d`."
|
||||||
[e d]
|
[e d]
|
||||||
(apply
|
(str
|
||||||
str
|
"["
|
||||||
(flatten
|
(s/trim (apply str (process (:content e) d)))
|
||||||
(list
|
"]("
|
||||||
"["
|
(-> e :attrs :href)
|
||||||
(map #(process % d) (:content e))
|
")"))
|
||||||
"]("
|
|
||||||
(-> e :attrs :href)
|
(defn markdown-br
|
||||||
")"))))
|
"Process the line-break element `e`, so beloved of tag-soupers, into
|
||||||
|
markdown"
|
||||||
|
[e d]
|
||||||
|
"\n")
|
||||||
|
|
||||||
(defn markdown-code
|
(defn markdown-code
|
||||||
"Process the code or samp `e` into markdown, using dispatcher `d`."
|
"Process the code or samp `e` into markdown, using dispatcher `d`."
|
||||||
|
@ -51,15 +54,12 @@
|
||||||
"Process the header element `e` into markdown, with level `level`,
|
"Process the header element `e` into markdown, with level `level`,
|
||||||
using dispatcher `d`."
|
using dispatcher `d`."
|
||||||
[e d level]
|
[e d level]
|
||||||
(apply
|
(str
|
||||||
str
|
"\n"
|
||||||
(flatten
|
(apply str (take level (repeat "#")))
|
||||||
(list
|
" "
|
||||||
"\n"
|
(s/trim (apply str (process (:content e) d)))
|
||||||
(take level (repeat "#"))
|
"\n"))
|
||||||
" "
|
|
||||||
(map #(process % d) (:content e))
|
|
||||||
"\n"))))
|
|
||||||
|
|
||||||
(defn markdown-h1
|
(defn markdown-h1
|
||||||
"Process the header element `e` into markdown, with level 1, using
|
"Process the header element `e` into markdown, with level 1, using
|
||||||
|
@ -105,7 +105,7 @@
|
||||||
(defn markdown-img
|
(defn markdown-img
|
||||||
"Process this image element `e` into markdown, using dispatcher `d`."
|
"Process this image element `e` into markdown, using dispatcher `d`."
|
||||||
[e d]
|
[e d]
|
||||||
(str " ")"))
|
(str " ")"))
|
||||||
|
|
||||||
(defn markdown-ol
|
(defn markdown-ol
|
||||||
"Process this ordered list element `e` into markdown, using dispatcher
|
"Process this ordered list element `e` into markdown, using dispatcher
|
||||||
|
@ -120,10 +120,15 @@
|
||||||
str
|
str
|
||||||
(flatten
|
(flatten
|
||||||
(list "\n" (inc %2) ". " (process %1 d))))
|
(list "\n" (inc %2) ". " (process %1 d))))
|
||||||
(:content e)
|
(html/select e [:li])
|
||||||
(range))))
|
(range))))
|
||||||
"\n\n"))
|
"\n\n"))
|
||||||
|
|
||||||
|
(defn markdown-omit
|
||||||
|
"Don't process the element `e` into markdown, but return `nil`."
|
||||||
|
[e d]
|
||||||
|
nil)
|
||||||
|
|
||||||
(defn markdown-pre
|
(defn markdown-pre
|
||||||
"Process the preformatted emphasis element `e` into markdown, using
|
"Process the preformatted emphasis element `e` into markdown, using
|
||||||
dispatcher `d`."
|
dispatcher `d`."
|
||||||
|
@ -155,13 +160,15 @@
|
||||||
str
|
str
|
||||||
(flatten
|
(flatten
|
||||||
(list "\n* " (process % d))))
|
(list "\n* " (process % d))))
|
||||||
(:content e))))
|
(html/select e [:li]))))
|
||||||
"\n\n"))
|
"\n\n"))
|
||||||
|
|
||||||
|
|
||||||
(def markdown-dispatcher
|
(def markdown-dispatcher
|
||||||
|
"A despatcher for transforming (X)HTML into Markdown."
|
||||||
{:a markdown-a
|
{:a markdown-a
|
||||||
:b markdown-strong
|
:b markdown-strong
|
||||||
|
:br markdown-br
|
||||||
:code markdown-code
|
:code markdown-code
|
||||||
:body markdown-default
|
:body markdown-default
|
||||||
:div markdown-div
|
:div markdown-div
|
||||||
|
@ -179,8 +186,10 @@
|
||||||
:p markdown-div
|
:p markdown-div
|
||||||
:pre markdown-pre
|
:pre markdown-pre
|
||||||
:samp markdown-code
|
:samp markdown-code
|
||||||
|
:script markdown-omit
|
||||||
:span markdown-default
|
:span markdown-default
|
||||||
:strong markdown-strong
|
:strong markdown-strong
|
||||||
|
:style markdown-omit
|
||||||
:ul markdown-ul
|
:ul markdown-ul
|
||||||
})
|
})
|
||||||
|
|
||||||
|
|
|
@ -26,26 +26,35 @@
|
||||||
(if processor
|
(if processor
|
||||||
(apply processor (list element dispatcher))
|
(apply processor (list element dispatcher))
|
||||||
(map #(process % dispatcher) (:content element))))
|
(map #(process % dispatcher) (:content element))))
|
||||||
(string? element) element))
|
|
||||||
|
(string? element) element
|
||||||
|
(or (seq? element) (vector? element))
|
||||||
|
(remove nil? (map #(process % dispatcher) element))))
|
||||||
|
|
||||||
|
(defn- transformer-dispatch
|
||||||
|
[a _]
|
||||||
|
(class a))
|
||||||
|
|
||||||
(defmulti transform
|
(defmulti transform
|
||||||
"Transform the `obj` which is my first argument using the `dispatcher`
|
"Transform the `obj` which is my first argument using the `dispatcher`
|
||||||
which is my second argument."
|
which is my second argument."
|
||||||
[class class] :default :default)
|
#'transformer-dispatch
|
||||||
|
:default :default)
|
||||||
|
|
||||||
(defmethod transform :default [obj dispatcher]
|
(defmethod transform :default [obj dispatcher]
|
||||||
(process obj dispatcher))
|
(process obj dispatcher))
|
||||||
|
|
||||||
(defmethod transform [java.net.URI Object] [uri dispatcher]
|
(defmethod transform java.net.URI [uri dispatcher]
|
||||||
(process (html/html-resource uri) dispatcher))
|
(remove nil? (process (html/html-resource uri) dispatcher)))
|
||||||
|
|
||||||
(defmethod transform [java.net.URL Object] [url dispatcher]
|
(defmethod transform java.net.URL [url dispatcher]
|
||||||
(transform (.toURI url) dispatcher))
|
(transform (.toURI url) dispatcher))
|
||||||
|
|
||||||
(defmethod transform [String Object] [s dispatcher]
|
(defmethod transform String [s dispatcher]
|
||||||
(let [url (try (java.net.URL. s) (catch Exception any))]
|
(let [url (try (java.net.URL. s) (catch Exception any))]
|
||||||
(if url (transform url dispatcher)
|
(if url (transform url dispatcher)
|
||||||
;; otherwise, if s is not a URL, consider it as an HTML fragment,
|
;; otherwise, if s is not a URL, consider it as an HTML fragment,
|
||||||
;; parse and process it
|
;; parse and process it
|
||||||
(process (tagsoup/parser (java.io.StringReader s)) dispatcher)
|
(process (tagsoup/parser (java.io.StringReader s)) dispatcher)
|
||||||
)))
|
)))
|
||||||
|
|
||||||
|
|
|
@ -73,7 +73,7 @@
|
||||||
|
|
||||||
(deftest img-test
|
(deftest img-test
|
||||||
(testing "Image tag."
|
(testing "Image tag."
|
||||||
(let [expected ""
|
(let [expected ""
|
||||||
actual (process
|
actual (process
|
||||||
{:tag :img
|
{:tag :img
|
||||||
:attrs {:src "http://foo.bar/image.png"
|
:attrs {:src "http://foo.bar/image.png"
|
||||||
|
|
Loading…
Reference in a new issue