Merge branch 'release/0.2.0'
This commit is contained in:
commit
97351feafe
2
.gitignore
vendored
2
.gitignore
vendored
|
@ -11,3 +11,5 @@ pom.xml.asc
|
|||
.hgignore
|
||||
.hg/
|
||||
*~
|
||||
|
||||
test\.md
|
||||
|
|
47
README.md
47
README.md
|
@ -1,6 +1,7 @@
|
|||
# html-to-md
|
||||
|
||||
A Clojure library designed to convert (Enlivened) HTML to markdown; but, more
|
||||
A Clojure library designed to convert
|
||||
([Enlive](https://github.com/cgrand/enlive)ned) HTML to markdown; but, more
|
||||
generally, a framework for [HT|SG|X]ML transformation.
|
||||
|
||||
## Introduction
|
||||
|
@ -17,35 +18,50 @@ unit test coverage.
|
|||
|
||||
To use this library in your project, add the following leiningen dependency:
|
||||
|
||||
[org.clojars.simon_brooke/html-to-md "0.1.0"]
|
||||
[org.clojars.simon_brooke/html-to-md "0.2.0"]
|
||||
|
||||
To use it in your namespace, require:
|
||||
|
||||
[html-to-md/transformer :refer [transform process]]
|
||||
[html-to-md/html-to-md :refer [markdown-dispatcher]]
|
||||
[html-to-md.core :refer [html-to-md]]
|
||||
|
||||
For default usage, that's all you need. To play more sophisticated tricks,
|
||||
consider:
|
||||
|
||||
[html-to-md.transformer :refer [transform process]]
|
||||
[html-to-md.html-to-md :refer [markdown-dispatcher]]
|
||||
|
||||
The intended usage is as follows:
|
||||
|
||||
```clojure
|
||||
(require '[html-to-md.transformer :refer [transform]])
|
||||
(require '[html-to-md.html-to-md :refer [markdown-dispatcher]])
|
||||
(require '[html-to-md.core :refer [html-to-md]])
|
||||
|
||||
(transform URL markdown-dispatcher)
|
||||
(html-to-md url output-file)
|
||||
```
|
||||
|
||||
Where URL is any URL that references an HTML, SGML, XHTML or XML document.
|
||||
However, my fancy multi-method doesn't work yet and may well be the wrong
|
||||
approach, so for now use
|
||||
This will read (X)HTML from `url` and write Markdown to `output-file`. If
|
||||
`output-file` is not supplied, it will return the markdown as a string:
|
||||
|
||||
```clojure
|
||||
(require '[html-to-md.core :refer [html-to-md]])
|
||||
|
||||
(require '[html-to-md.transformer :refer [process]])
|
||||
(require '[html-to-md.html-to-md :refer [markdown-dispatcher]])
|
||||
(require '[net.cgrand.enlive-html :as html])
|
||||
|
||||
(process (html/html-resource URL) markdown-dispatcher)
|
||||
(def md (html-to-md url))
|
||||
```
|
||||
|
||||
If you are specifically scraping [blogger.com](https://www.blogger.com/")
|
||||
pages, you may *try* the following recipe:
|
||||
|
||||
```clojure
|
||||
(require '[html-to-md.core :refer [blogger-to-md]])
|
||||
|
||||
(blogger-to-md url output-file)
|
||||
```
|
||||
|
||||
It works for my blogger pages. However, I'm not sure to what extent the
|
||||
skinning of blogger pages is pure CSS (in which case my recipe should work
|
||||
for yours) and to what extent it's HTML templating (in which case it
|
||||
probably won't). Results not guaranteed, if it doesn't work you get to
|
||||
keep all the pieces.
|
||||
|
||||
## Extending the transformer
|
||||
|
||||
In principle, the transformer can transform any [HT|SG|X]ML markup into any
|
||||
|
@ -66,3 +82,4 @@ Copyright © 2019 Simon Brooke <simon@journeyman.cc>
|
|||
|
||||
Distributed under the Eclipse Public License either version 1.0 or (at
|
||||
your option) any later version.
|
||||
|
||||
|
|
126
doc/intro.md
126
doc/intro.md
|
@ -1,3 +1,127 @@
|
|||
# Introduction to html-to-md
|
||||
|
||||
TODO: write [great documentation](http://jacobian.org/writing/what-to-write/)
|
||||
The itch I'm trying to scratch at present is to transform
|
||||
[Blogger.com](http://www.blogger.com)'s dreadful tag-soup markup into markdown;
|
||||
but my architecture for doing this is to build a completely general [HT|SG|X]ML
|
||||
transformation framework and then specialise it.
|
||||
|
||||
**WARNING:** this is presently alpha-quality code, although it does have fair
|
||||
unit test coverage.
|
||||
|
||||
## Usage
|
||||
|
||||
To use this library in your project, add the following leiningen dependency:
|
||||
|
||||
[org.clojars.simon_brooke/html-to-md "0.2.0"]
|
||||
|
||||
To use it in your namespace, require:
|
||||
|
||||
[html-to-md.core :refer [html-to-md]]
|
||||
|
||||
For default usage, that's all you need. To play more sophisticated tricks,
|
||||
consider:
|
||||
|
||||
[html-to-md.transformer :refer [transform process]]
|
||||
[html-to-md.html-to-md :refer [markdown-dispatcher]]
|
||||
|
||||
The intended usage is as follows:
|
||||
|
||||
```clojure
|
||||
(require '[html-to-md.core :refer [html-to-md]])
|
||||
|
||||
(html-to-md url output-file)
|
||||
```
|
||||
|
||||
This will read (X)HTML from `url` and write Markdown to `output-file`. If
|
||||
`output-file` is not supplied, it will return the markdown as a string:
|
||||
|
||||
```clojure
|
||||
(require '[html-to-md.core :refer [html-to-md]])
|
||||
|
||||
(def md (html-to-md url))
|
||||
```
|
||||
|
||||
If you are specifically scraping [blogger.com](https://www.blogger.com/")
|
||||
pages, you may *try* the following recipe:
|
||||
|
||||
```clojure
|
||||
(require '[html-to-md.core :refer [blogger-to-md]])
|
||||
|
||||
(blogger-to-md url output-file)
|
||||
```
|
||||
|
||||
It works for my blogger pages. However, I'm not sure to what extent the
|
||||
skinning of blogger pages is pure CSS (in which case my recipe should work
|
||||
for yours) and to what extent it's HTML templating (in which case it
|
||||
probably won't). Results not guaranteed, if it doesn't work you get to
|
||||
keep all the pieces.
|
||||
|
||||
## Extending the transformer
|
||||
|
||||
In principle, the transformer can transform any [HT|SG|X]ML markup into any
|
||||
other, or into any textual form. To extend it to do something other than
|
||||
markdown, supply a **dispatcher**. A dispatcher is essentially a function of one
|
||||
argument, a [HT|SG|X]ML tag represented as a Clojure keyword, which returns
|
||||
a **processor,** which should be a function of two arguments, an element assumed
|
||||
to have that tag, and a dispatcher. The processor should return the value that
|
||||
you want elements of that tag transformed into.
|
||||
|
||||
Thus the `html-to-md.html-to-md` namespace comprises a number of *processor*
|
||||
functions, such as this one:
|
||||
|
||||
```clojure
|
||||
(defn markdown-a
|
||||
"Process the anchor element `e` into markdown, using dispatcher `d`."
|
||||
[e d]
|
||||
(str
|
||||
"["
|
||||
(s/trim (apply str (process (:content e) d)))
|
||||
"]("
|
||||
(-> e :attrs :href)
|
||||
")"))
|
||||
```
|
||||
|
||||
and a *dispatcher* map:
|
||||
|
||||
```clojure
|
||||
(def markdown-dispatcher
|
||||
"A despatcher for transforming (X)HTML into Markdown."
|
||||
{:a markdown-a
|
||||
:b markdown-strong
|
||||
:br markdown-br
|
||||
:code markdown-code
|
||||
:body markdown-default
|
||||
:div markdown-div
|
||||
:em markdown-em
|
||||
:h1 markdown-h1
|
||||
:h2 markdown-h2
|
||||
:h3 markdown-h3
|
||||
:h4 markdown-h4
|
||||
:h5 markdown-h5
|
||||
:h6 markdown-h6
|
||||
:html markdown-html
|
||||
:i markdown-em
|
||||
:img markdown-img
|
||||
:ol markdown-ol
|
||||
:p markdown-div
|
||||
:pre markdown-pre
|
||||
:samp markdown-code
|
||||
:script markdown-omit
|
||||
:span markdown-default
|
||||
:strong markdown-strong
|
||||
:style markdown-omit
|
||||
:ul markdown-ul
|
||||
})
|
||||
```
|
||||
|
||||
Obviously it is convenient to write dispatchers as maps, but it isn't required
|
||||
that you do so: anything which, given a keyword, will return a processor, will
|
||||
work.
|
||||
|
||||
## License
|
||||
|
||||
Copyright © 2019 Simon Brooke <simon@journeyman.cc>
|
||||
|
||||
Distributed under the Eclipse Public License either version 1.0 or (at
|
||||
your option) any later version.
|
||||
|
||||
|
|
|
@ -1,10 +1,11 @@
|
|||
(defproject html-to-md "0.1.0"
|
||||
(defproject html-to-md "0.2.0"
|
||||
:description "Convert (Enlivened) HTML to markdown; but, more generally, a framework for [HT|SG|X]ML transformation."
|
||||
:url "https://github.com/simon-brooke/html-to-md"
|
||||
:license {:name "Eclipse Public License"
|
||||
:url "http://www.eclipse.org/legal/epl-v10.html"}
|
||||
:dependencies [[org.clojure/clojure "1.8.0"]
|
||||
[enlive "1.1.6"]]
|
||||
:plugins [[lein-codox "0.10.3"]]
|
||||
:plugins [[lein-codox "0.10.3"]
|
||||
[lein-release "1.0.5"]]
|
||||
:lein-release {:deploy-via :clojars}
|
||||
:signing {:gpg-key "Simon Brooke (Stultus in monte) <simon@journeyman.cc>"})
|
||||
|
|
41
src/html_to_md/blogger_to_md.clj
Normal file
41
src/html_to_md/blogger_to_md.clj
Normal file
|
@ -0,0 +1,41 @@
|
|||
(ns html-to-md.blogger-to-md
|
||||
(:require [clojure.string :as s]
|
||||
[html-to-md.html-to-md :refer [markdown-dispatcher markdown-header]]
|
||||
[html-to-md.transformer :refer [process]]
|
||||
[net.cgrand.enlive-html :as html]))
|
||||
|
||||
(defn blogger-scraper
|
||||
"Processor which scrapes the actual post content out of a blogger page.
|
||||
*NOTE:* This was written to scrape *my* blogger pages, yours may be
|
||||
different!"
|
||||
[e d]
|
||||
(let [title (first (html/select e [:h3.post-title]))
|
||||
content (html/select e [:div.post-body])]
|
||||
(if (and title content)
|
||||
(apply
|
||||
str
|
||||
(cons
|
||||
(markdown-header title d 1)
|
||||
(process content d))))))
|
||||
|
||||
(defn image-table-processor
|
||||
"Blogger's horrible tag soup wraps images in tables. Is this table such
|
||||
a table? If so extract the image from it and process it to markdown;
|
||||
otherwise, fall back on what `markdown-dispatcher` would do with the
|
||||
table (which is currently nothing, but that will change)."
|
||||
[e d]
|
||||
(let [caption (process (first (html/select e [:td.tr-caption])) d)
|
||||
alt (if caption (s/trim (apply str caption)))
|
||||
image (first (html/select e [:img]))
|
||||
src (if image (-> image :attrs :src))]
|
||||
(if image
|
||||
(str "")
|
||||
(process e markdown-dispatcher))))
|
||||
|
||||
|
||||
(def blogger-dispatcher
|
||||
"Adaptation of `markdown-dispatcher`, q.v., with the `:table`, `:h3` and
|
||||
`:html` dispatches overridden."
|
||||
(assoc markdown-dispatcher
|
||||
:html blogger-scraper
|
||||
:table image-table-processor))
|
|
@ -1,6 +1,21 @@
|
|||
(ns html-to-md.core)
|
||||
(ns html-to-md.core
|
||||
(:require [html-to-md.transformer :refer [transform process]]
|
||||
[html-to-md.html-to-md :refer [markdown-dispatcher]]
|
||||
[html-to-md.blogger-to-md :refer [blogger-dispatcher]]))
|
||||
|
||||
(defn foo
|
||||
"I don't do a whole lot."
|
||||
[x]
|
||||
(println x "Hello, World!"))
|
||||
(defn html-to-md
|
||||
"Transform the HTML document referenced by `url` into Markdown, and write
|
||||
it to `output`, if supplied."
|
||||
([url]
|
||||
(apply str (transform url markdown-dispatcher)))
|
||||
([url output]
|
||||
(spit output (html-to-md url))))
|
||||
|
||||
(defn blogger-to-md
|
||||
"Transform the Blogger post referenced by `url` into Markdown, and write
|
||||
it to `output`, if supplied. *NOTE:* This was written to scrape *my*
|
||||
blogger pages, yours may be different!"
|
||||
([url]
|
||||
(apply str (transform url blogger-dispatcher)))
|
||||
([url output]
|
||||
(spit output (blogger-to-md url))))
|
||||
|
|
|
@ -7,15 +7,18 @@
|
|||
(defn markdown-a
|
||||
"Process the anchor element `e` into markdown, using dispatcher `d`."
|
||||
[e d]
|
||||
(apply
|
||||
str
|
||||
(flatten
|
||||
(list
|
||||
"["
|
||||
(map #(process % d) (:content e))
|
||||
"]("
|
||||
(-> e :attrs :href)
|
||||
")"))))
|
||||
(str
|
||||
"["
|
||||
(s/trim (apply str (process (:content e) d)))
|
||||
"]("
|
||||
(-> e :attrs :href)
|
||||
")"))
|
||||
|
||||
(defn markdown-br
|
||||
"Process the line-break element `e`, so beloved of tag-soupers, into
|
||||
markdown"
|
||||
[e d]
|
||||
"\n")
|
||||
|
||||
(defn markdown-code
|
||||
"Process the code or samp `e` into markdown, using dispatcher `d`."
|
||||
|
@ -51,15 +54,12 @@
|
|||
"Process the header element `e` into markdown, with level `level`,
|
||||
using dispatcher `d`."
|
||||
[e d level]
|
||||
(apply
|
||||
str
|
||||
(flatten
|
||||
(list
|
||||
"\n"
|
||||
(take level (repeat "#"))
|
||||
" "
|
||||
(map #(process % d) (:content e))
|
||||
"\n"))))
|
||||
(str
|
||||
"\n"
|
||||
(apply str (take level (repeat "#")))
|
||||
" "
|
||||
(s/trim (apply str (process (:content e) d)))
|
||||
"\n"))
|
||||
|
||||
(defn markdown-h1
|
||||
"Process the header element `e` into markdown, with level 1, using
|
||||
|
@ -105,7 +105,7 @@
|
|||
(defn markdown-img
|
||||
"Process this image element `e` into markdown, using dispatcher `d`."
|
||||
[e d]
|
||||
(str " ")"))
|
||||
(str " ")"))
|
||||
|
||||
(defn markdown-ol
|
||||
"Process this ordered list element `e` into markdown, using dispatcher
|
||||
|
@ -120,10 +120,15 @@
|
|||
str
|
||||
(flatten
|
||||
(list "\n" (inc %2) ". " (process %1 d))))
|
||||
(:content e)
|
||||
(html/select e [:li])
|
||||
(range))))
|
||||
"\n\n"))
|
||||
|
||||
(defn markdown-omit
|
||||
"Don't process the element `e` into markdown, but return `nil`."
|
||||
[e d]
|
||||
nil)
|
||||
|
||||
(defn markdown-pre
|
||||
"Process the preformatted emphasis element `e` into markdown, using
|
||||
dispatcher `d`."
|
||||
|
@ -155,13 +160,15 @@
|
|||
str
|
||||
(flatten
|
||||
(list "\n* " (process % d))))
|
||||
(:content e))))
|
||||
(html/select e [:li]))))
|
||||
"\n\n"))
|
||||
|
||||
|
||||
(def markdown-dispatcher
|
||||
"A despatcher for transforming (X)HTML into Markdown."
|
||||
{:a markdown-a
|
||||
:b markdown-strong
|
||||
:br markdown-br
|
||||
:code markdown-code
|
||||
:body markdown-default
|
||||
:div markdown-div
|
||||
|
@ -179,8 +186,10 @@
|
|||
:p markdown-div
|
||||
:pre markdown-pre
|
||||
:samp markdown-code
|
||||
:script markdown-omit
|
||||
:span markdown-default
|
||||
:strong markdown-strong
|
||||
:style markdown-omit
|
||||
:ul markdown-ul
|
||||
})
|
||||
|
||||
|
|
|
@ -26,26 +26,35 @@
|
|||
(if processor
|
||||
(apply processor (list element dispatcher))
|
||||
(map #(process % dispatcher) (:content element))))
|
||||
(string? element) element))
|
||||
|
||||
(string? element) element
|
||||
(or (seq? element) (vector? element))
|
||||
(remove nil? (map #(process % dispatcher) element))))
|
||||
|
||||
(defn- transformer-dispatch
|
||||
[a _]
|
||||
(class a))
|
||||
|
||||
(defmulti transform
|
||||
"Transform the `obj` which is my first argument using the `dispatcher`
|
||||
which is my second argument."
|
||||
[class class] :default :default)
|
||||
#'transformer-dispatch
|
||||
:default :default)
|
||||
|
||||
(defmethod transform :default [obj dispatcher]
|
||||
(process obj dispatcher))
|
||||
|
||||
(defmethod transform [java.net.URI Object] [uri dispatcher]
|
||||
(process (html/html-resource uri) dispatcher))
|
||||
(defmethod transform java.net.URI [uri dispatcher]
|
||||
(remove nil? (process (html/html-resource uri) dispatcher)))
|
||||
|
||||
(defmethod transform [java.net.URL Object] [url dispatcher]
|
||||
(defmethod transform java.net.URL [url dispatcher]
|
||||
(transform (.toURI url) dispatcher))
|
||||
|
||||
(defmethod transform [String Object] [s dispatcher]
|
||||
(defmethod transform String [s dispatcher]
|
||||
(let [url (try (java.net.URL. s) (catch Exception any))]
|
||||
(if url (transform url dispatcher)
|
||||
;; otherwise, if s is not a URL, consider it as an HTML fragment,
|
||||
;; parse and process it
|
||||
(process (tagsoup/parser (java.io.StringReader s)) dispatcher)
|
||||
)))
|
||||
|
||||
|
|
|
@ -73,7 +73,7 @@
|
|||
|
||||
(deftest img-test
|
||||
(testing "Image tag."
|
||||
(let [expected ""
|
||||
(let [expected ""
|
||||
actual (process
|
||||
{:tag :img
|
||||
:attrs {:src "http://foo.bar/image.png"
|
||||
|
|
Loading…
Reference in a new issue