Compare commits
9 commits
Author | SHA1 | Date | |
---|---|---|---|
|
7f2ccd2d29 | ||
|
08af659366 | ||
|
ac80507b5f | ||
|
44b28902db | ||
|
4aa6bf978f | ||
|
ebd6230bdb | ||
|
f69fb619cb | ||
|
10d8574ace | ||
|
916c5d4f36 |
81
README.md
81
README.md
|
@ -4,82 +4,9 @@ A Clojure library designed to convert
|
||||||
([Enlive](https://github.com/cgrand/enlive)ned) HTML to markdown; but, more
|
([Enlive](https://github.com/cgrand/enlive)ned) HTML to markdown; but, more
|
||||||
generally, a framework for [HT|SG|X]ML transformation.
|
generally, a framework for [HT|SG|X]ML transformation.
|
||||||
|
|
||||||
## Introduction
|
[Documentation is here](https://simon-brooke.github.io/html-to-md/). In
|
||||||
|
particular, please read the
|
||||||
|
[introduction](https://simon-brooke.github.io/html-to-md/intro.html), which
|
||||||
|
contains everything you want to know.
|
||||||
|
|
||||||
The itch I'm trying to scratch at present is to transform
|
|
||||||
[Blogger.com](http://www.blogger.com)'s dreadful tag-soup markup into markdown;
|
|
||||||
but my architecture for doing this is to build a completely general [HT|SG|X]ML
|
|
||||||
transformation framework and then specialise it.
|
|
||||||
|
|
||||||
**WARNING:** this is presently alpha-quality code, although it does have fair
|
|
||||||
unit test coverage.
|
|
||||||
|
|
||||||
## Usage
|
|
||||||
|
|
||||||
To use this library in your project, add the following leiningen dependency:
|
|
||||||
|
|
||||||
[org.clojars.simon_brooke/html-to-md "0.3.0"]
|
|
||||||
|
|
||||||
To use it in your namespace, require:
|
|
||||||
|
|
||||||
[html-to-md.core :refer [html-to-md]]
|
|
||||||
|
|
||||||
For default usage, that's all you need. To play more sophisticated tricks,
|
|
||||||
consider:
|
|
||||||
|
|
||||||
[html-to-md.transformer :refer [transform process]]
|
|
||||||
[html-to-md.html-to-md :refer [markdown-dispatcher]]
|
|
||||||
|
|
||||||
The intended usage is as follows:
|
|
||||||
|
|
||||||
```clojure
|
|
||||||
(require '[html-to-md.core :refer [html-to-md]])
|
|
||||||
|
|
||||||
(html-to-md url output-file)
|
|
||||||
```
|
|
||||||
|
|
||||||
This will read (X)HTML from `url` and write Markdown to `output-file`. If
|
|
||||||
`output-file` is not supplied, it will return the markdown as a string:
|
|
||||||
|
|
||||||
```clojure
|
|
||||||
(require '[html-to-md.core :refer [html-to-md]])
|
|
||||||
|
|
||||||
(def md (html-to-md url))
|
|
||||||
```
|
|
||||||
|
|
||||||
If you are specifically scraping [blogger.com](https://www.blogger.com/")
|
|
||||||
pages, you may *try* the following recipe:
|
|
||||||
|
|
||||||
```clojure
|
|
||||||
(require '[html-to-md.core :refer [blogger-to-md]])
|
|
||||||
|
|
||||||
(blogger-to-md url output-file)
|
|
||||||
```
|
|
||||||
|
|
||||||
It works for my blogger pages. However, I'm not sure to what extent the
|
|
||||||
skinning of blogger pages is pure CSS (in which case my recipe should work
|
|
||||||
for yours) and to what extent it's HTML templating (in which case it
|
|
||||||
probably won't). Results not guaranteed, if it doesn't work you get to
|
|
||||||
keep all the pieces.
|
|
||||||
|
|
||||||
## Extending the transformer
|
|
||||||
|
|
||||||
In principle, the transformer can transform any [HT|SG|X]ML markup into any
|
|
||||||
other, or into any textual form. To extend it to do something other than
|
|
||||||
markdown, supply a **dispatcher**. A dispatcher is essentially a function of one
|
|
||||||
argument, a [HT|SG|X]ML tag represented as a Clojure keyword, which returns
|
|
||||||
a **processor,** which should be a function of two arguments, an element assumed
|
|
||||||
to have that tag, and a dispatcher. The processor should return the value that
|
|
||||||
you want elements of that tag transformed into.
|
|
||||||
|
|
||||||
Obviously it is convenient to write dispatchers as maps, but it isn't required
|
|
||||||
that you do so: anything which, given a keyword, will return a processor, will
|
|
||||||
work.
|
|
||||||
|
|
||||||
## License
|
|
||||||
|
|
||||||
Copyright © 2019 Simon Brooke <simon@journeyman.cc>
|
|
||||||
|
|
||||||
Distributed under the Eclipse Public License either version 1.0 or (at
|
|
||||||
your option) any later version.
|
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
(defproject html-to-md "0.3.0"
|
(defproject html-to-md "0.4.0-SNAPSHOT"
|
||||||
:description "Convert (Enlivened) HTML to markdown; but, more generally, a framework for [HT|SG|X]ML transformation."
|
:description "Convert (Enlivened) HTML to markdown; but, more generally, a framework for [HT|SG|X]ML transformation."
|
||||||
:url "https://github.com/simon-brooke/html-to-md"
|
:url "https://github.com/simon-brooke/html-to-md"
|
||||||
:license {:name "Eclipse Public License"
|
:license {:name "Eclipse Public License"
|
||||||
|
|
|
@ -93,6 +93,4 @@
|
||||||
(if url (transform url dispatcher)
|
(if url (transform url dispatcher)
|
||||||
;; otherwise, if s is not a URL, consider it as an HTML fragment,
|
;; otherwise, if s is not a URL, consider it as an HTML fragment,
|
||||||
;; parse and process it
|
;; parse and process it
|
||||||
(process (tagsoup/parser (java.io.StringReader s)) dispatcher)
|
(process (tagsoup/parser (java.io.StringReader. s)) dispatcher))))
|
||||||
)))
|
|
||||||
|
|
||||||
|
|
10
test/html_to_md/transformer_test.clj
Normal file
10
test/html_to_md/transformer_test.clj
Normal file
|
@ -0,0 +1,10 @@
|
||||||
|
(ns html-to-md.transformer-test
|
||||||
|
(:require
|
||||||
|
[clojure.test :as t :refer [deftest is testing]]
|
||||||
|
[html-to-md.html-to-md :refer [markdown-dispatcher]]
|
||||||
|
[html-to-md.transformer :refer [transform]]))
|
||||||
|
|
||||||
|
(deftest transform-payload
|
||||||
|
(testing "String `obj` for: 3. A string representation of an (X)HTML fragment;"
|
||||||
|
(is (= '("\n# This is a header\n")
|
||||||
|
(transform "<h1>This is a header</h1>" markdown-dispatcher)))))
|
Loading…
Reference in a new issue