Compare commits

...

9 commits

Author SHA1 Message Date
Simon Brooke 7f2ccd2d29
Merge pull request #1 from johanmynhardt/develop
Fix: java.lang.ClassCastException when source `obj` is (X)HTML.
2022-12-27 11:51:41 +00:00
Johan Mynhardt (MEA) 08af659366 Remove unused import. 2022-05-26 21:24:56 +02:00
Johan Mynhardt ac80507b5f Update transformer_test.clj
Use more appropriate test without `str/trim`.
2022-05-26 04:32:23 +02:00
Johan Mynhardt (MEA) 44b28902db Fix: java.lang.ClassCastException
Use trailing dot in constructing the StringReader:

`(java.io.StringReader. s)`
2022-05-21 16:26:28 +02:00
Johan Mynhardt (MEA) 4aa6bf978f Add test for transform java.lang.ClassCastException
When `obj` argument is a string (X)HTML payload and
not a string URL or URI, the following exception is
thrown:

```clojure
java.lang.ClassCastException: class java.lang.Class cannot be cast to class clojure.lang.IFn
```
2022-05-21 16:19:57 +02:00
Simon Brooke ebd6230bdb Further README improvement. 2019-05-01 14:23:45 +01:00
Simon Brooke f69fb619cb Replaced README with a pointer to new documentation. 2019-05-01 14:20:39 +01:00
Simon Brooke 10d8574ace Upversioned to 0.4.0-SNAPSHOT 2019-05-01 14:11:36 +01:00
Simon Brooke 916c5d4f36 Merge branch 'release/0.3.0' into develop 2019-05-01 14:10:29 +01:00
4 changed files with 16 additions and 81 deletions

View file

@ -4,82 +4,9 @@ A Clojure library designed to convert
([Enlive](https://github.com/cgrand/enlive)ned) HTML to markdown; but, more ([Enlive](https://github.com/cgrand/enlive)ned) HTML to markdown; but, more
generally, a framework for [HT|SG|X]ML transformation. generally, a framework for [HT|SG|X]ML transformation.
## Introduction [Documentation is here](https://simon-brooke.github.io/html-to-md/). In
particular, please read the
[introduction](https://simon-brooke.github.io/html-to-md/intro.html), which
contains everything you want to know.
The itch I'm trying to scratch at present is to transform
[Blogger.com](http://www.blogger.com)'s dreadful tag-soup markup into markdown;
but my architecture for doing this is to build a completely general [HT|SG|X]ML
transformation framework and then specialise it.
**WARNING:** this is presently alpha-quality code, although it does have fair
unit test coverage.
## Usage
To use this library in your project, add the following leiningen dependency:
[org.clojars.simon_brooke/html-to-md "0.3.0"]
To use it in your namespace, require:
[html-to-md.core :refer [html-to-md]]
For default usage, that's all you need. To play more sophisticated tricks,
consider:
[html-to-md.transformer :refer [transform process]]
[html-to-md.html-to-md :refer [markdown-dispatcher]]
The intended usage is as follows:
```clojure
(require '[html-to-md.core :refer [html-to-md]])
(html-to-md url output-file)
```
This will read (X)HTML from `url` and write Markdown to `output-file`. If
`output-file` is not supplied, it will return the markdown as a string:
```clojure
(require '[html-to-md.core :refer [html-to-md]])
(def md (html-to-md url))
```
If you are specifically scraping [blogger.com](https://www.blogger.com/")
pages, you may *try* the following recipe:
```clojure
(require '[html-to-md.core :refer [blogger-to-md]])
(blogger-to-md url output-file)
```
It works for my blogger pages. However, I'm not sure to what extent the
skinning of blogger pages is pure CSS (in which case my recipe should work
for yours) and to what extent it's HTML templating (in which case it
probably won't). Results not guaranteed, if it doesn't work you get to
keep all the pieces.
## Extending the transformer
In principle, the transformer can transform any [HT|SG|X]ML markup into any
other, or into any textual form. To extend it to do something other than
markdown, supply a **dispatcher**. A dispatcher is essentially a function of one
argument, a [HT|SG|X]ML tag represented as a Clojure keyword, which returns
a **processor,** which should be a function of two arguments, an element assumed
to have that tag, and a dispatcher. The processor should return the value that
you want elements of that tag transformed into.
Obviously it is convenient to write dispatchers as maps, but it isn't required
that you do so: anything which, given a keyword, will return a processor, will
work.
## License
Copyright © 2019 Simon Brooke <simon@journeyman.cc>
Distributed under the Eclipse Public License either version 1.0 or (at
your option) any later version.

View file

@ -1,4 +1,4 @@
(defproject html-to-md "0.3.0" (defproject html-to-md "0.4.0-SNAPSHOT"
:description "Convert (Enlivened) HTML to markdown; but, more generally, a framework for [HT|SG|X]ML transformation." :description "Convert (Enlivened) HTML to markdown; but, more generally, a framework for [HT|SG|X]ML transformation."
:url "https://github.com/simon-brooke/html-to-md" :url "https://github.com/simon-brooke/html-to-md"
:license {:name "Eclipse Public License" :license {:name "Eclipse Public License"

View file

@ -93,6 +93,4 @@
(if url (transform url dispatcher) (if url (transform url dispatcher)
;; otherwise, if s is not a URL, consider it as an HTML fragment, ;; otherwise, if s is not a URL, consider it as an HTML fragment,
;; parse and process it ;; parse and process it
(process (tagsoup/parser (java.io.StringReader s)) dispatcher) (process (tagsoup/parser (java.io.StringReader. s)) dispatcher))))
)))

View file

@ -0,0 +1,10 @@
(ns html-to-md.transformer-test
(:require
[clojure.test :as t :refer [deftest is testing]]
[html-to-md.html-to-md :refer [markdown-dispatcher]]
[html-to-md.transformer :refer [transform]]))
(deftest transform-payload
(testing "String `obj` for: 3. A string representation of an (X)HTML fragment;"
(is (= '("\n# This is a header\n")
(transform "<h1>This is a header</h1>" markdown-dispatcher)))))