From 10d8574ace9da64fc18a0f99d8bead7607c8b853 Mon Sep 17 00:00:00 2001 From: Simon Brooke Date: Wed, 1 May 2019 14:11:36 +0100 Subject: [PATCH 1/8] Upversioned to 0.4.0-SNAPSHOT --- project.clj | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/project.clj b/project.clj index 578d571..797c09d 100644 --- a/project.clj +++ b/project.clj @@ -1,4 +1,4 @@ -(defproject html-to-md "0.3.0" +(defproject html-to-md "0.4.0-SNAPSHOT" :description "Convert (Enlivened) HTML to markdown; but, more generally, a framework for [HT|SG|X]ML transformation." :url "https://github.com/simon-brooke/html-to-md" :license {:name "Eclipse Public License" From cb9966386158b671763d64736083051af6b261b4 Mon Sep 17 00:00:00 2001 From: Simon Brooke Date: Wed, 1 May 2019 14:16:01 +0100 Subject: [PATCH 2/8] Ooops! Must remember to put regenerating docs into the release process! --- docs/html-to-md.blogger-to-md.html | 2 +- docs/html-to-md.core.html | 2 +- docs/html-to-md.html-to-md.html | 2 +- docs/html-to-md.transformer.html | 2 +- docs/index.html | 2 +- docs/intro.html | 4 ++-- 6 files changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/html-to-md.blogger-to-md.html b/docs/html-to-md.blogger-to-md.html index 56285f1..e14c706 100644 --- a/docs/html-to-md.blogger-to-md.html +++ b/docs/html-to-md.blogger-to-md.html @@ -1,3 +1,3 @@ -html-to-md.blogger-to-md documentation

html-to-md.blogger-to-md

Convert blogger posts to Markdown format, omitting all the Blogger chrome and navigation.

blogger-dispatcher

Adaptation of markdown-dispatcher, q.v., with the :table and :html dispatches overridden.

blogger-scraper

(blogger-scraper e d)

Processor which scrapes the actual post content out of a blogger page. NOTE: This was written to scrape my blogger pages, yours may be different!

image-table-processor

(image-table-processor e d)

Blogger’s horrible tag soup wraps images in tables. Is this table such a table? If so extract the image from it and process it to markdown; otherwise, fall back on what markdown-dispatcher would do with the table (which is currently nothing, but that will change).

\ No newline at end of file +html-to-md.blogger-to-md documentation

html-to-md.blogger-to-md

Convert blogger posts to Markdown format, omitting all the Blogger chrome and navigation.

blogger-dispatcher

Adaptation of markdown-dispatcher, q.v., with the :table and :html dispatches overridden.

blogger-scraper

(blogger-scraper e d)

Processor which scrapes the actual post content out of a blogger page. NOTE: This was written to scrape my blogger pages, yours may be different!

image-table-processor

(image-table-processor e d)

Blogger’s horrible tag soup wraps images in tables. Is this table such a table? If so extract the image from it and process it to markdown; otherwise, fall back on what markdown-dispatcher would do with the table (which is currently nothing, but that will change).

\ No newline at end of file diff --git a/docs/html-to-md.core.html b/docs/html-to-md.core.html index c9fdebe..7f95066 100644 --- a/docs/html-to-md.core.html +++ b/docs/html-to-md.core.html @@ -1,3 +1,3 @@ -html-to-md.core documentation

html-to-md.core

Top level functions intended for very simple use.

blogger-to-md

(blogger-to-md url)(blogger-to-md url output)

Transform the Blogger post referenced by url into Markdown, and write it to output, if supplied. NOTE: This was written to scrape my blogger pages, yours may be different!

html-to-md

(html-to-md url)(html-to-md url output)

Transform the HTML document referenced by url into Markdown, and write it to output, if supplied.

\ No newline at end of file +html-to-md.core documentation

html-to-md.core

Top level functions intended for very simple use.

blogger-to-md

(blogger-to-md url)(blogger-to-md url output)

Transform the Blogger post referenced by url into Markdown, and write it to output, if supplied. NOTE: This was written to scrape my blogger pages, yours may be different!

html-to-md

(html-to-md url)(html-to-md url output)

Transform the HTML document referenced by url into Markdown, and write it to output, if supplied.

\ No newline at end of file diff --git a/docs/html-to-md.html-to-md.html b/docs/html-to-md.html-to-md.html index 6be170d..73138d9 100644 --- a/docs/html-to-md.html-to-md.html +++ b/docs/html-to-md.html-to-md.html @@ -1,3 +1,3 @@ -html-to-md.html-to-md documentation

html-to-md.html-to-md

Transform general HTML to Markdown, as faithfully as is reasonably possible.

markdown-a

(markdown-a e d)

Process the anchor element e into markdown, using dispatcher d.

markdown-br

(markdown-br e d)

Process the line-break element e, so beloved of tag-soupers, into markdown

markdown-code

(markdown-code e d)

Process the code or samp e into markdown, using dispatcher d.

markdown-default

(markdown-default e d)

Process an element e for which we have no other function into markdown, using dispatcher d.

markdown-dispatcher

A dispatcher for transforming (X)HTML into Markdown.

markdown-div

(markdown-div e d)

Process the division element e into markdown, using dispatcher d.

markdown-em

(markdown-em e d)

Process the emphasis element e into markdown, using dispatcher d.

markdown-h1

(markdown-h1 e d)

Process the header element e into markdown, with level 1, using dispatcher d.

markdown-h2

(markdown-h2 e d)

Process the header element e into markdown, with level 2, using dispatcher d.

markdown-h3

(markdown-h3 e d)

Process the header element e into markdown, with level 3, using dispatcher d.

markdown-h4

(markdown-h4 e d)

Process the header element e into markdown, with level 4, using dispatcher d.

markdown-h5

(markdown-h5 e d)

Process the header element e into markdown, with level 5, using dispatcher d.

markdown-h6

(markdown-h6 e d)

Process the header element e into markdown, with level 6, using dispatcher d.

markdown-header

(markdown-header e d level)

Process the header element e into markdown, with level level, using dispatcher d.

markdown-html

(markdown-html e d)

Process this HTML element e into markdown, using dispatcher d.

markdown-img

(markdown-img e d)

Process this image element e into markdown, using dispatcher d.

markdown-ol

(markdown-ol e d)

Process this ordered list element e into markdown, using dispatcher d.

markdown-omit

(markdown-omit e d)

Don’t process the element e into markdown, but return nil.

markdown-pre

(markdown-pre e d)

Process the preformatted emphasis element e into markdown, using dispatcher d.

markdown-strong

(markdown-strong e d)

Process the strong emphasis element e into markdown, using dispatcher d.

markdown-ul

(markdown-ul e d)

Process this unordered list element e into markdown, using dispatcher d.

\ No newline at end of file +html-to-md.html-to-md documentation

html-to-md.html-to-md

Transform general HTML to Markdown, as faithfully as is reasonably possible.

markdown-a

(markdown-a e d)

Process the anchor element e into markdown, using dispatcher d.

markdown-br

(markdown-br e d)

Process the line-break element e, so beloved of tag-soupers, into markdown

markdown-code

(markdown-code e d)

Process the code or samp e into markdown, using dispatcher d.

markdown-default

(markdown-default e d)

Process an element e for which we have no other function into markdown, using dispatcher d.

markdown-dispatcher

A dispatcher for transforming (X)HTML into Markdown.

markdown-div

(markdown-div e d)

Process the division element e into markdown, using dispatcher d.

markdown-em

(markdown-em e d)

Process the emphasis element e into markdown, using dispatcher d.

markdown-h1

(markdown-h1 e d)

Process the header element e into markdown, with level 1, using dispatcher d.

markdown-h2

(markdown-h2 e d)

Process the header element e into markdown, with level 2, using dispatcher d.

markdown-h3

(markdown-h3 e d)

Process the header element e into markdown, with level 3, using dispatcher d.

markdown-h4

(markdown-h4 e d)

Process the header element e into markdown, with level 4, using dispatcher d.

markdown-h5

(markdown-h5 e d)

Process the header element e into markdown, with level 5, using dispatcher d.

markdown-h6

(markdown-h6 e d)

Process the header element e into markdown, with level 6, using dispatcher d.

markdown-header

(markdown-header e d level)

Process the header element e into markdown, with level level, using dispatcher d.

markdown-html

(markdown-html e d)

Process this HTML element e into markdown, using dispatcher d.

markdown-img

(markdown-img e d)

Process this image element e into markdown, using dispatcher d.

markdown-ol

(markdown-ol e d)

Process this ordered list element e into markdown, using dispatcher d.

markdown-omit

(markdown-omit e d)

Don’t process the element e into markdown, but return nil.

markdown-pre

(markdown-pre e d)

Process the preformatted emphasis element e into markdown, using dispatcher d.

markdown-strong

(markdown-strong e d)

Process the strong emphasis element e into markdown, using dispatcher d.

markdown-ul

(markdown-ul e d)

Process this unordered list element e into markdown, using dispatcher d.

\ No newline at end of file diff --git a/docs/html-to-md.transformer.html b/docs/html-to-md.transformer.html index 5867b5f..64d28e8 100644 --- a/docs/html-to-md.transformer.html +++ b/docs/html-to-md.transformer.html @@ -1,6 +1,6 @@ -html-to-md.transformer documentation

html-to-md.transformer

The actual transformation engine, which is actually far more general than just something to generate Markdown. It isn’t as general as XSL-T but can nevertheless do a great deal of transformation on [HT|SG|X]ML documents.

+html-to-md.transformer documentation

html-to-md.transformer

The actual transformation engine, which is actually far more general than just something to generate Markdown. It isn’t as general as XSL-T but can nevertheless do a great deal of transformation on [HT|SG|X]ML documents.

Terminology

In this documentation the following terminology is used:

    diff --git a/docs/index.html b/docs/index.html index e99a971..0f58b7b 100644 --- a/docs/index.html +++ b/docs/index.html @@ -1,3 +1,3 @@ -Html-to-md 0.2.0

    Html-to-md 0.2.0

    Released under the Eclipse Public License

    Convert (Enlivened) HTML to markdown; but, more generally, a framework for [HT|SG|X]ML transformation.

    Installation

    To install, add the following dependency to your project or build file:

    [html-to-md "0.2.0"]

    Topics

    Namespaces

    html-to-md.blogger-to-md

    Convert blogger posts to Markdown format, omitting all the Blogger chrome and navigation.

    html-to-md.core

    Top level functions intended for very simple use.

    Public variables and functions:

    html-to-md.transformer

    The actual transformation engine, which is actually far more general than just something to generate Markdown. It isn’t as general as XSL-T but can nevertheless do a great deal of transformation on [HT|SG|X]ML documents.

    Public variables and functions:

    \ No newline at end of file +Html-to-md 0.3.0

    Html-to-md 0.3.0

    Released under the Eclipse Public License

    Convert (Enlivened) HTML to markdown; but, more generally, a framework for [HT|SG|X]ML transformation.

    Installation

    To install, add the following dependency to your project or build file:

    [html-to-md "0.3.0"]

    Topics

    Namespaces

    html-to-md.blogger-to-md

    Convert blogger posts to Markdown format, omitting all the Blogger chrome and navigation.

    html-to-md.core

    Top level functions intended for very simple use.

    Public variables and functions:

    html-to-md.transformer

    The actual transformation engine, which is actually far more general than just something to generate Markdown. It isn’t as general as XSL-T but can nevertheless do a great deal of transformation on [HT|SG|X]ML documents.

    Public variables and functions:

    \ No newline at end of file diff --git a/docs/intro.html b/docs/intro.html index 48afcd2..6bf0935 100644 --- a/docs/intro.html +++ b/docs/intro.html @@ -1,11 +1,11 @@ -Introduction to html-to-md

    Introduction to html-to-md

    +Introduction to html-to-md

    Introduction to html-to-md

    The itch I’m trying to scratch at present is to transform Blogger.com’s dreadful tag-soup markup into markdown; but my architecture for doing this is to build a completely general [HT|SG|X]ML transformation framework and then specialise it.

    WARNING: this is presently alpha-quality code, although it does have fair unit test coverage.

    Usage

    To use this library in your project, add the following leiningen dependency:

    -
    [org.clojars.simon_brooke/html-to-md "0.2.0"]
    +
    [org.clojars.simon_brooke/html-to-md "0.3.0"]
     

    To use it in your namespace, require:

    [html-to-md.core :refer [html-to-md]]
    
    From f69fb619cb419eec37400948ef60de75454ec28c Mon Sep 17 00:00:00 2001
    From: Simon Brooke 
    Date: Wed, 1 May 2019 14:20:39 +0100
    Subject: [PATCH 3/8] Replaced README with a pointer to new documentation.
    
    ---
     README.md | 78 +------------------------------------------------------
     1 file changed, 1 insertion(+), 77 deletions(-)
    
    diff --git a/README.md b/README.md
    index 0223912..791a2e0 100644
    --- a/README.md
    +++ b/README.md
    @@ -4,82 +4,6 @@ A Clojure library designed to convert
     ([Enlive](https://github.com/cgrand/enlive)ned) HTML to markdown; but, more
     generally, a framework for [HT|SG|X]ML transformation.
     
    -## Introduction
    +[Documentation is here](https://simon-brooke.github.io/html-to-md/)
     
    -The itch I'm trying to scratch at present is to transform
    -[Blogger.com](http://www.blogger.com)'s dreadful tag-soup markup into markdown;
    -but my architecture for doing this is to build a completely general [HT|SG|X]ML
    -transformation framework and then specialise it.
    -
    -**WARNING:** this is presently alpha-quality code, although it does have fair
    -unit test coverage.
    -
    -## Usage
    -
    -To use this library in your project, add the following leiningen dependency:
    -
    -    [org.clojars.simon_brooke/html-to-md "0.3.0"]
    -
    -To use it in your namespace, require:
    -
    -    [html-to-md.core :refer [html-to-md]]
    -
    -For default usage, that's all you need. To play more sophisticated tricks,
    -consider:
    -
    -    [html-to-md.transformer :refer [transform process]]
    -    [html-to-md.html-to-md :refer [markdown-dispatcher]]
    -
    -The intended usage is as follows:
    -
    -```clojure
    -(require '[html-to-md.core :refer [html-to-md]])
    -
    -(html-to-md url output-file)
    -```
    -
    -This will read (X)HTML from `url` and write Markdown to `output-file`. If
    -`output-file` is not supplied, it will return the markdown as a string:
    -
    -```clojure
    -(require '[html-to-md.core :refer [html-to-md]])
    -
    -(def md (html-to-md url))
    -```
    -
    -If you are specifically scraping [blogger.com](https://www.blogger.com/")
    -pages, you may *try* the following recipe:
    -
    -```clojure
    -(require '[html-to-md.core :refer [blogger-to-md]])
    -
    -(blogger-to-md url output-file)
    -```
    -
    -It works for my blogger pages. However, I'm not sure to what extent the
    -skinning of blogger pages is pure CSS (in which case my recipe should work
    -for yours) and to what extent it's HTML templating (in which case it
    -probably won't). Results not guaranteed, if it doesn't work you get to
    -keep all the pieces.
    -
    -## Extending the transformer
    -
    -In principle, the transformer can transform any [HT|SG|X]ML markup into any
    -other, or into any textual form. To extend it to do something other than
    -markdown, supply a **dispatcher**. A dispatcher is essentially a function of one
    -argument, a [HT|SG|X]ML tag represented as a Clojure keyword, which returns
    -a **processor,** which should be a function of two arguments, an element assumed
    -to have that tag, and a dispatcher. The processor should return the value that
    -you want elements of that tag transformed into.
    -
    -Obviously it is convenient to write dispatchers as maps, but it isn't required
    -that you do so: anything which, given a keyword, will return a processor, will
    -work.
    -
    -## License
    -
    -Copyright © 2019 Simon Brooke 
    -
    -Distributed under the Eclipse Public License either version 1.0 or (at
    -your option) any later version.
     
    
    From ebd6230bdbe0df8e00d49a0fa4581371e8a26e04 Mon Sep 17 00:00:00 2001
    From: Simon Brooke 
    Date: Wed, 1 May 2019 14:23:45 +0100
    Subject: [PATCH 4/8] Further README improvement.
    
    ---
     README.md | 5 ++++-
     1 file changed, 4 insertions(+), 1 deletion(-)
    
    diff --git a/README.md b/README.md
    index 791a2e0..2f50ecd 100644
    --- a/README.md
    +++ b/README.md
    @@ -4,6 +4,9 @@ A Clojure library designed to convert
     ([Enlive](https://github.com/cgrand/enlive)ned) HTML to markdown; but, more
     generally, a framework for [HT|SG|X]ML transformation.
     
    -[Documentation is here](https://simon-brooke.github.io/html-to-md/)
    +[Documentation is here](https://simon-brooke.github.io/html-to-md/). In
    +particular, please read the
    +[introduction](https://simon-brooke.github.io/html-to-md/intro.html), which
    +contains everything you want to know.
     
     
    
    From 4aa6bf978f28c38e0faebadbec9ffe40f366a52a Mon Sep 17 00:00:00 2001
    From: "Johan Mynhardt (MEA)" 
    Date: Sat, 21 May 2022 16:19:57 +0200
    Subject: [PATCH 5/8] Add test for `transform` java.lang.ClassCastException
    
    When `obj` argument is a string (X)HTML payload and
    not a string URL or URI, the following exception is
    thrown:
    
    ```clojure
    java.lang.ClassCastException: class java.lang.Class cannot be cast to class clojure.lang.IFn
    ```
    ---
     test/html_to_md/transformer_test.clj | 13 +++++++++++++
     1 file changed, 13 insertions(+)
     create mode 100644 test/html_to_md/transformer_test.clj
    
    diff --git a/test/html_to_md/transformer_test.clj b/test/html_to_md/transformer_test.clj
    new file mode 100644
    index 0000000..e1b7e5f
    --- /dev/null
    +++ b/test/html_to_md/transformer_test.clj
    @@ -0,0 +1,13 @@
    +(ns html-to-md.transformer-test
    +  (:require
    +   [clojure.string :as str]
    +   [clojure.test :as t :refer [deftest is testing]]
    +   [html-to-md.html-to-md :refer [markdown-dispatcher]]
    +   [html-to-md.transformer :refer [transform]]))
    +
    +(deftest transform-payload
    +  (testing "String `obj` for: 3. A string representation of an (X)HTML fragment;"
    +    (is (= "# This is a header"
    +           (str/trim (-> "

    This is a header" + (transform markdown-dispatcher) + (first))))))) From 44b28902db6c7aaa53bbf5c22b4fb527a2724e7b Mon Sep 17 00:00:00 2001 From: "Johan Mynhardt (MEA)" Date: Sat, 21 May 2022 16:26:28 +0200 Subject: [PATCH 6/8] Fix: java.lang.ClassCastException Use trailing dot in constructing the StringReader: `(java.io.StringReader. s)` --- src/html_to_md/transformer.clj | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/src/html_to_md/transformer.clj b/src/html_to_md/transformer.clj index 5933b3c..445aba5 100644 --- a/src/html_to_md/transformer.clj +++ b/src/html_to_md/transformer.clj @@ -93,6 +93,4 @@ (if url (transform url dispatcher) ;; otherwise, if s is not a URL, consider it as an HTML fragment, ;; parse and process it - (process (tagsoup/parser (java.io.StringReader s)) dispatcher) - ))) - + (process (tagsoup/parser (java.io.StringReader. s)) dispatcher)))) From ac80507b5f5c39213a649d0aee2ca4877ac1ce4d Mon Sep 17 00:00:00 2001 From: Johan Mynhardt Date: Thu, 26 May 2022 04:10:09 +0200 Subject: [PATCH 7/8] Update transformer_test.clj Use more appropriate test without `str/trim`. --- test/html_to_md/transformer_test.clj | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/test/html_to_md/transformer_test.clj b/test/html_to_md/transformer_test.clj index e1b7e5f..1a1e6d8 100644 --- a/test/html_to_md/transformer_test.clj +++ b/test/html_to_md/transformer_test.clj @@ -7,7 +7,5 @@ (deftest transform-payload (testing "String `obj` for: 3. A string representation of an (X)HTML fragment;" - (is (= "# This is a header" - (str/trim (-> "

    This is a header" - (transform markdown-dispatcher) - (first))))))) + (is (= '("\n# This is a header\n") + (transform "

    This is a header

    " markdown-dispatcher))))) From 08af65936684b8ff60a9b3ec83eb534f9c261836 Mon Sep 17 00:00:00 2001 From: "Johan Mynhardt (MEA)" Date: Thu, 26 May 2022 21:24:56 +0200 Subject: [PATCH 8/8] Remove unused import. --- test/html_to_md/transformer_test.clj | 1 - 1 file changed, 1 deletion(-) diff --git a/test/html_to_md/transformer_test.clj b/test/html_to_md/transformer_test.clj index 1a1e6d8..48369a4 100644 --- a/test/html_to_md/transformer_test.clj +++ b/test/html_to_md/transformer_test.clj @@ -1,6 +1,5 @@ (ns html-to-md.transformer-test (:require - [clojure.string :as str] [clojure.test :as t :refer [deftest is testing]] [html-to-md.html-to-md :refer [markdown-dispatcher]] [html-to-md.transformer :refer [transform]]))