Added the blogger scraper.

This commit is contained in:
Simon Brooke 2019-04-30 20:05:46 +01:00
parent 81a7337eb3
commit cb801b193f
5 changed files with 140 additions and 20 deletions

View file

@ -21,29 +21,29 @@ To use this library in your project, add the following leiningen dependency:
To use it in your namespace, require: To use it in your namespace, require:
[html-to-md/transformer :refer [transform process]] [html-to-md.core :refer [html-to-md]]
[html-to-md/html-to-md :refer [markdown-dispatcher]]
For default usage, that's all you need. To play more sophisticated tricks,
consider:
[html-to-md.transformer :refer [transform process]]
[html-to-md.html-to-md :refer [markdown-dispatcher]]
The intended usage is as follows: The intended usage is as follows:
```clojure ```clojure
(require '[html-to-md.transformer :refer [transform]]) (require '[html-to-md.core :refer [html-to-md]])
(require '[html-to-md.html-to-md :refer [markdown-dispatcher]])
(transform URL markdown-dispatcher) (html-to-md url output-file)
``` ```
Where URL is any URL that references an HTML, SGML, XHTML or XML document. This will read (X)HTML from `url` and write Markdown to `output-file`. If
However, my fancy multi-method doesn't work yet and may well be the wrong `output-file` is not supplied, it will return the markdown as a string:
approach, so for now use
```clojure ```clojure
(require '[html-to-md.core :refer [html-to-md]])
(require '[html-to-md.transformer :refer [process]]) (def md (html-to-md url))
(require '[html-to-md.html-to-md :refer [markdown-dispatcher]])
(require '[net.cgrand.enlive-html :as html])
(process (html/html-resource URL) markdown-dispatcher)
``` ```
## Extending the transformer ## Extending the transformer
@ -66,3 +66,4 @@ Copyright © 2019 Simon Brooke <simon@journeyman.cc>
Distributed under the Eclipse Public License either version 1.0 or (at Distributed under the Eclipse Public License either version 1.0 or (at
your option) any later version. your option) any later version.

View file

@ -1,3 +1,116 @@
# Introduction to html-to-md # Introduction to html-to-md
TODO: write [great documentation](http://jacobian.org/writing/what-to-write/) TODO: write [great documentation](http://jacobian.org/writing/what-to-write/)
## Introduction
The itch I'm trying to scratch at present is to transform
[Blogger.com](http://www.blogger.com)'s dreadful tag-soup markup into markdown;
but my architecture for doing this is to build a completely general [HT|SG|X]ML
transformation framework and then specialise it.
**WARNING:** this is presently alpha-quality code, although it does have fair
unit test coverage.
## Usage
To use this library in your project, add the following leiningen dependency:
[org.clojars.simon_brooke/html-to-md "0.1.0"]
To use it in your namespace, require:
[html-to-md.core :refer [html-to-md]]
For default usage, that's all you need. To play more sophisticated tricks,
consider:
[html-to-md.transformer :refer [transform process]]
[html-to-md.html-to-md :refer [markdown-dispatcher]]
The intended usage is as follows:
```clojure
(require '[html-to-md.core :refer [html-to-md]])
(html-to-md url output-file)
```
This will read (X)HTML from `url` and write Markdown to `output-file`. If
`output-file` is not supplied, it will return the markdown as a string:
```clojure
(require '[html-to-md.core :refer [html-to-md]])
(def md (html-to-md url))
```
## Extending the transformer
In principle, the transformer can transform any [HT|SG|X]ML markup into any
other, or into any textual form. To extend it to do something other than
markdown, supply a **dispatcher**. A dispatcher is essentially a function of one
argument, a [HT|SG|X]ML tag represented as a Clojure keyword, which returns
a **processor,** which should be a function of two arguments, an element assumed
to have that tag, and a dispatcher. The processor should return the value that
you want elements of that tag transformed into.
Thus the `html-to-md.html-to-md` namespace comprises a number of *processor*
functions, such as this one:
```clojure
(defn markdown-a
"Process the anchor element `e` into markdown, using dispatcher `d`."
[e d]
(str
"["
(s/trim (apply str (process (:content e) d)))
"]("
(-> e :attrs :href)
")"))
```
and a *dispatcher* map:
```clojure
(def markdown-dispatcher
"A despatcher for transforming (X)HTML into Markdown."
{:a markdown-a
:b markdown-strong
:br markdown-br
:code markdown-code
:body markdown-default
:div markdown-div
:em markdown-em
:h1 markdown-h1
:h2 markdown-h2
:h3 markdown-h3
:h4 markdown-h4
:h5 markdown-h5
:h6 markdown-h6
:html markdown-html
:i markdown-em
:img markdown-img
:ol markdown-ol
:p markdown-div
:pre markdown-pre
:samp markdown-code
:script markdown-omit
:span markdown-default
:strong markdown-strong
:style markdown-omit
:ul markdown-ul
})
```
Obviously it is convenient to write dispatchers as maps, but it isn't required
that you do so: anything which, given a keyword, will return a processor, will
work.
## License
Copyright © 2019 Simon Brooke <simon@journeyman.cc>
Distributed under the Eclipse Public License either version 1.0 or (at
your option) any later version.

View file

@ -1,6 +1,11 @@
(ns html-to-md.core) (ns html-to-md.core
(:require [html-to-md.transformer :refer [transform process]]
[html-to-md.html-to-md :refer [markdown-dispatcher]]))
(defn foo (defn html-to-md
"I don't do a whole lot." "Transform the HTML document referenced by `url` into Markdown, and write
[x] it to `output`, if supplied."
(println x "Hello, World!")) ([url]
(apply str (transform url markdown-dispatcher)))
([url output]
(spit output (html-to-md url))))

View file

@ -165,6 +165,7 @@
(def markdown-dispatcher (def markdown-dispatcher
"A despatcher for transforming (X)HTML into Markdown."
{:a markdown-a {:a markdown-a
:b markdown-strong :b markdown-strong
:br markdown-br :br markdown-br

View file

@ -29,7 +29,7 @@
(string? element) element (string? element) element
(or (seq? element) (vector? element)) (or (seq? element) (vector? element))
(doall (map #(process % dispatcher) element)))) (remove nil? (map #(process % dispatcher) element))))
(defn- transformer-dispatch (defn- transformer-dispatch
[a _] [a _]
@ -45,7 +45,7 @@
(process obj dispatcher)) (process obj dispatcher))
(defmethod transform java.net.URI [uri dispatcher] (defmethod transform java.net.URI [uri dispatcher]
(process (html/html-resource uri) dispatcher)) (remove nil? (process (html/html-resource uri) dispatcher)))
(defmethod transform java.net.URL [url dispatcher] (defmethod transform java.net.URL [url dispatcher]
(transform (.toURI url) dispatcher)) (transform (.toURI url) dispatcher))