| doc | ||
| resources | ||
| src/clj/cc/journeyman/elboob | ||
| test/elboob | ||
| .gitignore | ||
| CHANGELOG.md | ||
| LICENSE | ||
| project.clj | ||
| README.md | ||
elboob
A site search engine for Cryogen with search on the client side
Justification
Left, of course.
More seriously elboob is as near as I can get to an inversion of Google.
Design intention
This project is intended to be in two parts:
The compiler
A Clojure function which scans a list of directories of Markdown files, and produces a map which keys each lexical token occurring in each file (with Markdown formatting, common words, punctuation etc excepted) to a map which keys the relative file path of each file in which the token occurs to the frequency the token occurs within the file.
Thus, supposing we had one file, with the path name content/md/posts/aquarius.md with the content
The Age of Aquarius
This is the dawning of the Age of Aquarius.
Then the output should be
{"age" {"content/md/posts/aquarius.md" 2}
"aquarius" {"content/md/posts/aquarius.md" 2}
"dawning" {"content/md/posts/aquarius.md" 1}}
This map is then stored in a file elboob.edn in the root directory of the Cryogen public output. Whether the source path name (e.g. content/md/posts/) should be converted to the target pathname (e.g. /blog/posts-output/) at compile time or at search time is something I'll decide later.
The searcher
The searcher is a little Clojurescript function which, given a sequence of search terms, will read the elboob.edn file, will produce a web page showing a list of files which contain one or more of those search terms, ordered by the product of the number of occurences of each word in the file.
Implementation
Is at an early stage. I have a working indexer, which conforms to the specification given above. There are problems with it:
- It contains many many repetitions of long file path names, which results in a large data size (although it make it efficient to search);
- It doesn't contain human readable metadata about the files, which, given this is Cryogen and the files have metadata headers, it easily could.
I could assign a gensym to each file path name, store that gensym in the main index, add a separate dictionary map entry to the index which translated those gensyms into the full file paths. That would substantially reduce the file size without greatly increasing the cost of search.
License
Copyright © 2025 Simon Brooke. Licensed under the GNU General Public License, version 2.0 or (at your option) any later version.