A site search engine for Cryogen with search on the client side

Find a file

Simon Brooke e9d0c1b806 Added namespace doc!		2025-10-31 21:30:47 +00:00
doc	Added a rough sketch of the specification and project structure.	2025-10-31 11:37:27 +00:00
resources	OK, this can now compile an index for a single file, and do it very quickly;	2025-10-31 15:37:36 +00:00
src/clj/cc/journeyman/elboob	Added namespace doc!	2025-10-31 21:30:47 +00:00
test/elboob	Added a rough sketch of the specification and project structure.	2025-10-31 11:37:27 +00:00
.gitignore	Added more ignorables to .gitignore	2025-10-31 15:41:02 +00:00
CHANGELOG.md	Added a rough sketch of the specification and project structure.	2025-10-31 11:37:27 +00:00
LICENSE	Initial commit	2025-10-31 10:28:27 +00:00
project.clj	Minor improvements to indexing.	2025-10-31 18:46:08 +00:00
README.md	Minor improvements to indexing.	2025-10-31 18:46:08 +00:00

README.md

elboob

A site search engine for Cryogen with search on the client side

Justification

Left, of course.

More seriously elboob is as near as I can get to an inversion of Google.

Design intention

This project is intended to be in two parts:

The compiler

A Clojure function which scans a list of directories of Markdown files, and produces a map which keys each lexical token occurring in each file (with Markdown formatting, common words, punctuation etc excepted) to a map which keys the relative file path of each file in which the token occurs to the frequency the token occurs within the file.

Thus, supposing we had one file, with the path name content/md/posts/aquarius.md with the content

The Age of Aquarius

This is the dawning of the Age of Aquarius.

Then the output should be

{"age" {"content/md/posts/aquarius.md" 2}
 "aquarius" {"content/md/posts/aquarius.md" 2}
 "dawning" {"content/md/posts/aquarius.md" 1}}

This map is then stored in a file elboob.edn in the root directory of the Cryogen public output. Whether the source path name (e.g. content/md/posts/) should be converted to the target pathname (e.g. /blog/posts-output/) at compile time or at search time is something I'll decide later.

The searcher

The searcher is a little Clojurescript function which, given a sequence of search terms, will read the elboob.edn file, will produce a web page showing a list of files which contain one or more of those search terms, ordered by the product of the number of occurences of each word in the file.

Implementation

Is at an early stage. I have a working indexer, which conforms to the specification given above. There are problems with it:

It contains many many repetitions of long file path names, which results in a large data size (although it make it efficient to search);
It doesn't contain human readable metadata about the files, which, given this is Cryogen and the files have metadata headers, it easily could.

I could assign a gensym to each file path name, store that gensym in the main index, add a separate dictionary map entry to the index which translated those gensyms into the full file paths. That would substantially reduce the file size without greatly increasing the cost of search.