OK, my first git repository may have got smashed, but that isn't a reason

for doing without. Reinitialising, history lost. And this time push to
remote!
This commit is contained in:
Simon Brooke 2013-10-31 08:10:05 +00:00
commit e59f160f70
20 changed files with 6205 additions and 0 deletions

6
.classpath Normal file
View file

@ -0,0 +1,6 @@
<?xml version="1.0" encoding="UTF-8"?>
<classpath>
<classpathentry kind="src" path="src"/>
<classpathentry kind="con" path="org.eclipse.jdt.launching.JRE_CONTAINER/org.eclipse.jdt.internal.debug.ui.launcher.StandardVMType/JavaSE-1.6"/>
<classpathentry kind="output" path="bin"/>
</classpath>

17
.project Normal file
View file

@ -0,0 +1,17 @@
<?xml version="1.0" encoding="UTF-8"?>
<projectDescription>
<name>milkwood</name>
<comment></comment>
<projects>
</projects>
<buildSpec>
<buildCommand>
<name>org.eclipse.jdt.core.javabuilder</name>
<arguments>
</arguments>
</buildCommand>
</buildSpec>
<natures>
<nature>org.eclipse.jdt.core.javanature</nature>
</natures>
</projectDescription>

View file

@ -0,0 +1,11 @@
eclipse.preferences.version=1
org.eclipse.jdt.core.compiler.codegen.inlineJsrBytecode=enabled
org.eclipse.jdt.core.compiler.codegen.targetPlatform=1.6
org.eclipse.jdt.core.compiler.codegen.unusedLocal=preserve
org.eclipse.jdt.core.compiler.compliance=1.6
org.eclipse.jdt.core.compiler.debug.lineNumber=generate
org.eclipse.jdt.core.compiler.debug.localVariable=generate
org.eclipse.jdt.core.compiler.debug.sourceFile=generate
org.eclipse.jdt.core.compiler.problem.assertIdentifier=error
org.eclipse.jdt.core.compiler.problem.enumIdentifier=error
org.eclipse.jdt.core.compiler.source=1.6

72
README.txt Normal file
View file

@ -0,0 +1,72 @@
Trigrams process
Started at: 20131030:12:48 GMT
OK, it's a tokeniser, with a map. The map maps token tuples onto tokens.
(A, B) -> C
But, one tuple (A, B) may map onto any number of a number of tokens, so it's
(A, B) -> one_of( C, D, E...)
From the problem specification: 'What do we do with punctuation? Paragraphs?'
Punctuation is just tokens and should follow the same rules as other tokens
i.e. 'I CAME, I SAW, I CONQUERED,' should be treated as 'I CAME COMMA I SAW COMMA I CONQUERED'
Paragraphs... Since this is nonsense, a paragraph won't contain a logical unit of narrative vor argument, since there can be no logical units. It is effectively a lorem ipsum text. So paragraphs should be generated at random at the ends of sentences, with a roughly 20% probability (i.e. on average five sentences to a paragraph). A 'sentence' ends just exactly when we emit a period.
Is it really as simple as that? Seriously?
OK, this is (1) an ideal Prolog problem, and (2) something which it would be delightful to tackle in Lisp or Clojure (or indeed use that neat little Prolog-in-Clojure I saw somewhere recently) but I'm asked to do it in Java, so Java it shall (for now) be.
OK, interesting little problemette:
iterating over tokens on the input, I need to hold open N uncompleted tuples, where N is the number of tokens in a tuple. H'mmmm...
Oh, bother. A stream tokenizer doesn't return simple tokens, it tries to be clever.
Question: does it matter if, against a word, we map multiple identical tuples? Answer: no, it doesn't - it just slightly increases the probability of that sequence being selected on output.
Damn! and actually, this is why this really is a Lisp problem! the data structure we want is not
I -> [[I, CAME, COMMA],[I, CAME, TO],[I, SAW, COMMA],[I, SAW, HER]]
it would be better to have rule trees
I -> [I, [[CAME, [[COMMA], [TO]]], [SAW [[COMMA],[HER]]]]
Because then we could just walk the rule tree on output... Wait! No, dammit, it's even simpler than that. All we need to store is successors. Because we walk the succession hierarchy as deep as we need on output... Jings, that's neat.
But no, in won't work - because then I don't know where I've come from. Rule trees it must be.
Right. Backtracking required? My hunch is yes, because, suppose the input comprises the sequences:
A B B
A B C
B C A
B C D
C D A
C A B
D B C
Then we start by emitting A B C D, we're stuck, because we have no rule with the left-hand side 'D A'. So we have to roll back the B C D step and choose B C A instead. Which means, generation must be a recursive function.
OK, this would be SO MUCH easier in a functional language like Clojure... the problem is in the backtracking. I had thought it wouldn't matter not marking which branches I'd explored because I could just explore branches at random, but that doesn't work because either I end up getting stuck in an infinite loop retrying branches I've explored before, or else I could fail when there is a valid solution.
OK, the problem only specified that the tuple length should be two. I'm trying to build the general case. But the special case of tuple length = 2 would be easier to solve. Should I admit defeat, or shall I be arrogant? It would be more elegant to solve the general case.
Argh, power cut. This is not what I need. Copied everything onto laptop but Git repository is corrupt. Never mind, don't have time to fix it. Also, don't have Java 7 on laptop so no try-with-resources... bother.
Also don't have Netbeans on laptop and while Eclipse is handling the Netbeans project mostly fine, I can't do 'ant jar' because of stuff in the netbeans project file. Argghh! Hnd-hacked project.properties and now it works...
Now overtired and making mistakes - a situation made worse by broken git. I don't think I'm going through the whole sequence as I intend; also, something is scrambling glanceBack - and it's something I've broken recently, which makes it worse. Taking a break.
And, dammit! Although I've specified that line feed and carriage return are whitespace, the parser is still treating them as special. Oh, no, I beg it's pardon, it isn't. However, although I haven't specified 'period' as a word character, it is being treated as one. Bother.
Having said all that, the parse tree is looking very good. I'm extracting the rules well. It's applying them that's proving hard.
Right, StreamTokenizer was a poor choice. It seems to be legacy. But schlurping the whole text into a string and then using StringTokenizer or String.split() looks a bad choice too, since I don't know how long the string is. H'mmmm... This is a problem I kind of don't need, since it's not key to the project, and the .endsWith(PERIOD) hack works around it. Concentate on output.

74
build.xml Normal file
View file

@ -0,0 +1,74 @@
<?xml version="1.0" encoding="UTF-8"?>
<!-- You may freely edit this file. See commented blocks below for -->
<!-- some examples of how to customize the build. -->
<!-- (If you delete it and reopen the project it will be recreated.) -->
<!-- By default, only the Clean and Build commands use this build script. -->
<!-- Commands such as Run, Debug, and Test only use this build script if -->
<!-- the Compile on Save feature is turned off for the project. -->
<!-- You can turn off the Compile on Save (or Deploy on Save) setting -->
<!-- in the project's Project Properties dialog box.-->
<project name="milkwood" default="default" basedir=".">
<description>Builds, tests, and runs the project milkwood.</description>
<import file="nbproject/build-impl.xml"/>
<!--
There exist several targets which are by default empty and which can be
used for execution of your tasks. These targets are usually executed
before and after some main targets. They are:
-pre-init: called before initialization of project properties
-post-init: called after initialization of project properties
-pre-compile: called before javac compilation
-post-compile: called after javac compilation
-pre-compile-single: called before javac compilation of single file
-post-compile-single: called after javac compilation of single file
-pre-compile-test: called before javac compilation of JUnit tests
-post-compile-test: called after javac compilation of JUnit tests
-pre-compile-test-single: called before javac compilation of single JUnit test
-post-compile-test-single: called after javac compilation of single JUunit test
-pre-jar: called before JAR building
-post-jar: called after JAR building
-post-clean: called after cleaning build products
(Targets beginning with '-' are not intended to be called on their own.)
Example of inserting an obfuscator after compilation could look like this:
<target name="-post-compile">
<obfuscate>
<fileset dir="${build.classes.dir}"/>
</obfuscate>
</target>
For list of available properties check the imported
nbproject/build-impl.xml file.
Another way to customize the build is by overriding existing main targets.
The targets of interest are:
-init-macrodef-javac: defines macro for javac compilation
-init-macrodef-junit: defines macro for junit execution
-init-macrodef-debug: defines macro for class debugging
-init-macrodef-java: defines macro for class execution
-do-jar-with-manifest: JAR building (if you are using a manifest)
-do-jar-without-manifest: JAR building (if you are not using a manifest)
run: execution of project
-javadoc-build: Javadoc generation
test-report: JUnit report generation
An example of overriding the target for project execution could look like this:
<target name="run" depends="milkwood-impl.jar">
<exec dir="bin" executable="launcher.exe">
<arg file="${dist.jar}"/>
</exec>
</target>
Notice that the overridden target depends on the jar target and not only on
the compile target as the regular run target does. Again, for a list of available
properties which you can use, check the target you are overriding in the
nbproject/build-impl.xml file.
-->
</project>

3
manifest.mf Normal file
View file

@ -0,0 +1,3 @@
Manifest-Version: 1.0
X-COMMENT: Main-Class will be added automatically by build

1444
nbproject/build-impl.xml Normal file

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,8 @@
build.xml.data.CRC32=d35b316e
build.xml.script.CRC32=cd5c02b3
build.xml.stylesheet.CRC32=28e38971@1.56.1.46
# This file is used by a NetBeans-based IDE to track changes in generated files such as build-impl.xml.
# Do not edit this file. You may delete it but then the IDE will never regenerate such files for you.
nbproject/build-impl.xml.data.CRC32=d35b316e
nbproject/build-impl.xml.script.CRC32=0441a68e
nbproject/build-impl.xml.stylesheet.CRC32=c6d2a60f@1.56.1.46

View file

@ -0,0 +1,2 @@
compile.on.save=true
user.properties.file=/home/simon/.jmonkeyplatform/3.0/build.properties

View file

@ -0,0 +1,4 @@
<?xml version="1.0" encoding="UTF-8"?>
<project-private xmlns="http://www.netbeans.org/ns/project-private/1">
<editor-bookmarks xmlns="http://www.netbeans.org/ns/editor-bookmarks/2" lastBookmarkId="0"/>
</project-private>

View file

@ -0,0 +1,73 @@
annotation.processing.enabled=true
annotation.processing.enabled.in.editor=false
annotation.processing.processor.options=
annotation.processing.processors.list=
annotation.processing.run.all.processors=true
annotation.processing.source.output=${build.generated.sources.dir}/ap-source-output
build.classes.dir=${build.dir}/classes
build.classes.excludes=**/*.java,**/*.form
# This directory is removed when the project is cleaned:
build.dir=build
build.generated.dir=${build.dir}/generated
build.generated.sources.dir=${build.dir}/generated-sources
# Only compile against the classpath explicitly listed here:
build.sysclasspath=ignore
build.test.classes.dir=${build.dir}/test/classes
build.test.results.dir=${build.dir}/test/results
# Uncomment to specify the preferred debugger connection transport:
#debug.transport=dt_socket
debug.classpath=\
${run.classpath}
debug.test.classpath=\
${run.test.classpath}
# This directory is removed when the project is cleaned:
dist.dir=dist
dist.jar=${dist.dir}/milkwood.jar
dist.javadoc.dir=${dist.dir}/javadoc
excludes=
includes=**
jar.compress=false
javac.classpath=
# Space-separated list of extra javac options
javac.compilerargs=
javac.deprecation=false
javac.processorpath=\
${javac.classpath}
javac.source=1.6
javac.target=1.6
javac.test.classpath=\
${javac.classpath}:\
${build.classes.dir}
javac.test.processorpath=\
${javac.test.classpath}
javadoc.additionalparam=
javadoc.author=false
javadoc.encoding=${source.encoding}
javadoc.noindex=false
javadoc.nonavbar=false
javadoc.notree=false
javadoc.private=false
javadoc.splitindex=true
javadoc.use=true
javadoc.version=false
javadoc.windowtitle=
main.class=cc.journeyman.milkwood.Milkwood
manifest.file=manifest.mf
meta.inf.dir=${src.dir}/META-INF
mkdist.disabled=false
platform.active=JDK_1.6
platforms.JDK_1.6.home=/usr/lib/jvm/java-6-openjdk-amd64/
run.classpath=\
${javac.classpath}:\
${build.classes.dir}
# Space-separated list of JVM arguments used when running the project.
# You may also define separate properties like run-sys-prop.name=value instead of -Dname=value.
# To set system properties for unit tests define test-sys-prop.name=value:
run.jvmargs=
run.test.classpath=\
${javac.test.classpath}:\
${build.test.classes.dir}
source.encoding=UTF-8
src.dir=src
test.src.dir=test
project.license=unpublished

View file

@ -0,0 +1,72 @@
annotation.processing.enabled=true
annotation.processing.enabled.in.editor=false
annotation.processing.processor.options=
annotation.processing.processors.list=
annotation.processing.run.all.processors=true
annotation.processing.source.output=${build.generated.sources.dir}/ap-source-output
build.classes.dir=${build.dir}/classes
build.classes.excludes=**/*.java,**/*.form
# This directory is removed when the project is cleaned:
build.dir=build
build.generated.dir=${build.dir}/generated
build.generated.sources.dir=${build.dir}/generated-sources
# Only compile against the classpath explicitly listed here:
build.sysclasspath=ignore
build.test.classes.dir=${build.dir}/test/classes
build.test.results.dir=${build.dir}/test/results
# Uncomment to specify the preferred debugger connection transport:
#debug.transport=dt_socket
debug.classpath=\
${run.classpath}
debug.test.classpath=\
${run.test.classpath}
# This directory is removed when the project is cleaned:
dist.dir=dist
dist.jar=${dist.dir}/milkwood.jar
dist.javadoc.dir=${dist.dir}/javadoc
excludes=
includes=**
jar.compress=false
javac.classpath=
# Space-separated list of extra javac options
javac.compilerargs=
javac.deprecation=false
javac.processorpath=\
${javac.classpath}
javac.source=1.6
javac.target=1.6
javac.test.classpath=\
${javac.classpath}:\
${build.classes.dir}
javac.test.processorpath=\
${javac.test.classpath}
javadoc.additionalparam=
javadoc.author=false
javadoc.encoding=${source.encoding}
javadoc.noindex=false
javadoc.nonavbar=false
javadoc.notree=false
javadoc.private=false
javadoc.splitindex=true
javadoc.use=true
javadoc.version=false
javadoc.windowtitle=
main.class=cc.journeyman.milkwood.Milkwood
manifest.file=manifest.mf
meta.inf.dir=${src.dir}/META-INF
mkdist.disabled=false
platform.active=JDK_1.6
run.classpath=\
${javac.classpath}:\
${build.classes.dir}
# Space-separated list of JVM arguments used when running the project.
# You may also define separate properties like run-sys-prop.name=value instead of -Dname=value.
# To set system properties for unit tests define test-sys-prop.name=value:
run.jvmargs=
run.test.classpath=\
${javac.test.classpath}:\
${build.test.classes.dir}
source.encoding=UTF-8
src.dir=src
test.src.dir=test
project.license=unpublished

16
nbproject/project.xml Normal file
View file

@ -0,0 +1,16 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://www.netbeans.org/ns/project/1">
<type>org.netbeans.modules.java.j2seproject</type>
<configuration>
<data xmlns="http://www.netbeans.org/ns/j2se-project/3">
<name>milkwood</name>
<explicit-platform explicit-source-supported="true"/>
<source-roots>
<root id="src.dir"/>
</source-roots>
<test-roots>
<root id="test.src.dir"/>
</test-roots>
</data>
</configuration>
</project>

View file

@ -0,0 +1,74 @@
package cc.journeyman.milkwood;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
/*
* Proprietary unpublished source code property of
* Simon Brooke <simon@journeyman.cc>.
*
* Copyright (c) 2013 Simon Brooke <simon@journeyman.cc>
*/
/**
*
* @author Simon Brooke <simon@journeyman.cc>
*/
public class Milkwood {
/**
* Parse command line arguments and kick off the process. Expected
* arguments include:
* <dl>
* <dt>-i, -input</dt>
* <dd>Input file, expected to be an English (or, frankly, other natural
* language) text. Defaults to standard in.</dd>
* <dt>-n, -tuple-length</dt>
* <dd>The length of tuples into which the file will be analised, default 2.</dd>
* <dt>-o, -output</dt>
* <dd>Output file, to which generated text will be written.
* Defaults to standard out.</dd>
* </dl>
*
* @param args the command line arguments
* @exception FileNotFoundException if the user specifies a file which
* isn't available.
* @excpetion IOException if could not read from input or write to output.
*/
public static void main(String[] args) throws FileNotFoundException, IOException {
InputStream in = System.in;
OutputStream out = System.out;
int tupleLength = 2;
for (int cursor = 0; cursor < args.length; cursor++) {
String arg = args[cursor];
if (arg.startsWith("-") && arg.length() > 1) {
switch (arg.charAt(1)) {
case 'i':
// input
in = new FileInputStream(new File(args[++cursor]));
break;
case 'o': // output
out = new FileOutputStream(new File(args[++cursor]));
break;
case 'n':
case 't': // tuple length
tupleLength = Integer.parseInt(args[++cursor]);
break;
default:
throw new IllegalArgumentException(
String.format("Unrecognised argument '%s'", arg));
}
}
}
new TextGenerator().readAndGenerate( in, out, tupleLength);
}
}

View file

@ -0,0 +1,17 @@
/*
* Proprietary unpublished source code property of
* Simon Brooke <simon@journeyman.cc>.
*
* Copyright (c) 2013 Simon Brooke <simon@journeyman.cc>
*/
package cc.journeyman.milkwood;
/**
*
* @author Simon Brooke <simon@journeyman.cc>
*/
class NoSuchPathException extends Exception {
private static final long serialVersionUID = 1L;
}

View file

@ -0,0 +1,170 @@
/*
* Proprietary unpublished source code property of
* Simon Brooke <simon@journeyman.cc>.
*
* Copyright (c) 2013 Simon Brooke <simon@journeyman.cc>
*/
package cc.journeyman.milkwood;
import java.util.ArrayList;
import java.util.Collection;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import java.util.Queue;
import java.util.Random;
import java.util.Stack;
/**
* Mapping a word to its successor words. This is probably highly
* inefficient of store, but for the present purposes my withers are unwrung.
* Not thread safe in this form because of access to the random number generator.
*
* @author Simon Brooke <simon@journeyman.cc>
*/
public class RuleTreeNode {
/**
* The line separator on this platform.
*/
public static final String NEWLINE = System.getProperty("line.separator", "\n");
/**
* A random number generator.
*/
private static Random RANDOM = new Random();
/**
* The word at this node.
*/
private final String word;
/**
* Potential successors of this node
*/
private Map<String,RuleTreeNode> rules = new HashMap<String,RuleTreeNode>();
/**
* Create me wrapping this word.
* @param word the word I represent.
*/
public RuleTreeNode(String word) {
this.word = word;
}
public String toString() {
StringBuffer buffy = new StringBuffer();
this.printToBuffer( buffy, 0);
return buffy.toString();
}
private void printToBuffer(StringBuffer buffy, int indent) {
for (int i = 0; i < indent; i++) {
buffy.append( '\t');
}
buffy.append( this.getWord());
if ( this.rules.isEmpty()) {
buffy.append(NEWLINE);
} else {
buffy.append( " ==>").append(NEWLINE);
for ( String successor : this.getSuccessors()) {
rules.get(successor).printToBuffer(buffy, indent + 1);
}
buffy.append(NEWLINE);
}
}
/**
*
* @return my word.
*/
public String getWord() {
return word;
}
/**
*
* @return a shuffled list of the words which could follow this one.
*/
public Collection<String> getSuccessors() {
ArrayList<String> result = new ArrayList<String>();
result.addAll(rules.keySet());
Collections.shuffle(result, RANDOM);
return result;
}
/**
* Compile this sequence of tokens into rule nodes under me.
* @param sequence the sequence of tokens to compile.
*/
public void addSequence(Queue<String> sequence) {
if (!sequence.isEmpty()) {
String word = sequence.remove();
RuleTreeNode successor = this.getRule(word);
if (successor == null) {
successor = new RuleTreeNode(word);
this.rules.put(word, successor);
}
successor.addSequence(sequence);
}
}
/**
* Choose a successor at random.
*
* @return the successor chosen, or null if I have none.
*/
protected RuleTreeNode getRule() {
RuleTreeNode result = null;
if (!rules.isEmpty()) {
int target = RANDOM.nextInt(rules.keySet().size());
for (String key : rules.keySet()) {
/*
* NOTE: decrement after test.
*/
if (target-- == 0) {
result = rules.get(key);
}
}
}
return result;
}
/**
*
* @param token a token to seek.
* @return the successor among my successors which has this token, if any.
*/
protected RuleTreeNode getRule(String token) {
return rules.get(token);
}
protected String getWord(Stack<String> path) throws NoSuchPathException {
final String result;
if ( path.isEmpty()) {
result = this.getWord();
} else {
final RuleTreeNode successor = this.getRule(path.pop());
if (successor == null) {
throw new NoSuchPathException();
} else {
result = successor.getWord(path);
}
}
return result;
}
}

View file

@ -0,0 +1,432 @@
/*
* Proprietary unpublished source code property of
* Simon Brooke <simon@journeyman.cc>.
*
* Copyright (c) 2013 Simon Brooke <simon@journeyman.cc>
*/
package cc.journeyman.milkwood;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.io.Reader;
import java.io.StreamTokenizer;
import java.util.Collection;
import java.util.LinkedList;
import java.util.Locale;
import java.util.Queue;
import java.util.Random;
import java.util.Stack;
import java.util.logging.Level;
import java.util.logging.Logger;
/**
*
* @author Simon Brooke <simon@journeyman.cc>
*/
class TextGenerator {
/**
* The magic token which identifies the root node of the
* rule tree.
*/
private static final String ROOTMAGICTOKEN = "*ROOT*";
/**
* The special magic token which is deemed to end sentences.
*/
public static final String PERIOD = ".";
/**
* The average number of sentences in a paragraph.
*/
public static final int AVSENTENCESPERPARA = 5;
/**
* A random number generator.
*/
private static Random RANDOM = new Random();
/**
* Dictionary of first-words we know about; each first-word maps
* onto a tuple of tuples of word sequences beginning with that
* word, so 'I' might map onto [[I, CAME, COMMA],[I, SAW, COMMA],[I CONQUERED COMMA]].
*/
TupleDictionary dictionary = new TupleDictionary();
public TextGenerator() {
}
/**
* Read tokens from this input and use them to generate text on this output.
* @param in the input stream to read.
* @param out the output stream to write to.
* @param tupleLength the length of tuples to be used in generation.
* @throws IOException if the file system buggers up, which is not, in the
* cosmic scheme of things, very likely.
*/
void readAndGenerate(InputStream in, OutputStream out, int tupleLength) throws IOException {
/* The root of the rule tree I shall build. */
RuleTreeNode root = new RuleTreeNode( ROOTMAGICTOKEN);
int length = read(in, tupleLength, root);
System.err.println( root.toString());
generate( out, tupleLength, root, length);
}
/**
* Read tokens from the input stream, and compile them into a ruleset below root.
* @param in the input stream from which I read.
* @param tupleLength the length of the tuples I read.
* @param root the ruleset to which I shall add.
* @return the number of tokens read.
* @throws IOException
*/
private int read(InputStream in, int tupleLength, RuleTreeNode root) throws IOException {
int result = 0;
Queue<WordSequence> openTuples = new LinkedList<WordSequence>();
StreamTokenizer tok = prepareTokenizer(in);
for (int type = tok.nextToken(); type != StreamTokenizer.TT_EOF; type = tok.nextToken()) {
result ++;
final WordSequence newTuple = new WordSequence();
String token = readBareToken(tok, type);
openTuples.add(newTuple);
for ( WordSequence tuple : openTuples) {
tuple.add(token);
}
if (openTuples.size() > tupleLength) {
root.addSequence( openTuples.remove());
}
}
return result;
}
/**
* There surely must be a better way to get just the token out of a
* StreamTokenizer...!
* @param tok the tokenizer.
* @return just the next token.
*/
private String readBareToken(StreamTokenizer tok, int type) {
final String token;
switch (type) {
case StreamTokenizer.TT_EOL:
token = "FIXME"; // TODO: fix this!
break;
case StreamTokenizer.TT_NUMBER:
token = new Double(tok.nval).toString();
break;
case StreamTokenizer.TT_WORD:
token = tok.sval.toLowerCase();
break;
default:
StringBuffer buffy = new StringBuffer();
buffy.append((char) type);
token = buffy.toString();
break;
}
return token;
}
/**
* Prepare a tokeniser on this input stream, set up to handle at least
* Western European natural language text.
* @param in the stream.
* @return a suitable tokeniser.
*/
private StreamTokenizer prepareTokenizer(InputStream in) {
Reader gentle = new BufferedReader(new InputStreamReader(in));
StreamTokenizer tok = new StreamTokenizer(gentle);
tok.resetSyntax();
tok.whitespaceChars(8, 15);
tok.whitespaceChars(28, 32);
/* treat quotemarks as white space */
tok.whitespaceChars((int) '\"', (int) '\"');
tok.whitespaceChars((int) '\'', (int) '\'');
tok.wordChars((int) '0', (int) '9');
tok.wordChars((int) 'A', (int) 'Z');
tok.wordChars((int) 'a', (int) 'z');
tok.parseNumbers();
return tok;
}
private void generate(OutputStream out, int tupleLength, RuleTreeNode root, int length) throws IOException {
WordSequence tokens = this.compose( root, tupleLength, length);
if ( tokens.contains(PERIOD)) {
// TODO: eq = equal?
tokens = this.truncateAtLastInstance( tokens, PERIOD);
}
this.generate( out, tokens);
}
/**
* Write this sequence of tokens on this stream, sorting out minor
* issues of orthography.
* @param out the stream.
* @param tokens the tokens.
* @throws IOException if it is impossible to write (e.g. file system full).
*/
private void generate(OutputStream out, WordSequence tokens) throws IOException {
BufferedWriter dickens = new BufferedWriter(new OutputStreamWriter(out));
boolean capitaliseNext = true;
try {
for (String token : tokens) {
capitaliseNext = writeToken(dickens, capitaliseNext, token);
}
} finally {
dickens.flush();
dickens.close();
}
}
/**
* Deal with end of paragraph, capital after full stop, and other
* minor orthographic conventions.
* @param dickens the scrivenor who writes for us.
* @param capitalise whether or not the token should be capitalised
* @param token the token to write;
* @returnvtrue if the next token to be written should be capitalised.
* @throws IOException
*/
private boolean writeToken(BufferedWriter dickens, boolean capitalise,
String token) throws IOException {
if ( this.spaceBefore(token)) {
dickens.write( " ");
}
if ( capitalise) {
dickens.write(token.substring(0, 1).toUpperCase(Locale.getDefault()));
dickens.write(token.substring(1));
} else {
dickens.write(token);
}
this.maybeParagraph( token, dickens);
return (token.endsWith(PERIOD));
}
/**
* Return false if token is punctuation, else true. Wouldn't it be
* nice if Java provided Character.isPunctuation(char)? However, since it
* doesn't, I can give this slightly special semantics: return true only if
* this is punctuation which would not normally be preceded with a space.
* @param ch a character.
* @return true if the should be preceded by a space, else false.
*/
private boolean spaceBefore(String token) {
final boolean result;
if (token.length() == 1) {
switch (token.charAt(0)) {
case '.':
case ',':
case ':':
case ';':
case 's':
/*
* an 's' on its own is probably evidence of a possessive with
* the apostrophe lost
*/
case 't':
/* similar; probably 'doesn't' or 'shouldn't' or other cases
* of 'not' with an elided 'o'.
*/
result = false;
break;
default:
result = true;
break;
}
} else {
result = false;
}
return result;
}
/**
* If this token is an end-of-sentence token, then, on one chance in
* some, have the writer write two new lines. NOTE: The tokeniser is treating
* PERIOD ('.') as a word character, even though it has not been told to.
* Token.endsWith( PERIOD) is a hack to get round this problem.
* TODO: investigate and fix.
*
* @param token a token
* @param dickens our scrivenor
* @throws IOException if Mr Dickens has run out of ink
*/
private void maybeParagraph(String token, BufferedWriter dickens) throws IOException {
if ( token.endsWith(PERIOD) && RANDOM.nextInt(AVSENTENCESPERPARA) == 0) {
dickens.write("\n\n");
}
}
/**
* Recursive, backtracking, output generator.
* @param rules
* @param tupleLength
* @param length
* @return
*/
private WordSequence compose(RuleTreeNode rules, int tupleLength, int length) {
Stack<String> preamble = composePreamble( rules);
WordSequence result = new WordSequence();
// composing the preamble will have ended with *ROOT* on top of the stack;
// get rid of it.
preamble.pop();
result.addAll(preamble);
result.addAll(this.compose( preamble, rules, rules, tupleLength, length));
return result;
}
/**
* Recursively attempt to find sequences in the ruleset to append to
* what's been composed so far.
* @param glanceBack
* @param allRules
* @param currentRules
* @param tupleLength
* @param length
* @return
*/
private WordSequence compose(Stack<String> glanceBack,
RuleTreeNode allRules, RuleTreeNode currentRules, int tupleLength,
int length) {
assert (glanceBack.size() == tupleLength) : "Shouldn't happen: bad tuple size";
assert (allRules.getWord() == ROOTMAGICTOKEN) : "Shoudn't happen: bad rule set";
WordSequence result;
try {
@SuppressWarnings("unchecked")
String here = currentRules.getWord((Stack<String>) glanceBack.clone());
System.err.println( String.format( "Trying token %s", here));
result = new WordSequence();
result.add(here);
if (length != 0) {
/* we're not done yet */
Collection<String> options = allRules.getSuccessors();
for (String next : options) {
WordSequence rest =
this.tryOption( (Stack<String>) glanceBack.clone(), allRules,
currentRules.getRule(next), tupleLength, length - 1);
if (rest != null) {
/* we have a solution */
result.addAll(rest);
break;
}
}
}
} catch (NoSuchPathException ex) {
Logger.getLogger(TextGenerator.class.getName()).log(Level.WARNING,
String.format("No path %s: Backtracking...", glanceBack));
result = null;
}
return result;
}
/**
* Try composing with this ruleset
* @param glanceBack
* @param allRules all the rules there are.
* @param currentRules the current node in the rule tree.
* @param tupleLength the size of the glanceback window we're considering.
* @param length
* @return
*/
private WordSequence tryOption(Stack<String> glanceBack,
RuleTreeNode allRules, RuleTreeNode currentRules, int tupleLength,
int length) {
final Stack<String> restack = this.restack(glanceBack,
currentRules.getWord());
restack.pop();
return this.compose(restack, allRules, currentRules, tupleLength,
length);
}
/**
* Return a new stack comprising all the items on the current stack,
* with this new string added at the bottom
*
* @param stack the stack to restack.
* @param bottom the item to place on the bottom.
* @return the restacked stack.
*/
private Stack<String> restack(Stack<String> stack, String bottom) {
final Stack<String> result;
if (stack.isEmpty()) {
result = new Stack<String>();
result.push(bottom);
} else {
String top = stack.pop();
result = restack(stack, bottom);
result.push(top);
}
return result;
}
/**
* Random walk of the rule tree to extract (from the root) a legal sequence of words the length of our tuple.
*
* @param rules the rule tree (fragment) to walk.
* @return a sequence of words.
*/
private Stack<String> composePreamble(RuleTreeNode rules) {
final Stack<String> result;
final RuleTreeNode successor = rules.getRule();
if (successor == null) {
result = new Stack<String>();
} else {
result = this.composePreamble(successor);
result.push(rules.getWord());
}
return result;
}
/**
*
* @param tokens a sequence of tokens
* @param marker a marker to terminate after the last occurrance of.
* @return a copy of tokens, truncated at the last occurrance of the marker.
*/
private WordSequence truncateAtLastInstance(WordSequence tokens,
String marker) {
final WordSequence result = new WordSequence();
if (!tokens.isEmpty()) {
String token = tokens.remove();
result.add(token);
if (!(marker.equals(token) && !tokens.contains(marker))) {
/* woah, double negatives. If the token we're looking at is the
* marker, and the remainder of the tokens does not include the
* marker, we're done. Otherwise, we continue. OK? */
result.addAll(this.truncateAtLastInstance(tokens, marker));
}
}
return result;
}
}

View file

@ -0,0 +1,60 @@
/*
* Proprietary unpublished source code property of
* Simon Brooke <simon@journeyman.cc>.
*
* Copyright (c) 2013 Simon Brooke <simon@journeyman.cc>
*/
package cc.journeyman.milkwood;
import java.util.ArrayList;
import java.util.Collection;
import java.util.HashMap;
/**
*
* @author Simon Brooke <simon@journeyman.cc>
*/
public class TupleDictionary extends HashMap<String, Collection<WordSequence>> {
private static final long serialVersionUID = 1L;
/**
* Specialisation: if there isn't an existing entry, create one.
*
* @param token the token to look up
* @return the collection of possible tuples for that token.
*/
public Collection<WordSequence> get(String token) {
Collection<WordSequence> result = super.get(token);
if (result == null) {
result = new ArrayList<WordSequence>();
this.put(token, result);
}
return result;
}
/**
* Add a new, empty sequence to my entry for this token.
* @param token the token
* @return the new sequence which was added.
*/
protected WordSequence addSequence(String token) {
return this.addSequence(token, new WordSequence());
}
/**
* Add this sequence to my entry for this token.
* @param token the token.
* @param sequence the sequence to add. Must not be null!
* @return the sequence which was added.
*/
protected WordSequence addSequence(String token, WordSequence sequence) {
assert (sequence != null) : "invalid sequence argument";
this.get(token).add(sequence);
return sequence;
}
}

View file

@ -0,0 +1,22 @@
/*
* Proprietary unpublished source code property of
* Simon Brooke <simon@journeyman.cc>.
*
* Copyright (c) 2013 Simon Brooke <simon@journeyman.cc>
*/
package cc.journeyman.milkwood;
import java.util.LinkedList;
import java.util.Queue;
/**
* An ordered sequence of words. Of course it implements Queue since it is a
* LinkedList and LinkedList implements Queue, but I want to make it explicitly
* clear that this is a queue and can be used as such.
* @author Simon Brooke <simon@journeyman.cc>
*/
class WordSequence extends LinkedList<String> implements Queue<String> {
private static final long serialVersionUID = 1L;
}

3628
undermilkwood.txt Normal file

File diff suppressed because it is too large Load diff