Tokenisation - Don't drink too much HAIpe

Tokenization

As an Australian, I spell "Tokenisation" without the zed. However, this gets a bit annoying, because the libraries use the zed of the American spelling. Therefore, I use the American spelling.

Next I wanted to make the model's getText, and setText functions. So I started with Sentence and added some description to the class explaining how the sentence is simply made up of a list of Tokens.

Read the explanatory comment in the Sentence class @src/dockb/models/sentence.py#L8-12.
Recommend the best NLP library with the best Tokenization functions which will achieve what is needed.
If my Token is too simplistic so that it will lose significant information of the Tokenization, advise me.
Don't do any coding work just now, wait for me to authorise you to proceed.

It recommended spaCy, and also pointed out that the Token class is too simplistic. It also mentioned that spaCy loses the whitespace between words, and would need a post process to re-insert the whitespace, so I asked the following:

If I use Hugging Face Transformers, would that give me the whitespace without requiring a post process?
If I used Hugging Face, would that require extra functions/properties in the Token, and what would the value be?
Would Hugging Face be less efficient or cause a lot of complexities in the code?

The answer was that it does not give me the whitespace, and it could require a GPU. I decided to go with spaCy. I prompted it as so:

Does spaCy give me the option of tokenizing and embedding the whitespace into the words?
If so, I could tokenize it as words with embedded spaces, but before constructing the Token
object, extract it, and create an extra Token (or Tokens) for the whitespace. Is that possible with spaCy?

Aha! It corrected itself, spaCy does work with whitespace as tokens, but the most natural way for spaCy is to have the trailing whitespace in the word token itself. I decided to prompt it again:

If I use spaCy in its more natural mode of having the whitespace as token.whitespace_,
then I could mirror that in my Token class - not have a TOKEN_IS_WHITESPACE type.
Check also - does spaCy emit punctuation Tokens?

It confirmed that spaCy handles punctuation as Tokens. It agreed that changing the Token WRT whitespace was a good idea. It also pointed out that leading whitespace could be an issue, and suggested that be a property of the Sentence class rather than the Token (which makes sense.) I decided to go with the modified Token class. I changed the tests, and told it to fix the class.

That worked very well, and I have begun to learn about spaCy. So far, so good. Rather than bore you with all my prompts, I'll summarise:

I asked it does spaCy handle multiple punctuation as a single token, or multiples? As singles.
Does spaCy interpret extended characters like the accented 'e' in 'café' as just part of the word? Fully part of the word.
Does spaCy separate special characters like emojis as single tokens? Yes, and emojis are distinct from punctuation, words, or digits. Mixed alphanum is a word.
Is a single number like '123' treated as a special type of token? Yes (I added a TOKEN_IS_NUMBER type).

Then I asked it to tell me what extra properties are needed in Token so that it properly captures what spaCy tokenizes.
It said the Token class is missing like_num, is_alpha, is_stop, lemma, and pos. I asked it to add those properties to the Token class, but not to modify the functions or tests. I wanted to look at the types it uses first.

I got it to add the tokenization to the Sentence class. In this, I departed completely from TDD, because I was prototyping using an LLM. This is what it wrote, and it gives me cause for concern:

          def tokenize(self) -> None:
            doc = _nlp(self.text)
            self.tokens = []
            for spacy_token in doc:
                token = Token()
                token.set_text(spacy_token.text)
                token.trailing_ws = spacy_token.whitespace_
                if spacy_token.pos_:
                    token.set_pos(spacy_token.pos_)
                token.set_lemma(spacy_token.lemma_)
                token.is_digit = spacy_token.is_digit
                token.like_num = spacy_token.like_num
                token.is_alpha = spacy_token.is_alpha
                self.tokens.append(token)
            self.dirty = False

Is set_text always going to create the correct type of Token - are my rules 100% compatible with spaCy's ? Token can't throw. Rather than throw, it should have a fallback to an uncategorised type. I decided to change all the tests, and get the LLM to fix the Token class accordingly.

After that change, my Token class adds value to spaCy's tokenization, but does not enforce its own rules above those of spaCy, which could have caused it to crash for no good reason.

Then I wrote tests for the Sentence.tokenize function... and I ended up writing a DocCache class for caching the documents used by the tokenizer. Then I also wrote an AsyncSentenceHelper class... then wrote the tests. I admit this was not exactly TDD, but I wrote the classes hastily to help me think about what I really wanted. This was prototyping after all.
Then the real work: write the tests, and perhaps rewrite the classes with LLM assistance.

I noticed that I had classes in packages and was importing them individually from their files, rather than using an __init__.py file. I got the LLM to tidy that up rather than doing that boring work myself.

Then I ran pylint, and it gave me a stack of errors. I fed them all to the LLM and went and made breakfast for my wife and I.

At this point I realised it needed more specific instructions regarding circular dependencies, because I want to have types declared, rather than if TYPE_CHECKING, and as a result did a bit of manual cleanup. I also decided to do some cleanup with the following prompt:

Change the TokenType to a shorter form. (Rather than TokenType.TOKEN_IS_WORD, just Type.WORD.)
@src/dockb/models/token.py#L38 rename the class from TokenType to Type, and wherever TokenType
is refereced throughout the source, tests, and comments, change to Type.
@src/dockb/models/token.py#L41-44 remove the "TOKEN_IS_" prefix, and wherever the old forms are referenced in the
source, including any comments and in tests, change them there also.

Then I followed up with:

I have increased the line length in all the tools to 140 characters, so you can now
make the declaration of the elements of the "expected" array be each on a single line here:
@tests/dockb/models/test_sentence.py#L194

To be honest, a good IDE would have done that more quickly.
Then I wrote tests for the AsyncSentenceReconstructor.

I ran the tests myself and fixed a few problems, then asked the LLM to review the code in job_queue.py, because I'd had to fix a race condition. It found another one. It found a few issues, and I gave it instructions about how to fix one of them. It did not find one issue, which I pointed out to it, but then it fixed that problem very well. Compared to Claude, I'm doing a lot more work. Claude probably would have done a better job from the start and not needed so much prompting. However, this experiment is not really about Big Pickle's competency, it's to show how effectively (even a poor) LLM works when the user writes the tests and the specification first. So far, the only real problems I've had with that approach is in my own discipline. I've broken out of TDD mode a few times.

I noticed that the JobQueue was not thread safe, and I definitely want it to be - because this back end will be multi-threaded. So I asked the LLM how it would go about dispensing with AsyncIO and using a single threaded Python thread worker, and the plan seemed good, so I told it to go for it.
It did a pretty good job, but I reviewed its code and went through a few iterations of reviews and adjustments before it was all working nicely. This does not depart from TDD, all it shows is that the LLM was imperfect.

I then used the LLM to assist me with a design issue. Here is my prompt:

consider (don't make changes) the idea of removing the tokenize method from all the models
and moving the logic of the tokenixe method from the Sentence model class into the ReconstructJob class.
You would need to avoid causing circular dependencies, and I do not want to use the TYPE_CHECKING Pattern.
I want all __init__.py files to be empty. Tell me what would be involved, and propose a good approach

It replied, confirming my instinct:

What would be involved:

Move tokenize body from Sentence into ReconstructJob.run(). The Token/Type imports move with it — already fine since services → models is the established direction.
Remove tokenize from DockbModel base class. The default no-op method exists purely because SyncSentenceReconstructor calls model.tokenize(...). That caller would need to change — either hasattr(model, "tokens") check (it already has this) followed by inline logic, or a separate SentenceTokenizer service.
Remove tokenize from Sentence. No longer needed; the job owns that logic.
Remove DocCache import from models/base.py and models/sentence.py. This is the big win: models stop knowing about spaCy entirely.
SyncSentenceReconstructor would need similar treatment — its model.tokenize() call would become inline tokenization logic or delegate to a shared SentenceTokenizer utility in services (to avoid duplicating the logic from ReconstructJob).

No circular dependencies would be created. The services → models arrow stays one-way. TYPE_CHECKING is not needed because Token and Type would just be normal imports in services.

Proposed approach:

Extract the tokenization logic into a SentenceTokenizer class in services/semantics/sentence_tokenizer.py. Both ReconstructJob and SyncSentenceReconstructor delegate to it. This avoids code duplication, keeps Sentence clean, and models become a pure domain layer with zero spaCy awareness.

services/semantics/sentence_tokenizer.py (new)

takes (text: str, doc_cache: DocCache) → list[Token]

ReconstructJob.run() → calls SentenceTokenizer

SyncSentenceReconstructor.run() → calls SentenceTokenizer

Sentence → no tokenize, no DocCache import

DockbModel → no tokenize method

Want me to do this?

I replied, "Yes", and after it had finished this, prompted it again:

Add a mechanism in the SentenceTokenizer so that it can be interrupted if the ReconstructJob is cancelled.
Note that the cancel method in Job calls on_cancel(), and this can be called from another thread. Implement on_cancel() in the 
ReconstructJob to call a method in SentenceTokenizer which will set a variable, and have this variable checked at pertinent
points in the time consuming part of the tokenization process.

Oops. That was in effect a large design change. I should have written a test first, documented it, and then set the LLM going. It wrote a test... The test was not so great. I wrote more tests. I also moved DocCache from the utils directory of model to the semantics directory.

My tests for the SentenceTokenizer were difficult to write. It was extremely difficult to set up the test to cause collisions, and then I realised that whatever I add to cause collisions on my PC, possibly won't work on someone elses. I already have a test for causing collisions in two threads which was based upon some event signalling to make sure a running job did get cancelled correctly, so I removed that logic from the SentenceTokenizer test. However, when I did have collisions happening, I could only ever get one to occur, even when throwing 100 ReconstructJobs at the queue. This is because the queue is single threaded, and while one is running, if any other happen to be enqueued at the same time, they all will cancel it, and the last one standing will cancel any other one that is left queued. By this time all 100 have been enqueued, and only one was cancelled while running. The very fact that this was difficult to simulate tell me that the queue is going to work very well in production, and that the design is robust.

Next I asked the LLM to check my test coverage. It came back with a fairly long list of issues. That is embarrassing. So I prompted it:

Starting from the first problem you mentioned above where you want to parametize the
duplicates that all inherit DockbModel behaviour, keep separate tests for Document, Chapter, Paragraph,
and Sentence - because the tests don't know about the inheritance, and that may change.
But do parametize them within the tests for those classes. Make the changes, and pause for me to ask
you to step to the next problem you identified. I'll ask you to step through each one, but I want to
look at the code in detail between each change you make. So pause after making each change you listed
For example after addressing the "690 lines of duplicated ..." which I've been talking about here,
when I say to continue, then go on with the "get_text() dirty-False branches ...", then after that
one, "clear_semantics()..." and so on, waiting for me to tell you to continue after each one.

It did a very good job of parameterising the tests, nothing was lost in that process, so I let it continue to the next problem: get_text() dirty=False branches on Document, Chapter, Paragraph — when dirty=False they aggregate text from child hierarchies, but no test populates children and verifies the aggregated output.

It did a very good job of that also... I told it to continue, but it got carried away and did not step through its long list waiting for me to say "continue". This left me with a lot of tests to check and also a lot of its comments to check. It made two tests purposefully forgiving to hide a bug in the code.
I fixed the code and told it to revisit the test. This confirms that TDD is the way to go, because even though it is tempting to just let the LLM write tests - if you do, then you need to check each test. You may as well write them yourself, and just ask the LLM to spot code coverage issues. I did that, and apart from the two erroneous tests, the rest were good.