Debugging a .tsv Spacy dataset.

If you have ever worked with any kind of dataset, or tried to fine-tune a NLP model by yourself using the Spacy3 library for python3 you might have found a problem like the one on the image below when trying to convert your manually tagged dataset…

This blogpost aims to help you, the reader, in your ML adventures and get you to debug your custom built NER dataset, or at least to remind the writer how to do it.

Searching on the internet does not show much help specifically with .tsv files, nor does spacy help you a little bit by telling you where the problem is, the command simple shows the message above or one similar to it in the case the format is not correct so ready up your regex abilities and let’s go…

We start by searching for any space characters, replace them for 1 tabulation:

Search for any 2 or more tabulation characters together, replace for only 1:

Also any tabulation character followed by a newline character, possibly indicating an empty tag:

or any 3 or more newline characters together:

And this should cover most of the problems your dataset might have, now if you are still unable to convert your .tsv IOB file you can try the following.

As there is no message indicating the line number where the conversion could not be correctly done, there is a small hack we can do on the script spacy uses to convert the file.

Although the error asks for us to check that the format of the contents of the file are appropriately written, we can find the python module being used for the conversion on the error, I’m using python 3.8 in WSL2 .

This file conll_ner_to_docs.py is located in:

/home/<user>/.local/lib/python3.8/site-packages/spacy/training/converters/

And has the following contents, where the error is generated from:

Excerpt line 100 to line 107 from conll_ner_to_docs.py

On line 105 we can see the error handling raise, but what if we decide to print the content of the cols variable before it gets to raise the ValueError??

That’s right.

We get to print the sentence on where the error is happening.

And we at least have an idea on where the error is, and can go and search around that area on the dataset:

And we can tag that missing skill we forgot before and try running the command again, and hopefully not have the same phrase printing again, so that we can have an output as follows:

Indicating that your dataset was correctly converted into Spacy3 JSON, so you can now convert to .spacy and start running your custom NER model.

I hope you found useful the information in this blogpost, and that you can let me know if there is a suggestion, or comment you have, I’m always open to improve my knowledge.

You can read more about the story of this project in my blogpost about it here.