- Doccano annotation server with spacy backend
fiete@ubu:~/Documents/programming/spacy/doccano_spacy$ tree -L 2 --dirsfirst
.
├── custom-model # contains the spacy model (training) files
│ ├── model-best # trained model (best)
│ ├── model-last # trained model (last)
│ ├── base_config.cfg
│ ├── config.cfg
│ └── train.spacy
├── data # contains the source data
│ ├── exported
│ ├── captum.csv
│ ├── captum.txt
│ └── label_config.json
├── spacy-server # spacy backend server
│ ├── app
│ ├── Dockerfile
│ └── run.sh
├── convert.py # convert reports csv to doccano format
├── docker-compose.yaml
├── exporter.py # contains helper functions
├── generate_train_file.py # generate data file used for training spacy
└── README.md
9 directories, 13 filesFor proper authentication, you'll need to create a .env file with the following content in the root of this project:
SPACY_USER=admin
SPACY_PASSWORD=password
You can change the credentials to your liking, but make sure to also adjust the Authorization headers, as described in the Server README and the Set parameters step below.
docker-compose up -dDoccano should now be available on http://localhost:8000 in your browser (Credentials: admin, password)
Shut down:
docker-compose stopIn order to start annotating, convert your csv file (in my case data/captum.csv) into the format doccano requires for imports.
python convert.pyNote that this requires the spacy en_core_web_md model, which can be obtained by running python -m spacy download en_core_web_md.
Open the web UI at http://localhost:8000.
Files you import must have a specific format. You may use the convert.py to convert from a pandas dataframe to a textline file.

You can create labels in the labels section (sidemenu). Labels can also be im- and exported (see the data/label_config.json).

Important: Make sure you have created your custom labels before setting this up!
Navigate to Settings and select the Auto Labeling tab. Hit Create and select Custom REST template.
In the next step, we are specifying the request properties. This includes setting the Content-Type and Authorization headers and the request Body. For details on how to obtain the correct Authorization Header, also check the Server README.
If all is configured correctly, the test should return a valid response.

Here we can customize the mapping between the response we get from the annotation backend (in this case the spacy server) and doccano. For the mapping Jinja2 is used.

Finally we have to provide the mapping between the labels returned by the spacy backend and the ones present in doccano. It looks like we have to provide this even in the case that they are identical.

pip install -r requirementsDownload the spacy model
python -m spacy download en_core_web_mdIn Doccano, go to the Datasets page and export the dataset. This will create a zip file containing the annotations per user, i.e admin.jsonl and unknown.jsonl which contains all the sections that have not been annotated yet.
In the data folder, create an exported folder and copy over the admin.jsonl file.
The training file is used by spacy in the spacy train command. Run the generate_train_file.py script, to generate the file based on the admin.jsonl.
python generate_train_file.pypython -m spacy train custom-model/config.cfg --output ./custom-modelSciSpacy
python -m spacy train custom-model/scispacy/config.cfg --output ./custom-model/scispacy/ --paths.train ./custom-model/train.spacy --paths.dev ./custom-model/train.spacyIf everything was successfull, you should now have a model-best and model-last folder in the custom-model directory.
If the containers are still running, use docker-compose stop to stop them. Now we can recreate them with:
docker-compose up --build