This repository contains a simple containerized API to tokenize text using the spaCy library. The API is built using FastAPI.
The image is available on Docker Hub under the name codeinchq/nlp-tokenizer.
By default, the container listens on port 3000. The port is configurable using the PORT environment variable.
To run locally the container, execute the following command:
docker run -p "8000:8000" codeinchq/nlp-tokenizerThe sentence tokenization endpoint is available at /tokenize/sentences and accept the following parameters:
text: the text to tokenizelang: the language of the text (optional)
curl -X POST "http://127.0.0.1:8000/tokenize/sentences" \
-H "Content-Type: application/json" \
-d '{"text": "This is a sample sentence for the documentation. It is used for illustrative purposes."}'The word tokenization endpoint is available at /tokenize/words and accept the following parameters:
text: the text to tokenizelang: the language of the text (optional)exclude_punct: exclude punctuation from the tokenization (optional, default:true)lowercase: lowercase the tokens (optional, default:false)
curl -X POST "http://127.0.0.1:8000/tokenize/words" \
-H "Content-Type: application/json" \
-d '{"text": "This is a sample sentence for the documentation. It is used for illustrative purposes."}'The paragraph tokenization endpoint is available at /tokenize/paragraphs and accept the following parameters:
text: the text to tokenizelang: the language of the text (optional)
curl -X POST "http://127.0.0.1:8000/tokenize/paragraphs" \
-H "Content-Type: application/json" \
-d '{"text": "This is a sample sentence for the documentation. It is used for illustrative purposes."}'A health check is available at the /health endpoint. The server returns a status code of 200 if the service is healthy, along with a JSON object:
{ "status": "up", "timestamp": "0001-01-01T00:00:00Z" }This project is licensed under the MIT License - see the LICENSE file for details.