Skip to content
This repository was archived by the owner on Jan 18, 2025. It is now read-only.

codeinchq/nlp-tokenizer

Repository files navigation

NLP Tokenizer

Code Inc. Docker Image CI Docker Image Version

This repository contains a simple containerized API to tokenize text using the spaCy library. The API is built using FastAPI.

The image is available on Docker Hub under the name codeinchq/nlp-tokenizer.

Configuration

By default, the container listens on port 3000. The port is configurable using the PORT environment variable.

Usage

To run locally the container, execute the following command:

docker run -p "8000:8000" codeinchq/nlp-tokenizer

Sentence Tokenization

The sentence tokenization endpoint is available at /tokenize/sentences and accept the following parameters:

  • text: the text to tokenize
  • lang: the language of the text (optional)
curl -X POST "http://127.0.0.1:8000/tokenize/sentences" \
-H "Content-Type: application/json" \
-d '{"text": "This is a sample sentence for the documentation. It is used for illustrative purposes."}'

Word Tokenization

The word tokenization endpoint is available at /tokenize/words and accept the following parameters:

  • text: the text to tokenize
  • lang: the language of the text (optional)
  • exclude_punct: exclude punctuation from the tokenization (optional, default: true)
  • lowercase: lowercase the tokens (optional, default: false)
curl -X POST "http://127.0.0.1:8000/tokenize/words" \
-H "Content-Type: application/json" \
-d '{"text": "This is a sample sentence for the documentation. It is used for illustrative purposes."}'

Paragraph Tokenization

The paragraph tokenization endpoint is available at /tokenize/paragraphs and accept the following parameters:

  • text: the text to tokenize
  • lang: the language of the text (optional)
curl -X POST "http://127.0.0.1:8000/tokenize/paragraphs" \
-H "Content-Type: application/json" \
-d '{"text": "This is a sample sentence for the documentation. It is used for illustrative purposes."}'

Health check

A health check is available at the /health endpoint. The server returns a status code of 200 if the service is healthy, along with a JSON object:

{ "status": "up", "timestamp": "0001-01-01T00:00:00Z" }

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

NLP tokenizer based on NLTK

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published