NLP Tokenizer

This repository contains a simple containerized API to tokenize text using the spaCy library. The API is built using FastAPI.

The image is available on Docker Hub under the name codeinchq/nlp-tokenizer.

Configuration

By default, the container listens on port 3000. The port is configurable using the PORT environment variable.

Usage

To run locally the container, execute the following command:

docker run -p "8000:8000" codeinchq/nlp-tokenizer

Sentence Tokenization

The sentence tokenization endpoint is available at /tokenize/sentences and accept the following parameters:

text: the text to tokenize
lang: the language of the text (optional)

curl -X POST "http://127.0.0.1:8000/tokenize/sentences" \
-H "Content-Type: application/json" \
-d '{"text": "This is a sample sentence for the documentation. It is used for illustrative purposes."}'

Word Tokenization

The word tokenization endpoint is available at /tokenize/words and accept the following parameters:

text: the text to tokenize
lang: the language of the text (optional)
exclude_punct: exclude punctuation from the tokenization (optional, default: true)
lowercase: lowercase the tokens (optional, default: false)

curl -X POST "http://127.0.0.1:8000/tokenize/words" \
-H "Content-Type: application/json" \
-d '{"text": "This is a sample sentence for the documentation. It is used for illustrative purposes."}'

Paragraph Tokenization

The paragraph tokenization endpoint is available at /tokenize/paragraphs and accept the following parameters:

text: the text to tokenize
lang: the language of the text (optional)

curl -X POST "http://127.0.0.1:8000/tokenize/paragraphs" \
-H "Content-Type: application/json" \
-d '{"text": "This is a sample sentence for the documentation. It is used for illustrative purposes."}'

Health check

A health check is available at the /health endpoint. The server returns a status code of 200 if the service is healthy, along with a JSON object:

{ "status": "up", "timestamp": "0001-01-01T00:00:00Z" }

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
.idea		.idea
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
docker-compose.yaml		docker-compose.yaml
entrypoint.sh		entrypoint.sh
models.py		models.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP Tokenizer

Configuration

Usage

Sentence Tokenization

Word Tokenization

Paragraph Tokenization

Health check

License

About

Uh oh!

Releases

Packages

Languages

License

codeinchq/nlp-tokenizer

Folders and files

Latest commit

History

Repository files navigation

NLP Tokenizer

Configuration

Usage

Sentence Tokenization

Word Tokenization

Paragraph Tokenization

Health check

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages