This repository contains a simple containerized API to convert PDF documents to text using Mozilla's pdf.js and pdf.js-extract.
The image is available on Docker Hub under the name codeinchq/pdf2txt.
By default, the container listens on port 3000. The port is configurable using the PORT environment variable.
All requests must by send in POST to the /extract endpoint with a multipart/form-data content type. The request must contain a PDF file with the key file.
Additional parameters can be sent to customize the conversion process:
firstPage: The first page to extract. Default is1.lastPage: The last page to extract. Default is the last page of the document.password: The password to unlock the PDF. Default is none.normalizeWhitespace: If set totrue, the server normalizes the whitespace in the extracted text. Default istrue.format: The output format. Supported values aretext(the server returns the raw text astext/plain) orjson(the server returns a JSON object astext/json). Default istext.
The server returns 200 if the conversion was successful and the images are available in the response body. In case of error, the server returns a 400 status code with a JSON object containing the error message (format: {error: string}).
docker run -p "3000:3000" codeinchq/pdf2txt Convert a PDF file to text with a JSON response:
curl -X POST -F "file=@/path/to/file.pdf" http://localhost:3000/extract -o example.jsonConvert a PDF file to text:
curl -X POST -F "file=@/path/to/file.pdf" http://localhost:3000/extractExtract a password-protected PDF file's text content as JSON and save it to a file:
curl -X POST -F "file=@/path/to/file.pdf" -F "password=XXX" -F "format=json" http://localhost:3000/extract -o example.jsonA health check is available at the /health endpoint. The server returns a status code of 200 if the service is healthy, along with a JSON object:
{ "status": "up" }A PHP 8 client is available at on GitHub and Packagist.
This project is licensed under the MIT License - see the LICENSE file for details.