- Python 75%
- Jupyter Notebook 23.2%
- JavaScript 1.3%
- Makefile 0.2%
- HTML 0.1%
| .agents/skills/docling-document-intelligence | ||
| .opencode | ||
| document-extractor-api | ||
| document-extractor-app | ||
| document-extractor-shared | ||
| document-pipeline | ||
| local | ||
| main-docs | ||
| model-inference | ||
| pdf-rag | ||
| reports | ||
| scripts | ||
| src/services_py | ||
| test-pdf | ||
| .env.example | ||
| .env.modal.example | ||
| .gitignore | ||
| .python-version | ||
| Makefile | ||
| product_extraction_schema.yaml | ||
| pyproject.toml | ||
| README.md | ||
| skills-lock.json | ||
| test-schema.yaml | ||
| uv.lock | ||
Docext
Docext is a document extraction and processing workspace. It contains the document extractor API, shared extraction code, model inference components, and documentation for the surrounding system.
Hosted Sites
- API documentation and Swagger UI: https://docext.unlp.nextdorf.de/docs
- Frontend: https://un-lp.com or https://unlp.nextdorf.de
Local Development
This repository is managed with uv. The main documentation can optionally be served with mdbook. Building the test PDF additionally requires typst.
Prerequisites
uv- Optional:
mdbook, for servingmain-docslocally - Optional:
typst, for buildingtest-pdf/test-pdf.pdf
uv, mdbook, and typst can be installed through Rust's package manager, cargo:
cargo install uv
cargo install mdbook
cargo install typst-cli
cargo can be installed directly, or it can be installed through rustup by installing the Rust toolchain. When using rustup, you may need to add Cargo's binary directory to your shell path:
export PATH="$HOME/.cargo/bin:$PATH"
Install Dependencies
Install the workspace dependencies with:
make uv-sync
Configuration
Example environment files are provided as .env*.example files. Copy the relevant example file to a local .env file and fill in the values needed for your setup.
.env.example contains local runtime settings:
PYTORCH_ROCM_ARCH: optional ROCm setting for selecting the AMD GPU architecture PyTorch should target.HSA_OVERRIDE_GFX_VERSION: optional ROCm setting for overriding the detected AMD GPU GFX version when needed by the local ROCm stack.TF_MIN_GPU_MULTIPROCESSOR_COUNT: optional GPU-related setting for TensorFlow workloads.DOCLING_CUDA_USE_FLASH_ATTENTION2: optional Docling setting for enabling or disabling FlashAttention 2 usage on CUDA setups.OPENROUTER_API_KEY: optional API key for OpenRouter-backed model access.
.env.modal.example contains Modal-specific settings:
HF_TOKEN: optional Hugging Face token used by Modal jobs when they need authenticated access to Hugging Face resources.
The repository also contains experimental, disabled ROCm support. The relevant PyTorch ROCm dependency configuration is currently commented out in pyproject.toml; enable and adjust it only if you are working on an AMD GPU setup and know which ROCm versions and GPU architecture values apply to your machine.
Run The Document Extractor API
Start the local API server with:
make serve-document-extractor
The API will be served at http://localhost:8100. Its local Swagger UI is available at http://localhost:8100/docs.
Serve The Documentation
If mdbook is installed, serve the main documentation locally with:
make serve-main-docs
The documentation will be available at http://localhost:3080.
Serve the pipeline documentation locally with:
make serve-pipeline-docs
The pipeline documentation will be available at http://localhost:3081.