Apache Tika Server is a server edition of Apache Tika.
Apache Tika is a content detection and analysis framework. It allows users to easy text-extraction for thousand different file types (such as PPT, XLS, and PDF) in the single interface. Tika useful for search engine indexing, content analysis, translation, and much more.
Tika is a project of the Apache Software Foundation and was formely a subproject of Apache Lucene.
wikipedia.org/wiki/Apache_Tika
$ docker run -d -p 9998:9998 --name some-tika kujira/tika...via docker-compose
Example docker-compose.yml for kujira/tika:
version: '3.1'
services:
tika:
image: kujira/tika
restart: always
ports:
- 9998:9998Tika does not use data store.
Tika exposes 9998 port in the container. Just add -p 9998:9998 to the docker run arguments and then access either http://localhost:9998 or http://host-ip:9998 in a browser.
When you start the kujira/tika image, you can adjust the configuration of the instance by passing one or more environment variables on the docker run command line.
Set a fraction value to determine text file is csv or not in. Default is 0.5.
Set true to enable HTML script extraction. Default is false.
Set true to enable VBA macro extraction. Default is false.
Set true to enable deleted content extraction. Default is false.
Set true to enable moved content extraction. Default is false.
Set true to enable moved content extraction. Default is false.
Set false to disable header and footer extraction. Default is true.
Set true to enable missing rows extraction. Default is false.
Set false to disable slide note extraction. Default is true.
Set false to disable slide master extraction. Default is true.
Set false to disable concatenate phonetic(aka. furigana) extraction. Default is true.
When true, 山田太郎 will be extracted to 山田太郎ヤマダタロウ.
Set true to enable SAX docx and pptx extraction. Default is false.
Sets the format of date string. Default is yyyy-mm-dd.
You can find custom date format here.
Set Tesseract OCR language model name. Default is eng. You can join several model names with + character, like this: eng+deu+fra
Supported model names
deu(German)eng(English)fra(French)ita(Itarian)jpn(Japanese)jpn_vert(Japanese Vertical)spa(Spanish)
Set the timeout of tesseract ocr execution. Default is 120.
Set true to automatic rotate image if needs. Default is false.
Set true to enable bookmark extraction. Default is false.
Set true to enable annotation extraction. Default is false.
Set an OCR stragegy string. Default is no_ocr.
OCR strategy values
no_ocr(default)ocr_onlyocr_and_textauto