Java Pub House

Episode 104. It's all about Apache Tika, the project that lets you index EVERYTHING.


Listen Later

So we continue to have guests in our show to talk to us about interesting things... This time is about Apache Tika. This is an incredible tool to do search file processing and metadata extraction. Think about that you have tons of unstructured files, like emails, or documents, and you want to extract, index and then search theses. This is Tika's purpose. And who best to walk us through how it does its magic that its Project Management Committee (PMC) Chair, Tim Allison!

So take a listen as we go deeper on ingesting tons of content (which is fundamental for things like training LLMs).

http://www.javapubhouse.com/datadog We thank DataDogHQ for sponsoring this podcast episode

Don't forget to SUBSCRIBE to our cool NewsCast OffHeap! http://www.javaoffheap.com/

Apache Tika * https://tika.apache.org/

OpenSearch Project and OpenSearch Neural Plugin Tutorials * https://opensearch.org/ * https://opensearch.org/docs/latest/search-plugins/neural-search/ * https://opster.com/guides/opensearch/opensearch-machine-learning/how-to-set-up-vector-search-in-opensearch/  * https://opster.com/guides/opensearch/opensearch-machine-learning/opensearch-hybrid-search/ * https://sease.io/2024/01/opensearch-knn-plugin-tutorial.html * https://sease.io/2024/04/opensearch-neural-search-tutorial-hybrid-search.html

Selected Advanced File Processing toolkits/services * https://unstructured.io/ * https://aws.amazon.com/textract/ * https://azure.microsoft.com/en-us/products/ai-services/ai-document-intelligence

Selected Hybrid Search/RAG toolkits (there are _MANY_ others!) * Haystack: https://haystack.deepset.ai/ * LangChain: https://www.langchain.com/ * LangStream: https://langstream.ai/

Search/Relevance Conferences * https://haystackconf.com/ * https://2024.berlinbuzzwords.de/ * https://mices.co/

Tim's personal project * JavaFX (ahem) tika-config writer UI: https://github.com/tballison/tika-gui-v2

Do you like the episodes? Want more? Help us out! Buy us a beer! https://www.javapubhouse.com/beer

And Follow us!  https://www.twitter.com/javapubhouse

...more
View all episodesView all episodes
Download on the App Store

Java Pub HouseBy Freddy Guime & Bob Paulin

  • 4.8
  • 4.8
  • 4.8
  • 4.8
  • 4.8

4.8

46 ratings


More shows like Java Pub House

View all
Security Now (Audio) by TWiT

Security Now (Audio)

1,972 Listeners

Marketplace by Marketplace

Marketplace

8,639 Listeners

6 Minute English by BBC Radio

6 Minute English

1,824 Listeners

The Cloudcast by Massive Studios

The Cloudcast

152 Listeners

Software Engineering Daily by Software Engineering Daily

Software Engineering Daily

625 Listeners

Learning English Vocabulary by BBC Radio

Learning English Vocabulary

509 Listeners

Inside Java by Chad Arimura, David Delabassee

Inside Java

26 Listeners