1) Install tools like Ghostscript, GraphicsMagick, and LibreOffice with a JS script. 2) Convert document pages to Base64 PNGs and send them to OpenAI for data extraction. 3) Use Supabase for unclear reasons.
Some issues with this approach:
* OpenAI may retain and use your data for training, raising privacy concerns [1].
* Dependencies should be managed with Docker or package managers like Nix or Pixi, which are more robust. Example: a tool like Parsr [2] provides a Dockerized pdf-to-json solution, complete with OCR support and an HTTP api.
* GPT-4 vision seems like a costly, error-prone, and unreliable solution, not really suited for extracting data from sensitive docs like invoices, without review.
* Traditional methods (PDF parsers with OCR support) are cheaper, more reliable, and avoid retention risks for this particular use case. Although these tools do require some plumbing... probably LLMs can really help with that!
While there are plenty of tools for structured data extraction, I think there’s still room for a streamlined, all-in-one solution. This gap likely explains the abundance of closed-source commercial options tackling this very challenge.
Multimodal LLM are not the way to do this for a business workflow yet.
In my experience your much better of starting with a Azure Doc Intelligence or AWS Textract to first get the structure of the document (PDF). These tools are incredibly robust and do a great job with most of the common cases you can throw at it. From there you can use an LLM to interrogate and structure the data to your hearts delight.
I’ll have to test this against my local Python pipeline which does all this without an LLM in attendance. There are a ton of existing Python libraries which have been doing this for a long time, so let’s take a look..
This looks like a promising tool for working with unstructured documents! A few questions come to mind:
1) Data Accuracy: How do you ensure the extracted data aligns perfectly with the source? Are there specific safeguards or confidence scoring mechanisms in place to flag potentially inaccurate extractions, or is this left entirely to manual review?
2) Customization and Flexibility: Many real-world scenarios involve highly specific schemas or even multi-step extraction workflows. Does Documind allow for layered or conditional parsing where fields depend on the values of others?
3) Local Hosting for Confidential Data: Data confidentiality is a big concern for many businesses (e.g., legal or financial industries). While it's great that Documind is open source, do you have any built-in provisions or guides for secure local hosting, especially in resource-constrained environments?
Looking forward to seeing how this evolves—seems like a tool with great potential for streamlining document processing!
With such a system, how do you ensure that the extracted data matches the data in the source document? Run the process several times and check that the results are identical? Can it reject inputs for manual processing? Or is it intended to be always checked manually? How good is it, how many errors does it make, say per million extracted values?
Perhaps there's still value in the documents being transformed by this tool and someone reviewing them manually, but obviously the real value would be in reducing manual review. I don't think there's a world–for now–in which this manual review can be completely eliminated.
However, if you process, say, 1 million documents, you could sample and review a small percentage of them manually (a power calculation would help here). Assuming your random sample models the "distribution" (which may be tough to define/summarize) of the 1 million documents, you could then extrapolate your accuracy onto the larger set of documents without having to review each and every one.
You can sample the result to determine the error rate, but if you find an unacceptable level of errors, then you still have to review everything manually. On the other hand, if you use traditional techniques, pattern matching with regular expressions and things like that, then you can probably get pretty close to perfection for those cases where your patterns match and you can just reject the rest for manual processing. Maybe you could ask a language model to compare the source document and the extracted data and to indicate whether there are errors, but I am not sure if that would help, maybe what tripped up the extraction would also trip up the result evaluation.
What I've noticed, that on scanned documents, where stamp-text and handwriting is just as important as printed text, Gemini was way better compared to chat gpt.
Of course, my prompts might have been an issue, but gemini with very brief and generic queries made significantly better results.
Hi. I totally get the concern about sending data to OpenAI. Right now, Documind uses OpenAI's API just so people could quickly get started and see what it is like, but I’m open to adding options and contributions that would be better for privacy.
Very nice tool! Just last week, I was working on extracting information from PDFs for an automation flow I’m building. I used Unstructured (https://unstructured.io/), which supports multiple file types, not just PDFs.
However, my main issue is that I need to work with confidential client data that cannot be uploaded to a third party. Setting up the open-source, locally hosted version of Unstructured was quite cumbersome due to the numerous additional packages and installation steps required.
While I’m open to the idea of parsing content with an LLM that has vision capabilities, data safety and confidentiality are critical for many applications. I think your project would go from good to great if it would be possible to connect to Ollama and run locally,
That said, this is an excellent application! I can definitely see myself using it in other projects that don’t demand such stringent data confidentiality.”
Documind: Open-Source AI for Document Data Extraction
If you're dealing with unstructured data trapped in PDFs, Documind might be the tool you’ve been waiting for. It’s an open-source solution that simplifies the process of turning documents into clean, structured JSON data with the power of AI.
Key Features:
1. Customizable Data Extraction
Define your own schema to extract exactly the information you need from PDFs—no unnecessary clutter.
2. Simple Input, Clean Output
Just provide a PDF link and your schema definition, and it returns structured JSON data, ready to integrate into your workflows.
3. Developer-Friendly
With a simple setup (`npm install documind`), you can get started right away and start automating tedious document processing tasks.
Whether you’re automating invoice processing, handling contracts, or working with any document-heavy workflows, Documind offers a lightweight, accessible solution. And since it’s open-source, you can customize it further to suit your specific needs.
Would love to hear if others in the community have tried it—how does it stack up for your use cases?
I am looking for a similar service that turns any document (PNG, PDf, DocX) into JSON (preserving the field relationships). I tried with ChatGPT, but hallucinations are common. Does anything exist?
I built a drag-and-drop document converter that extracts text into custom columns (for CSV) or keys (for JSON). You can schedule it to run at certain times and update a database as well.
I haven't had issues with hallucinations. If you're interested, my email is in my bio.
Not sure I would want something non-deterministic in my data pipeline. Maybe if it used GenAI to _develop a ruleset_ that could then be deployed, it would be more practical.
const systemPrompt = `
Convert the following PDF page to markdown.
Return only the markdown with no explanation text. Do not include deliminators like '''markdown.
You must include all information on the page. Do not exclude headers, footers, or subtext.
`;
Thanks for the laugh and your feedback! I know that depending on an OpenAI isn't ideal for everyone. I'm considering ways to make it more self-contained in the future, so it’s great to hear what users are looking for.
litellm would be a start, then you just pass in a model string that includes the provider, and can default on openai gpts, that removes most of the effort in adapting stuff both from you and other users.
From the source, Documind appears to:
1) Install tools like Ghostscript, GraphicsMagick, and LibreOffice with a JS script. 2) Convert document pages to Base64 PNGs and send them to OpenAI for data extraction. 3) Use Supabase for unclear reasons.
Some issues with this approach:
* OpenAI may retain and use your data for training, raising privacy concerns [1].
* Dependencies should be managed with Docker or package managers like Nix or Pixi, which are more robust. Example: a tool like Parsr [2] provides a Dockerized pdf-to-json solution, complete with OCR support and an HTTP api.
* GPT-4 vision seems like a costly, error-prone, and unreliable solution, not really suited for extracting data from sensitive docs like invoices, without review.
* Traditional methods (PDF parsers with OCR support) are cheaper, more reliable, and avoid retention risks for this particular use case. Although these tools do require some plumbing... probably LLMs can really help with that!
While there are plenty of tools for structured data extraction, I think there’s still room for a streamlined, all-in-one solution. This gap likely explains the abundance of closed-source commercial options tackling this very challenge.
---
1: https://platform.openai.com/docs/models#how-we-use-your-data
2: https://github.com/axa-group/Parsr
Multimodal LLM are not the way to do this for a business workflow yet.
In my experience your much better of starting with a Azure Doc Intelligence or AWS Textract to first get the structure of the document (PDF). These tools are incredibly robust and do a great job with most of the common cases you can throw at it. From there you can use an LLM to interrogate and structure the data to your hearts delight.
I’ll have to test this against my local Python pipeline which does all this without an LLM in attendance. There are a ton of existing Python libraries which have been doing this for a long time, so let’s take a look..
This looks like a promising tool for working with unstructured documents! A few questions come to mind:
1) Data Accuracy: How do you ensure the extracted data aligns perfectly with the source? Are there specific safeguards or confidence scoring mechanisms in place to flag potentially inaccurate extractions, or is this left entirely to manual review?
2) Customization and Flexibility: Many real-world scenarios involve highly specific schemas or even multi-step extraction workflows. Does Documind allow for layered or conditional parsing where fields depend on the values of others?
3) Local Hosting for Confidential Data: Data confidentiality is a big concern for many businesses (e.g., legal or financial industries). While it's great that Documind is open source, do you have any built-in provisions or guides for secure local hosting, especially in resource-constrained environments?
Looking forward to seeing how this evolves—seems like a tool with great potential for streamlining document processing!
From just reading the README, the example is not valid JSON. Is that intentional?
Otherwise it seems like a prompt building tool, or am I missing something here?
Thanks for pointing this out. This was an error on my part.
I see someone opened an issue for it so will fix now.
Oof you’re right LOL
With such a system, how do you ensure that the extracted data matches the data in the source document? Run the process several times and check that the results are identical? Can it reject inputs for manual processing? Or is it intended to be always checked manually? How good is it, how many errors does it make, say per million extracted values?
Perhaps there's still value in the documents being transformed by this tool and someone reviewing them manually, but obviously the real value would be in reducing manual review. I don't think there's a world–for now–in which this manual review can be completely eliminated.
However, if you process, say, 1 million documents, you could sample and review a small percentage of them manually (a power calculation would help here). Assuming your random sample models the "distribution" (which may be tough to define/summarize) of the 1 million documents, you could then extrapolate your accuracy onto the larger set of documents without having to review each and every one.
You can sample the result to determine the error rate, but if you find an unacceptable level of errors, then you still have to review everything manually. On the other hand, if you use traditional techniques, pattern matching with regular expressions and things like that, then you can probably get pretty close to perfection for those cases where your patterns match and you can just reject the rest for manual processing. Maybe you could ask a language model to compare the source document and the extracted data and to indicate whether there are errors, but I am not sure if that would help, maybe what tripped up the extraction would also trip up the result evaluation.
Just this weekend was solving similar problem.
What I've noticed, that on scanned documents, where stamp-text and handwriting is just as important as printed text, Gemini was way better compared to chat gpt.
Of course, my prompts might have been an issue, but gemini with very brief and generic queries made significantly better results.
Got excited about an open-source tool doing this.
Alas, i am let down. It is an open-source tool creating the prompt for the OpenAI API and i can't go and send customer data to them.
I'm aware of https://github.com/clovaai/donut so i hoped this would be more like that.
You can self host OpenAPI compatible models with lmstudio and the like. I've used it with https://anythingllm.com/
Hi. I totally get the concern about sending data to OpenAI. Right now, Documind uses OpenAI's API just so people could quickly get started and see what it is like, but I’m open to adding options and contributions that would be better for privacy.
You might be able to use Ollama, which has a OpenAI compatible API.
Not without chaning the code (should be easy though)
https://github.com/DocumindHQ/documind/blob/d91121739df03867...
Very nice tool! Just last week, I was working on extracting information from PDFs for an automation flow I’m building. I used Unstructured (https://unstructured.io/), which supports multiple file types, not just PDFs.
However, my main issue is that I need to work with confidential client data that cannot be uploaded to a third party. Setting up the open-source, locally hosted version of Unstructured was quite cumbersome due to the numerous additional packages and installation steps required.
While I’m open to the idea of parsing content with an LLM that has vision capabilities, data safety and confidentiality are critical for many applications. I think your project would go from good to great if it would be possible to connect to Ollama and run locally,
That said, this is an excellent application! I can definitely see myself using it in other projects that don’t demand such stringent data confidentiality.”
Thank you, I appreciate the feedback! I understand people wanting data confidentiality and I'm considering connecting Ollama for future updates!
Documind: Open-Source AI for Document Data Extraction
If you're dealing with unstructured data trapped in PDFs, Documind might be the tool you’ve been waiting for. It’s an open-source solution that simplifies the process of turning documents into clean, structured JSON data with the power of AI.
Key Features: 1. Customizable Data Extraction Define your own schema to extract exactly the information you need from PDFs—no unnecessary clutter.
2. Simple Input, Clean Output Just provide a PDF link and your schema definition, and it returns structured JSON data, ready to integrate into your workflows.
3. Developer-Friendly With a simple setup (`npm install documind`), you can get started right away and start automating tedious document processing tasks.
Whether you’re automating invoice processing, handling contracts, or working with any document-heavy workflows, Documind offers a lightweight, accessible solution. And since it’s open-source, you can customize it further to suit your specific needs.
Would love to hear if others in the community have tried it—how does it stack up for your use cases?
Looking at the source it seems this is just a thin wrapper over OpenAI. Am I missing something?
I am looking for a similar service that turns any document (PNG, PDf, DocX) into JSON (preserving the field relationships). I tried with ChatGPT, but hallucinations are common. Does anything exist?
I built a drag-and-drop document converter that extracts text into custom columns (for CSV) or keys (for JSON). You can schedule it to run at certain times and update a database as well.
I haven't had issues with hallucinations. If you're interested, my email is in my bio.
This is also using OpenAI's GPT model. So the same hallucinations are probable here for PDFs.
I'm not sure having statistics with fabrication try to extract text from PDF's would result in any mission-critical reliable data?
That's a valid problem you are solving. I had similar usecase that I solved using PDF[dot]co
Not sure I would want something non-deterministic in my data pipeline. Maybe if it used GenAI to _develop a ruleset_ that could then be deployed, it would be more practical.
Reading from the comments, some of the common questions regarding document extraction are:
* Run locally or on premise for security/privacy reasons
* Support multiple LLMs and vector DBs - plug and play
* Support customisable schemas
* Method to check/confirm accuracy with source
* Cron jobs for automation
There is Unstract that solves the above requirements.
https://github.com/Zipstack/unstract
> an interesting open source project
enthusiastically setting up a lounge chair
> OPENAI_API_KEY=your_openai_api_key
carrying it back apathetically
Thanks for the laugh and your feedback! I know that depending on an OpenAI isn't ideal for everyone. I'm considering ways to make it more self-contained in the future, so it’s great to hear what users are looking for.
litellm would be a start, then you just pass in a model string that includes the provider, and can default on openai gpts, that removes most of the effort in adapting stuff both from you and other users.