Llama-OCR: Document to Markdown

(llamaocr.com)

250 points | by lapnect 15 hours ago ago

84 comments

nutlope 13 hours ago ago
Hi all, I'm the author of llama-ocr. Thank you for sharing & for the kind comments! I built this earlier this week since I wanted a simple API to do OCR – it uses llama 3.2 vision (hosted on together.ai, where i work) to parse images into structured markdown. I also have it available as an npm package.
Planning to add a bunch of other features like the ability to parse PDFs, output a response in JSON, ect... If anyone has any questions, feel free to send them and I'll try to respond!
[-]
- nh2 11 hours ago ago
  I put in a bill that has 3 identical line items and it didn't include them as 3 bullet points as usual, but generated a table with a "quantity" column that doesn't exist on the original paper.
  Is this amount of larger transformation expected/desirable?
  (It also means that the output is sometimes a bullet point list, sometimes a table, making further automatic processing a bit harder.)
  [-]
  - zainia 4 hours ago ago
    Here's the prompt being used, tweaking that might help: https://github.com/Nutlope/llama-ocr/blob/main/src/index.ts#...
- gcr an hour ago ago
  How accurate is this?
  When compared with existing OCR systems, what sorts of mistakes does it make?
- Szpadel 8 hours ago ago
  > Need an example image? Try ours. Great idea, I wish more services would have similar feature
- Curiositry 11 hours ago ago
  Option to use a local LLM?
  [-]
  - Eisenstein 10 hours ago ago
    I made a script which does exactly the same thing but locally using koboldcpp for inference. It downloads MiniCPM-V 2.6 with image projector the first time you run it. If you want to use a different model you can, but you will want to edit the instruct template to match.
    * https://github.com/jabberjabberjabber/LLMOCR
    [-]
    - nirav72 10 hours ago ago
      MiniCPM-v 2.6 is probably the best self-hosted vision model I have used so far. Not just for OCR, but also image analysis. I have it setup, so my NVR (frigate) sends couple of images upon motion alert from a driveway security camera to Ollama with minicpm-v 2.6. I’m able to get a reasonably accurate description of the vehicle that pulled into the driveway. Including describing the person that exits the vehicle and also the license plate. All sent to my phone.
sdflhasjd 7 hours ago ago
Here's a bit of a quirk: I uploaded a webcomic as an example, all the dialog was ALL CAPS, but the output was inconsistently either sentence case or title case between panels.
I also tried some real examples a problem I'd like to use OCR with: I've got some old slides that needs digitising, and most of them are labelled, uploading one of these provides the output:
```
  The image appears to be a photograph of a slide or film frame, possibly from an old camera or projector. The slide is yellowed with age and has a rectangular cutout in the center, which is filled with a dark gray or black material. The cutout is surrounded by a thin border, and there is some text written on the slide in black ink.

  The text reads "Once Upon a Time" and is written in a cursive font. It is located at the bottom of the slide, below the cutout. There is also a small number "1069" written in the same font and color, but it is not clear what this number refers to.

  Overall, the image suggests that the slide is an old photograph or film frame that has been preserved for many years. The yellowing of the slide and the cursive writing suggest that it may be from the early 20th century or earlier.
```
So aside from unnecessary repetitious description of the slide, (and the "yellowing" is actually just white balance being off, though I can forgive that), the actual written text (not cursive) was "Once Uniquitous." and the number was 106g. It's very clearly a 'g' and not a '9'.
What I think is interesting about this is that it might be a demonstration of biases in models, it focuses too much on the slide being an antique that it hallucinated a completely cliche title. Also, it missed the forest for the trees and that the "black square" was the slide being front-lit so the text could be read, so the transparency wasn't visible.
Additionally, the API itself seems to have file size or resolution limits that are not documented
generalizations an hour ago ago
How does it handle images? That has seemed to be the major weak point of these doc-to-markdown systems.
philips 12 hours ago ago
I have recently used llama3.2-vision to handle some paper bidsheets for a charity auction and it is fairly accurate with some terrible handwriting. I hope to use it for my event next year.
I do find it rather annoying not being able to get it to consistently output a CSV though. ChatGPT and Gemini seem better at doing that but I haven’t tried to automate it.
The scale of my problem is about 100 pages of bidsheets and so some manual cleaning is ok. It is certainly better than burning volunteers time.
https://github.com/philips/paper-bidsheets
[-]
- mosselman 10 hours ago ago
  What about using llama3.2-vision to do the OCR bit and then deferring to ChatGPT to do the CSV part?
notsylver 13 hours ago ago
I've been doing a lot of OCR recently, mostly digitising text from family photos. Normal OCR models are terrible at it, LLMs do far better. Gemini Flash came out on top from the models I tested and it wasn't even close. It still had enough failures and hallucinations to make it faster to write it in by hand. Annoying considering how close it feels to working.
This seems worse. Sometimes it replies with just the text, sometimes it replies with a full "The image is a scanned document with handwritten text...". I was hoping for some fine tuning or something for it to beat Gemini Flash, it would save me a lot of time. :(
[-]
- philips 12 hours ago ago
  Have you tried downscaling the images? I started getting better results with lower resolution images. I was using scans made with mobile phone cameras for this.
  convert -density 76 input.pdf output-%d.png
  https://github.com/philips/paper-bidsheets
  [-]
  - notsylver 11 hours ago ago
    That's interesting. I downscaled the images to something like 800px but that was mostly to try improve upload times. I wonder if downscaling further and with a better algorithm would help.. I remember using CLIP and found different scaling algorithms helped text readability. Maybe the text is just being butchered when its rescaled.
    Though I also tried with the high detail setting which I think would deal with most issues that come from that and it didn't seem to help much
- danvk 5 hours ago ago
  I've had really good luck recently running OCR over a corpus of images using gpt-4o. The most important thing I realized was that non-fancy data prep is still important, even with fancy LLMs. Cropping my images to just the text (excluding any borders) and increasing the contrast of the image helped enormously. (I wrote about this in 2015 and this post still holds up well with GPT: https://www.danvk.org/2015/01/07/finding-blocks-of-text-in-a...).
  I also found that giving GPT at most a few paragraphs at a time worked better than giving it whole pages. Shorter text = less chance to hallucinate.
  [-]
  - pbhjpbhj 5 hours ago ago
    Have you tried doing a verification pass: so giving gpt-4o the output of the first pass, and the image, and asking if they can correct the text (or if they match, or...)?
    Just curious whether repetition increases accuracy or of it hurt increases the opportunities for hallucinations?
- og_kalu 13 hours ago ago
  >Normal OCR models are terrible at it, LLMs do far better. Gemini Flash came out on top from the models I tested and it wasn't even close.
  For Normal models, the state of Open Source OCR is pretty terrible. Unfortunately, the closed options from Microsoft, Google etc are much better. Did you try those ?
  Interesting about Flash, what LLMs did you test ?
  [-]
  - notsylver 12 hours ago ago
    I tried open source and closed source OCR models, all were pretty bad. Google vision was probably the best of the "OCR" models, but it liked adding spaces between characters and had other issues I've forgotten. It was bad enough that I wondered if I was using it wrong. By the time I was trying to pass the text to an LLM with the image so it could do "touchups" and fix the mistakes, I gave up and decided to try LLMs for the whole task.
    I don't remember the exact models, I more or less just went through the OpenRouter vision model list and tried them all. Gemini Flash performed the best, somehow better than Gemini Pro. GPT-4o/mini was terrible and expensive enough that it would have had to be near perfect to consider it. Pixtral did terribly. That's all I remember, but I tried more than just those. I think Llama 3.2 is the only one I haven't properly tried, but I don't have high hopes for it.
    I think even if OCR models were perfect, they couldn't have done some of the things I was using LLMs for. Like extracting structured information at the same time as the plain text - extracting any dates listed in the text into a standard ISO format was nice, as well as grabbing peoples names. Being able to say "Only look at the hand-written text, ignore printed text" and have it work was incredible.
    [-]
    - dleeftink 9 hours ago ago
      WordNinja is pretty good as a post-processing step on wrongly split/concatenated words:
      [0]: https://github.com/keredson/wordninja
  - pbhjpbhj 5 hours ago ago
    The OCR in OneNote is incredible IME. But, I've not tested in a wide range of fonts -- only that I have abysmal handwriting and it will find words that are almost unrecognisable.
- 8n4vidtmkvmk 13 hours ago ago
  That's a bummer. I'm trying to do the exact same thing right now, digitize family photos. Some of mine have German on the back. The last OCR to hit headlines was terrible, was hoping this would be better. ChatGPT 4o has been good though, when I paste individual images into the chat. I haven't tried with the API yet, not sure how much that would cost me to process 6500 photos, many of which are blank but I don't have an easy way to filter them either.
  [-]
  - notsylver 11 hours ago ago
    I found 4o to be one of the worst, but I was using the API. I didn't test it but sometimes it feels like images uploaded through ChatGPT work better than ones through the API. I was using Gemini Flash in the end, it seemed better than 4o and the images are so cheap that I have a hard time believing google is making any money even by bandwidth costs
    I also tried preprocessing images before sending them through. I tried cropping it to just the text to see if it helped. Then I tried filtering on top to try brighten the text, somehow that all made it worse. The most success I had was just holding the image in my hand and taking a photo of it, the busy background seemed to help but I have absolutely no idea why.
    The main problem was that it would work well for a few dozen images, you'd start to trust it, and then it'd hallucinate or not understand a crossed out word with a correction or wouldn't see text that had faded. I've pretty much given up on the idea. My new plan is to repurpose the website I made for verifying the results into one where you enter the text manually, as well as date/location/favourite status.
  - bosie 12 hours ago ago
    Use a local rubbish model to extract text. If it doesn’t find any on the back, don’t send it to chatgtp?
    Terrascan comes to mind
- bboygravity 10 hours ago ago
  Have you tried Claude?
  It's not good at returning the locations of text (yet), but it's insane at OCR as far as I have tested.
gexla 14 hours ago ago
Should this be a "Show HN" post? Seems to just be the front-end and has no association we may make with the name Llama? Maybe together.ai gave them cloud space?
mg 12 hours ago ago
I gave it a sentence, which I created by placing 500 circles via a genetic algorithm to form a sentence. And then drew with an actual physical circle:
https://www.instagram.com/marekgibney/p/BiFNyYBhvGr/
Interestingly, it sees the circles just fine, but not the sentence. It replied with this:
```
    The image contains no text or other elements
    that can be represented in Markdown. It is a
    visual composition of circles and does not
    convey any information that can be translated
    into Markdown format.
```
[-]
- Vetch 11 hours ago ago
  Based on the fact that squinting works, I applied a Gaussian blur to the image. Here's the response I got:
  Markdown:
  The provided image is a blurred text that reads "STOP THINKING IN CIRCLES." There are no other visible elements such as headers, footers, subtexts, images, or tables.
  Markdown Content:
  STOP THINKING IN CIRCLES
  As the response is not deterministic, I also tried several times with the unprocessed image but it never worked. However, all the low-pass filter effects I applied worked with a high success rate.
  https://imgur.com/q7Zd7fa
  [-]
  - mg 10 hours ago ago
    I guess blurring it is similar to reducing the resolution or to looking at the image from further away.
    It's interesting that the neural net figures out the circles, but not the words. Because the circles are also not so easily apparent from looking closely at the image. It could also be whirly lines.
- ggerules 5 hours ago ago
  Was the original LLM ever trained on original material like this?
  Pretty cool use of genetic algorithm! Would love to see the code or at least the reward function.
- DandyDev 12 hours ago ago
  I can't read this either.
  Edit: at a distance it's easier to read
  [-]
  - thih9 11 hours ago ago
    If you squint it’s easier too. I wonder if lowering the resolution of the image would make the text visible to ocr.
    [-]
    - pbhjpbhj 5 hours ago ago
      I wonder if you could do a composite image, like bracketed images, and so give the model multiple goes, for which it could amalgamate results. So, you could do an exposure bracket, do a focus/blur, maybe a stretch/compression, or an adjustment for font-height as a proportion of the image.
      Feed all of the alternatives to the model, tell it they each have the same textual content?
- echoangle 12 hours ago ago
  I can’t read anything but the „stop“ either without seeing the solution first
- wasyl 12 hours ago ago
  Why is it interesting? The image does not look like anything, and you need to skew it (by looking at an angle) to see any letters (barely).
xenodium 6 hours ago ago
Japanese OCR to structured content works very well via chatgpt API.
https://xenodium.com/images/chatgpt-shell-repo-splits-up/jap...
Other unrelated examples https://lmno.lol/alvaro/chatgpt-shell-repo-splits-up
Tepix 7 hours ago ago
So, i uploaded a HN screenshot and it showed some rendered text but where is the Markdown code? A site titles "Document to Markdown" that fails to give me the MarkDown? What am i overlooking?
cheema33 8 hours ago ago
I uploaded a multi-page PDF and it did not know what to do. This is before I went to the github repo and noticed that it wasn't supported. I think the tool should let the user know when they upload a file that is not supported.
amelius 7 hours ago ago
I tried it on a Walmart receipt. It misread a 9 for a 0.
https://imgur.com/a/ni8zOmb
nash 12 hours ago ago
Holy Hallucinations batman!
Even the example images hallucinates random text
[-]
- KeplerBoy 12 hours ago ago
  Same for me. The receipt headline only says "Trader Joe's" and yet the model insists on adding some information and transcribes "Trader Joe's Receipt". This is like Xeroxgate, but infinitely worse.
  Someday this will do great damage in ways we will completely neglect and overlook.
fros1y 5 hours ago ago
Are there any OCR engines out there that actually recognizes underlines properly? Even the LLMs seem to struggle to model the underline (though they get the text fine).
Eisenstein 14 hours ago ago
All it does is send the image to Llama 3.2 Vision and ask for it to read the text.
Note that this is just as open to hallucination as any other LLM output, because what it is doing is not reading the pixels looking for text characters, but describing the picture, which uses the images it trained on and their captions to determine what the text is. It may completely make up words, especially if it can't read them.
[-]
- M4v3R 14 hours ago ago
  This is also true for any other OCR system, we just never called these errors “hallucinations” in this context.
  [-]
  - geysersam 13 hours ago ago
    I gave this tool a picture of a restaurant menu and it made up several additional entries that didn't exist in the picture... What other OCR system would do that?
  - noduerme 12 hours ago ago
    No, it's not even close to OCR systems, which are based on analyzing points in a grid for each character stroke and comparing them with known characters. Just for one thing, OCR systems are deterministic. Deterministic. Look it up.
    [-]
    - visarga 12 hours ago ago
      OCR system use vision models and as such they can make mistakes. They don't sample but they produce a distribution of probability over words like LLMs.
    - alex_suzuki 10 hours ago ago
      One of my worries for the coming years is that people will forget what deterministic actually means. It terrifies me!
  - llm_trw 13 hours ago ago
    It really isn't since those systems are character based.
  - 8n4vidtmkvmk 12 hours ago ago
    OCR tools sometimes make errors, but they don't make things up. There's a difference.
LeoPanthera 14 hours ago ago
I wonder what the watts-per-character is of this tool.
[-]
- threatripper 14 hours ago ago
  Joules per character
  [-]
  - amelius 6 hours ago ago
    I'm running this with 60Hz on my HDMI output.
  - danielEM 13 hours ago ago
    I think it is perfectly fine to describe it in Watts per character as you can easily determine how many characters per second you can process.
rasz 2 hours ago ago
Old scan of Asus P3B-F motherboard schematic from 1997.
- only managed to extract some of the text from Title Block (project name, date etc)
- despite distinct font got all 8/B and 1/I mixed up.
- the actual useful info got turned into
```
    Tables
    Table 1: [Insert table 1 here]

    Other Elements
    [Insert other elements here]
```
AmazingTurtle 11 hours ago ago
One can combine apache tika OCR and feed it together with the image into LLM to fix typos.
[-]
- cess11 6 hours ago ago
  While I'm a fan of Tika a lot of people get queasy from Java and XML, they might be better served by their preferred scripting language and https://github.com/ocrmypdf/OCRmyPDF, which has the same OCR engine.
burnt-resistor 6 hours ago ago
I might've broken it as I gave it the Intel developer’s manual combined volumes. }:)
alecco 9 hours ago ago
Is it possible to do this locally with open source software? I have a lot of accounting PDFs to convert but due to privacy concerns it should not run in the cloud.
[-]
- criddell 8 hours ago ago
  Does it have to be open source, or just running locally? The paid version of Acrobat does this well. MacOS has pretty good built-in OCR capabilities and Windows isn’t far behind.
  If you have the hardware for it, you can run some LLMs locally. Although for accounting data, I probably wouldn’t trust it.
- bugglebeetle 4 hours ago ago
  Yes, Docling and Marker do very similar things and can be run fully locally.
- Eisenstein 5 hours ago ago
  I don't recommend using it for anything important unless you very diligently proofread it, but I made one that runs locally that I linked to elsewhere in this post:
  * https://news.ycombinator.com/item?id=42155548
- cess11 6 hours ago ago
  Either you need to be somewhat tolerant when it comes to misinterpretations and hallucinations, or you'll be proofreading a lot.
  A cheap hack is to push the documents through pdftotext from Poppler and if nothing or very little comes out, push them through OCRMyPDF and pipe it to pdftotext. If it's scanned you probably want some flags for deskewing and so on.
  To make a bulk load of PDF mostly greppable it's a decent technique, to get every 0 as a 0 you're probably going to proofread every conversion.
joeyblueee 6 hours ago ago
get this error in console when requesting /ocr, and a 504 status code """ An error occurred with your deployment
FUNCTION_INVOCATION_TIMEOUT """
MattDaEskimo 3 hours ago ago
Dreamt of fine design, layers of code, art refined— found wrappers instead.
Nothing to see here folks.
bbor 15 hours ago ago
Looks awesome! Been doing a lot of OCR recently, and love the addition to the space. The reigning champion in the PDF -> Markdown space (AFAIK) is Facebook's Nougat[1], and I'm excited to hook this up to DSPy and see which works better for philosophy books. This repo links the Zerox[2] project by some startup, which also looks awesome, and certainly more smoothly advertised than Nougat. Would love corrections/advice from any actual experts passing by this comment section :)
That said, I have a few questions if OP/anyone knows the answers:
1. What is Together.ai, and is this model OSS? Their website sells them as a hosting service, and the "Custom Models" page[3] seems to be about custom finetuning, not, like, training new proprietary models in-house. They might have a HuggingFace profile but it's hard to tell if it's them https://huggingface.co/TogetherAI
2. The GitHub says "hosted demo", but the hosting part is just the tiny (clean!) WebGUI, yes? It's implied that this functionality is and will always be available only through API calls?
P.S. The header links are broken on my desktop browser -- no onClick triggered
[1] https://facebookresearch.github.io/nougat/
[2] https://github.com/getomni-ai/zerox
[3] https://www.together.ai/products#custom-models
[-]
- jurnalanas 14 hours ago ago
  the project author is Devrel from Together.ai. This is a fantastic way to advertise a dev tool, though.
- gexla 14 hours ago ago
  My guess is together.ai is at least partially sponsoring the demo.
- magicalhippo 14 hours ago ago
  Yeah was hoping for something I could self-host, both for privacy and cost.
- rajansheth 13 hours ago ago
  together.ai serves 100+ open-source models including multi-modal Llama 3.2 with an OpenAI compatible API
d1sxeyes 13 hours ago ago
Seemed pretty good with handwriting. Didn’t make any mistakes with numbers in the sample I tried.
constantinum 6 hours ago ago
The problem with using LLMs for OCR is hallucinations. It makes it impossible to use in business use cases such as insurance, banking and health/medical — which demands high accuracy or predictable inaccuracy rate. Not to mention handling scale — processing millions of documents with speed and affordable costs.
For all the test use cases mentioned in this thread, I’d suggest trying LLMwhisperer. A general purpose text Pre-processor/OCR built for LLM consumption. https://pg.llmwhisperer.unstract.com
noduerme 12 hours ago ago
Um, I just quickly uploaded an unstructured RTF file to this and apparently broke it... unless it's just realllly slow.
If this is just for converting hand-written documents, maybe put that in the header of the website. Right now it just says "Document to Markdown", which could be interpreted lots of different ways.
sumedh 13 hours ago ago
Site is dead now :(
[-]
- nutlope 12 hours ago ago
  Should be up, please try again!
  [-]
  - mkl 10 hours ago ago
    It let me upload a file, but didn't produce any output.
revskill 9 hours ago ago
Non-English image is slow.
anothername12 13 hours ago ago
We tried this and it was an absolute shit show for us.
[-]
- cpursley 8 hours ago ago
  You could have at least provided some constructive feedback...
hrpnk 7 hours ago ago
Reading the Llama community license agreement, section "Redistribution and Use" I expected to find 'Built with Llama'. Is this not required?
https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instr... links to the community license.
[-]
- kennethwolters 7 hours ago ago
  Why don't you think that calling the app "Llama-OCR" is good enough?
  [-]
  - sdflhasjd 6 hours ago ago
    The license is pretty specific, if the API counts as a "service".
```
  i. If you distribute or make available the Llama Materials (or any derivative works thereof), or a product or service (including another AI model) that contains any of them, you shall (A) provide a copy of this Agreement with any such Llama Materials; and (B) prominently display “Built with Llama” on a related website, user interface, blogpost, about page, or product documentation.
```
HaiderAftab1 14 hours ago ago
Great tool for quickly converting plain text to Markdown, saving time and ensuring consistent formatting for documents
[-]
- nutlope 12 hours ago ago
  Thank you!