Deepseek OCR

Interesting paper that makes an encoder, DeepEncoder, which is really good at information compression inside of an image while maintaining low activation memory.

https://papiers.ai/2510.18234

Questions

I don’t really understand why they do this two stage training thing. From what I understand, in the first part, they use DeepEncoder to pair it with some small language model and do language modeling tasks, and in the second part, they freeze the convolutional and SAM parts and then do the exact same thing again but with DeepSeek 3 billion MOE.
unclear on the exact training data
- - what is fitz
How are they generating bounding boxes?
I need to get better at parallelism: https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=high-level_overview
Vision tokens are in some ways more expensive than text tokens, because the creation of them requires linear projection and then going through the actual encoder. at what point does it become useful to use the vision encoder here? immediately on the first generation? or is there a certain number of autoregressive generations that have to occur for it to be useful. looks like the paper mentions if you use it even twice then it’s more useful

Put to the test: https://www.runpulse.com/blog/putting-deepseek-ocr-to-the-test

Krish's Digital Jungle 🍃

Projects
Thoughts

Deepseek OCR

Questions

Graph View

Table of Contents

Recent Notes

MagicMidi

capstone

laundry folding robot

Krish's Digital Jungle 🍃

Projects Thoughts

Deepseek OCR

Questions

Related

Graph View

Table of Contents

Recent Notes

MagicMidi

capstone

laundry folding robot

Projects
Thoughts