Interesting paper that makes an encoder, DeepEncoder, which is really good at information compression inside of an image while maintaining low activation memory.
Questions
-
I don’t really understand why they do this two stage training thing. From what I understand, in the first part, they use DeepEncoder to pair it with some small language model and do language modeling tasks, and in the second part, they freeze the convolutional and SAM parts and then do the exact same thing again but with DeepSeek 3 billion MOE.
-
unclear on the exact training data
-
- what is fitz
-
-
How are they generating bounding boxes?
-
I need to get better at parallelism: https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=high-level_overview
-
Vision tokens are in some ways more expensive than text tokens, because the creation of them requires linear projection and then going through the actual encoder. at what point does it become useful to use the vision encoder here? immediately on the first generation? or is there a certain number of autoregressive generations that have to occur for it to be useful. looks like the paper mentions if you use it even twice then it’s more useful
Related
- Put to the test: https://www.runpulse.com/blog/putting-deepseek-ocr-to-the-test