We have a slack channel for

Unlock business potential through effective first dataset management solutions.
Post Reply
asimm22
Posts: 8
Joined: Thu May 22, 2025 5:18 am

We have a slack channel for

Post by asimm22 »

The MRC compression decomposes each image into a background, foreground and foreground mask, heavily compressing (and sometimes downscaling) each layer separately. The mask is compressed losslessly, ensuring that the text and lines in an image do not suffer from compression artifacts and look clear. Using this method, we observe a 10x compression factor for most of our books.

The PDFs themselves are created using the buy sales lead high-performance mupdf and pymupdf python library: both projects were supportive and promptly fixed various bugs, which propelled our efforts forwards.

And best of all, we have expanded our community to include people all over the world that are working together to make cultural materials more available. OCR researchers and implementers now, that you can join if you would like (to join, drop an email to [email protected]). We look to contribute software and data sets to these projects to help them improve (lead by Merlijn Wajer and Derek Fukumori).

Next steps to fulfill the dream of Vanevar Bush’s Memex, Ted Nelson’s Xanadu, Michael Hart’s Project Gutenberg, Tim Berners-Lee’s World Wide Web, Raj Ready’s call for Universal Access to All Knowledge (and now the Internet Archive’s mission statement).
Post Reply