pdfcomp: problems with inverted text that is often better in hocr.

This form [https://www.kvk.nl/download/Formulier-14-wijziging-ondernemings-en-vestigingsgegevens_tcm109-365607.pdf](url)

First page saved to jpeg via this site: https://smallpdf.com

![0001](https://user-images.githubusercontent.com/3341558/175789293-f39ddfdb-6f3e-4598-8d16-80a1f4a88b36.jpg)

Result of the left column is quite readable at the right screen-resolution.

```
ocrmypdf --pdfa-image-compression lossless -O0  0001.jpg formulierhocrjpg.pdf
Input file is not a PDF, checking if it is an image...
Input file is an image
Input image has no ICC profile, assuming sRGB
Image seems valid. Try converting to PDF...
Successfully converted to PDF, processing...
Scanning contents: 100%|████████████████████████| 1/1 [00:00<00:00, 73.93page/s]
OCR: 100%|██████████████████████████████████| 1.0/1.0 [00:09<00:00,  9.92s/page]
Postprocessing...
PDF/A conversion: 100%|█████████████████████████| 1/1 [00:00<00:00,  2.46page/s]
Optimize ratio: 1.00 savings: 0.0%
Output file is a PDF/A-2B (as expected)

pdfcomp formulierhocrjpg.pdf formulierhocrjpgkleiner.pdf
Compression factor: 9.617848822158944

```
[formulierhocrjpgkleiner.pdf](https://github.com/internetarchive/archive-pdf-tools/files/8985496/formulierhocrjpgkleiner.pdf)

Contains unreadable text on the left. The hocr contains "Toelichting 1.1", it is completely unreadable.

My patch for the inversion ratio makes it better readable:

[formulierhocrjpgkleinerpatch.pdf](https://github.com/internetarchive/archive-pdf-tools/files/8985519/formulierhocrjpgkleinerpatch.pdf)

However if you lookup the mask-picture it doesn't contain this text in the left column at all.

So my patch isn't the only needed change for that routine.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pdfcomp: problems with inverted text that is often better in hocr. #55

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

pdfcomp: problems with inverted text that is often better in hocr. #55

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions