extract scanned image from pdf

If you are using the main C++ distribution of wxWidgets, Feel free to ask any question related to wxWidgets development here. This means questions regarding to C++ and wxWidgets, not compile problems.
mael15
Ultimate wxWidgets Guru
Ultimate wxWidgets Guru
Posts: 542
Joined: Fri May 22, 2009 8:52 am
Location: Bremen, Germany

extract scanned image from pdf

Post by mael15 »

Hi everyone, happy new year! :)
I have to extract a scanned image from an external pdf and then use it im my wxDC. i have read
viewtopic.php?f=30&t=44130&p=181192&hil ... ge#p181192
but i do not need text etc., the external pdfs are created directly by the scanner, so every page is just one image. is there a simple way to extract these images? pdfium seems overkill and i do not need a gui like wxPdfView provides.
thank you!
User avatar
xaviou
Super wx Problem Solver
Super wx Problem Solver
Posts: 437
Joined: Mon Aug 21, 2006 3:18 pm
Location: Annecy - France

Re: extract scanned image from pdf

Post by xaviou »

Hi.

You can easily do it "from scratch".
I've juste made a test with such a pdf file using an Hex Editor (https://mh-nexus.de/en/hxd) to see how a jpeg image is stored.

First comes the pdf headers, then some infos about the objet itself (in that case, a jpeg image) and then the datas of the object:

Code: Select all

00000000  25 50 44 46 2D 31 2E 33 0D 31 20 30 20 6F 62 6A  %PDF-1.3.1 0 obj
00000010  0D 3C 3C 2F 54 79 70 65 20 2F 58 4F 62 6A 65 63  .<</Type /XObjec
00000020  74 20 2F 53 75 62 74 79 70 65 20 2F 49 6D 61 67  t /Subtype /Imag
00000030  65 20 2F 4E 61 6D 65 20 2F 49 6D 31 20 2F 57 69  e /Name /Im1 /Wi
00000040  64 74 68 20 31 36 35 34 20 2F 48 65 69 67 68 74  dth 1654 /Height
00000050  20 32 33 33 38 20 2F 4C 65 6E 67 74 68 20 36 38   2338 /Length 68
00000060  38 31 35 37 2F 43 6F 6C 6F 72 53 70 61 63 65 20  8157/ColorSpace 
00000070  2F 44 65 76 69 63 65 52 47 42 20 2F 42 69 74 73  /DeviceRGB /Bits
00000080  50 65 72 43 6F 6D 70 6F 6E 65 6E 74 20 38 20 2F  PerComponent 8 /
00000090  46 69 6C 74 65 72 20 5B 20 2F 44 43 54 44 65 63  Filter [ /DCTDec
000000A0  6F 64 65 20 5D 20 3E 3E 20 73 74 72 65 61 6D 0D  ode ] >> stream.
000000B0  FF D8 FF E0 00 10 4A 46 49 46 00 01 01 01 00 C8  ÿØÿà..JFIF.....È
000000C0  00 C8 00 00 FF DB 00 43 00 08 06 06 07 06 05 08  .È..ÿÛ.C........
000000D0  07 07 07 09 09 08 0A 0C 14 0D 0C 0B 0B 0C 19 12  ................
000000E0  13 0F 14 1D 1A 1F 1E 1D 1A 1C 1C 20 24 2E 27 20  ........... $.' 
000000F0  22 2C 23 1C 1C 28 37 29 2C 30 31 34 34 34 1F 27  ",#..(7),01444.'

..................

000A8000  14 50 01 45 14 50 01 45 14 50 01 45 14 50 01 45  .P.E.P.E.P.E.P.E
000A8010  14 50 01 45 14 50 01 45 14 50 01 45 14 50 01 45  .P.E.P.E.P.E.P.E
000A8020  14 50 01 45 14 50 01 45 14 50 01 45 14 50 01 45  .P.E.P.E.P.E.P.E
000A8030  14 50 01 45 14 50 01 45 14 50 01 45 14 50 01 45  .P.E.P.E.P.E.P.E
000A8040  14 50 01 45 14 50 01 45 14 50 01 45 14 50 01 45  .P.E.P.E.P.E.P.E
000A8050  14 50 01 45 14 50 01 45 14 50 01 45 14 50 01 45  .P.E.P.E.P.E.P.E
000A8060  14 50 01 45 14 50 01 45 14 50 01 45 14 50 01 45  .P.E.P.E.P.E.P.E
000A8070  14 50 01 45 14 50 01 45 14 50 01 45 14 50 01 45  .P.E.P.E.P.E.P.E
000A8080  14 50 01 45 14 50 01 45 14 50 01 45 14 50 01 45  .P.E.P.E.P.E.P.E
000A8090  14 50 01 45 14 50 01 45 14 50 01 45 14 50 01 45  .P.E.P.E.P.E.P.E
000A80A0  14 50 01 45 14 50 01 45 14 50 01 45 14 50 01 45  .P.E.P.E.P.E.P.E
000A80B0  14 50 01 45 14 50 01 45 14 50 01 45 14 50 01 45  .P.E.P.E.P.E.P.E
000A80C0  14 50 01 45 14 50 01 45 14 50 07 FF D9 65 6E 64  .P.E.P.E.P.ÿÙend
000A80D0  73 74 72 65 61 6D 0D 65 6E 64 6F 62 6A 0D 32 20  stream.endobj.2 
000A80E0  30 20 6F 62 6A 0D 3C 3C 20 2F 4C 65 6E 67 74 68  0 obj.<< /Length
000A80F0  20 34 37 20 0D 3E 3E 0D 73 74 72 65 61 6D 0D 71   47 .>>.stream.q
000A8100  20 35 39 35 2E 34 34 20 30 20 30 20 38 34 31 2E   595.44 0 0 841.
000A8110  36 38 20 30 2E 30 30 20 30 2E 30 30 20 63 6D 20  68 0.00 0.00 cm 
000A8120  31 20 67 20 2F 49 6D 31 20 44 6F 20 51 0D 65 6E  1 g /Im1 Do Q.en
000A8130  64 73 74 72 65 61 6D 0D 65 6E 64 6F 62 6A 0D 33  dstream.endobj.3
..................
I've tried removing the bytes from 0x00 to 0xAF and from 0xA80CD to the end and all was ok : I just obtained a jpeg image file.
And to check this, you have the real length of the object in the headers.
Here : Length 688157 corresponds to 0xA80CC (address of the last byte of the image file) - 0xAF = 688157 bytes.

So you can extract the needed bytes in memory, and directly create a wxImage from the extracted datas.

Regards
Xav'
My wxWidgets stuff web page : X@v's wxStuff
mael15
Ultimate wxWidgets Guru
Ultimate wxWidgets Guru
Posts: 542
Joined: Fri May 22, 2009 8:52 am
Location: Bremen, Germany

Re: extract scanned image from pdf

Post by mael15 »

wow, that is a pretty unique approach!
i just checked three different pdfs in a hex editor. while the first few lines are quite similar, there are some major differences before the binary data starts. one pdf was version 1.4 and the other 1.6, so i guess your approach would work fine if all the pdfs came from the same scanner and it is never updated?
in my case, i get these pdfs from different sources, so i suppose i need something less error-prone?
User avatar
xaviou
Super wx Problem Solver
Super wx Problem Solver
Posts: 437
Joined: Mon Aug 21, 2006 3:18 pm
Location: Annecy - France

Re: extract scanned image from pdf

Post by xaviou »

Hi.

Perhaps you can try building PoDoFo lib anf use it in your project

I did not try it myself, but it comes with a tool named "podofoimgextract" witch should do the job (or witch you can inspire yourself from).

I could not faind any build version of the lib so I can't test : sorry.

Regards
Xav'
My wxWidgets stuff web page : X@v's wxStuff