Discussion:
ghostscript PDF page extraction, leaving text as text
(too old to reply)
David Mathog
2010-05-06 21:19:15 UTC
Permalink
Ghostscript may be used to extract pages from a PDF file with a
command like this:

gs -sDEVICE=pdfwrite \
-dNOPAUSE -dBATCH -dSAFER \
-dFirstPage=48 -dLastPage=48 \
-sOutputFile=onepage.pdf input.pdf

The problem is, while that page looks the same as the original in a
PDF reader, it seems to be an image rather than an "object"
representation. That is, open the extracted PDF in something like
Acrobat or PDF XChange Viewer and "search" and "text selection" work,
whereas in the extracted one neither function works. Presumably this
is because the text has been rasterized.

Is it possible to use gs to extract ranges of pages, preferably also
reducing the resolution of the embedded images, but leaving the text
as text? I frequently need to reduce the size of PDF files, but it
should all come out of the resolution of the images, and the text
should remain as accessible as it was in the original.

If ghostscript cannot do this, is there another linux tool that can?

Thanks,

David Mathog
LEE Sau Dan
2010-05-07 01:58:58 UTC
Permalink
David> gs -sDEVICE=pdfwrite \ -dNOPAUSE -dBATCH -dSAFER \
David> -dFirstPage=48 -dLastPage=48 \ -sOutputFile=onepage.pdf
David> input.pdf

I've just tried this with a PDF file, and it works: search and select
works on both onepage.pdf and input.pdf.


David> The problem is, while that page looks the same as the
David> original in a PDF reader, it seems to be an image rather than
David> an "object" representation. That is, open the extracted PDF
David> in something like Acrobat or PDF XChange Viewer and "search"
David> and "text selection" work, whereas in the extracted one
David> neither function works. Presumably this is because the text
David> has been rasterized.

Maybe, your PDF file has used some special features (e.g. transparency),
so that GS has decided that the most faithful way of converting it into
PDF is to rasterize the page?


David> Is it possible to use gs to extract ranges of pages,
David> preferably also reducing the resolution of the embedded
David> images, but leaving the text as text? I frequently need to
David> reduce the size of PDF files, but it should all come out of
David> the resolution of the images, and the text should remain as
David> accessible as it was in the original.

For page selection, try 'pdftk' or 'pdfjam'.


David> If ghostscript cannot do this, is there another linux tool
David> that can?

GS can do it, but maybe not in your special case.
--
Lee Sau Dan 李守敦 ~{@nJX6X~}

E-mail: ***@informatik.uni-freiburg.de
Home page: http://www.informatik.uni-freiburg.de/~danlee

--- news://freenews.netfront.net/ - complaints: ***@netfront.net ---
a***@hotmail.com
2010-05-07 12:24:51 UTC
Permalink
In some cases such as bitmap or Type3 fonts -dEmbedAllFonts=true has
worked.

Ed
David Mathog
2010-05-07 16:16:58 UTC
Permalink
Post by a***@hotmail.com
In some cases such as bitmap or Type3 fonts -dEmbedAllFonts=true has
worked.
Tried that and it didn't work.
Post by a***@hotmail.com
Maybe, your PDF file has used some special features (e.g. transparency),
so that GS has decided that the most faithful way of converting it into
PDF is to rasterize the page?
Yes, I think there is something wrong with the input file. When it
runs through ghostscript this is emitted:

GPL Ghostscript 8.64 (2009-02-03)
Copyright (C) 2009 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
**** Warning: File has an invalid xref entry: 22. Rebuilding
xref table.
Processing pages 1 through 1.
Page 1

**** This file had errors that were repaired or ignored.
**** The file was produced by:
**** >>>> Mac OS X 10.5.8 Quartz PDFContext <<<<
**** Please notify the author of the software that produced this
**** file that it does not conform to Adobe's published PDF
**** specification.

This run was on Linux (but the problem originally surfaced with
ghostscript on Windows). As you can see the PDF was generated on a
Mac, so it could be some Mac vs. other OS incompatibility. Plus
whatever this xref corruption is.

Here is the problem input file (25MB - that's why I want to reduce
it.)

http://saf.bio.caltech.edu/pub/pickup/problem.pdf

It is a set of lecture notes from a class. The unselectable text
problem surfaces even when only the first page is extracted.
Helge Blischke
2010-05-07 16:57:46 UTC
Permalink
Post by David Mathog
Post by a***@hotmail.com
In some cases such as bitmap or Type3 fonts -dEmbedAllFonts=true has
worked.
Tried that and it didn't work.
Post by a***@hotmail.com
Maybe, your PDF file has used some special features (e.g. transparency),
so that GS has decided that the most faithful way of converting it into
PDF is to rasterize the page?
Yes, I think there is something wrong with the input file. When it
GPL Ghostscript 8.64 (2009-02-03)
Copyright (C) 2009 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
**** Warning: File has an invalid xref entry: 22. Rebuilding
xref table.
Processing pages 1 through 1.
Page 1
**** This file had errors that were repaired or ignored.
**** >>>> Mac OS X 10.5.8 Quartz PDFContext <<<<
**** Please notify the author of the software that produced this
**** file that it does not conform to Adobe's published PDF
**** specification.
This run was on Linux (but the problem originally surfaced with
ghostscript on Windows). As you can see the PDF was generated on a
Mac, so it could be some Mac vs. other OS incompatibility. Plus
whatever this xref corruption is.
Here is the problem input file (25MB - that's why I want to reduce
it.)
http://saf.bio.caltech.edu/pub/pickup/problem.pdf
It is a set of lecture notes from a class. The unselectable text
problem surfaces even when only the first page is extracted.
The "text" on the first two pages indeed is contained in images.
Apart from that, the PDF as downloaded from your site is OK. The xref
comlaints you got must be due to a transfer error (probably some end of line
conversion during transfer?).

Helge
David Mathog
2010-05-07 18:19:57 UTC
Permalink
Post by Helge Blischke
The "text" on the first two pages indeed is contained in images.
Apart from that, the PDF as downloaded from your site is OK. The xref
complaints you got must be due to a transfer error (probably some end of line
conversion during transfer?).
It isn't working for me still, and the transfer was OK, md5sum didn't
change. Ghostscript is 8.64.
On the linux system extract a single page with this command:

gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER -dFirstPage=48 -
dLastPage=48 -sOutputFile=foo48.pdf -dEmbedAllFonts=true
BMB170c_2010_LECTURE11.pdf

Open that up with PDF Xchange viewer on Windows XP. It looks OK.
HOWEVER, search doesn't work.
To see why, select any text, copy and paste into a word processor.
Garbage. Search on the original
page did work. So it looks like ghostscript is remapping the
characters to the font entries during
the extraction, possibly at the

**** Warning: File has an invalid xref entry: 22. Rebuilding
xref table.

step. Specifically, the first text on page 48 of the original is
"Polysaccharide A" and it copies that way to another application.
However, in the extracted page 48 copy/paste of that text reads "=%∀#:
3∃∃635∋79)Β" (some of those are unprintable characters, not sure how
they will post) even though when displayed in the PDF viewer it still
displays as "Polysaccharide A". Here is the extracted file:

http://saf.bio.caltech.edu/pub/pickup/foo48.pdf

The properties for the "Polysaccharide A" text was:

Font: Calibri (Embedded Subset)
Type: TrueType
Encoding: Custom
Object Number: [X]
Global Object ID: 0
Font Size 43.5 pt
Horizontal Scaling: 100%
Baseline Offset: 0.0 pt

in both the original and the extracted page, with [X]=10 in the
extracted file, [X]=23 in the original. Not sure if this relates to
the error message, could be a coincidence, but as all programmers
know, 22 is 23 if you count from 0 instead of 1.

Thanks.

David Mathog
ken
2010-05-08 08:02:02 UTC
Permalink
In article <bc631d67-d6e4-4be1-9e35-
Post by David Mathog
gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER -dFirstPage=48 -
dLastPage=48 -sOutputFile=foo48.pdf -dEmbedAllFonts=true
BMB170c_2010_LECTURE11.pdf
Open that up with PDF Xchange viewer on Windows XP. It looks OK.
HOWEVER, search doesn't work.
To see why, select any text, copy and paste into a word processor.
Garbage. Search on the original
page did work.
This is an example of why its important to carefully describe the
problem. Your original post was quite firm that the problem was
conversion to a bitmap image.

This is actually quite a different problem. In order to
search/copy/paste text Acrobat wants a ToUnicode CMap in the output PDF
file, this allows it to 'know' what the Unicode code point is for a
given glyph on the page.

Without that, Acrobat will fall back to other approaches; if the
Encoding for the font is one of the standards then it will use that to
work out the Uicode values. If the Encoding is non-standard, but the
glyph names are recognisable it will try to use those.

If none of the above is true, then it is forced to give up. In this case
it copies the character indices directly form the PDF file, as a stream
of bytes. From your later description it seems to me that this is what
is happening.
Post by David Mathog
So it looks like ghostscript is remapping the
characters to the font entries during
the extraction, possibly at the
**** Warning: File has an invalid xref entry: 22. Rebuilding
xref table.
step.
No, that is Ghostscript telling you that it thinks there is something
wrong with the original file. The xref table simply tells the PDF
consumer where to find all the objects in the file so that it can
interpret them. If the index is damaged then Ghostscript will scan the
entire file looking for object declarations (eg 1 0 obj) and build a
completely new index from that information. The presence or absence of a
ToUnicode CMap, and the encoding of the fonts, is not affected by this.
Post by David Mathog
Font: Calibri (Embedded Subset)
Type: TrueType
Encoding: Custom
So its not a standard Encoding, as I suspected.

Without seeing the original file (and no, I'm sorry but I'm not going to
download and examine a 25MB file) I can't really say for sure what is
going on. However I would suggest that you try a more up to date version
of Ghostscript. 8.64 is a year old now, (the current version is 8.71)
and there have been a number of changes to pdfwrite over the last year.


Ken

uhhu
2010-05-07 20:37:09 UTC
Permalink
Post by David Mathog
Here is the problem input file (25MB - that's why I want to reduce
it.)
http://saf.bio.caltech.edu/pub/pickup/problem.pdf
Go to page 47 and search for ">". Note that the two "ti" strings also
match!
David Mathog
2010-05-07 20:49:18 UTC
Permalink
Post by uhhu
Post by David Mathog
Here is the problem input file (25MB - that's why I want to reduce
it.)
http://saf.bio.caltech.edu/pub/pickup/problem.pdf
Go to page 47 and search for ">". Note that the two "ti" strings also
match!
Yup. Also "tt" as seen in the display is represented by "K" in the
encoding. (Search for "zwiker" and it finds "zwitter".) Feels like
the PDF generator had a few too many characters to fit into a finite
sized lookup table and so used an alternative encoding.
Loading...