![]()
Extracting PDF Document Content
The Document class can allow alternate access to the content in a PDF document. It is possible to extract document meta-data, text, and images.
ICEpdf supports extracting document meta-data via the API that is available on the document hierarchy classes in the com.icesoft.pdf.pobjects package. The main entry-point into the document meta-data is the Document class.
See Content Extraction Examples for an example that illustrates extracting meta-data from a document. Also, see the API documentation for the com.icesoft.pdf.pobjects package for more information on what types of data are available.
Text extraction is possible for most PDF documents. There are, however, some limitations to how a document text is encoded and the type of font used to render the text.
Note: If a document is encrypted, the document permissions should be checked to make sure that content extraction is allowed.
The following code demonstrates how to extract text from the first page of a PDF document. The text on the first page of the document is extracted into a vector of StringBuffer objects using the Document getText( int pgNumber) method. The StringBuffer vector is then iterated and each entry is appended to a single file that contains all of the text for the page.
// load the file URL documentURL = new URL("your url"); Document document = new Document(); document.setUrl( documentURL); try { // create an output file FileOutputStream fileOutputStream = new FileOutputStream( "extracted.txt"); Enumeration pageText = document.getPageText(0).elements(); while(pageText.hasMoreElements()) { StringBuffer text = (StringBuffer)pageText.nextElement(); fileOutputStream.write( text.toString().getBytes()); fileOutputStream.write(10); // line break } fileOutputStream.close(); } catch (IOException e) { e.printStackTrace(); } finally { // clean up the document resources document.dispose(); }Image extraction is possible for all PDF documents.
Note: If a document is encrypted, the document permissions should be checked to make sure that content extraction is allowed.
The following code demonstrates how to extract images from the first page of a PDF document. The images on the first page of the document are extracted into a vector of Image objects using the Document getImages( int pgNumber) method. The image vector is then iterated with each image entry being saved to disk as a separate image file.
// load the file URL documentURL = new URL("your url"); Document document = new Document(); document.setUrl( documentURL); // Get the images for a single page Enumeration tmpImages = document.getPageImages(0).elements(); // Save the images as JPEGs int count = 0; while ( tmpImages.hasMoreElements() ){ Image image = (Image)tmpImages.nextElement(); // create new buffered image to paint to. BufferedImage bufferedImage = new BufferedImage(image.getWidth(this), image.getHeight(this), BufferedImage.TYPE_INT_RGB); Graphics2D g2d = bufferedImage.createGraphics(); g2d.drawImage(image, 0, 0, image.getWidth(this), image.getHeight(this), this); RenderedImage rendImage = bufferedImage; try { // Save as JPEG File file = new File( "newimage_" + count + ".jpg"); ImageIO.write( rendImage, "jpg", file); } catch (IOException e) { e.printStackTrace(); } g2d.dispose(); } // Clean up document resources document.dispose();
|
Copyright 2005-2007. ICEsoft Technologies, Inc. http://www.icesoft.com |