Extract Text from pdf using C#

Hi,
We are Solution developer using Acrobat,as we have reuirement of extracting text from pdf using C# we have downloaded adobe sdk and installed. We have found only four exmaples in C# and those are used only for viewing pdf in windows application. Can you please guide us how to extract text from pdf using SDK in C#.
Thanks you for your help.
Regards
kiranmai

Okay so I went ahead and actually added the text extraction functionality to my own C# application, since this was a requested feature by the client anyhow, which originally we were told to bypass if it wasn't "cut and dry", but it wasn't bad so I went ahead and gave the client the text extraction that they wanted. Decided I'd post the source code here for you. This returns the text from the entire document as a string.
       private static string GetText(AcroPDDoc pdDoc)
            AcroPDPage page;
            int pages = pdDoc.GetNumPages();
            string pageText = "";
            for (int i = 0; i < pages; i++)
                page = (AcroPDPage)pdDoc.AcquirePage(i);
                object jso, jsNumWords, jsWord;
                List<string> words = new List<string>();
                try
                    jso = pdDoc.GetJSObject();
                    if (jso != null)
                        object[] args = new object[] { i };
                        jsNumWords = jso.GetType().InvokeMember("getPageNumWords", BindingFlags.InvokeMethod, null, jso, args, null);
                        int numWords = Int32.Parse(jsNumWords.ToString());
                        for (int j = 0; j <= numWords; j++)
                            object[] argsj = new object[] { i, j, false };
                            jsWord = jso.GetType().InvokeMember("getPageNthWord", BindingFlags.InvokeMethod, null, jso, argsj, null);
                            words.Add((string)jsWord);
                    foreach (string word in words)
                        pageText += word;
                catch
            return pageText;

Similar Messages

How to read/extract text from pdf

Respected All,
I want to read/extract text from pdf. I tried using etymon but not succed.
Could anyone will guide me in this.
Thanks and regards,
Ajay.

Thank you very much Abhilshit, PDFBox works for reading pdf.
Regards,
Ajay.

Applescript or workflow to extract text from PDF and rename PDF with the results

Hi Everyone,
I get supplied hundreds of PDFs which each contain a stock code, but the PDFs themselves are not named consistantly, or they are supplied as multi-page PDFs.
What I need to do is name each PDF with the code which is in the text on the PDF.
It would work like this in an ideal world:
1. Split PDF into single pages
2. Extract text from PDF
3. Rename PDF using the extracted text
I'm struggling with part 3!
I can get a textfile with just the code (using a call to BBEDIT I'm extracting the code)
I did think about using a variable for the name, but the rename functions doesn't let me use variables.

Hello
You may also try the following applescript script, which is a wrapper of rubycocoa script. It will ask you choose source pdf files and destination directory. Then it will scan text of each page of pdf files for the predefined pattern and save the page as new pdf file with the name as extracted by the pattern in the destination directory. Those pages which do not contain string matching the pattern are ignored. (Ignored pages, if any, are reported in the result of script.)
Currently the regex pattern is set to:
/HB-.._[0-9]{6}/
which means HB- followed by two characters and _ and 6 digits.
Minimally tested under 10.6.8.
Hope this may help,
H
_main()
on _main()
    script o
        property aa : choose file with prompt ("Choose pdf files.") of type {"com.adobe.pdf"} ¬
            default location (path to desktop) with multiple selections allowed
        set my aa's beginning to choose folder with prompt ("Choose destination folder.") ¬
            default location (path to desktop)
        set args to ""
        repeat with a in my aa
            set args to args & a's POSIX path's quoted form & space
        end repeat
        considering numeric strings
            if (system info)'s system version < "10.9" then
                set ruby to "/usr/bin/ruby"
            else
                set ruby to "/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby"
            end if
        end considering
        do shell script ruby & " <<'EOF' - " & args & "
require 'osx/cocoa'
include OSX
require_framework 'PDFKit'
outdir = ARGV.shift.chomp('/')
ARGV.select {|f| f =~ /\\.pdf$/i }.each do |f|
    url = NSURL.fileURLWithPath(f)
    doc = PDFDocument.alloc.initWithURL(url)
    path = doc.documentURL.path
    pcnt = doc.pageCount
    (0 .. (pcnt - 1)).each do |i|
        page = doc.pageAtIndex(i)
        page.string.to_s =~ /HB-.._[0-9]{6}/
        name = $&
        unless name
            puts \"no matching string in page #{i + 1} of #{path}\"
            next # ignore this page
        end
        doc1 = PDFDocument.alloc.initWithData(page.dataRepresentation) # doc for this page
        unless doc1.writeToFile(\"#{outdir}/#{name}.pdf\")
            puts \"failed to save page #{i + 1} of #{path}\"
        end
    end
end
EOF"
    end script
    tell o to run
end _main

How to read line number text from PDF using plugin?

Hi, I would like to know how to read line number text from PDF using plugin?
Thanks in advance.

Ok, some background reading of the PDF Reference will help you understand why this is so difficult. PDF files are not organised into lines. It is best to think of each word or character on the page as being a graphic with its own position. The human eye sees lines where a series of graphics (words) are roughly in the same horizontal region.
In the general case it is difficult or even impossible to answer this. You may have columns with different spacing (but the PDF stores no information on what is a column). You may have subscripts and superscripts. You may have text in graphics coinciding with other text. Commonly, there may be titles, headings or page numbers which are just ordinary text and might count as lines.
That said, what you need to do is extract the text on the page and its positions. The WordFinder APIs are the way to do that. Now, sort all the words out, using the Y coordinates and size to try and guess what makes a "line". Now you are in a position to find the text (divided into words, not strings) and report the "line number" you have estimated.

Extract text from pdf

Hi, is it possible to extract text from a pdf file using the command line to get an output like you would get by using the File menu and then 'Save as text..."?
I also noticed that in the installation folder there is a small executable called AcroTextExtractor which sounds interesting, but I was unable to figure out how to use it.

what's wrong with using automator for this? this certainly seems the easiest. I'm not aware of any built in apple script commands that will do this. But You should also ask on the Apple script forum under Mac OS Technologies.
Message was edited by: V.K.

Can't seem to save non-English as text from PDF using Reader

I have several PDF documents that were originally generated by OpenOffice from a UTF8-encoded text file. The text is in different languages, e.g. Korean, Arabic, Russian, English. When I open these documents and then "save as text", the resulting text files contain garbage or nothing at all in all cases except for English. Is it possible to extract non-English text from a PDF document using Reader? If not, is there a different product that could be used for this purpose? Thanks much!

They're using fonts that you don't have on your system so no, it isn't possible with Reader.

Extract text from PDF without opening PDF in window C#

Hello,
I'm creating a application for searching text in PDF's. I found some code wich uses the SDK from Acrobat (Installed on my system). But all the snippets I find seem to open a PDF window and then extract the text. Is it possible to extract the text without openening this window. I think this would increase the search time since I need te search a lot of files. And I just need a list with the file name and page number where the search string is found.
AcroAVDoc avDoc = (AcroAVDoc)gAppClass.GetInterface("Acrobat.AcroAVDoc");
Then I use the javascript obects to acces the "getPageNumWords" and "getPageNthWord" in a loop and putting the word in a string.
Thanks in advance fore the help.
I didn't want to put the entire code here because it's easely found all over the web
Thanks in advance for your help.
avDoc.Open(System.IO.Path.GetFullPath(filespec), filespec);

Hello,
I own a copy of Acrobat pro 9. and its is for my own use. I am not a proffesional developper and this application wil not be distributed.

Extracting text from pdf file

Hi All
I want to extract only text from a pdf file.
I am trying to extrat text from a pdf file using PDFBox. But I am getting error. My code is like this:
* Main.java
* Created on den 10 september 2007, 23:01
* To change this template, choose Tools | Template Manager
* and open the template in the editor.
package extracttext;
import org.pdfbox.exceptions.InvalidPasswordException;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;
//import java.awt.Rectangle;
//import java.util.List;
import org.pdfbox.pdmodel.PDPage;
public class Main {
/** Creates a new instance of Main */
public Main() {
* @param args the command line arguments
public static void main( String[] args ) throws Exception
int startPage = 1;
int endPage = Integer.MAX_VALUE;
PDDocument document = null;
try
document = PDDocument.load( "C:\\thesis\\fileread\\sim.pdf" );
if( document.isEncrypted() )
try
document.decrypt( "" );
catch( InvalidPasswordException e )
System.err.println( "Error: Document is encrypted with a password." );
System.exit( 1 );
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition( true );
stripper.setStartPage( startPage );
stripper.setEndPage( endPage );
System.out.println("Text: " + stripper.getText(document));
finally
if( document != null )
document.close();
can anybody pls help me solving this problem
Regards,
UK

i get the following error message:
Exception in thread "main" java.lang.NoClassDefFoundError: org/fontbox/afm/FontMetric
at org.pdfbox.pdmodel.font.PDFont.getAFM(PDFont.java:334)
at org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(PDSimpleFont.java:104)
at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:336)
at org.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:80)
at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
at extracttext.Main.main(Main.java:55)
Java Result: 1
BUILD SUCCESSFUL (total time: 1 second)
I would appreciate if you can please help me writing a java program that can extract only test from a pdf file

Extracting text from PDF in columns?

I'm trying to extract the text from a number of PDFs in which the text is in two columns. When I copy the txt and paste it into Word, it ends up being in a single column, the width of one of the original columns, but with a hard return at the end of each line.
I don't really care if the extracted text ends up being in one column or two, I just want it not to have a return at the end of each line. Can anyone suggest an alternative method?

i get the following error message:
Exception in thread "main" java.lang.NoClassDefFoundError: org/fontbox/afm/FontMetric
at org.pdfbox.pdmodel.font.PDFont.getAFM(PDFont.java:334)
at org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(PDSimpleFont.java:104)
at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:336)
at org.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:80)
at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
at extracttext.Main.main(Main.java:55)
Java Result: 1
BUILD SUCCESSFUL (total time: 1 second)
I would appreciate if you can please help me writing a java program that can extract only test from a pdf file

How to Extract Paragraph from Pdf using Adobe Pdf Library in C# or Java

By Using This library I extracted Content of Pdf File.
I got Content Line by Line(by using Last wordOnline )
<Line> Content </Line>
But I want to Extract Content Paragraph by Paragraph Like
<Paragrph> Content </paragraph>

Thanks for reply.
Here I have used "Y" co -ordinate of line to find paragraph.But I can not get expected Output.
Can you please explain me the logic of Find paragraph usnig Co-ordinates.?
Here I am pasting my Code.
                double PreY2=0;
                double   Result2=0;
                foreach (DataRow oRow in dtLineMaster.Rows)
                          double Result1 = Math.Round(PreY2 - Convert.ToDouble(oRow["Y2"].ToString()));
                     if (Result1 > Result2)
                        MessageBox.Show("" + oRow["LineText"].ToString());
                        Result2 = Result1;
                PreY2 = Convert.ToDouble(oRow["Y2"].ToString());
I have already extracted Pdf file in databse with lineId and X and Y co-ordinates.On this I am implementing above code.

Extracting text from PDF files produced by Oracle reports

Hi,
I am currently using Report Builder 9.0.4.0.21 to produce reports in PDF format.
The pdf reports were displayed to screen and printed to printer correctly.
However, doing a copy-and-paste from the pdf report to a text editor produces
garbage characters. Also, I failed to extract the text using any of available adobe
plug-ins. I know that the PDF report is using font subseting with custom
encoding.I have already read the pdf reference manual and it seems that
the PDF report is missing the mapping tables to convert the custom encoding
used in the report back to ansi or unicode.
Is there a solution to this problem?
Are there any environment variables or settings that I am missing?
Your help is really appreciated.

Hello,
Your problem may be related to a limitation in the PDF generated with Reports 9.0.2 / 9.0.4 when using Subsetting :
Font Subsetting Creates PDF Output not Searchable with Acrobat Reader (Doc ID 311345.1)
This limitation no more exists in Reports 10.1.2 / 11.1
Regards

Reg Extracting data from PDF using file adapter

Hi Experts,
In my business process I will get different files in the form of pdf. I have to extract the fields from the file and send it to ECC system. Can any one suggest me how to do it without using CA.
Regards
Suresh

you might have to use a custom solution.
you will find tips here Trouble writing out a PDF in XI/PI?

Extract text from PDF and parse to Excel, automator? applescript?

I'd been trying to do this with automator and have been getting nowhere.
I have some pdf files that have selectable, formatted text in them.
Its an old college directory that I would like to import. There are chunks of text:
Name: XXX, XXX
Telephone: (XXX) XXX-XXXX        DOB:    XX/XX/XX          Gender: XXX
Hometown:
dflasdkfjdlkj
asdflajksdflj
adsflakjdsf
Name: XXX, XXX
Telephone: (XXX) XXX-XXXX        DOB:    XX/XX/XX          Gender: XXX
Hometown:
dflasdkfjdlkj
asdflajksdflj
adsflakjdsf
Name: XXX, XXX
Telephone: (XXX) XXX-XXXX        DOB:    XX/XX/XX          Gender: XXX
Hometown:
dflasdkfjdlkj
asdflajksdflj
adsflakjdsf
etc.
I'd like to find a way to extract the names, the phone numbers, and the hometowns and get it all into those respective columns in excel. Seems that I don't stand a chance with automator. Do I need a programmer for this?

I'd been trying to do this with automator and have been getting nowhere.
I have some pdf files that have selectable, formatted text in them.
Its an old college directory that I would like to import. There are chunks of text:
Name: XXX, XXX
Telephone: (XXX) XXX-XXXX        DOB:    XX/XX/XX          Gender: XXX
Hometown:
dflasdkfjdlkj
asdflajksdflj
adsflakjdsf
Name: XXX, XXX
Telephone: (XXX) XXX-XXXX        DOB:    XX/XX/XX          Gender: XXX
Hometown:
dflasdkfjdlkj
asdflajksdflj
adsflakjdsf
Name: XXX, XXX
Telephone: (XXX) XXX-XXXX        DOB:    XX/XX/XX          Gender: XXX
Hometown:
dflasdkfjdlkj
asdflajksdflj
adsflakjdsf
etc.
I'd like to find a way to extract the names, the phone numbers, and the hometowns and get it all into those respective columns in excel. Seems that I don't stand a chance with automator. Do I need a programmer for this?

Can I extract images from PDFs using Batch Processing as I have many separate PDFs all with images t

I have about 500 separate PDF pages all that need their images extracting, surely there must be a way to run a batch command on it?
PLease help! it will take me for ever!

Advanced>Batch Processing...
Click the "New Sequence" button
Name the sequence (i.e. Extract Images)
Click the "Select Commands..." button
Select one of the following items:
- Export All Images As JPEG,
- Export All Images As JPEG2000,
- Export All Images As PNG,
- Export All Images As TIFF
Click the "Add" button
Click the "OK" button
Select your preference in the "Run commands on:" pop-up menu
Select your preference in the "Select output location:" pop-up menu
Click the "Output Options..." button
In the "Output Options" dialog box, make your preference selections.
Click OK
Click OK
Click the "Run Sequence" button.
Sabian

Extracting images from pdf

I am trying to extract images from pdfs using pdfimages, but i am unable to retrieve all the images. By opening the pdfs using Acrobat Reader 9.0, I am able to select, those images retrieved by pdfimages, using the select tool but for other figures/images we need to try other options like print screen and then cut the relevant image. I was wondering why or when does the Acrobat treats the figures/images differently.

Hi Dave,
Thanks for the reply. My question was not regarding any non-Adobe product like pdfimages. It was in general the way Acrobat handles the images while creating pdfs.
I wanted to know why can we select some of the images from the pdf using select tool and can not select others for which we need to print screen and cut. Is there anything in the eps files of included image that causes such effect?
Thanks.

Extract Text from pdf using C#

Similar Messages

Maybe you are looking for