Extract text from PDF and parse to Excel, automator? applescript?
I'd been trying to do this with automator and have been getting nowhere.
I have some pdf files that have selectable, formatted text in them.
Its an old college directory that I would like to import. There are chunks of text:
Name: XXX, XXX
Telephone: (XXX) XXX-XXXX DOB: XX/XX/XX Gender: XXX
Hometown:
dflasdkfjdlkj
asdflajksdflj
adsflakjdsf
Name: XXX, XXX
Telephone: (XXX) XXX-XXXX DOB: XX/XX/XX Gender: XXX
Hometown:
dflasdkfjdlkj
asdflajksdflj
adsflakjdsf
Name: XXX, XXX
Telephone: (XXX) XXX-XXXX DOB: XX/XX/XX Gender: XXX
Hometown:
dflasdkfjdlkj
asdflajksdflj
adsflakjdsf
etc.
I'd like to find a way to extract the names, the phone numbers, and the hometowns and get it all into those respective columns in excel. Seems that I don't stand a chance with automator. Do I need a programmer for this?
I'd been trying to do this with automator and have been getting nowhere.
I have some pdf files that have selectable, formatted text in them.
Its an old college directory that I would like to import. There are chunks of text:
Name: XXX, XXX
Telephone: (XXX) XXX-XXXX DOB: XX/XX/XX Gender: XXX
Hometown:
dflasdkfjdlkj
asdflajksdflj
adsflakjdsf
Name: XXX, XXX
Telephone: (XXX) XXX-XXXX DOB: XX/XX/XX Gender: XXX
Hometown:
dflasdkfjdlkj
asdflajksdflj
adsflakjdsf
Name: XXX, XXX
Telephone: (XXX) XXX-XXXX DOB: XX/XX/XX Gender: XXX
Hometown:
dflasdkfjdlkj
asdflajksdflj
adsflakjdsf
etc.
I'd like to find a way to extract the names, the phone numbers, and the hometowns and get it all into those respective columns in excel. Seems that I don't stand a chance with automator. Do I need a programmer for this?
Similar Messages
-
Applescript or workflow to extract text from PDF and rename PDF with the results
Hi Everyone,
I get supplied hundreds of PDFs which each contain a stock code, but the PDFs themselves are not named consistantly, or they are supplied as multi-page PDFs.
What I need to do is name each PDF with the code which is in the text on the PDF.
It would work like this in an ideal world:
1. Split PDF into single pages
2. Extract text from PDF
3. Rename PDF using the extracted text
I'm struggling with part 3!
I can get a textfile with just the code (using a call to BBEDIT I'm extracting the code)
I did think about using a variable for the name, but the rename functions doesn't let me use variables.Hello
You may also try the following applescript script, which is a wrapper of rubycocoa script. It will ask you choose source pdf files and destination directory. Then it will scan text of each page of pdf files for the predefined pattern and save the page as new pdf file with the name as extracted by the pattern in the destination directory. Those pages which do not contain string matching the pattern are ignored. (Ignored pages, if any, are reported in the result of script.)
Currently the regex pattern is set to:
/HB-.._[0-9]{6}/
which means HB- followed by two characters and _ and 6 digits.
Minimally tested under 10.6.8.
Hope this may help,
H
_main()
on _main()
script o
property aa : choose file with prompt ("Choose pdf files.") of type {"com.adobe.pdf"} ¬
default location (path to desktop) with multiple selections allowed
set my aa's beginning to choose folder with prompt ("Choose destination folder.") ¬
default location (path to desktop)
set args to ""
repeat with a in my aa
set args to args & a's POSIX path's quoted form & space
end repeat
considering numeric strings
if (system info)'s system version < "10.9" then
set ruby to "/usr/bin/ruby"
else
set ruby to "/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby"
end if
end considering
do shell script ruby & " <<'EOF' - " & args & "
require 'osx/cocoa'
include OSX
require_framework 'PDFKit'
outdir = ARGV.shift.chomp('/')
ARGV.select {|f| f =~ /\\.pdf$/i }.each do |f|
url = NSURL.fileURLWithPath(f)
doc = PDFDocument.alloc.initWithURL(url)
path = doc.documentURL.path
pcnt = doc.pageCount
(0 .. (pcnt - 1)).each do |i|
page = doc.pageAtIndex(i)
page.string.to_s =~ /HB-.._[0-9]{6}/
name = $&
unless name
puts \"no matching string in page #{i + 1} of #{path}\"
next # ignore this page
end
doc1 = PDFDocument.alloc.initWithData(page.dataRepresentation) # doc for this page
unless doc1.writeToFile(\"#{outdir}/#{name}.pdf\")
puts \"failed to save page #{i + 1} of #{path}\"
end
end
end
EOF"
end script
tell o to run
end _main -
Extract Text from pdf using C#
Hi,
We are Solution developer using Acrobat,as we have reuirement of extracting text from pdf using C# we have downloaded adobe sdk and installed. We have found only four exmaples in C# and those are used only for viewing pdf in windows application. Can you please guide us how to extract text from pdf using SDK in C#.
Thanks you for your help.
Regards
kiranmaiOkay so I went ahead and actually added the text extraction functionality to my own C# application, since this was a requested feature by the client anyhow, which originally we were told to bypass if it wasn't "cut and dry", but it wasn't bad so I went ahead and gave the client the text extraction that they wanted. Decided I'd post the source code here for you. This returns the text from the entire document as a string.
private static string GetText(AcroPDDoc pdDoc)
AcroPDPage page;
int pages = pdDoc.GetNumPages();
string pageText = "";
for (int i = 0; i < pages; i++)
page = (AcroPDPage)pdDoc.AcquirePage(i);
object jso, jsNumWords, jsWord;
List<string> words = new List<string>();
try
jso = pdDoc.GetJSObject();
if (jso != null)
object[] args = new object[] { i };
jsNumWords = jso.GetType().InvokeMember("getPageNumWords", BindingFlags.InvokeMethod, null, jso, args, null);
int numWords = Int32.Parse(jsNumWords.ToString());
for (int j = 0; j <= numWords; j++)
object[] argsj = new object[] { i, j, false };
jsWord = jso.GetType().InvokeMember("getPageNthWord", BindingFlags.InvokeMethod, null, jso, argsj, null);
words.Add((string)jsWord);
foreach (string word in words)
pageText += word;
catch
return pageText; -
How to read/extract text from pdf
Respected All,
I want to read/extract text from pdf. I tried using etymon but not succed.
Could anyone will guide me in this.
Thanks and regards,
Ajay.Thank you very much Abhilshit, PDFBox works for reading pdf.
Regards,
Ajay. -
Hi, is it possible to extract text from a pdf file using the command line to get an output like you would get by using the File menu and then 'Save as text..."?
I also noticed that in the installation folder there is a small executable called AcroTextExtractor which sounds interesting, but I was unable to figure out how to use it.what's wrong with using automator for this? this certainly seems the easiest. I'm not aware of any built in apple script commands that will do this. But You should also ask on the Apple script forum under Mac OS Technologies.
Message was edited by: V.K. -
Extract text from PDF without opening PDF in window C#
Hello,
I'm creating a application for searching text in PDF's. I found some code wich uses the SDK from Acrobat (Installed on my system). But all the snippets I find seem to open a PDF window and then extract the text. Is it possible to extract the text without openening this window. I think this would increase the search time since I need te search a lot of files. And I just need a list with the file name and page number where the search string is found.
AcroAVDoc avDoc = (AcroAVDoc)gAppClass.GetInterface("Acrobat.AcroAVDoc");
Then I use the javascript obects to acces the "getPageNumWords" and "getPageNthWord" in a loop and putting the word in a string.
Thanks in advance fore the help.
I didn't want to put the entire code here because it's easely found all over the web
Thanks in advance for your help.
avDoc.Open(System.IO.Path.GetFullPath(filespec), filespec);Hello,
I own a copy of Acrobat pro 9. and its is for my own use. I am not a proffesional developper and this application wil not be distributed. -
Extracting text from PDF in columns?
I'm trying to extract the text from a number of PDFs in which the text is in two columns. When I copy the txt and paste it into Word, it ends up being in a single column, the width of one of the original columns, but with a hard return at the end of each line.
I don't really care if the extracted text ends up being in one column or two, I just want it not to have a return at the end of each line. Can anyone suggest an alternative method?i get the following error message:
Exception in thread "main" java.lang.NoClassDefFoundError: org/fontbox/afm/FontMetric
at org.pdfbox.pdmodel.font.PDFont.getAFM(PDFont.java:334)
at org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(PDSimpleFont.java:104)
at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:336)
at org.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:80)
at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
at extracttext.Main.main(Main.java:55)
Java Result: 1
BUILD SUCCESSFUL (total time: 1 second)
I would appreciate if you can please help me writing a java program that can extract only test from a pdf file -
Hi All
I want to extract only text from a pdf file.
I am trying to extrat text from a pdf file using PDFBox. But I am getting error. My code is like this:
* Main.java
* Created on den 10 september 2007, 23:01
* To change this template, choose Tools | Template Manager
* and open the template in the editor.
package extracttext;
import org.pdfbox.exceptions.InvalidPasswordException;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;
//import java.awt.Rectangle;
//import java.util.List;
import org.pdfbox.pdmodel.PDPage;
public class Main {
/** Creates a new instance of Main */
public Main() {
* @param args the command line arguments
public static void main( String[] args ) throws Exception
int startPage = 1;
int endPage = Integer.MAX_VALUE;
PDDocument document = null;
try
document = PDDocument.load( "C:\\thesis\\fileread\\sim.pdf" );
if( document.isEncrypted() )
try
document.decrypt( "" );
catch( InvalidPasswordException e )
System.err.println( "Error: Document is encrypted with a password." );
System.exit( 1 );
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition( true );
stripper.setStartPage( startPage );
stripper.setEndPage( endPage );
System.out.println("Text: " + stripper.getText(document));
finally
if( document != null )
document.close();
can anybody pls help me solving this problem
Regards,
UKi get the following error message:
Exception in thread "main" java.lang.NoClassDefFoundError: org/fontbox/afm/FontMetric
at org.pdfbox.pdmodel.font.PDFont.getAFM(PDFont.java:334)
at org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(PDSimpleFont.java:104)
at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:336)
at org.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:80)
at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
at extracttext.Main.main(Main.java:55)
Java Result: 1
BUILD SUCCESSFUL (total time: 1 second)
I would appreciate if you can please help me writing a java program that can extract only test from a pdf file -
How can I copy text from PDF and include the source filename in the pasted selection?
I'm a biologist and frequently cut-and-paste notes from PDFs of scientific articles. I name all of the PDF articles with their PubMed ID, a short unique identifier (e.g. 19397482.pdf). When I take notes, I will select a few sentences from the PDF and then paste them into a text editor for later reference.
Can anyone suggest a method or script that would allow me to paste the copied text with the Pubmed filename included in a single action? I would want the pasted output it to look something like this, with the filename appended to the end:
Of the transcripts that were significantly different, there was a greater number of transcripts that were down-regulated in the IVC embryos (380) than the number of transcripts that were up-regulated (208). [20668257.pdf]
This would really help me to properly cite information sources during the writing process. I know there are bibliography managers that might be able to do something like this, but I prefer to read the PDF articles directly in Preview and select the text as I am reading.
Thanks very much for any suggestions / ideas.
jjwTo copy and paste in a single action:
tell application "Preview" to activate
tell application "System Events" to tell process "Preview"
-- Get the PubMed ID:
get the title of the front window
set thePubMedID to word 1 of result
-- Copy the selected text to the clipboard:
keystroke "c" using {command down} -- ⌘C
delay 0.25 -- adjust if necessary
-- Add the PubMed ID to the contents of the clipboard:
set theNotes to the clipboard
set the clipboard to (theNotes & space & "[" & thePubMedID & ".pdf]")
end tell
tell application "Notational Velocity" to activate
tell application "System Events"
-- Paste the contents of the clipboard to the end of the Notational Velocity document
key code 125 using command down -- ⌘↓
keystroke return & return
keystroke "v" using {command down} -- ⌘V
end tell -
How to read text from PDF and HTML
I have got solution to read text form .txt file but did'nt get code for PDF and HTML.
I dont want to convert PDF to txt.
Please help me ...reading from a file is always the same. using the same strategy used for a .txt will allow you to read a .pdf file.
Offcourse in itself it will be useless becuase pdf files have a special internal structure.
html files are identical to txt files.
What are you trying to accomplisch with the files you are reading ? -
Copy text from PDF and paste to imessage
Before iOS7 I would be able to copy text from a PDF file and paste it normally into a text message or imessage. Now when doing the same thing it doesn't paste the actual text I copied, but it pastes as an html attachment. Have I changed a setting or is this just a bug that needs to be fixed? Please help.
I get the PDF in an email and open to view it. Not sure what the pdf opens in automatically, but that is the way I've been doing it since I've had the iphone 4 and haven't had issues until ios7.
I just now tried to opened a file in the default view and then switched it to open in Adobe reader and I can copy and paste normally. So now i have to do an extra step all the time? -
Discoverer Plus Automation- Extract data from Pac2K and fill in Excel 5 min
Hi,
I am new to this tool. I have zero knowledge on this tool.
I got a requirement to fetch the data from Pac2K tool for every 10 mins and export to Excel sheet. This should be done using Discoverer plus. I am using discoverer plus web application.
As it is very difficult to fetch the data manually for every 10 mins, I got a requirement to do the automation.
So, if you would be so kind, could you please let me know possiblity to do the automation for the above process using the Discoverer plus Web Application.
If possible please let me know in detail.Hi,
Using the Plus it is not possible to automate the export to excel.
This can be done using the desktop edition using a command line.
you can create a batch file and schedule it using any scheduler.
Tamir -
Extracting text from PDF files produced by Oracle reports
Hi,
I am currently using Report Builder 9.0.4.0.21 to produce reports in PDF format.
The pdf reports were displayed to screen and printed to printer correctly.
However, doing a copy-and-paste from the pdf report to a text editor produces
garbage characters. Also, I failed to extract the text using any of available adobe
plug-ins. I know that the PDF report is using font subseting with custom
encoding.I have already read the pdf reference manual and it seems that
the PDF report is missing the mapping tables to convert the custom encoding
used in the report back to ansi or unicode.
Is there a solution to this problem?
Are there any environment variables or settings that I am missing?
Your help is really appreciated.Hello,
Your problem may be related to a limitation in the PDF generated with Reports 9.0.2 / 9.0.4 when using Subsetting :
Font Subsetting Creates PDF Output not Searchable with Acrobat Reader (Doc ID 311345.1)
This limitation no more exists in Reports 10.1.2 / 11.1
Regards -
Copying text from PDF and no spaces
I get a pdf of Investors Business Daily from thier site. It is copy protected but I use PDFKey to unlock it, so I can copy text. When I copy text and paste it elsewhere often there are no spaces between words, I at first thought it had to do with the justification they used. However if I use Preview the spaces are there! So why the difference? I use acrobat pro vs. preview for other reasons. Is there some setting in preferences?
Second question. On this site, upper right is a box and magnifing glass. If I use that to search the forums I get results for ALL forums. Anyway to limit the search to just the forum your in?Steve,
I think you missunderstand. Since IBD posts most of the articles on their webpage, and that is not copy protected I could just do that. Also since I am not republishing the articles but merrily re-organizing them for my own use, this does NOT break copyright law (check laws about copying any copyrighted material for you own personel use).
You must own a copy of the material being reproduced. check, I am a subscriber
Purpose of copying - for your own private use. check
Copies cannot be lent or shared with anyone. check
The work being copied must be a legal (i.e. non-pirate) copy. check
Beyond this there are also fair use and research and study laws, that, if they applied for my use, would still permmit copying the material.
That aside, I am NOT asking for help in using PDF cracking software. Infact, Preview copies the material from the original file just fine.
All that PDFKey does is flip one bit in the file, it does not change spaces or spacing between words, throughout the file.
I am just asking how can it be that copying the same material from one source file, using two different PDF reading apps, give different results. -
Problems when i copy text from Pdf and paste on Word
In Pdf documente the text is in perfect conditions, but, when i copy the text and paste in WORD document the character change into random crazy character like: "()*"*&!(!*"(!"(!)"( )*"()!*("!&("@*")(!*@"!*@(
how i fix this??I have the same problem when copying the PDF into a Word file. I tried Save as RTF doc. and it is still just symbols.
It could be a font problem, because it has some weird Gill Sans and Futura fonts. I am sending a picture as an example.
The option of exporting as Tiff and then applying OCR is interesting but still is kind of slow when i have a 100 pages document. If the fonts is the problem is there any way i could select the whole text and apply it a known font like Arial?
Thanks for any info!
Cheers,
Sebastian
Maybe you are looking for
-
Error while saving a query view
Hi All, I am trying to save a query view in my favourites and it is failing with below message: Namespace '/BIC/' must be set to 'changeable' (transaction SE06) I have created many query view so far, this issue has just started. Could you please help
-
Error message when trying to install ink cartridges
Getting error while loading ink cartridge that says "insert black cartridge in right stall" and its inserted. Have repeated taking it out and putting it back but still get error
-
I have searched for Adobe Edge/HTML5 animation galleries and haven't been able to find anything. It does not have to be limited to Adobe Edge, though it would be nice if the galleries display animations that are possible to be created in Adobe Edge.
-
Regarding activation of bcsets thro eCATT
Hi, I am a fresher in SAP joined before 7 months. Currently i have been assigned for Product Development team. Please tell me how to activate BCsets thro eCATT as soon as possible.
-
Hi User has entered wrong Vat % in some A/P invoices . how these entries can be corrected. Thanks