Extract URL from HTML text

Suppose you have the following String that is body text with HTML.
String bodyText = " My name is Blake. I live in New York City. See my image here: <img href="http://www.blake.com/blake.jpg"/> isn't my picture awesome? Tata for now!"
I want to extract the URL that contains the location of the image in this bodyText. The ideal would be to create a function called public String extractor(String bodyText) to be used
String imageURL = extractor(bodyText);
//imageURL should be "http://www.blake.com/blake.jpg"
My first thoughts are using reg exp, yet the place i would find to use that would using the .replace in String class. I am by no means an expert on reg exp so I haven't taken too much time to try to figure it out with reg exp. I obviously could do a linear search of the bodyText and do a ton of if statements, but thats just poor coding. I want see if anyone came across this or has insight to this problem.
Thanks for all the help,
Blake

How would the regexp change if there were multiple img tags within the String.I don't rightly know, I'm just a raw beginner in regexes.
Would this regexp return all the img URLs found in the String.No, as it stands it would return only the last URL. But this will:String bodyText = " My name is Blake. " +
      "I live in New York City. See my image here: " +
      "<img href=\"http://www.blake.com/blake.jpg\"/>" +
      " isn't my picture awesome? Here's another: " +
      "<img href='http://www.blake.com/Vandelay.jpg'/>" +
      " Tata for now!";
String regex = "(?<=<img\\shref=[\"'])http://.*?(?=[\"']/?>)";
Pattern pattern = Pattern.compile (regex);
Matcher matcher = pattern.matcher (bodyText);
while (matcher.find ()) {
   System.out.println (matcher.group ());
}Note the enhancement that takes into account that both single and double quotes are legal in HTML. But unlike the earlier example, this does not tolerate more than one space between <img and href=, I couldn't find a way to achieve that.
Visit this thread later, there are some real regex experts around who may post a better solution.
db

Similar Messages

Hi i am new to labview. i want to extract data from a text file and display it on the front panel. how do i proceed??

Hi i am new to labview
I want to extract data from a text file and display it on the front panel.
How do i proceed??
I have attached a file for your brief idea...
Attachments:
extract.jpg ‏3797 KB

RoopeshV wrote:
Hi,
The below code shows how to read from txt file and display in the perticular fields.
Why have you used waveform?
Regards,
Roopesh
There are so many things wrong with this VI, I'm not even sure where to start.
Hard-coding paths that point to your user folder on the block diagram. What if somebody else tries to run it? They'll get an error. What if somebody tries to run this on Windows 7? They'll get an error. What if somebody tries to run this on a Mac or Linux? They'll get an error.
Not using Read From Spreadsheet File.
Use of local variables to populate an array.
Cannot insert values into an empty array.
What if there's a line missing from the text file? Now your data will not line up. Your case structure does handle this.
Also, how does this answer the poster's question?

Extract URL from a href string

Greetings,
I am trying to solve a problem (in a specific way) that will lead to the solution for another problem (which is similar).
I am trying to extract a URL from an HTML <a> tag like this:
<a href="www.someurl.com">Go to www.someurl.com</a>Can anyone help with the best solution? I've tried it using String.split() and StringTokenizer (which might work for this example), but, for my real problem, they seem quite inadequate. I'm guessing the solution involves some regex but I don't know where to look (and am quite unfamiliar with regex to know how).
Thank you in advance.

Actually, I am trying to break down many sections of the URL.
The URL I am trying to breakdown has a lot of data that I need.
I guess a better example would be:
<a href="www.someurl.com" id="url-id-43612">Link Text - data here</a>In my program, I need to grab three things: the href URL, the id, and "Link Text - data here."
Here is an SSCCE of how I am currently dealing with this:
import java.util.StringTokenizer;
* To change this template, choose Tools | Templates
* and open the template in the editor.
* @author Ryan
public class htmlparser {
    public static void main(String[] args) {
        String str = "<a href=\"www.someurl.com\" id=\"url-id-43612\">Link Text - data here</a>";
        str = str.replaceAll("<a href=\"", "\n")
            .replaceAll("\" id=\"url-id-", "\n")
            .replaceAll("\">", "\n")
            .replaceAll("</a>", "\n");
        StringTokenizer tokenizer = new StringTokenizer(str, "\n");
        String url = tokenizer.nextToken();
        String id = tokenizer.nextToken();
        String text = tokenizer.nextToken();
        System.out.println(url);
        System.out.println(id);
        System.out.println(text);
}But I am looking for a more elegant solution to the problem, is there one?

Extract textdata from HTML with AUTO_FILTER

Hi,
we're using Oracle AUTO_FILTER to extract text-information from DOC and PDF - Documents.
Works fine.
We also have data stored within a HTML structure.
We use our filter with the following options:
ctx_ddl.create_preference('SEARCH_iMT_ATTRIB_AFILTER', 'AUTO_FILTER');
ctx_ddl.create_policy('SE_IMT_POLICY', 'SEARCH_iMT_ATTRIB_AFILTER');
The filter itself is called within a loop:
CTX_DOC.Policy_Filter('SE_IMT_POLICY', v_blobtab(i), v_ctmp, TRUE);
It seems as if AUTO_FILTER converts our HTML to HTML again.
When trying to insert the AUTO_FILTER, I get an ora-31167: XML nodes over 64K in size cannot be inserted
How can I force the AUTO_FILTER only to return plain-text?
Thanks in advance
Message was edited by:
user557708

You need to create a section group preference employing HTML_SECTION_GROUP and then use it when creating your Policy.
Faisal

Parsing url from HTML file

hello
I have a html page converted to text format. I need to parse the url's present in that file.
I got all the url's but I want to extract only a specific url i.e., url in the anchor tag
e.g., like <a href = "http://www.java.sun.com"> Java </a>
In that file, if I need only the url in the anchor tag... what would be the method ?
Any help would be great
Thanks in advance
k

Thank you very much for your prompt reply and for the link u have provided.

Extract article from HTML code

Hi,
I'm trying to build a search engine for an RSS feed. Thing is i'm trying to store every article as a BLOB field in the database. To optimize my search i'll need to extract the article only and nothing else (no unrelated hyperlinks or html code)
I'm using the HTMLEditoKit of swing to get the html content without the code, but that's not enough. I need to clean the page of things like headers and footers (they affect the search results)

i already parsed the XML in the RSS...i've got informa ; the problem is the article itself
Take this article for example: [http://news.bbc.co.uk/2/hi/europe/7572635.stm] , you've got the article, but you've things all around it from link to different articles to headers, footers menus.
All of this affects the search results dramatically so i just want the article itself.
that's the greatest challenge.

How can I extract XML from a text document?

I have tons of text documents containing useless text and a section of XML. I would like to use either Mac Automator or Apple Script to pull the XML section out and place it in a new document with a .xml extension. How can I do that?
Here is a sample of the XML section that I need to pull:
- ---Start ACNS XML
<?xml version="1.0" encoding="UTF-8"?>
<Infringement xsi:schemaLocation="http://www.movielabs.com/ACNS/ACNS2v1.xsd" xmlns="http://www.movielabs.com/ACNS" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <Case>
<ID>22242387629</ID>
<Status>OPEN</Status>
<Severity>Normal</Severity>
</Case>
<Complainant>
<Entity>MPAA Search and Notify</Entity>
<Contact></Contact>
<Address></Address>
<Phone>5555555555</Phone>
<Email>[email protected]</Email>
</Complainant>
<Service_Provider>
<Entity>Some Place, Somewhere</Entity>
<Contact></Contact>
<Address>Some Place, Somewhere </Address>
<Phone></Phone>
<Email>[email protected]</Email>
</Service_Provider>
<Source>
<TimeStamp>2011-12-02T23:41:59.94Z</TimeStamp>
<IP_Address>127.0.0.1</IP_Address>
<Port>64153</Port>
<Type>P2P</Type>
<SubType BaseType="P2P" Protocol="BitTorrent" />
<UserName></UserName>
<Number_Files>1</Number_Files>
</Source>
<Content>
<Item>
<TimeStamp>2011-12-02T23:41:59.94Z</TimeStamp>
<AlsoSeen Start="2011-12-02T23:40:00.11Z" End="2011-12-02T23:41:59.94Z"></AlsoSeen>
<Title>asdfasdf (2011)</Title>
<Artist></Artist>
<FileName>asdfasdf (2011) DVDRip XviD-MAXSPEED</FileName>
<FileSize>1580908467</FileSize>
<Type>Video</Type>
<Hash Type="SHA1">8FB7B1F4984AB6E0746B43D2B82D4ED8102984D5</Hash>
</Item>
</Content>
<History></History>
<Notes></Notes><Type Retraction="false">DMCA</Type>
<Detection>
<Asset>
<OriginalAssetName>asdfasdf (2011)</OriginalAssetName>
</Asset>
<ContentMatched Audio="false" Video="true" Text="false" />
<HashMatched>true</HashMatched>
<VerificationID>Manual and automated watermark verification</VerificationID>
</Detection>
<Verification>
<VerificationLevel Type="DT">2</VerificationLevel>
</Verification>
<TextNotice><![CDATA[12-03-2011

XML portion always starts with <Infringement and ends with </Infringement>.
Actually, it doesn't... the XML starts with the <?xml> tag, but that's just me being pedantic
Given what you've said, though, it's easy to extract the XML data from a given block of text.
First, read the source data:
set theText to read file "path:to:the:file"
Then you can extract the XML via something like:
set start_tag to "<?xml"
set end_tag to "</Infringement>"
set start_of_data to offset of start_tag in theText
set end_of_data to (offset of end_tag in theText) + (-1 + (length of end_tag))
set theXML to text start_of_data through end_of_data of theText
Now you can write that data to a file:
set theFile to open for access file ((path to desktop as text) & "output.xml" as text) with write permission
set eof theFile to 0
write theXML to theFile starting at 0
close access theFile
If you have multiple files you can either run this in a loop that iterates over the files, or save the script as a droplet, then drop the files onto the script icon. Let me know if you need help with that, too.

Extract URL from Mail message

Hello Applescript-Meisters,
I have the following problem:
Every week we receive an email which contains a secure URL (https://) with always different username and password to download a ".tar" file.
Since I'm new to Applescript I tried to extract the paragraph of the email message with the line which contains the URL, but doesn't seem to work:
tell application "Mail"
set theMessages to message 1 of mailbox "EXAMPLE" of account "AccountTest" whose subject begins with "Data request"
set theContent to content of theMessages
set someData to paragraphs of theContent
repeat with someData in theContent
if someData begins with "https://" then
set theURL to paragraph of someData
else
set theURL to paragraph 18 of theContent -- this works but this line can change weekly
end if
end repeat
end tell
Any help or suggestions would be greatly appreciated.
Cheers,
gilles

The most obvious problem is how you're trying to iterate through the paragraphs.
You start off with:
set theContent to content of theMessages
which is fine - theContent now contains the text of the email.
You then extract the paragraphs of that text:
set someData to paragraphs of theContent
But you then reset 'someData' to be the iterator in a loop:
repeat with someData in theContent
Additionally, you're iterating through theContent, which is a single block of text.
I think what you mean to do is iterate through the paragraphs of theContent, like:
repeat with someData in (get paragraphs of theContent)
Once you have this, you know someData is the paragraph you're currently looking at so you can:
set theURL to someData
Finally, you do not want an 'else ... set theURL' statement because that will reset theURL every time you find a line that doesn't begin with "https://" (e.g. if paragraph 10 starts with https:// you set theURL to that paragraph, but as soon as you move onto line 11 you reset theURL because the paragraph doesn't begin with https://
To address this you should just iterate through the paragraphs and only set theURL to paragraph 18 if you didn't find any matching lines.
Putting it all together:
set theURL to "" -- default value for theURL
theContent to content of theMessage
repeat with someData in (get paragraphs of theContent)
if someData begins with "http://" then
set theURL to someData
end if
end repeat
if theURL = "" then
try
set theURL to paragraph 18 of theContent
end try
end if
Note that I've included the 'paragraph 18' line in a try block. That's so that the script doesn't fail if there are less than 18 paragraphs in the message (you can't get paragraph 18 of a 17-paragraph document).

Extracting info from HTML documents

My program returns the HTML of any web page entered by the user. The HTML documents that are returned all contain pricing infomration that I want to extract. Any idea of the best way to search an HTML document for specific infomration I require. Seems like a huge task to split it all into tokens and searching for � sign!!!!!

This a nightmare of a problem........... the html
files that I am retrieving are huge. All I need from
them are a couple of lines of information. How do I
find the specific infomration I need???Load the entire file, search for it. You find the information in the same way like you'd do when ouy look for it in the file's source code.
Is it possible from a java program to open the HTML
file in web broweser, search, then return the info?
The html files seem really complex to search on.How would this help?

Extracting URL From Application Object

If I type http://mymachine.com/myapp?a=aaa&b=bbb in my browser (to access application "myapp") how do I extract just the http://mymachine.com/myapp part from the application object? I would have figured that the code below would allow me to construct that string but these methods instead return the stuff in comments...
URL theURL = application.getResource("/");
theURL.getProtocol(); // jndi
theURL.getHost(); // null
theURL.getPort(); // -1
theURL.getFile(); // org.apache.naming.resources.FileDirContext@!8bf135
URL theURL = application.getResource("/myapp");
theURL.getProtocol(); // null pointer exception
theURL.getHost(); // null pointer exception
theURL.getPort(); // null pointer exception
theURL.getFile(); // null pointer exception

tech note... HttpUtils is depricated for newer Servlet APIs (2.3+). Instead, you would use the same methods in the HttpServletRequest:
URL theURL = new URL(request.getRequestURL().toString());

Extract URLs from webpage

Hi guys,
I'd like to know if there is a possibility to find hyperlinks on a webpage and write their targets to a text file using applescript.
I'll take youtube as an example, because nearly everyone is familiar with it:
When I click on a username, I will be redirected to the Channel Page of that particular user. What I want is to get the URL of this Channel Page.
I think I'll have to create a list, with all the URLs in it, then filter them and save them into a text file.
The problem is that I can't find a way to do that, whereas Automator offers that feature.
Hope you can help me!
Cheers

I have no experience with automator but I think I understand. Here is something to get you started.
set theURL to text returned of (display dialog "Enter URL" default answer "http://www.")
try
set theSource to (do shell script "curl " & theURL)
on error
display dialog "Error getting website source."
end try
This will set the variable theSource to the source code of the entered webpage. You then need to search through this source code and extract the information that you want. I'm still not exactly sure what that is though and there are going to be a ton of links in it. Hope that helps at least a little.

Get Url from a text box

Hello,
I have read a couple posts in regards to the GEt URL function
but couldn't find what I'm looking for.
Trusting someone can help.
Here are two reference files
1- Main swf that loads the movie clip
with the links
Swf
that displays the links I'm opening in a new window
I'm able to open the url on a new window but couldn't find a
way to control the height of this window. Please note that the
links are all inside a text box. They are not separate buttons they
are just text hiperlinks.
Please let me know if any of you knows how to control the new
_blank window height.
Thanks
Fe

Hi. You should consider posting this to either the iPhone forum or the Developer forum. This forum is for discussing the Unix subsystem of OS X...

Extracting data from a text file

Hi there, I am trying to create a program that can extract a message from a csv file. The csv is an output from a database that stores messages with the following syntax:
username,password,date,to,from,subject,body
I am very new to java (about 3 days) so any help would be appreciated.
Thanks

Lesson: I/O: Reading and Writing (but no 'rithmetic)
String.split
Resources for Beginners
Sun's basic Java tutorial
Sun's New To Java Center. Includes an overview of what Java is, instructions for setting up Java, an intro to programming (that includes links to the above tutorial or to parts of it), quizzes, a list of resources, and info on certification and courses.
http://javaalmanac.com. A couple dozen code examples that supplement The Java Developers Almanac.
jGuru. A general Java resource site. Includes FAQs, forums, courses, more.
JavaRanch. To quote the tagline on their homepage: "a friendly place for Java greenhorns." FAQs, forums (moderated, I believe), sample code, all kinds of goodies for newbies. From what I've heard, they live up to the "friendly" claim.
Bruce Eckel's Thinking in Java (Available online.)
Joshua Bloch's Effective Java
Bert Bates and Kathy Sierra's Head First Java.
James Gosling's The Java Programming Language. Gosling is the creator of Java. It doesn't get much more authoratative than this.

Use SESSION variable in URL from HTML region

Hello,
I have what looks like a simple question but i've been struggling with it for three days now, and i need your help!
I have a page, with an HTML region. I want to display some links to other pages within the application, so i thought i'd use this:
a href="http://server.xxxx.com:7777/pls/htmldb/f?p=114:13:&SESSION"
But the session ID is not interpreted, and the link doesn;t work. Any idea what's wrong here? or how i should create links with the session id in it?
Thanks!!!! Matt
Message was edited by:
matt_amsterdam

Matt,
You've missed off the period (.) off the end, use &SESSION. instead

Extract addresses from text

I'm trying to use automator to extract addresses from a text document. It's a long list of addresses and I eventually need to print them on to mailing labels. The trouble is having them identified, and extracted. After that I can put them into address book and take care of the printing. Any suggestions on where to start in automator?

Plain text, entries are separated by a blank line.
The format is typically:
City, State
Company
address line 1
address line 2
City, State Zip
Phone: ###-###-####
Fax: ###-###-####
E-Mail: [email protected]
Web: http://www.website.com
We're only looking to use the first 5 lines. I would like to pipe this information into address book. Then print labels from there, but that can be handled outside of automator. It's pulling the data out of the text that I'm having trouble with.
It's a a lot to do by hand. Work smarter not harder right?

Extract URL from HTML text

Similar Messages

Maybe you are looking for