Extract URLs from webpage

Hi guys,
I'd like to know if there is a possibility to find hyperlinks on a webpage and write their targets to a text file using applescript.
I'll take youtube as an example, because nearly everyone is familiar with it:
When I click on a username, I will be redirected to the Channel Page of that particular user. What I want is to get the URL of this Channel Page.
I think I'll have to create a list, with all the URLs in it, then filter them and save them into a text file.
The problem is that I can't find a way to do that, whereas Automator offers that feature.
Hope you can help me!
Cheers

I have no experience with automator but I think I understand. Here is something to get you started.
set theURL to text returned of (display dialog "Enter URL" default answer "http://www.")
try
set theSource to (do shell script "curl " & theURL)
on error
display dialog "Error getting website source."
end try
This will set the variable theSource to the source code of the entered webpage. You then need to search through this source code and extract the information that you want. I'm still not exactly sure what that is though and there are going to be a ton of links in it. Hope that helps at least a little.

Similar Messages

Automator Stops Responding while trying to perform "Get Image URLs from Webpage"

I have seen it in many tutorials, and it seems strait forward enough, so here is my order of events for my scriptl;
Get Current Webpage from Safari
Get Image URLs from Webpage [linked from these webpages]
Download URLs [Downloads]
However when I run this script it returns the get current webpage quickly but then hangs on Get Image URLs from webpage either never passes or takes 700+ seconds to move to the next step.
The times it did pass to the next step it worked 98% still missed images occasionaly.
After I figure out how to fix this issue, I am looking for a way to make it create a new folder, and name it according to the sub directory of the webpage, for example:
http:\\www.mypictures.com\Birthday\01.JPG
http:\\www.mypictures.com\Birthday\02.JPG
would create folder Birthday, and save images 1 and 2
Then in the future I want to download another set such as;
http:\\www.mypictures.com\MothersDay\01.JPG
http:\\www.mypictures.com\MothersDay\02.JPG
As these are both named 01.JPG the sets would overwrite, or add strange numbers and not divide by set by default;
So I need this script to create folders based on Directory (portion of url) In my example it would be MothersDay, and Birthday folders.
Any help would be greatly appreciated!
Thanks in advance!
Mike

I should add, it has simply frozen and stopped responding about 95% of the 20 times or so I have run it. I have tried rebooting (not sure if that helps on Mac, I'm still new to the OS been a windows user for a long time but so far I love my Mac).

Extract URL from a href string

Greetings,
I am trying to solve a problem (in a specific way) that will lead to the solution for another problem (which is similar).
I am trying to extract a URL from an HTML <a> tag like this:
<a href="www.someurl.com">Go to www.someurl.com</a>Can anyone help with the best solution? I've tried it using String.split() and StringTokenizer (which might work for this example), but, for my real problem, they seem quite inadequate. I'm guessing the solution involves some regex but I don't know where to look (and am quite unfamiliar with regex to know how).
Thank you in advance.

Actually, I am trying to break down many sections of the URL.
The URL I am trying to breakdown has a lot of data that I need.
I guess a better example would be:
<a href="www.someurl.com" id="url-id-43612">Link Text - data here</a>In my program, I need to grab three things: the href URL, the id, and "Link Text - data here."
Here is an SSCCE of how I am currently dealing with this:
import java.util.StringTokenizer;
* To change this template, choose Tools | Templates
* and open the template in the editor.
* @author Ryan
public class htmlparser {
    public static void main(String[] args) {
        String str = "<a href=\"www.someurl.com\" id=\"url-id-43612\">Link Text - data here</a>";
        str = str.replaceAll("<a href=\"", "\n")
            .replaceAll("\" id=\"url-id-", "\n")
            .replaceAll("\">", "\n")
            .replaceAll("</a>", "\n");
        StringTokenizer tokenizer = new StringTokenizer(str, "\n");
        String url = tokenizer.nextToken();
        String id = tokenizer.nextToken();
        String text = tokenizer.nextToken();
        System.out.println(url);
        System.out.println(id);
        System.out.println(text);
}But I am looking for a more elegant solution to the problem, is there one?

Extracting info from webpages

I am trying to create a hotel program that has various features including finding cheapest hotel prices on the web. My program searces the web and returns the approriate web page results in html format. My problem is I'm not sure the best way to extract the information I want. Below is an example of a web page (I know it's long but if you copy it into a web browser, it does work, honest!). From this page I want to extract the hotels name and prices.
http://www.bookings.org/searchresults.html?class_interval=2&country=gb&error_url=http%3A%2F%2Fwww.bookings.org%2Fcountry%2Fgb.html%3F&search_by=city&city=-2595386&region=Avon+Aberdeenshire&class_key=1&class=0&do_availability_check=on&checkin_monthday=21&checkin_year_month=2005-6&checkout_monthday=22&checkout_year_month=2005-6&newlangurl=%2Fcountry%2Fgb.en.html&x=77&y=14
At present I can return the HTML of this page. However can anyone suggest how I go about extracting the specific info i require. The html file is huge!
Regards
Ross

Sorry pasted the wrong url. This one should work.
http://www.bookings.org/searchresults.html?class_interval=2&country=gb&error_url=http%3A%2F%2Fwww.bookings.org%2Fcountry%2Fgb.html%3F&search_by=city&city=-2595386&region=Avon+Aberdeenshire&class_key=1&class=0&do_availability_check=on&checkin_monthday=24&checkin_year_month=2005-3&checkout_monthday=27&checkout_year_month=2005-3&newlangurl=%2Fcountry%2Fgb.en.html&x=88&y=5

Extracting URL From Application Object

If I type http://mymachine.com/myapp?a=aaa&b=bbb in my browser (to access application "myapp") how do I extract just the http://mymachine.com/myapp part from the application object? I would have figured that the code below would allow me to construct that string but these methods instead return the stuff in comments...
URL theURL = application.getResource("/");
theURL.getProtocol(); // jndi
theURL.getHost(); // null
theURL.getPort(); // -1
theURL.getFile(); // org.apache.naming.resources.FileDirContext@!8bf135
URL theURL = application.getResource("/myapp");
theURL.getProtocol(); // null pointer exception
theURL.getHost(); // null pointer exception
theURL.getPort(); // null pointer exception
theURL.getFile(); // null pointer exception

tech note... HttpUtils is depricated for newer Servlet APIs (2.3+). Instead, you would use the same methods in the HttpServletRequest:
URL theURL = new URL(request.getRequestURL().toString());

Extract URL from Mail message

Hello Applescript-Meisters,
I have the following problem:
Every week we receive an email which contains a secure URL (https://) with always different username and password to download a ".tar" file.
Since I'm new to Applescript I tried to extract the paragraph of the email message with the line which contains the URL, but doesn't seem to work:
tell application "Mail"
set theMessages to message 1 of mailbox "EXAMPLE" of account "AccountTest" whose subject begins with "Data request"
set theContent to content of theMessages
set someData to paragraphs of theContent
repeat with someData in theContent
if someData begins with "https://" then
set theURL to paragraph of someData
else
set theURL to paragraph 18 of theContent -- this works but this line can change weekly
end if
end repeat
end tell
Any help or suggestions would be greatly appreciated.
Cheers,
gilles

The most obvious problem is how you're trying to iterate through the paragraphs.
You start off with:
set theContent to content of theMessages
which is fine - theContent now contains the text of the email.
You then extract the paragraphs of that text:
set someData to paragraphs of theContent
But you then reset 'someData' to be the iterator in a loop:
repeat with someData in theContent
Additionally, you're iterating through theContent, which is a single block of text.
I think what you mean to do is iterate through the paragraphs of theContent, like:
repeat with someData in (get paragraphs of theContent)
Once you have this, you know someData is the paragraph you're currently looking at so you can:
set theURL to someData
Finally, you do not want an 'else ... set theURL' statement because that will reset theURL every time you find a line that doesn't begin with "https://" (e.g. if paragraph 10 starts with https:// you set theURL to that paragraph, but as soon as you move onto line 11 you reset theURL because the paragraph doesn't begin with https://
To address this you should just iterate through the paragraphs and only set theURL to paragraph 18 if you didn't find any matching lines.
Putting it all together:
set theURL to "" -- default value for theURL
theContent to content of theMessage
repeat with someData in (get paragraphs of theContent)
if someData begins with "http://" then
set theURL to someData
end if
end repeat
if theURL = "" then
try
set theURL to paragraph 18 of theContent
end try
end if
Note that I've included the 'paragraph 18' line in a try block. That's so that the script doesn't fail if there are less than 18 paragraphs in the message (you can't get paragraph 18 of a 17-paragraph document).

Extract URL from HTML text

Suppose you have the following String that is body text with HTML.
String bodyText = " My name is Blake. I live in New York City. See my image here: <img href="http://www.blake.com/blake.jpg"/> isn't my picture awesome? Tata for now!"
I want to extract the URL that contains the location of the image in this bodyText. The ideal would be to create a function called public String extractor(String bodyText) to be used
String imageURL = extractor(bodyText);
//imageURL should be "http://www.blake.com/blake.jpg"
My first thoughts are using reg exp, yet the place i would find to use that would using the .replace in String class. I am by no means an expert on reg exp so I haven't taken too much time to try to figure it out with reg exp. I obviously could do a linear search of the bodyText and do a ton of if statements, but thats just poor coding. I want see if anyone came across this or has insight to this problem.
Thanks for all the help,
Blake

How would the regexp change if there were multiple img tags within the String.I don't rightly know, I'm just a raw beginner in regexes.
Would this regexp return all the img URLs found in the String.No, as it stands it would return only the last URL. But this will:String bodyText = " My name is Blake. " +
      "I live in New York City. See my image here: " +
      "<img href=\"http://www.blake.com/blake.jpg\"/>" +
      " isn't my picture awesome? Here's another: " +
      "<img href='http://www.blake.com/Vandelay.jpg'/>" +
      " Tata for now!";
String regex = "(?<=<img\\shref=[\"'])http://.*?(?=[\"']/?>)";
Pattern pattern = Pattern.compile (regex);
Matcher matcher = pattern.matcher (bodyText);
while (matcher.find ()) {
   System.out.println (matcher.group ());
}Note the enhancement that takes into account that both single and double quotes are legal in HTML. But unlike the earlier example, this does not tolerate more than one space between <img and href=, I couldn't find a way to achieve that.
Visit this thread later, there are some real regex experts around who may post a better solution.
db

Automator's Link URLs from Webpages not finding links

There are clearly <a href='s in the page I'm looking at but Automator's action simply will not return them.
Is this a common problem?

Yes you are correct, it's the space character—not the apostrophe. The trouble is that the source of your web page does not contain an escaped %20, it has merely printed literal space characters in the link tag. While Safari managed to interpret this correctly and convert the spaces to %20, Automator did not. As a test I made an html file containing:
<a href="red blue">green</a>
and Automator returned:
{"file:///Volumes/hd2/test/red"}
There may be a very long way around this bug. I'll think about it.

How to extract an mp3's url from m3u playlist (and download the file)?

hello!
I've managed to get automator to retrieve (and save to hdd) an .m3u playlist from a radio's website that broadcasts a program I like in mp3.
inside this m3u playlist is the link an actual mp3... both the m3u and the mp3 change every week.
If I drag the m3u to TextEdit it opens up and I can see the 3 lines of "source code" of which the last line is the full url of the mp3 file (http://www.thatradiosurl.com/audio/shownumber.mp3)
My question is as follows: how do I get automator to 'extract' that url and save that file to my hard drive (so I can automatically add it to a playlist in itunes (and retrieve the show on my ipod at sync))??
I'll save the final workflow as an iCal event that will hopefully grab this show every week.
before you ask... no, that show is not available as a podcast! I wish it were... I think they're a bit slow on catching up with the tech.
thanks to all!
have a great day
d
pbg415"1.67 Mac OS X (10.4.6)

Hi!
You might have seen this post in the Automator part
of the forum... but I might have posted it in the
wrong place (got no interest:( )
I've managed to get automator to retrieve (and save
to hdd) an .m3u playlist from a radio's website that
broadcasts a program I like in mp3.
inside this m3u playlist is the link to an actual mp3
file... both the m3u and the mp3 change every week.
If I drag the m3u to TextEdit it opens up and I can
see the 3 lines of "source code" of which the last
line is the full url of the mp3 file (http://www
.thatradiosurl.com/audio/show-shownumber.mp3)
You already got the playlist into a text file, you have two options:
1. use Automator's 'Get Link URLs from Webpages' action to extract the URL. 'Get Link URLs from Webpages' is under Safari lib.
2. run Automator Python script from 'Get Link URLs from Webpages' in AppleScript. The script below with get your mp3 URLs:
set playlist_file to "/path/to/playlist/file.txt" -- change to your actual playlist file path
set _URLs to do shell script "/System/Library/Automator/Get\\ Link\\ URLs\\ from\\ Webpages.action/Contents/Resources/links file://" & quoted form of playlist_file
set mp3_URLs to ""
repeat with _URL in paragraphs of _URLs
if _URL contains ".mp3" then set mp3_URLs to mp3_URLs & _URL & (ASCII character 10)
end repeat
mp3_URLs is the list you want you extract.

JWS Download Dialog's URL(From:) different from browser's URL

Hi everyone,
My problem:
I have deployed application A.jar at: {color:#0000ff}192.168.0.10{color}
B.jar at:{color:#ff0000} 192.168.0.20{color}
they were alaway launched successfully.
but sometimes when I input {color:#0000ff}192.168.0.10{color} in the browser address, B.jar from {color:#ff0000}192.168.0.20{color} has be launched(*A.jar* has not),
when I clear the JAVA cache , the problem will not happen 。 but After a few days , the problem happened again...
i don't know how to slove it.
anyone can help me ? thanks
my english is poor ,i hope you will get my point.
, :)

Hi!
You might have seen this post in the Automator part
of the forum... but I might have posted it in the
wrong place (got no interest:( )
I've managed to get automator to retrieve (and save
to hdd) an .m3u playlist from a radio's website that
broadcasts a program I like in mp3.
inside this m3u playlist is the link to an actual mp3
file... both the m3u and the mp3 change every week.
If I drag the m3u to TextEdit it opens up and I can
see the 3 lines of "source code" of which the last
line is the full url of the mp3 file (http://www
.thatradiosurl.com/audio/show-shownumber.mp3)
You already got the playlist into a text file, you have two options:
1. use Automator's 'Get Link URLs from Webpages' action to extract the URL. 'Get Link URLs from Webpages' is under Safari lib.
2. run Automator Python script from 'Get Link URLs from Webpages' in AppleScript. The script below with get your mp3 URLs:
set playlist_file to "/path/to/playlist/file.txt" -- change to your actual playlist file path
set _URLs to do shell script "/System/Library/Automator/Get\\ Link\\ URLs\\ from\\ Webpages.action/Contents/Resources/links file://" & quoted form of playlist_file
set mp3_URLs to ""
repeat with _URL in paragraphs of _URLs
if _URL contains ".mp3" then set mp3_URLs to mp3_URLs & _URL & (ASCII character 10)
end repeat
mp3_URLs is the list you want you extract.

Auto-extracting URL LINKS from webpage?

is there a way to extract all the linked url's from a webpage?
i am trying to put together some lists and would like to gather these up and then insert them into a numbers spreadsheet. in some cases there are 30 to 50 of these listed on the left hand side of a webpage and right now I imagine clicking on a link, copying the url from the browser, inserting into numbers, clicking back, clicking on the next link etc.
is there a way to save some steps by exporting these to a text document and then copying and pasting from there?
thanks for any help.

You create the new bookmark with the javascript replacing the original URL from any old bookmark you have and no longer need, or just create a new bookmark for any thing at all specifically for your links script. Then once you are at a page where you want the links displayed, you just select your new links bookmark. Your page will stay open, but a new window will open as well, with all the links listed. You might want to change the size of the window, the width=400 and height=200 result in a pretty small window. Anyway, I put the Links javascript bookmarklet in my Bookmarks menu, so while I am here on this page I just go to the Bookmarks menu, slide down and select Links, and a new window opens with all the links on this page listed.
Francine
PS--It would be more impressive if I had written that script, but I didn't. I found it years ago and saved it with a batch of bookmarklets that do other useful things, such as suppressing animated gifs. I just checked, and a particularly useful collection, Jesse's Bookmarklets, still exists:
https://www.squarefree.com/bookmarklets/

How to extract data from web URL

I was doing one project which need to extract data from web pages and then analyze these data. the question is how to extract data from there, using html parser? need help, thanks a lot

I was doing one project which need to extract data
from web pages and then analyze these data. the
question is how to extract data from there, using
html parser? need help, thanks a lotTry this:
http://java.sun.com/docs/books/tutorial/networking/urls/readingURL.html
Or, like you said yourself, use an HTML parser:
http://java-source.net/open-source/html-parsers

"Message from Webpage (error) There was an error in the browser while setting properties into the page HTML, possibly due to invalid URLs or other values. Please try again or use different property values."

I created a site column at the root of my site and I have publishing turned on. I selected the Hyperlink with formatting and constraints for publishing.
I went to my subsite and added the column. The request was to have "Open in new tab" for their hyperlinks. I was able to get the column to be added and yesterday we added items without a problem.
The problem arose when, today, a user told me that he could not edit the hyperlink. He has modify / delete permissions on this list.
He would edit the item, in a custom list, and click on the address "click to add a new hyperlink" and then he would get the error below after succesfully putting in the Selected URL (http://www.xxxxxx.com), Open
Link in New Window checkbox, the Display Text, and Tooltip:
"Message from Webpage There was an error in the browser while setting properties into the page HTML, possibly due to invalid URLs or other values. Please try again or use different property values."
We are on IE 9.0.8.1112 x86, Windows 7 SP1 Enterprise Edition x64
The farm is running SharePoint 2010 SP2 Enterprise Edition August 2013 CU Mark 2, 14.0.7106.5002
and I saw in another post, below with someone who had a similar problem and the IISreset fixed it, as did this problem. I wonder if this is resolved in the latest updated CU of SharePoint, the April 2014 CU?
Summary from this link below: Comment out, below, in AssetPickers.js
//callbackThis.VerifyAnchorElement(HtmlElement, Config);
perform IISReset
This is referenced in the item below:
http://social.technet.microsoft.com/Forums/en-US/d51a3899-e8ea-475e-89e9-770db550c06e/message-from-webpage-error-there-was-an-error-in-the-browser-while-setting?forum=sharepointgeneralprevious
TThThis is possibly the same information that I saw, possibly from the above link as reference.
http://seanshares.com/post/69022029652/having-problems-with-sharepoint-publishing-links-after
Again, if I update my SharePoint 2010 farm to April 2014 CU is this going to resolve the issue I have?
I don't mind changing the JS file, however I'd like to know / see if there is anything official regarding this instead of my having to change files.
Thank you!
Matt

We had the same issue after applying the SP2 & August CU. we open the case with MSFT and get the same resolution as you mentioned.
I blog about this issue and having the office reference.
Later MSFT release the Hotfix for this on December 10, 2013 which i am 100% positive should be part of future CUs.
So if you apply the April CU then you will be fine.
Please remember to mark your question as answered &Vote helpful,if this solves/helps your problem. ****************************************************************************************** Thanks -WS MCITP(SharePoint 2010, 2013) Blog: http://wscheema.com/blog

I am looking to find out if it is possible in Adobe Muse to have a section on a webpage that will update every 30 seconds or 30 minutes with a snapshot URL From a live webcam stream

I am looking to find out if it is possible in Adobe Muse to have a section on a webpage that will update every 30 seconds or 30 minutes with a snapshot URL From a live webcam stream

You can do this via insert HTML , you would need to use the live feed code in Muse page.
Thanks,
Sanjit

Hide webpage link URL from browser status bar?

Anyone know how to hide webpage link URL from the browser
status bar with Dreamweaver CS4? So that when the mouse cursor
hovers over a link on a webpage, the URL won't be shown in the
browser status bar. This comes in useful for when I put my email as
a link on a webpage but I don't want to let my email address be
known prematurely. I prefer to do this through the Dreamweaver CS4
interface without coding, if possible. What is the easiest way to
hide link URL like this? Thanks.

More than the anal users. Many people use this information to
decide
whether they will click on the link. By hiding it, you remove
that
extremely valuable security information - for example, a link
tells you on
the screen that it is taking you to www.wachovia.com, but the
status bar
link tells you it's taking you to
www.iamahackersiteandwilldrinkyourmilkshake.cn
I don't think you want to remove that from your site, unless,
of course, you
are the webmaster for
'iamahackersiteandwilldrinkyourmilkshake.cn....
Murray --- ICQ 71997575
Adobe Community Expert
(If you *MUST* email me, don't LAUGH when you do so!)
==================
http://www.projectseven.com/go
- DW FAQs, Tutorials & Resources
http://www.dwfaq.com - DW FAQs,
Tutorials & Resources
==================
"joeq" <[email protected]> wrote in message
news:gib1tf$kn7$[email protected]..
> it can be done - Behaviors > Set Text > Set Text
of Status Bar.
>
> do what you want, but know that it's somewhat unreliable
(see the caveat
> in
> the DW Behavior box) and that some anal users will
object to the masking
> of
> your links.
>
>
>
quote:
Originally posted by:
ghost zero
> Anyone know how to hide webpage link URL from the
browser status bar with
> Dreamweaver CS4? So that when the mouse cursor hovers
over a link on a
> webpage, the URL won't be shown in the browser status
bar. This comes in
> useful for when I put my email as a link on a webpage
but I don't want to
> let
> my email address be known prematurely. I prefer to do
this through the
> Dreamweaver CS4 interface without coding, if possible.
What is the
> easiest way
> to hide link URL like this? Thanks.
>
>
>

Extract URLs from webpage

Similar Messages

Maybe you are looking for