Extract URLs from webpage

Hi guys,
I'd like to know if there is a possibility to find hyperlinks on a webpage and write their targets to a text file using applescript.
I'll take youtube as an example, because nearly everyone is familiar with it:
When I click on a username, I will be redirected to the Channel Page of that particular user. What I want is to get the URL of this Channel Page.
I think I'll have to create a list, with all the URLs in it, then filter them and save them into a text file.
The problem is that I can't find a way to do that, whereas Automator offers that feature.
Hope you can help me!
Cheers

I have no experience with automator but I think I understand. Here is something to get you started.
set theURL to text returned of (display dialog "Enter URL" default answer "http://www.")
try
set theSource to (do shell script "curl " & theURL)
on error
display dialog "Error getting website source."
end try
This will set the variable theSource to the source code of the entered webpage. You then need to search through this source code and extract the information that you want. I'm still not exactly sure what that is though and there are going to be a ton of links in it. Hope that helps at least a little.

Similar Messages

  • Automator Stops Responding while trying to perform "Get Image URLs from Webpage"

    I have seen it in many tutorials, and it seems strait forward enough, so here is my order of events for my scriptl;
    Get Current Webpage from Safari
    Get Image URLs from Webpage [linked from these webpages]
    Download URLs [Downloads]
    However when I run this script it returns the get current webpage quickly but then hangs on Get Image URLs from webpage either never passes or takes 700+ seconds to move to the next step.
    The times it did pass to the next step it worked 98% still missed images occasionaly.
    After I figure out how to fix this issue, I am looking for a way to make it create a new folder, and name it according to the sub directory of the webpage, for example:
    http:\\www.mypictures.com\Birthday\01.JPG
    http:\\www.mypictures.com\Birthday\02.JPG
    would create folder Birthday, and save images 1 and 2
    Then in the future I want to download another set such as;
    http:\\www.mypictures.com\MothersDay\01.JPG
    http:\\www.mypictures.com\MothersDay\02.JPG
    As these are both named 01.JPG the sets would overwrite, or add strange numbers and not divide by set by default;
    So I need this script to create folders based on Directory (portion of url)  In my example it would be MothersDay, and Birthday folders.
    Any help would be greatly appreciated!
    Thanks in advance!
                    Mike

    I should add, it has simply frozen and stopped responding about 95% of the 20 times or so I have run it. I have tried rebooting (not sure if that helps on Mac, I'm still new to the OS been a windows user for a long time but so far I love my Mac).

  • Extract URL from a href string

    Greetings,
    I am trying to solve a problem (in a specific way) that will lead to the solution for another problem (which is similar).
    I am trying to extract a URL from an HTML <a> tag like this:
    <a href="www.someurl.com">Go to www.someurl.com</a>Can anyone help with the best solution? I've tried it using String.split() and StringTokenizer (which might work for this example), but, for my real problem, they seem quite inadequate. I'm guessing the solution involves some regex but I don't know where to look (and am quite unfamiliar with regex to know how).
    Thank you in advance.

    Actually, I am trying to break down many sections of the URL.
    The URL I am trying to breakdown has a lot of data that I need.
    I guess a better example would be:
    <a href="www.someurl.com" id="url-id-43612">Link Text - data here</a>In my program, I need to grab three things: the href URL, the id, and "Link Text - data here."
    Here is an SSCCE of how I am currently dealing with this:
    import java.util.StringTokenizer;
    * To change this template, choose Tools | Templates
    * and open the template in the editor.
    * @author Ryan
    public class htmlparser {
        public static void main(String[] args) {
            String str = "<a href=\"www.someurl.com\" id=\"url-id-43612\">Link Text - data here</a>";
            str = str.replaceAll("<a href=\"", "\n")
                .replaceAll("\" id=\"url-id-", "\n")
                .replaceAll("\">", "\n")
                .replaceAll("</a>", "\n");
            StringTokenizer tokenizer = new StringTokenizer(str, "\n");
            String url = tokenizer.nextToken();
            String id = tokenizer.nextToken();
            String text = tokenizer.nextToken();
            System.out.println(url);
            System.out.println(id);
            System.out.println(text);
    }But I am looking for a more elegant solution to the problem, is there one?

  • Extracting info from webpages

    I am trying to create a hotel program that has various features including finding cheapest hotel prices on the web. My program searces the web and returns the approriate web page results in html format. My problem is I'm not sure the best way to extract the information I want. Below is an example of a web page (I know it's long but if you copy it into a web browser, it does work, honest!). From this page I want to extract the hotels name and prices.
    http://www.bookings.org/searchresults.html?class_interval=2&country=gb&error_url=http%3A%2F%2Fwww.bookings.org%2Fcountry%2Fgb.html%3F&search_by=city&city=-2595386&region=Avon+Aberdeenshire&class_key=1&class=0&do_availability_check=on&checkin_monthday=21&checkin_year_month=2005-6&checkout_monthday=22&checkout_year_month=2005-6&newlangurl=%2Fcountry%2Fgb.en.html&x=77&y=14
    At present I can return the HTML of this page. However can anyone suggest how I go about extracting the specific info i require. The html file is huge!
    Regards
    Ross

    Sorry pasted the wrong url. This one should work.
    http://www.bookings.org/searchresults.html?class_interval=2&country=gb&error_url=http%3A%2F%2Fwww.bookings.org%2Fcountry%2Fgb.html%3F&search_by=city&city=-2595386&region=Avon+Aberdeenshire&class_key=1&class=0&do_availability_check=on&checkin_monthday=24&checkin_year_month=2005-3&checkout_monthday=27&checkout_year_month=2005-3&newlangurl=%2Fcountry%2Fgb.en.html&x=88&y=5

  • Extracting URL From Application Object

    If I type http://mymachine.com/myapp?a=aaa&b=bbb in my browser (to access application "myapp") how do I extract just the http://mymachine.com/myapp part from the application object? I would have figured that the code below would allow me to construct that string but these methods instead return the stuff in comments...
    URL theURL = application.getResource("/");
    theURL.getProtocol(); // jndi
    theURL.getHost(); // null
    theURL.getPort(); // -1
    theURL.getFile(); // org.apache.naming.resources.FileDirContext@!8bf135
    URL theURL = application.getResource("/myapp");
    theURL.getProtocol(); // null pointer exception
    theURL.getHost(); // null pointer exception
    theURL.getPort(); // null pointer exception
    theURL.getFile(); // null pointer exception

    tech note... HttpUtils is depricated for newer Servlet APIs (2.3+). Instead, you would use the same methods in the HttpServletRequest:
    URL theURL = new URL(request.getRequestURL().toString());

  • Extract URL from Mail message

    Hello Applescript-Meisters,
    I have the following problem:
    Every week we receive an email which contains a secure URL (https://) with always different username and password to download a ".tar" file.
    Since I'm new to Applescript I tried to extract the paragraph of the email message with the line which contains the URL, but doesn't seem to work:
    tell application "Mail"
    set theMessages to message 1 of mailbox "EXAMPLE" of account "AccountTest" whose subject begins with "Data request"
    set theContent to content of theMessages
    set someData to paragraphs of theContent
    repeat with someData in theContent
    if someData begins with "https://" then
    set theURL to paragraph of someData
    else
    set theURL to paragraph 18 of theContent -- this works but this line can change weekly
    end if
    end repeat
    end tell
    Any help or suggestions would be greatly appreciated.
    Cheers,
    gilles

    The most obvious problem is how you're trying to iterate through the paragraphs.
    You start off with:
    set theContent to content of theMessages
    which is fine - theContent now contains the text of the email.
    You then extract the paragraphs of that text:
    set someData to paragraphs of theContent
    But you then reset 'someData' to be the iterator in a loop:
    repeat with someData in theContent
    Additionally, you're iterating through theContent, which is a single block of text.
    I think what you mean to do is iterate through the paragraphs of theContent, like:
    repeat with someData in (get paragraphs of theContent)
    Once you have this, you know someData is the paragraph you're currently looking at so you can:
    set theURL to someData
    Finally, you do not want an 'else ... set theURL' statement because that will reset theURL every time you find a line that doesn't begin with "https://" (e.g. if paragraph 10 starts with https:// you set theURL to that paragraph, but as soon as you move onto line 11 you reset theURL because the paragraph doesn't begin with https://
    To address this you should just iterate through the paragraphs and only set theURL to paragraph 18 if you didn't find any matching lines.
    Putting it all together:
    set theURL to "" -- default value for theURL
    theContent to content of theMessage
    repeat with someData in (get paragraphs of theContent)
     if someData begins with "http://" then
      set theURL to someData
     end if
    end repeat
    if theURL = "" then
     try
      set theURL to paragraph 18 of theContent
     end try
    end if
    Note that I've included the 'paragraph 18' line in a try block. That's so that the script doesn't fail if there are less than 18 paragraphs in the message (you can't get paragraph 18 of a 17-paragraph document).

  • Extract URL from HTML text

    Suppose you have the following String that is body text with HTML.
    String bodyText = " My name is Blake. I live in New York City. See my image here: <img href="http://www.blake.com/blake.jpg"/> isn't my picture awesome? Tata for now!"
    I want to extract the URL that contains the location of the image in this bodyText. The ideal would be to create a function called public String extractor(String bodyText) to be used
    String imageURL = extractor(bodyText);
    //imageURL should be "http://www.blake.com/blake.jpg"
    My first thoughts are using reg exp, yet the place i would find to use that would using the .replace in String class. I am by no means an expert on reg exp so I haven't taken too much time to try to figure it out with reg exp. I obviously could do a linear search of the bodyText and do a ton of if statements, but thats just poor coding. I want see if anyone came across this or has insight to this problem.
    Thanks for all the help,
    Blake

    How would the regexp change if there were multiple img tags within the String.I don't rightly know, I'm just a raw beginner in regexes.
    Would this regexp return all the img URLs found in the String.No, as it stands it would return only the last URL. But this will:String bodyText = " My name is Blake. " +
          "I live in New York City. See my image here: " +
          "<img href=\"http://www.blake.com/blake.jpg\"/>" +
          " isn't my picture awesome? Here's another: " +
          "<img href='http://www.blake.com/Vandelay.jpg'/>" +
          " Tata for now!";
    String regex = "(?<=<img\\shref=[\"'])http://.*?(?=[\"']/?>)";
    Pattern pattern = Pattern.compile (regex);
    Matcher matcher = pattern.matcher (bodyText);
    while (matcher.find ()) {
       System.out.println (matcher.group ());
    }Note the enhancement that takes into account that both single and double quotes are legal in HTML. But unlike the earlier example, this does not tolerate more than one space between <img and href=, I couldn't find a way to achieve that.
    Visit this thread later, there are some real regex experts around who may post a better solution.
    db

  • Automator's Link URLs from Webpages not finding links

    There are clearly <a href='s in the page I'm looking at but Automator's action simply will not return them.
    Is this a common problem?

    Yes you are correct, it's the space character—not the apostrophe. The trouble is that the source of your web page does not contain an escaped %20, it has merely printed literal space characters in the link tag. While Safari managed to interpret this correctly and convert the spaces to %20, Automator did not. As a test I made an html file containing:
    <a href="red blue">green</a>
    and Automator returned:
    {"file:///Volumes/hd2/test/red"}
    There may be a very long way around this bug. I'll think about it.

  • How to extract an mp3's url from m3u playlist (and download the file)?

    hello!
    I've managed to get automator to retrieve (and save to hdd) an .m3u playlist from a radio's website that broadcasts a program I like in mp3.
    inside this m3u playlist is the link an actual mp3... both the m3u and the mp3 change every week.
    If I drag the m3u to TextEdit it opens up and I can see the 3 lines of "source code" of which the last line is the full url of the mp3 file (http://www.thatradiosurl.com/audio/shownumber.mp3)
    My question is as follows: how do I get automator to 'extract' that url and save that file to my hard drive (so I can automatically add it to a playlist in itunes (and retrieve the show on my ipod at sync))??
    I'll save the final workflow as an iCal event that will hopefully grab this show every week.
    before you ask... no, that show is not available as a podcast! I wish it were... I think they're a bit slow on catching up with the tech.
    thanks to all!
    have a great day
    d
    pbg415"1.67   Mac OS X (10.4.6)  

    Hi!
    You might have seen this post in the Automator part
    of the forum... but I might have posted it in the
    wrong place (got no interest:( )
    I've managed to get automator to retrieve (and save
    to hdd) an .m3u playlist from a radio's website that
    broadcasts a program I like in mp3.
    inside this m3u playlist is the link to an actual mp3
    file... both the m3u and the mp3 change every week.
    If I drag the m3u to TextEdit it opens up and I can
    see the 3 lines of "source code" of which the last
    line is the full url of the mp3 file (http://www
    .thatradiosurl.com/audio/show-shownumber.mp3)
    You already got the playlist into a text file, you have two options:
    1. use Automator's 'Get Link URLs from Webpages' action to extract the URL. 'Get Link URLs from Webpages' is under Safari lib.
    2. run Automator Python script from 'Get Link URLs from Webpages' in AppleScript. The script below with get your mp3 URLs:
    set playlist_file to "/path/to/playlist/file.txt" -- change to your actual playlist file path
    set _URLs to do shell script "/System/Library/Automator/Get\\ Link\\ URLs\\ from\\ Webpages.action/Contents/Resources/links file://" & quoted form of playlist_file
    set mp3_URLs to ""
    repeat with _URL in paragraphs of _URLs
    if _URL contains ".mp3" then set mp3_URLs to mp3_URLs & _URL & (ASCII character 10)
    end repeat
    mp3_URLs is the list you want you extract.

  • JWS Download Dialog's URL(From:) different from browser's URL

    Hi everyone,
    My problem:
    I have deployed application A.jar at: {color:#0000ff}192.168.0.10{color}
    B.jar at:{color:#ff0000} 192.168.0.20{color}
    they were alaway launched successfully.
    but sometimes when I input {color:#0000ff}192.168.0.10{color} in the browser address, B.jar from {color:#ff0000}192.168.0.20{color} has be launched(*A.jar* has not),
    when I clear the JAVA cache , the problem will not happen &#12290; but After a few days , the problem happened again...
    i don't know how to slove it.
    anyone can help me ? thanks
    my english is poor ,i hope you will get my point.
    , :)

    Hi!
    You might have seen this post in the Automator part
    of the forum... but I might have posted it in the
    wrong place (got no interest:( )
    I've managed to get automator to retrieve (and save
    to hdd) an .m3u playlist from a radio's website that
    broadcasts a program I like in mp3.
    inside this m3u playlist is the link to an actual mp3
    file... both the m3u and the mp3 change every week.
    If I drag the m3u to TextEdit it opens up and I can
    see the 3 lines of "source code" of which the last
    line is the full url of the mp3 file (http://www
    .thatradiosurl.com/audio/show-shownumber.mp3)
    You already got the playlist into a text file, you have two options:
    1. use Automator's 'Get Link URLs from Webpages' action to extract the URL. 'Get Link URLs from Webpages' is under Safari lib.
    2. run Automator Python script from 'Get Link URLs from Webpages' in AppleScript. The script below with get your mp3 URLs:
    set playlist_file to "/path/to/playlist/file.txt" -- change to your actual playlist file path
    set _URLs to do shell script "/System/Library/Automator/Get\\ Link\\ URLs\\ from\\ Webpages.action/Contents/Resources/links file://" & quoted form of playlist_file
    set mp3_URLs to ""
    repeat with _URL in paragraphs of _URLs
    if _URL contains ".mp3" then set mp3_URLs to mp3_URLs & _URL & (ASCII character 10)
    end repeat
    mp3_URLs is the list you want you extract.

  • Auto-extracting URL LINKS from webpage?

    is there a way to extract all the linked url's from a webpage?
    i am trying to put together some lists and would like to gather these up and then insert them into a numbers spreadsheet. in some cases there are 30 to 50 of these listed on the left hand side of a webpage and right now I imagine clicking on a link, copying the url from the browser, inserting into  numbers, clicking back, clicking on the next link etc.
    is there a way to save some steps by exporting these to a text document and then copying and pasting from there?
    thanks for any help.

    You create the new bookmark with the javascript replacing the original URL from any old bookmark you have and no longer need, or just create a new bookmark for any thing at all specifically for your links script. Then once you are at a page where you want the links displayed, you just select your new links bookmark. Your page will stay open, but a new window will open as well, with all the links listed. You might want to change the size of the window, the width=400 and height=200 result in a pretty small window. Anyway, I put the Links javascript bookmarklet in my Bookmarks menu, so while I am here on this page I just go to the Bookmarks menu, slide down and select Links, and a new window opens with all the links on this page listed.
    Francine
    PS--It would be more impressive if I had written that script, but I didn't. I found it years ago and saved it with a batch of bookmarklets that do other useful things, such as suppressing animated gifs. I just checked, and a particularly useful collection, Jesse's Bookmarklets, still exists:
    https://www.squarefree.com/bookmarklets/

  • How to extract data from web URL

    I was doing one project which need to extract data from web pages and then analyze these data. the question is how to extract data from there, using html parser? need help, thanks a lot

    I was doing one project which need to extract data
    from web pages and then analyze these data. the
    question is how to extract data from there, using
    html parser? need help, thanks a lotTry this:
    http://java.sun.com/docs/books/tutorial/networking/urls/readingURL.html
    Or, like you said yourself, use an HTML parser:
    http://java-source.net/open-source/html-parsers

  • "Message from Webpage (error) There was an error in the browser while setting properties into the page HTML, possibly due to invalid URLs or other values. Please try again or use different property values."

    I created a site column at the root of my site and I have publishing turned on.  I selected the Hyperlink with formatting and constraints for publishing.
    I went to my subsite and added the column.  The request was to have "Open in new tab" for their hyperlinks.  I was able to get the column to be added and yesterday we added items without a problem. 
    The problem arose when, today, a user told me that he could not edit the hyperlink.  He has modify / delete permissions on this list.
    He would edit the item, in a custom list, and click on the address "click to add a new hyperlink" and then he would get the error below after succesfully putting in the Selected URL (http://www.xxxxxx.com), Open
    Link in New Window checkbox, the Display Text, and Tooltip:
    "Message from Webpage  There was an error in the browser while setting properties into the page HTML, possibly due to invalid URLs or other values. Please try again or use different property values."
    We are on IE 9.0.8.1112 x86, Windows 7 SP1 Enterprise Edition x64
    The farm is running SharePoint 2010 SP2 Enterprise Edition August 2013 CU Mark 2, 14.0.7106.5002
    and I saw in another post, below with someone who had a similar problem and the IISreset fixed it, as did this problem.  I wonder if this is resolved in the latest updated CU of SharePoint, the April 2014 CU?
    Summary from this link below: Comment out, below, in AssetPickers.js
    //callbackThis.VerifyAnchorElement(HtmlElement, Config);
    perform IISReset
    This is referenced in the item below:
    http://social.technet.microsoft.com/Forums/en-US/d51a3899-e8ea-475e-89e9-770db550c06e/message-from-webpage-error-there-was-an-error-in-the-browser-while-setting?forum=sharepointgeneralprevious
    TThThis is possibly the same information that I saw, possibly from the above link as reference.
    http://seanshares.com/post/69022029652/having-problems-with-sharepoint-publishing-links-after
    Again, if I update my SharePoint 2010 farm to April 2014 CU is this going to resolve the issue I have?
    I don't mind changing the JS file, however I'd like to know / see if there is anything official regarding this instead of my having to change files.
    Thank you!
    Matt

    We had the same issue after applying the SP2 & August CU. we open the case with MSFT and get the same resolution as you mentioned.
    I blog about this issue and having the office reference.
    Later MSFT release the Hotfix for this on December 10, 2013 which i am 100% positive should be part of future CUs.
    So if you apply the April CU then you will be fine.
    Please remember to mark your question as answered &Vote helpful,if this solves/helps your problem. ****************************************************************************************** Thanks -WS MCITP(SharePoint 2010, 2013) Blog: http://wscheema.com/blog

  • I am looking to find out if it is possible in Adobe Muse to have a section on a webpage that will update every 30 seconds or 30 minutes with a snapshot URL From a live webcam stream

    I am looking to find out if it is possible in Adobe Muse to have a section on a webpage that will update every 30 seconds or 30 minutes with a snapshot URL From a live webcam stream

    You can do this via insert HTML , you would need to use the live feed code in Muse page.
    Thanks,
    Sanjit

  • Hide webpage link URL from browser status bar?

    Anyone know how to hide webpage link URL from the browser
    status bar with Dreamweaver CS4? So that when the mouse cursor
    hovers over a link on a webpage, the URL won't be shown in the
    browser status bar. This comes in useful for when I put my email as
    a link on a webpage but I don't want to let my email address be
    known prematurely. I prefer to do this through the Dreamweaver CS4
    interface without coding, if possible. What is the easiest way to
    hide link URL like this? Thanks.

    More than the anal users. Many people use this information to
    decide
    whether they will click on the link. By hiding it, you remove
    that
    extremely valuable security information - for example, a link
    tells you on
    the screen that it is taking you to www.wachovia.com, but the
    status bar
    link tells you it's taking you to
    www.iamahackersiteandwilldrinkyourmilkshake.cn
    I don't think you want to remove that from your site, unless,
    of course, you
    are the webmaster for
    'iamahackersiteandwilldrinkyourmilkshake.cn....
    Murray --- ICQ 71997575
    Adobe Community Expert
    (If you *MUST* email me, don't LAUGH when you do so!)
    ==================
    http://www.projectseven.com/go
    - DW FAQs, Tutorials & Resources
    http://www.dwfaq.com - DW FAQs,
    Tutorials & Resources
    ==================
    "joeq" <[email protected]> wrote in message
    news:gib1tf$kn7$[email protected]..
    > it can be done - Behaviors > Set Text > Set Text
    of Status Bar.
    >
    > do what you want, but know that it's somewhat unreliable
    (see the caveat
    > in
    > the DW Behavior box) and that some anal users will
    object to the masking
    > of
    > your links.
    >
    >
    >
    quote:
    Originally posted by:
    ghost zero
    > Anyone know how to hide webpage link URL from the
    browser status bar with
    > Dreamweaver CS4? So that when the mouse cursor hovers
    over a link on a
    > webpage, the URL won't be shown in the browser status
    bar. This comes in
    > useful for when I put my email as a link on a webpage
    but I don't want to
    > let
    > my email address be known prematurely. I prefer to do
    this through the
    > Dreamweaver CS4 interface without coding, if possible.
    What is the
    > easiest way
    > to hide link URL like this? Thanks.
    >
    >
    >

Maybe you are looking for

  • Want to Strip GPS Data from Photos

    Does anyone know an easy way to get iWeb to strip the GPS data out of photos being published as part of the Web pages? The box in the iPhoto preferences to include location data in published images is unchecked, but this only seems to affect MobileMe

  • Help on ddr in Tux 8.1

    Any help/pointers are really appreciated.... We are in the process of trying out DDR in Tux. Our version is 8.1 - 32-bit and we use view32. We have an existing server that we want to process different requests depending on data. (everything will run

  • Places

    If anyone else is having problems with allocating places to non GPS photos in Iphoto 11, select the event or photo, then try clicking on 'i', then delete the place name and put in the place, using the location as per Google maps and hey presto! It wo

  • Booking App problems

    Does anyone have any ideas or suggestions for how I can deliver a time booking application? Very simply, I am designing an application to replace a paper based car pool booking system. A single car can be booked from 08:00 to 18:00 in half hour inter

  • Incomplete update request deleted

    Hi We started getting this message in sm21 log. It comes in abundance. Has anyone had similar problem? Message was edited by:         Patrick Rezek