Regex for web page parsing

I am trying to parse the html output of a web page. I need to grab all the URLs which are in hrefs, actions etc inside.
I am using the regex pattern
Pattern p = Pattern.compile("http.*?//\\S+[^\"]");
(basically grab anything that starts with http till you hit a double quote and white space)
When I do a find using matcher it gets me URLs which are like
href="http://www.abcd.com/" or
href="http://www.abcd.com/mypath" or
href="http://www.abcd.com/mypath?p1=v1&p2=v2"
But it doesn't work for something like
href="http://www.abcd.com"><img src="/abcd/my.gif"
Is there a good pattern to catch all of these varaints of URLs                    
Thanks

First off, your regex is wrong. After matching the double slash, it gobbles up as many non-whitespace characters as it can, then looks for a character that is not a double-quote. When the attribute value is followed by a space, the \S+ will stop at that point, then the [^"] will match the space. This at least give you a usable result; all you have to do is strip off the quote and the space. But if the value is immediately followed by a closing angle-bracket, the \S+ just keeps on gobbling. In your last example, it won't stop until it hits the space after the "img"--and now all you have is garbage.
Here's a better way to extract URL's:
  Pattern p = Pattern.compile(
    "(href|src|background)  # ...and whatever other attr's  \n" +
    "\\s*=\\s*                                              \n" +
    "(?:                                                    \n" +
    "   \"([^\"]+)\"  # anything in double quotes           \n" +
    " |               # or                                  \n" +
    "   \'([^\']+)\'  # anything in single quotes           \n" +
    " |               # or                                  \n" +
    "   ([^\\s>]+)    # any non-WS except closing bracket   \n" +
    ")", Pattern.COMMENTS);Instead of looking for the URL's themselves, it looks for the names of attributes that should have URL's as values. The attribute name will be captured in group(1) and the URL will be in group(2), group(3) or group(4), depending on whether it was double-quoted, single-quoted or not quoted. If need be, you can verify that the value is indeed a URL using the java.net.URL class.

Similar Messages

  • I'm having problems with 8.1 and Continuity / Handoff. It will work fine for web pages, etc. but in email when I try to do it between by iPhone 5s running 8.1 and my Macbook Pro running Yosemite I consistently get an error.

    I'm having problems with 8.1 and Continuity / Handoff. It will work fine for web pages, etc. but in email when I try to do it between by iPhone 5s running 8.1 and my Macbook Pro running Yosemite I consistently get an error. "Failed to Continue Activity" Cocoa Error 4609.  Handoff is working for phone calls and text messages. By email just crashes each time. It was also doing it under 8.0.2.  My iPhone and iPad handle this fine. It's only the MacBook to the iPhone that fails, and only on email.

    Handoff Continuity Troubleshooting

  • How do i fix my printer when it won't show the print preview for web pages? it worked and now not.

    how do i fix my printer when it won't show the print preview for web pages?  it worked for a while and now it doesn't.  printer is an hp officejet 7310 all-in-one.

    I would suspect this is a hardware issue.  The rollers are probably having issues picking up the relatively smooth thick media.  You might have better results be cleaning the paper pickup rollers with a damp paper towel.  Also make sure the paper is snugly loaded and the paper guides have been correctly positioned.
    Regards,
    Bob Headrick, MS MVP Printing/Imaging
    Bob Headrick,  HP Expert
    I am not an employee of HP, I am a volunteer posting here on my own time.
    If your problem is solved please click the "Accept as Solution" button ------------V
    If my answer was helpful please click the "Thumbs Up" to say "Thank You"--V

  • How can i see visitor statistics for web page hosted on osx lion server

    Hello
    how can i see visitor statistics for web page hosted on osx lion server
    Thanks
    Adrian

    Just click inside the url address bar. Full url address highlighted, will appear.
    Best.

  • Using Java for Web Pages

    Is there any way to completely avoid HTML for Web pages? can Java be used entirely for sites?

    (X)HTML cannot be completely avoided, this is the only one markup language which the web browser can interpret to display websites. If you want to avoid writing plain vanilla HTML and scriptlets and want more interaction with Java, consider a MVC framework like JSF or Struts.

  • Remove details for web page object types in search results

    Within the search results, the first 200 characters are displayed for documents, web pages, etc.. Web pages show the HTML code. Is there a way to hide these details for web pages but have them visible for all other object types?

    The solution can be found in this thread: http://discussions.apple.com/thread.jspa?threadID=2456976
    Download Secrets at http://secrets.blacktree.com/
    Install the prefspane, select Safari in the left hand list, and uncheck "Use new URL completion list".
    You need to log out and log in again for it to work.

  • The latest version of Firefox has totally stuffed my computer. Nothing now works properly and all my settings for web pages have totally disappeared.

    Latest version of Firefox has totally stuffed my computer. Nothing works properly and all settings for web pages have disappeared.

    Make sure that you allow pages to choose their colors and that you haven't enabled High Contrast in the Accessibility settings.
    *Tools > Options > Content : Fonts & Colors > Colors : [X] "Allow pages to choose their own colors, instead of my selections above"
    *http://kb.mozillazine.org/Website_colors_are_wrong
    *https://support.mozilla.org/kb/Websites+look+wrong

  • Resizing pixs for web page use

    I have resized the photos to use on MS Frontpage but when the thumbnails are clicked on to be enlarged in the web page they are still too large. How do I make photos usable for web page use?

    What version of Elements are you using? How are you now going about creating the thumbnails and larger images? What size are you making the large ones? I suspect all you need to do is change the width and height parameters to something smaller.

  • Fireworks or Indesign for web page layout?

    Hi, I've currently ordered CS3 Web Premium and it won't ship
    until tomarrow, and I don't know whether I should cancel it and get
    the Design Premium instead.
    It all depends on whether Fireworks is a must have for web
    page design. Does it offer anything invaluable that I need that
    Photoshop Cs3 won't already fill? I saw the tour for InDesign and
    was impressed how you can create amazing page layouts and how you
    can export them to XHTML.
    I probably won't be using Adobe Contribute as I prefer making
    my own WYSIWYG editor in asp.net for my webpages, because I don't
    want to buy a new license of Contribute for every web page I make
    for a different client, so the choice of whether I should leave Web
    Premium and get Design Premium rest on whether Fireworks is more
    useful for webpage design and whether creating web page layouts
    with InDesign is worth paying the extra money for (I don't mind
    using it to create computer software manuals as well).
    I appreciate any input on this matter. :)

    Excellent clarifications, DWD.
    Jim Babbage - .:Community MX:. & .:Adobe Community
    Expert:.
    Extending Knowledge, Daily
    http://www.communityMX.com/
    CommunityMX - Free Resources:
    http://www.communitymx.com/free.cfm
    .:Adobe Community Expert for Fireworks:.
    news://forums.macromedia.com/macromedia.fireworks
    news://forums.macromedia.com/macromedia.dreamweaver
    Deaf Web Designer wrote:
    > Hi TheStrangerSome,
    >
    > If I may chime in and add to Jim Babbage's advice or
    suggestion.
    >
    > Please note that there is no such a thing as "WYSIWYG"
    editor. It is more
    > accurate of putting "HTML" as HTML editing app.
    > In addition to this aspect of discussion about web
    authoring app issue, it
    > should be addressed in either general discussion forums
    either at Dreamweaver
    > or Contribute for that matter.
    >
    > As for "changing the order" from Web Premium to Design
    Premium, it is BEST if
    > you talk to Adobe customer service department
    immediately. Though, I believe
    > that their business hours are based on Pacific Standard
    Time Monday through
    > Friday. Please refer to Contact as appeared on top of
    Adobe systemwide website,
    > including this forum. If you roll your browser up to the
    top of webpage, you'll
    > notice the term "Contact". use that link for further
    details.
    >
    > I think you might be mistaken for term "layout" used in
    InDesign or Fireworks.
    > Please note that InDesign is specifically designed for
    high-end
    > professional-quality print layout such as magazine
    articles, newspapers and PDF
    > documentations (export as PDF format) straight out of
    InDesign application.
    > That might confuse you or led you into thinking that
    InDesign is ideally a
    > layout design for web. Actually, it is not.
    >
    > Perhaps that you stick to the fundamentals of Fireworks
    or Photoshop and use
    > them as layout structure and then transfer them into
    Dreamweaver CS3. Please
    > try to keep this as simple as much as you can when you
    produce a layout design
    > in Fireworks (or Photoshop). If you are good at HTML and
    CSS editing, then all
    > you have to do is put in image files into Dreamweaver
    and code away by linking
    > or referencing image files to HTML.
    >
    > By the sound of your a bit of confusion or don't
    understand the primary
    > difference between Fireworks, Photoshop or InDesign, you
    ought to keep it
    > simple and take it easy. Becuase you will probably feel
    overwhelemed with all
    > of these powerful applications and not knowing the
    primary functionality or the
    > main purposes of these applications and what they are
    for, etc.
    >
    > So, hopefully this helps you better understand and try
    to figure this out
    > yourself, because you know what's the best for you.
    >
    > Hope that helps, no?
    >
    >
    > P.S. Please note that Adobe forums are based on
    "volunteer-basis", which means
    > it is not monitored or run by full-time Adobe staff or
    team from Adobe.
    > Although, there are times, that some people from Adobe
    chime in and offer some
    > tips or suggestion or pointers. In other words, there
    are thousand of forum
    > participants, like you, helping each other. Hopefully
    that helps, too.
    >
    >
    quote:
    Originally posted by:
    TheStrangerSome
    > Hi, I've currently ordered CS3 Web Premium and it won't
    ship until tomarrow,
    > and I don't know whether I should cancel it and get the
    Design Premium instead.
    >
    > It all depends on whether Fireworks is a must have for
    web page design. Does
    > it offer anything invaluable that I need that Photoshop
    Cs3 won't already fill?
    > I saw the tour for InDesign and was impressed how you
    can create amazing page
    > layouts and how you can export them to XHTML.
    >
    > I probably won't be using Adobe Contribute as I prefer
    making my own WYSIWYG
    > editor in asp.net for my webpages, because I don't want
    to buy a new license of
    > Contribute for every web page I make for a different
    client, so the choice of
    > whether I should leave Web Premium and get Design
    Premium rest on whether
    > Fireworks is more useful for webpage design and whether
    creating web page
    > layouts with InDesign is worth paying the extra money
    for (I don't mind using
    > it to create computer software manuals as well).
    >
    > I appreciate any input on this matter. :)
    >
    >
    >
    Excellent clarifications, DWD.
    Jim Babbage - .:Community MX:. & .:Adobe Community
    Expert:.
    Extending Knowledge, Daily
    http://www.communityMX.com/
    CommunityMX - Free Resources:
    http://www.communitymx.com/free.cfm
    .:Adobe Community Expert for Fireworks:.
    news://forums.macromedia.com/macromedia.fireworks
    news://forums.macromedia.com/macromedia.dreamweaver

  • Since installing firefox, none of my dscktop icons for web page short cuts work. Why?

    None of my desktop icon short cuts for web pages work since I installed firefox.
    Why?
    Can you help me with this?

    You can check for issues with the Windows icon cache and try to rebuild the icon cache.
    # Open the Task Manager (Shift+Ctrl+ESC)
    # In the Process tab, right-click on the Explorer.exe process and select End Process.
    # Open the file picker via "File > New Task (Run)" and click the Browse button.
    # Type or Paste %USERPROFILE%\AppData\Local (%LocalAppData%) in the File name field (AppData is a hidden folder).
    # Select the IconCache.db file and use "Delete" in the right-click context menu to delete the file.
    # After the IconCache.db file has been deleted, start a new explorer.exe process via "File > New Task" to get the desktop and Taskbar back.
    The IconCache.db file is a hidden file, so make sure that you can see hidden files.
    * http://kb.mozillazine.org/Show_hidden_files_and_folders

  • What happened to the add to home page option for web pages. Can't find it.

    What happened to the add to home page option for web pages. Can't find it.

    Try stopping Safari followed by resetting your phone: double-tap the home button, locate Safari at the bottom (swiping if necessary), tap and hold until it wiggles, tap the minus sign.  Now reset: hold the on/off and home buttons until you see the Apple logo (ignore the off slider that appers first), then release.  If it still is missing you may need to restore your phone, which will put a fresh copy of the iOS on your phone (see http://support.apple.com/kb/HT1414).

  • Is it possible to specify print settings for web pages?

    Is it possible to specify print settings for web pages designed in Muse? For example, can you exclude background images from printing?

    Hello,
    You can define Print Setting to page By Defining CSS Style in page Header by typing Style tag.
    you can go to Page > Page Properties > Metadata  > HTML for <Head>.
    you can refer to the link below for more details about Print properties in CSS.
    http://www.smashingmagazine.com/2011/11/24/how-to-set-up-a-print-style-sheet/
    Regards
    Vivek

  • How can I add a "search" field for web pages content-not blogs or podcasts?

    This seems to be such a basic function, I can't believe I'm having so much trouble. I don't have a blog or a podcast on my new website that I'm in the process of designing. So how do I allow people who'll visit my site to search there for specified content? The only instructions I find are for the RSS in inspector for blogs or podcasts. All I want is for people to be able to search my site/web pages (that have no blogs or podcasts on them). This must be a common request... or am I crazy? How do I do that?
    Thanks for your help.

    I think I've answered my own question after a few hours of searching:
    http://services.google.com/searchcode2.html?accept=on
    Thank you Google.

  • CSS Flexbox - a new layout method for web pages

    Hi,
    As you may all know, css flexbox is a new layout method for laying out your web page, and with Firefox releasing version 22 within the next few days, (should have been yesterday) it is now no longer hidden behind browser settings. Add to this the fact that all desktop browser except IE10 and Safari 5 & 6, no longer require the use of a vendor prefix, and that http://html5please.com/ simply say ’USE’, flexbox is now finally a viable layout alternative, especially for responsive layouts.
    Flexbox has been available on all Android and iOS devices, (and any device using the webkit engine, e.g. new blackberry’s and Kindle Fire devices) since they were first release.
    To help find problems, and provide help in using flexbox, we would be grateful if you would experiment and provide feedback in the use of flexbox for layouts.
    To help you get started a video tutorial, (with files to download) and examples of layouts, (more to be added next week) and tips & tricks to help with some of the more common problems have been provided at http://flexboxlayouts.com/. This site will also be updated to provide you with a list of problems and hopefully the solutions to those problems, so that you will have a ’one stop’ reference site for using flexbox.
    If you have any tips in using flexbox, or a flexbox layout that you would like added to the site, then you can use the ’Submit’ email address on the http://flexboxlayouts.com/ site.
    For more info on the css flexbox specifications see -
    http://www.w3.org/TR/2012/CR-css3-flexbox-20120918/
    or if you just want to know what property is supported by which browser/device -
    http://flexboxlayouts.com/pdfs/flexbox%20browser%20Properties%20.pdf
    Note to regular contributors & moderators:
    As I only have time to visit and help in the forum for a few hours per week, should you find any flexbox problems unanswered then please let me know via email.
    PZ

    Hi Al
    Yes I know, and this is why I am asking people to experiment with flexbox.
    The fix for IE10 is to give the right hand sidebar, (or the content) a larger flex-grow property, the bug is caused by the flex shorthand property not recognizing % values, and using the flex-basis as a set width instead of a preferred size.
    I have logged the bug, (and many others) with the various browser bug bases.
    This and other bugs, do have 'fixes', so I hope users will experiment with flexbox and provide feedback.
    Flexbox is no different than any other css feature, "we can only find the problems in actual use".
    Strangely enough, I have found the most consistent to use is the old 2009 implementation on mobile devices, (no doubt iOS7 will change all that ).
    PZ

  • Not enough memory on adobe acrobat 9 for web page capture

    I use adobe acrobat 9 pro for creating a pdf with web page capture function.
    After a while (about more of 500 pages) the free ram became insufisant and the program stop.
    I am running on windows 7 with 4 Go ram.
    When i take a look on memory utilisation i see that adobe use more of 1.7 Go ram .
    During capture the ram is always increasing , i didnt see disk saving .
    Who has a idea ?

    The .rpt file size is 14MB with the Data Save option enabled, 12MB without Data Save.  Presumably the 12MB file size is because of the 24bit PNG we have as our background.
    The Designer executes the report in less than a second and we can scroll through all pages and see the image fields perfectly.
    When we Export to PDF, the Designer takes a long time, eventually gets to the 77%, the 7th record and returns "Export report failed" followed by "Memory full".  If we export only page 1 of the 3 pages, it also returns a Memory full error.  However, when the same report is run with only 1 page, that page exports to PDF but with a ridiculously large size and export time.
    The machine has 2GB of physical memory with an 8GB pagefile with Windows 2003 (latest everything).  The process runs up to about 1GB before reporting the memory full error.
    We've also tried a variety of other suggestions posted in the other thread with no success.
    We're happy to provide the RPT file to the Report Team to diagnose the problem.  Ultimately, we need to be able to produce a 15 page report with approximately 45 images.
    Our preferred scenario is fixing problem 2.  The CR Designer seems quite capable of rendering our report and printing it to our third party PDF printer in a timely manner with small size.  However, the API reports memory full.
    The API resides in a dedicated reporting web service with NO other code except for loading the report, setting parameters and printing.  When executing, it uses up to about 1.1GB before reporting the error.
    Are there any other suggestions for fixing what we have?  Are there known problems with large images in reports?  Do we need to lodge a formal support request?
    Regards,  Grant.
    PS.  Grr and my message formatting is lost when I edited this message!!!
    There is a 1500 character limit and then all formatting is removed to save space. Break you posts into separate entries.
    Edited by: grantph on Sep 30, 2009 2:49 AM

Maybe you are looking for

  • How to capture an image with the camera axis

    Hi, I have an camera axis. I can receive the video. Now I try to capture an image, for example when I push the button take. I do the .vi but I confused between the different Method. SaveCurrentImageave the current Image GetImage:Gets the data corresp

  • How do I install Acrobat 7 Pro on Windows 8.1?

    I have tried installing on laptop. While installing it comes up with this error . I click okay. It finishes and then I get this error when I try to start program.

  • Price for battery replacement at apple (2010 model)

    Hello forum. I've got a Macbook pro (model 2010), and are wondering what the price are for replacing a battery at apple repairs, outside the warrenty period (i've bought the MBP at apple.com)? And are there any danes that have tried getting a battery

  • Need advice on displays

    I am currently using a couple of Sony CRT G520/420 displays. I am looking at getting a 23-24 inch LCD and will be using it mostly at home for Photography using Photoshop, Bridge, Lightroom etc., for output to inkjet and/or Lambda. I also spend time i

  • No Progress Bar During Wi-Fi File Transfer ?

    Got my new macbook and my old mac mini sharing and tried to copy a movie from one to the other......no progress bar......looked like it was hung up so I quit....should there be a progress bar ?