Download a web page, how to ?

Can any one help me with code for downloading a web page given the url address ? I can download the page, but the problem is it doesn't download the associated images, javascripts etc. neither does it create an associated folder as one might expect when saving a page using browser.
Below is the code snippet -
                    URL url = new URL(address);
                    out = new BufferedOutputStream(
                         new FileOutputStream(localFileName));
                    conn = url.openConnection();
                    in = conn.getInputStream();
                    byte[] buffer = new byte[1024];
                    int numRead;
                    long numWritten = 0;
                    while ((numRead = in.read(buffer)) != -1) {
                         out.write(buffer, 0, numRead);
                         numWritten += numRead;
                    System.out.println(localFileName + "\t" + numWritten);

javaflex wrote:
I don't think web crawler would work
webcrawler simply takes every link or url on the given address and digs into it ..
Would it work for javascripts ? Given a url like xyz.com/a.html,
1. the code above would downlod the plain html.
2. parse html to find javascripts, images (anything else I need to look at ?)
3. download those
4. put everything in one folder (but the question is then do I need to rename the pointers in the dwnlded html to point at the other contents on the disk ? )
This is naive approach - anything better ?
thanks.More advanced web-crawlers parse the JavaScipt source files (or embedded JS sources inside HTML files) and (try) to execute the script in order to find new links. So the answer is: yes, some crawlers do. I know for a fact that Heritrix can do this quite well, but it is a rather "large" crawler and can take a while to get to work with. But it really is one of the best (if bot the best) open source Java web-crawlers around.

Similar Messages

  • Where to Download WPC [ Web Page Composer ] and How to install it ?

    Hi Experts,
    I need to download the Web Page Composer and install for our use in my company. Can anyone help me
    on this where to get it and and how to install ?
    thanks
    Suresh

    Hi,
    Chech the SAP Note Number: [1080110 |https://www.sdn.sap.com/irj/servlet/prt/portal/prtroot/com.sap.km.cm.docs/oss_notes/sdn_oss_ep_km/~form/handler]
    Also some links that may help you:
    https://www.sdn.sap.com/irj/sdn/go/portal/prtroot/docs/library/uuid/d07b5354-c058-2a10-a98d-a23f775808a6
    There are also lots of documents available on SDN, so just use SDN search.
    Regards,
    Praveen Gudapati

  • I recently downloaded a web page but then i deleted history which deleted the web page. there is still a short cut to it but says file does not exist. can i retrieve this as web page no longer exixts

    as the question states i downloaded a web page but before i put it on a memory stick i changed the options on my firefox and the file is no longer there.
    there is a shortcut in the 'recently changed' folder in windows explorer but when i click on it it says the file has moved or no longer exists.
    Is there anyway to retrieve this as the web page no longer exists

    Try:
    *Extended Copy Menu (fix version): https://addons.mozilla.org/firefox/addon/extended-copy-menu-fix-vers/

  • When I copy say something from a web page how then can I paste it and where to ie pages/messages etc on my ipad

    when I copy say something from a web page how then can I paste it and where to ie pages/messages etc on my ipad

    You can paste it into any word processing program you have, evernotes as stated, the notes app, even into e-mail if you want.

  • When downloading a Web page, Firefox includes ALL information on site, plus the Web page. I am forced to eliminate Firefox and return to Safari for saving individual page.

    For further info, I have Safari as my Mac provider for the internet, and although I click on it, Firefox takes over when I wish to download a Web page. It saves two files--a folder, in which all kinds of massive-like data will appear, and a file with some of the information I wish to save from a page or two.
    I've tried to select only the page I wish to save in several ways. However, nothing has worked until I take Firefox out of "Applications" and put it in the trash basket.
    Ideally, I'd like Safari to be my default provider, and I'd like Firefox to be my secondary provider.
    Please advise.
    Arline Ostolaza
    [email protected]

    Did you try to save the page as "Web Page, HTML only" instead of "Web Page, complete", if that is what you want ?

  • Can we download a web page using SOA?

    Hi,
    We have a requirement to download a web page whether html or xml format by using SOA middleware.Is it possible with this? Have anyone tried with this earlier? Any suggestions regarding this will be of great help.
    Thanks,
    SOA Team

    Hello Iyanu_IO,
    Thank you for answering my question "can I download a web page as a PDF".
    I downloaded and installed the add-on and it works really well.
    Regards
    Paul_Abrahams

  • Why the speed of MacBook Air to download a web page is lower than iPad2? Even worse sometimes Ipad2 can but MBA can't load the web pages.

    Why the speed of MacBook Air to download a web page is lower than iPad2?
    Even worse ,Why sometimes iPad2 can but MBA can't load the web pages?

    Why the speed of MacBook Air to download a web page is lower than iPad2?
    Even worse ,Why sometimes iPad2 can but MBA can't load the web pages?

  • How can I download a web page's text only?

    Hello All
    I have some code (below) which can download the html of a web page and this is usually good enough for me. However, sometimes it would just be easier to work with if I could download just the text only of a given page. And sometimes the text that appears on
    the given page is not even available in the html source, I guess because it is the result of some script or other in the html rather than being definied by the html itself.
    What I would like to know is, is there any way I can download the "final text" of a web page rather than its html source? Even if that text is generated by a script in the html? As far as I can see the only way to do this is to load the page in a
    browser control and then get the document text from that - but that's far from ideal.
    And I dare say somebody out there knows better!
    Anyway, here's my existing code:
    Public Function downloadWebPage(ByVal url As String) As String
    Dim txt, ua As String
    Try
    Dim myWebClient As New WebClient()
    ua = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)"
    myWebClient.Headers.Add(HttpRequestHeader.UserAgent, ua)
    myWebClient.Headers("Accept") = "/"
    Dim myStream As Stream = myWebClient.OpenRead(url)
    Dim sr As New StreamReader(myStream)
    txt = sr.ReadToEnd()
    myStream.Close()
    Return txt
    Catch ex As Exception
    End Try
    Return Nothing
    End Function

    A WebBrowsers DocumentText looks like below. The outertext of each HTML element does not necessarily contain all of the text displayed on a WebBrowser Document. Nor does the HTML innertext or anything in the documents code necessarily contain all of the
    text displayed on a WebBrowser Document. Although I suppose the code tells where to display something from somewhere.
    Therefore you would require
    screen scraping the image displayed in a WebBrowser Control in order to collect all of the characters displayed in the image of the WebBrowser Document.
    <!DOCTYPE html>
    <html>
    <head>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>What&#39;s My User Agent?</title>
    <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <meta name="author" content="Bruce Horst" />
    <link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" />
    <link href='http://fonts.googleapis.com/css?family=Raleway:400,700' rel='stylesheet' type='text/css'>
    <link href="//maxcdn.bootstrapcdn.com/font-awesome/4.2.0/css/font-awesome.min.css" rel="stylesheet">
    <script src="/js/mainjs?v=Ux43t_hse2TjAozL2sHVsnZmOvskdsIEYFzyzeubMjE1"></script>
    <link href="/css/stylecss?v=XWSktfeyFOlaSsdgigf1JDf3zbthc_eTQFU5lbKu2Rs1" rel="stylesheet"/>
    </head>
    <body>
    Code for OuterHTML using WebBrowser
    Option Strict On
    Public Class Form1
    WithEvents WebBrowser1 As New WebBrowser
    Dim WBDocText As New List(Of String)
    Private Sub Form1_Load(sender As Object, e As EventArgs) Handles MyBase.Load
    Me.Location = New Point(CInt((Screen.PrimaryScreen.WorkingArea.Width / 2) - (Me.Width / 2)), CInt((Screen.PrimaryScreen.WorkingArea.Height / 2) - (Me.Height / 2)))
    Button1.Anchor = AnchorStyles.Top
    RichTextBox1.Anchor = CType(AnchorStyles.Bottom + AnchorStyles.Left + AnchorStyles.Right + AnchorStyles.Top, AnchorStyles)
    End Sub
    Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
    WebBrowser1.Navigate("http://www.whatsmyuseragent.com/", Nothing, Nothing, "User-Agent: Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko)")
    End Sub
    Private Sub WebBrowser1_DocumentCompleted(sender As Object, e As WebBrowserDocumentCompletedEventArgs) Handles WebBrowser1.DocumentCompleted
    RichTextBox1.Clear()
    WBDocText.Clear()
    Dim First As Boolean = True
    If WebBrowser1.IsBusy = False Then
    For Each Item As HtmlElement In WebBrowser1.Document.All
    Try
    If First = True Then
    First = False
    WBDocText.Add(Item.OuterText)
    Else
    Dim Contains As Boolean = False
    For i = WBDocText.Count - 1 To 0 Step -1
    If WBDocText(i) = Item.OuterText Then
    Contains = True
    End If
    Next
    If Contains = False Then
    WBDocText.Add(Item.OuterText)
    End If
    End If
    Catch ex As Exception
    End Try
    Next
    If WBDocText.Count > 0 Then
    For Each Item In WBDocText
    Try
    RichTextBox1.AppendText(Item)
    Catch ex As Exception
    End Try
    Next
    End If
    End If
    End Sub
    End Class
    Some text returned
    What's My User Agent?
    Your User Agent String is: Analyze
    Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko)
    (adsbygoogle = window.adsbygoogle || []).push({}); Your IP Address is:
    207.109.140.1
    Client Information:
    JavaScript Enabled:Yes
    Cookies Enabled:Yes
    Device Pixel Ratio:1DevicePixelRation.com
    Screen Resolution:1366 px x 768 pxWhatsMyScreenResolution.com
    Browser Window:250 px x 250 px
    Local Time:9:48 am
    Time Zone:-7 hours
    Recent User Agents Visiting this Page:
    You!! Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko)
    Mozilla/5.0 (compatible; MSIE 10.0; Windows Phone 8.0; Trident/6.0; IEMobile/10.0; ARM; Touch; NOKIA; Lumia 520)
    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36
    Mozilla/5.0 (compatible; MSIE 10.0; Windows Phone 8.0; Trident/6.0; IEMobile/10.0; ARM; Touch; NOKIA; Lumia 630)
    Mozilla/5.0 (compatible; MSIE 10.0; Windows Phone 8.0; Trident/6.0; IEMobile/10.0; ARM; Touch; NOKIA; Lumia 630)
    curl/7.35.0
    curl/7.35.0
    curl/7.35.0
    curl/7.35.0
    <!-- google_ad_client = "ca-pub-0399362207612216"; /* WMUA LargeRect */ google_ad_slot = "5054480278"; google_ad_width = 336; google_ad_height = 280; //-->
    La vida loca

  • How to download a web page intact as one file, which can be easily done with Safari

    On Safari all I do with a complex web page is 'Save As (whatever I wish, or its existing description. That shows as a complete webpage in a single file. Firefox saves it as a file PLUS a folder of elements. Takes up twice my desktop real estate, and makes going back too complex. Chrome has a sort of fix which now only works occasionally. Am probably overlooking some kind of Ad-on, but have yet to find it. Any thoughts?

    Safari saves the web pages in the ''web archive'' format. Basically, it rolls every element of the web page into a single file. It provides the convenience of having everything in a single file but it may not necessarily mean that it saves space. Note that this format is not very portable - other browsers like Internet Explorer cannot open this web archive file.
    <BR><BR> Since Firefox saves the web page and its associated elements separately, it can be opened in any other browser. To allow saving web pages in web archive format from Firefox, you can try this add-on: [https://addons.mozilla.org/en-US/firefox/addon/mozilla-archive-format/ Mozilla Archive Format].

  • Qt downloading for web page

    hi to all
    Ive got a stupid question that i hope someone can answer 4 me, i have a qt movie of 80 mb(6 minutes apx) that i want to put on my web page with the option of -right click, download to my computer- for the viewers of my page, how do i do that, i am not using qt pro, i would also like to know if there is a way to make the viewing on the site of this film smother, cause at the moment its taking ages for it to download to the site(half an hour)
    thanks

    You could place a "text" link (complete URL) and instruct your visitors to first "copy" the link (a click may, or may not, open the file in its own window).
    The second part of your instructions would have your visitors open their QuickTime Player application. Even the "free" QT Player allows users to open a URL. If they Command- (Control key on a PC) and then hit the letter U key a dialog window will open. This is where they "paste" the copied URL.
    Your file will open inside a new QuickTime Player window.
    Your viewers could also "right-click" and choose "Save linked file as..."
    This would also download the file (bypassing the QT plug-in).
    80MB's for a 6 minute file is a bit large. QuickTime Pro can re-compress your source (or the 80MB version) to make it a bit smaller. Adjust the codec, dimensions and audio to make a more "Web Friendly" file size.

  • Why does Firefox re-download a web page that has already appeared in my browser? VERY time-consuming! Chrome just saves from the cache!

    Every time I go to save a web page or image, Firefox insists on re-downloading the page/image, which consumes a great deal of time. Chrome (and my other browsers) seem to save the item directly from the cache or memory, which is MUCH FASTER! Is there an option to make Firefox save more rapidly, ie directly from cache or memory, so I don't have to sit (and swear and fume) while the file is being RE-downloaded??

    600+ MB does sound like a lot, but that depends on what you are using as your homepage, and how much of that 600 is being used by the plugin container - Flash and other plugins. Also, using that amount might have to do with how much RAM you have installed - the more RAM installed and the more Firefox allows itself to use. If you have AVG 2012 installed and the AVG Advsior is giving you that message, see this: <br />
    http://blogs.avg.com/community/avg-feedback-update-week-44/
    See - 3. AVG Advisor = Disable AVG Advisor performance notifications
    or try using the "Change when these notifications appear" and set a higher threshhold

  • How do I search for text on safari web pages, How do I search for text on safari web pages

    How so I do a text search on a safari web page?  Please help

    When you're on the page you want to search tap the URL bar to turn it blue. Tap the x so it disappears. Type the word you want to search for, at the bottom of the suggestions is find on page, tap that. Your high-lighted text should appear.

  • Firefox frequently has problems completely downloading complete web pages. It often takes several clicks of the "save as" button to save more than a few bytes or kb of a web page.

    OS is Windows 2000. Problem is with websites in general. Pictures are frequently missing or are blank unless download is attempted more than once. Ads are frequently downloaded as blank.areas. No problems downloading text only sites. PDF file downloads are sometimes stopped.
    When web pages are printed using HP software, sometimes the printing is "padded" before and after with blank pages, or the range of pages is incomplete both in the printing review and the printed results. This does not happen when printing from the IE browsers supported by the OS.

    It is possible that your security software (firewall, anti-virus) blocks or restricts Firefox or the plugin-container process without informing you, possibly after detecting changes (update) to the Firefox program.
    Remove all rules for Firefox and the plugin-container from the permissions list in the firewall and let your firewall ask again for permission to get full, unrestricted, access to internet for Firefox and the plugin-container process and the updater process.
    See:
    *https://support.mozilla.org/kb/Server+not+found
    *https://support.mozilla.org/kb/Firewalls
    *https://support.mozilla.org/kb/fix-problems-connecting-websites-after-updating
    Do a malware check with several malware scanning programs on the Windows computer.<br>
    Please scan with all programs because each program detects different malware.<br>
    All these programs have free versions.
    Make sure that you update each program to get the latest version of their databases before doing a scan.
    *Malwarebytes' Anti-Malware:<br>http://www.malwarebytes.org/mbam.php
    *AdwCleaner:<br>http://www.bleepingcomputer.com/download/adwcleaner/<br>http://www.softpedia.com/get/Antivirus/Removal-Tools/AdwCleaner.shtml
    *SuperAntispyware:<br>http://www.superantispyware.com/
    *Microsoft Safety Scanner:<br>http://www.microsoft.com/security/scanner/en-us/default.aspx
    *Windows Defender:<br>http://windows.microsoft.com/en-us/windows/using-defender
    *Spybot Search & Destroy:<br>http://www.safer-networking.org/en/index.html
    *Kasperky Free Security Scan:<br>http://www.kaspersky.com/security-scan
    You can also do a check for a rootkit infection with TDSSKiller.
    *Anti-rootkit utility TDSSKiller:<br>http://support.kaspersky.com/5350?el=88446
    See also:
    *"Spyware on Windows": http://kb.mozillazine.org/Popups_not_blocked

  • When OS X Lion enlarges a web page how do I get it back to normal size?

    How do I get the web pages back to normal size when Lion expands them? I don't even know what I am doing with my magic mouse to expand the web pages.

    Barney-15E wrote:
    If you double-click with two fingers, the section of the web page you clicked on will expand to fill screen. Double-click again with two fingers to unzoom.
    That is how it works on a Trackpad; I assume it is the same on the Magic Mouse.
    On the Magic Mouse it is a one-finger double-tap.

  • Fresh install of Firefox not displaying any fonts in the menues or web pages how do I fix this?

    completely uninstalled priviouse version of Firefox and just intalled Firefox 13.0.1. it doesn't display any fonts not on web pages and not within the Firefox menu structure.

    Try to disable hardware acceleration in Firefox.
    *Tools > Options > Advanced > General > Browsing: "Use hardware acceleration when available"
    *https://hacks.mozilla.org/2010/09/hardware-acceleration/
    *https://support.mozilla.org/kb/how-do-i-upgrade-my-graphics-drivers

Maybe you are looking for

  • GPU 0 Die Temp Sensor ERROR on a MacBook Pro. Can anybody help?

    Hello. My wife has a 15-inch, 2.4GHz MacBook Pro model 3.1 laptop. All of sudden the screen started to flicker, and after that,  it could no longer be booted. Applelogo visible but the timewheel stops and nothing happens. I managed to get hold of a c

  • Is there a Thunderbolt to Component Video cable / adapter for the new iMac?

    Hello! My church is looking to purchase a new iMac to run our presentation software. We currently have a PC that has a graphics card which has a component-out cable.  We connect that to a component video distribution amplifer to distribute to several

  • Changing document name per approval stage

    Hello All, I need to use the Purchase Order interface both as a Purchase Requisition and a Purchase Order. Basically, I have setup an approval stage for PO. Before the approval takes place, the document is considered a PR and post approval a PO. Sinc

  • EasyDMS -using interface for exchanging file props and doc data to classify

    For easyDMS there is the possibility to transfer information from file properties to doc data and characteristic information on the DIR.  See note 1444113 for more info. What I would also like to do is as follows: Instead of updating just characerist

  • EXEC FND_CONC_CLONE.SETUP_CLEAN

    As part of a clone (11.5.10.2) using hot backup, I am running the following. EXEC FND_CONC_CLONE.SETUP_CLEAN; Every time I run this, it takes a couple hours to finish. That is just too long. Any known performance issues/bugs with running that? Thanks