Download a web page, how to ?
Can any one help me with code for downloading a web page given the url address ? I can download the page, but the problem is it doesn't download the associated images, javascripts etc. neither does it create an associated folder as one might expect when saving a page using browser.
Below is the code snippet -
URL url = new URL(address);
out = new BufferedOutputStream(
new FileOutputStream(localFileName));
conn = url.openConnection();
in = conn.getInputStream();
byte[] buffer = new byte[1024];
int numRead;
long numWritten = 0;
while ((numRead = in.read(buffer)) != -1) {
out.write(buffer, 0, numRead);
numWritten += numRead;
System.out.println(localFileName + "\t" + numWritten);
javaflex wrote:
I don't think web crawler would work
webcrawler simply takes every link or url on the given address and digs into it ..
Would it work for javascripts ? Given a url like xyz.com/a.html,
1. the code above would downlod the plain html.
2. parse html to find javascripts, images (anything else I need to look at ?)
3. download those
4. put everything in one folder (but the question is then do I need to rename the pointers in the dwnlded html to point at the other contents on the disk ? )
This is naive approach - anything better ?
thanks.More advanced web-crawlers parse the JavaScipt source files (or embedded JS sources inside HTML files) and (try) to execute the script in order to find new links. So the answer is: yes, some crawlers do. I know for a fact that Heritrix can do this quite well, but it is a rather "large" crawler and can take a while to get to work with. But it really is one of the best (if bot the best) open source Java web-crawlers around.
Similar Messages
-
Where to Download WPC [ Web Page Composer ] and How to install it ?
Hi Experts,
I need to download the Web Page Composer and install for our use in my company. Can anyone help me
on this where to get it and and how to install ?
thanks
SureshHi,
Chech the SAP Note Number: [1080110 |https://www.sdn.sap.com/irj/servlet/prt/portal/prtroot/com.sap.km.cm.docs/oss_notes/sdn_oss_ep_km/~form/handler]
Also some links that may help you:
https://www.sdn.sap.com/irj/sdn/go/portal/prtroot/docs/library/uuid/d07b5354-c058-2a10-a98d-a23f775808a6
There are also lots of documents available on SDN, so just use SDN search.
Regards,
Praveen Gudapati -
as the question states i downloaded a web page but before i put it on a memory stick i changed the options on my firefox and the file is no longer there.
there is a shortcut in the 'recently changed' folder in windows explorer but when i click on it it says the file has moved or no longer exists.
Is there anyway to retrieve this as the web page no longer existsTry:
*Extended Copy Menu (fix version): https://addons.mozilla.org/firefox/addon/extended-copy-menu-fix-vers/ -
when I copy say something from a web page how then can I paste it and where to ie pages/messages etc on my ipad
You can paste it into any word processing program you have, evernotes as stated, the notes app, even into e-mail if you want.
-
For further info, I have Safari as my Mac provider for the internet, and although I click on it, Firefox takes over when I wish to download a Web page. It saves two files--a folder, in which all kinds of massive-like data will appear, and a file with some of the information I wish to save from a page or two.
I've tried to select only the page I wish to save in several ways. However, nothing has worked until I take Firefox out of "Applications" and put it in the trash basket.
Ideally, I'd like Safari to be my default provider, and I'd like Firefox to be my secondary provider.
Please advise.
Arline Ostolaza
[email protected]Did you try to save the page as "Web Page, HTML only" instead of "Web Page, complete", if that is what you want ?
-
Can we download a web page using SOA?
Hi,
We have a requirement to download a web page whether html or xml format by using SOA middleware.Is it possible with this? Have anyone tried with this earlier? Any suggestions regarding this will be of great help.
Thanks,
SOA TeamHello Iyanu_IO,
Thank you for answering my question "can I download a web page as a PDF".
I downloaded and installed the add-on and it works really well.
Regards
Paul_Abrahams -
Why the speed of MacBook Air to download a web page is lower than iPad2?
Even worse ,Why sometimes iPad2 can but MBA can't load the web pages?Why the speed of MacBook Air to download a web page is lower than iPad2?
Even worse ,Why sometimes iPad2 can but MBA can't load the web pages? -
How can I download a web page's text only?
Hello All
I have some code (below) which can download the html of a web page and this is usually good enough for me. However, sometimes it would just be easier to work with if I could download just the text only of a given page. And sometimes the text that appears on
the given page is not even available in the html source, I guess because it is the result of some script or other in the html rather than being definied by the html itself.
What I would like to know is, is there any way I can download the "final text" of a web page rather than its html source? Even if that text is generated by a script in the html? As far as I can see the only way to do this is to load the page in a
browser control and then get the document text from that - but that's far from ideal.
And I dare say somebody out there knows better!
Anyway, here's my existing code:
Public Function downloadWebPage(ByVal url As String) As String
Dim txt, ua As String
Try
Dim myWebClient As New WebClient()
ua = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)"
myWebClient.Headers.Add(HttpRequestHeader.UserAgent, ua)
myWebClient.Headers("Accept") = "/"
Dim myStream As Stream = myWebClient.OpenRead(url)
Dim sr As New StreamReader(myStream)
txt = sr.ReadToEnd()
myStream.Close()
Return txt
Catch ex As Exception
End Try
Return Nothing
End FunctionA WebBrowsers DocumentText looks like below. The outertext of each HTML element does not necessarily contain all of the text displayed on a WebBrowser Document. Nor does the HTML innertext or anything in the documents code necessarily contain all of the
text displayed on a WebBrowser Document. Although I suppose the code tells where to display something from somewhere.
Therefore you would require
screen scraping the image displayed in a WebBrowser Control in order to collect all of the characters displayed in the image of the WebBrowser Document.
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>What's My User Agent?</title>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<meta name="author" content="Bruce Horst" />
<link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" />
<link href='http://fonts.googleapis.com/css?family=Raleway:400,700' rel='stylesheet' type='text/css'>
<link href="//maxcdn.bootstrapcdn.com/font-awesome/4.2.0/css/font-awesome.min.css" rel="stylesheet">
<script src="/js/mainjs?v=Ux43t_hse2TjAozL2sHVsnZmOvskdsIEYFzyzeubMjE1"></script>
<link href="/css/stylecss?v=XWSktfeyFOlaSsdgigf1JDf3zbthc_eTQFU5lbKu2Rs1" rel="stylesheet"/>
</head>
<body>
Code for OuterHTML using WebBrowser
Option Strict On
Public Class Form1
WithEvents WebBrowser1 As New WebBrowser
Dim WBDocText As New List(Of String)
Private Sub Form1_Load(sender As Object, e As EventArgs) Handles MyBase.Load
Me.Location = New Point(CInt((Screen.PrimaryScreen.WorkingArea.Width / 2) - (Me.Width / 2)), CInt((Screen.PrimaryScreen.WorkingArea.Height / 2) - (Me.Height / 2)))
Button1.Anchor = AnchorStyles.Top
RichTextBox1.Anchor = CType(AnchorStyles.Bottom + AnchorStyles.Left + AnchorStyles.Right + AnchorStyles.Top, AnchorStyles)
End Sub
Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
WebBrowser1.Navigate("http://www.whatsmyuseragent.com/", Nothing, Nothing, "User-Agent: Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko)")
End Sub
Private Sub WebBrowser1_DocumentCompleted(sender As Object, e As WebBrowserDocumentCompletedEventArgs) Handles WebBrowser1.DocumentCompleted
RichTextBox1.Clear()
WBDocText.Clear()
Dim First As Boolean = True
If WebBrowser1.IsBusy = False Then
For Each Item As HtmlElement In WebBrowser1.Document.All
Try
If First = True Then
First = False
WBDocText.Add(Item.OuterText)
Else
Dim Contains As Boolean = False
For i = WBDocText.Count - 1 To 0 Step -1
If WBDocText(i) = Item.OuterText Then
Contains = True
End If
Next
If Contains = False Then
WBDocText.Add(Item.OuterText)
End If
End If
Catch ex As Exception
End Try
Next
If WBDocText.Count > 0 Then
For Each Item In WBDocText
Try
RichTextBox1.AppendText(Item)
Catch ex As Exception
End Try
Next
End If
End If
End Sub
End Class
Some text returned
What's My User Agent?
Your User Agent String is: Analyze
Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko)
(adsbygoogle = window.adsbygoogle || []).push({}); Your IP Address is:
207.109.140.1
Client Information:
JavaScript Enabled:Yes
Cookies Enabled:Yes
Device Pixel Ratio:1DevicePixelRation.com
Screen Resolution:1366 px x 768 pxWhatsMyScreenResolution.com
Browser Window:250 px x 250 px
Local Time:9:48 am
Time Zone:-7 hours
Recent User Agents Visiting this Page:
You!! Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko)
Mozilla/5.0 (compatible; MSIE 10.0; Windows Phone 8.0; Trident/6.0; IEMobile/10.0; ARM; Touch; NOKIA; Lumia 520)
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36
Mozilla/5.0 (compatible; MSIE 10.0; Windows Phone 8.0; Trident/6.0; IEMobile/10.0; ARM; Touch; NOKIA; Lumia 630)
Mozilla/5.0 (compatible; MSIE 10.0; Windows Phone 8.0; Trident/6.0; IEMobile/10.0; ARM; Touch; NOKIA; Lumia 630)
curl/7.35.0
curl/7.35.0
curl/7.35.0
curl/7.35.0
<!-- google_ad_client = "ca-pub-0399362207612216"; /* WMUA LargeRect */ google_ad_slot = "5054480278"; google_ad_width = 336; google_ad_height = 280; //-->
La vida loca -
How to download a web page intact as one file, which can be easily done with Safari
On Safari all I do with a complex web page is 'Save As (whatever I wish, or its existing description. That shows as a complete webpage in a single file. Firefox saves it as a file PLUS a folder of elements. Takes up twice my desktop real estate, and makes going back too complex. Chrome has a sort of fix which now only works occasionally. Am probably overlooking some kind of Ad-on, but have yet to find it. Any thoughts?
Safari saves the web pages in the ''web archive'' format. Basically, it rolls every element of the web page into a single file. It provides the convenience of having everything in a single file but it may not necessarily mean that it saves space. Note that this format is not very portable - other browsers like Internet Explorer cannot open this web archive file.
<BR><BR> Since Firefox saves the web page and its associated elements separately, it can be opened in any other browser. To allow saving web pages in web archive format from Firefox, you can try this add-on: [https://addons.mozilla.org/en-US/firefox/addon/mozilla-archive-format/ Mozilla Archive Format]. -
hi to all
Ive got a stupid question that i hope someone can answer 4 me, i have a qt movie of 80 mb(6 minutes apx) that i want to put on my web page with the option of -right click, download to my computer- for the viewers of my page, how do i do that, i am not using qt pro, i would also like to know if there is a way to make the viewing on the site of this film smother, cause at the moment its taking ages for it to download to the site(half an hour)
thanksYou could place a "text" link (complete URL) and instruct your visitors to first "copy" the link (a click may, or may not, open the file in its own window).
The second part of your instructions would have your visitors open their QuickTime Player application. Even the "free" QT Player allows users to open a URL. If they Command- (Control key on a PC) and then hit the letter U key a dialog window will open. This is where they "paste" the copied URL.
Your file will open inside a new QuickTime Player window.
Your viewers could also "right-click" and choose "Save linked file as..."
This would also download the file (bypassing the QT plug-in).
80MB's for a 6 minute file is a bit large. QuickTime Pro can re-compress your source (or the 80MB version) to make it a bit smaller. Adjust the codec, dimensions and audio to make a more "Web Friendly" file size. -
Every time I go to save a web page or image, Firefox insists on re-downloading the page/image, which consumes a great deal of time. Chrome (and my other browsers) seem to save the item directly from the cache or memory, which is MUCH FASTER! Is there an option to make Firefox save more rapidly, ie directly from cache or memory, so I don't have to sit (and swear and fume) while the file is being RE-downloaded??
600+ MB does sound like a lot, but that depends on what you are using as your homepage, and how much of that 600 is being used by the plugin container - Flash and other plugins. Also, using that amount might have to do with how much RAM you have installed - the more RAM installed and the more Firefox allows itself to use. If you have AVG 2012 installed and the AVG Advsior is giving you that message, see this: <br />
http://blogs.avg.com/community/avg-feedback-update-week-44/
See - 3. AVG Advisor = Disable AVG Advisor performance notifications
or try using the "Change when these notifications appear" and set a higher threshhold -
How do I search for text on safari web pages, How do I search for text on safari web pages
How so I do a text search on a safari web page? Please help
When you're on the page you want to search tap the URL bar to turn it blue. Tap the x so it disappears. Type the word you want to search for, at the bottom of the suggestions is find on page, tap that. Your high-lighted text should appear.
-
OS is Windows 2000. Problem is with websites in general. Pictures are frequently missing or are blank unless download is attempted more than once. Ads are frequently downloaded as blank.areas. No problems downloading text only sites. PDF file downloads are sometimes stopped.
When web pages are printed using HP software, sometimes the printing is "padded" before and after with blank pages, or the range of pages is incomplete both in the printing review and the printed results. This does not happen when printing from the IE browsers supported by the OS.It is possible that your security software (firewall, anti-virus) blocks or restricts Firefox or the plugin-container process without informing you, possibly after detecting changes (update) to the Firefox program.
Remove all rules for Firefox and the plugin-container from the permissions list in the firewall and let your firewall ask again for permission to get full, unrestricted, access to internet for Firefox and the plugin-container process and the updater process.
See:
*https://support.mozilla.org/kb/Server+not+found
*https://support.mozilla.org/kb/Firewalls
*https://support.mozilla.org/kb/fix-problems-connecting-websites-after-updating
Do a malware check with several malware scanning programs on the Windows computer.<br>
Please scan with all programs because each program detects different malware.<br>
All these programs have free versions.
Make sure that you update each program to get the latest version of their databases before doing a scan.
*Malwarebytes' Anti-Malware:<br>http://www.malwarebytes.org/mbam.php
*AdwCleaner:<br>http://www.bleepingcomputer.com/download/adwcleaner/<br>http://www.softpedia.com/get/Antivirus/Removal-Tools/AdwCleaner.shtml
*SuperAntispyware:<br>http://www.superantispyware.com/
*Microsoft Safety Scanner:<br>http://www.microsoft.com/security/scanner/en-us/default.aspx
*Windows Defender:<br>http://windows.microsoft.com/en-us/windows/using-defender
*Spybot Search & Destroy:<br>http://www.safer-networking.org/en/index.html
*Kasperky Free Security Scan:<br>http://www.kaspersky.com/security-scan
You can also do a check for a rootkit infection with TDSSKiller.
*Anti-rootkit utility TDSSKiller:<br>http://support.kaspersky.com/5350?el=88446
See also:
*"Spyware on Windows": http://kb.mozillazine.org/Popups_not_blocked -
When OS X Lion enlarges a web page how do I get it back to normal size?
How do I get the web pages back to normal size when Lion expands them? I don't even know what I am doing with my magic mouse to expand the web pages.
Barney-15E wrote:
If you double-click with two fingers, the section of the web page you clicked on will expand to fill screen. Double-click again with two fingers to unzoom.
That is how it works on a Trackpad; I assume it is the same on the Magic Mouse.
On the Magic Mouse it is a one-finger double-tap. -
completely uninstalled priviouse version of Firefox and just intalled Firefox 13.0.1. it doesn't display any fonts not on web pages and not within the Firefox menu structure.
Try to disable hardware acceleration in Firefox.
*Tools > Options > Advanced > General > Browsing: "Use hardware acceleration when available"
*https://hacks.mozilla.org/2010/09/hardware-acceleration/
*https://support.mozilla.org/kb/how-do-i-upgrade-my-graphics-drivers
Maybe you are looking for
-
GPU 0 Die Temp Sensor ERROR on a MacBook Pro. Can anybody help?
Hello. My wife has a 15-inch, 2.4GHz MacBook Pro model 3.1 laptop. All of sudden the screen started to flicker, and after that, it could no longer be booted. Applelogo visible but the timewheel stops and nothing happens. I managed to get hold of a c
-
Is there a Thunderbolt to Component Video cable / adapter for the new iMac?
Hello! My church is looking to purchase a new iMac to run our presentation software. We currently have a PC that has a graphics card which has a component-out cable. We connect that to a component video distribution amplifer to distribute to several
-
Changing document name per approval stage
Hello All, I need to use the Purchase Order interface both as a Purchase Requisition and a Purchase Order. Basically, I have setup an approval stage for PO. Before the approval takes place, the document is considered a PR and post approval a PO. Sinc
-
EasyDMS -using interface for exchanging file props and doc data to classify
For easyDMS there is the possibility to transfer information from file properties to doc data and characteristic information on the DIR. See note 1444113 for more info. What I would also like to do is as follows: Instead of updating just characerist
-
EXEC FND_CONC_CLONE.SETUP_CLEAN
As part of a clone (11.5.10.2) using hot backup, I am running the following. EXEC FND_CONC_CLONE.SETUP_CLEAN; Every time I run this, it takes a couple hours to finish. That is just too long. Any known performance issues/bugs with running that? Thanks