How can I download a web page's text only?

Hello All
I have some code (below) which can download the html of a web page and this is usually good enough for me. However, sometimes it would just be easier to work with if I could download just the text only of a given page. And sometimes the text that appears on
the given page is not even available in the html source, I guess because it is the result of some script or other in the html rather than being definied by the html itself.
What I would like to know is, is there any way I can download the "final text" of a web page rather than its html source? Even if that text is generated by a script in the html? As far as I can see the only way to do this is to load the page in a
browser control and then get the document text from that - but that's far from ideal.
And I dare say somebody out there knows better!
Anyway, here's my existing code:
Public Function downloadWebPage(ByVal url As String) As String
Dim txt, ua As String
Try
Dim myWebClient As New WebClient()
ua = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)"
myWebClient.Headers.Add(HttpRequestHeader.UserAgent, ua)
myWebClient.Headers("Accept") = "/"
Dim myStream As Stream = myWebClient.OpenRead(url)
Dim sr As New StreamReader(myStream)
txt = sr.ReadToEnd()
myStream.Close()
Return txt
Catch ex As Exception
End Try
Return Nothing
End Function

A WebBrowsers DocumentText looks like below. The outertext of each HTML element does not necessarily contain all of the text displayed on a WebBrowser Document. Nor does the HTML innertext or anything in the documents code necessarily contain all of the
text displayed on a WebBrowser Document. Although I suppose the code tells where to display something from somewhere.
Therefore you would require
screen scraping the image displayed in a WebBrowser Control in order to collect all of the characters displayed in the image of the WebBrowser Document.
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>What&#39;s My User Agent?</title>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<meta name="author" content="Bruce Horst" />
<link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" />
<link href='http://fonts.googleapis.com/css?family=Raleway:400,700' rel='stylesheet' type='text/css'>
<link href="//maxcdn.bootstrapcdn.com/font-awesome/4.2.0/css/font-awesome.min.css" rel="stylesheet">
<script src="/js/mainjs?v=Ux43t_hse2TjAozL2sHVsnZmOvskdsIEYFzyzeubMjE1"></script>
<link href="/css/stylecss?v=XWSktfeyFOlaSsdgigf1JDf3zbthc_eTQFU5lbKu2Rs1" rel="stylesheet"/>
</head>
<body>
Code for OuterHTML using WebBrowser
Option Strict On
Public Class Form1
WithEvents WebBrowser1 As New WebBrowser
Dim WBDocText As New List(Of String)
Private Sub Form1_Load(sender As Object, e As EventArgs) Handles MyBase.Load
Me.Location = New Point(CInt((Screen.PrimaryScreen.WorkingArea.Width / 2) - (Me.Width / 2)), CInt((Screen.PrimaryScreen.WorkingArea.Height / 2) - (Me.Height / 2)))
Button1.Anchor = AnchorStyles.Top
RichTextBox1.Anchor = CType(AnchorStyles.Bottom + AnchorStyles.Left + AnchorStyles.Right + AnchorStyles.Top, AnchorStyles)
End Sub
Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
WebBrowser1.Navigate("http://www.whatsmyuseragent.com/", Nothing, Nothing, "User-Agent: Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko)")
End Sub
Private Sub WebBrowser1_DocumentCompleted(sender As Object, e As WebBrowserDocumentCompletedEventArgs) Handles WebBrowser1.DocumentCompleted
RichTextBox1.Clear()
WBDocText.Clear()
Dim First As Boolean = True
If WebBrowser1.IsBusy = False Then
For Each Item As HtmlElement In WebBrowser1.Document.All
Try
If First = True Then
First = False
WBDocText.Add(Item.OuterText)
Else
Dim Contains As Boolean = False
For i = WBDocText.Count - 1 To 0 Step -1
If WBDocText(i) = Item.OuterText Then
Contains = True
End If
Next
If Contains = False Then
WBDocText.Add(Item.OuterText)
End If
End If
Catch ex As Exception
End Try
Next
If WBDocText.Count > 0 Then
For Each Item In WBDocText
Try
RichTextBox1.AppendText(Item)
Catch ex As Exception
End Try
Next
End If
End If
End Sub
End Class
Some text returned
What's My User Agent?
Your User Agent String is: Analyze
Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko)
(adsbygoogle = window.adsbygoogle || []).push({}); Your IP Address is:
207.109.140.1
Client Information:
JavaScript Enabled:Yes
Cookies Enabled:Yes
Device Pixel Ratio:1DevicePixelRation.com
Screen Resolution:1366 px x 768 pxWhatsMyScreenResolution.com
Browser Window:250 px x 250 px
Local Time:9:48 am
Time Zone:-7 hours
Recent User Agents Visiting this Page:
You!! Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko)
Mozilla/5.0 (compatible; MSIE 10.0; Windows Phone 8.0; Trident/6.0; IEMobile/10.0; ARM; Touch; NOKIA; Lumia 520)
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36
Mozilla/5.0 (compatible; MSIE 10.0; Windows Phone 8.0; Trident/6.0; IEMobile/10.0; ARM; Touch; NOKIA; Lumia 630)
Mozilla/5.0 (compatible; MSIE 10.0; Windows Phone 8.0; Trident/6.0; IEMobile/10.0; ARM; Touch; NOKIA; Lumia 630)
curl/7.35.0
curl/7.35.0
curl/7.35.0
curl/7.35.0
<!-- google_ad_client = "ca-pub-0399362207612216"; /* WMUA LargeRect */ google_ad_slot = "5054480278"; google_ad_width = 336; google_ad_height = 280; //-->
La vida loca

Similar Messages

  • How can I download my web pages back into IWeb

    I have created a site with about 10 pages photos everything it is still on web. problem is the local files are not on my MacPro, lost them somehow. Is there a way to get them back from the web?

    No unfortunately not, as iWeb has no import facility and so is unable to open any html or css files/an already published site.
    If you really don't have access to your domain.sites files, which is what you need and these can be found under your User/Library/Application Support/iWeb/domain.sites, then you are look at re-building your site from scratch.
    You can use copy and paste to re-create it, but you are still looking at re-building if you don't have that all important domain.sites file.
    Also, if you have to re-build, then consider using different software instead of iWeb, as Apple no longer actively supports or develops it.
    There are various alternatives such as RapidWeaver, Sandvox, Freeway Express/Pro, Flux 4, WebAcappella 4 and Quick n Easy Website Builder as well as EasyWeb that is in development by Rage Software and Adobe Muse.
    You can download free trials of most of these so that you can try before you buy.

  • How can I print a web page to a hard copy with Adobe?

    How can I print a web page to a hard copy with Adobe? When I print this particular page in the usual way it prints only the text of that page, not the photo that's also part of that page.

    I have Adobe SendNow, but I can buy another Adobe product if it will solve my web page printing problem.

  • How can I "expire" a web page (prevent BACK button access)

    How can I "expire" a web page ?
    I know some sites will display the following message if we click on the
    BACK button in your browser.
    Warning: Page has Expired The page you requested was created using
    information you submitted in a form. This page is no longer available. As a
    security precaution, Internet Explorer does not automatically resubmit your
    information for you.
    To resubmit your information and view this Web page, click the Refresh
    button.
    How do I implement this feature??

    U want to disable back button click
    if yes i think
    javascript:window.history.forward(1);

  • Can we download a web page using SOA?

    Hi,
    We have a requirement to download a web page whether html or xml format by using SOA middleware.Is it possible with this? Have anyone tried with this earlier? Any suggestions regarding this will be of great help.
    Thanks,
    SOA Team

    Hello Iyanu_IO,
    Thank you for answering my question "can I download a web page as a PDF".
    I downloaded and installed the add-on and it works really well.
    Regards
    Paul_Abrahams

  • HT4972 How can I zoom the web page only vertically or horizontally ,i.e.,wThen I want to read the web page can I enlarge the page so that the font only increases size ?

    How can I zoom the web page only vertically or horizontally ,i.e.,wThen I want to read the web page can I enlarge the page so that the font only increases size ?

    You can't.  (Fonts increase in size like everything else, not in height alone).

  • How can I upload my web page to the internet?

    How can I upload  my web page to the internet. When I click on "publish" it automatically takes me to mobileme - which is no longer accepting members..

    http://financialbin.com/2011/06/13/steve-jobs-confirms-discontinuation-of-iweb-i n-icloud-transition/
    I don't think you can upload pages from iWeb anymore.
    Information about the MobileMe transition

  • How can I open a web page that is not compatible with safari?

    Hi all
    I get the following message when I try to go to a certain web site:
    http://www.sia.homeoffice.gov.uk/Pages/licensing-rolh.aspx click on the "GO TO THE REGISTER" button
    Please note: clicking the button above will transfer you to our secure server, which will open in a new window. You must be using Internet Explorer 6, 7 or 8, or Netscape 7.02 for the new site to display properly. You can return to this site by closing the new window.
    I have looked at Internet Explorer and Netscape and both seem to be out of date with our IOS.
    How can I access this page?

    Try downloading another browser, Firefox for example. Open the link in Firefox and once the page opens copy the link and paste it in safari address bar. That works for me.
    Still trying to find out why we have to do this though since it works just fine in Safari once you had it opened in another browser.
    Hope that helps you!

  • How do you download a web page to a folder on the desktop without having all of the other rubish with it

    When looking on Web Sites associated with Ancestry information how can i save the information i find to a folder on my desktop without having 3/4 files with other information that i do not want / need. The information i want is shown in a second file which is great but when i delete the 3/4 files i do not want it removes the file that i do want also. I use Windows 7.

    Hi bjmw,
    What is the method you are using to download the webpage? Are your right clicking on the webpage and choosing "Save as.." or using an alternative method? The 3 or 4 files you are getting are likely the style-sheets associated with the website, without them the website would be nearly impossible to read.
    Thanks,
    Scott

  • How can I save a web page without creating a dozen or more tiny files, which is worse than Internet Explorer?

    When I try to save a Firefox web page, I get about a dozen tiny files with funny names cluttering up my documents folder, and making it hard to find anything. This reminds me of problems I used to have with Internet Explorer more than ten years ago. But at least, IE was kind enough to gather all of these files in a single folder, named "files". Firefox dumps them all into my documents folder. Messy.
    Years ago, I discovered Opera, which offered a file format that combined the original html code with the other files into one file: an internet archive, with the extension mht instead of htm. I used Opera for years, but recently became aware of the advantages of Firefox.
    A few days ago, I downloaded the latest version of Firefox, and installed it. I have been evaluating it ever since. It looks good; I have figured out (more or less) how bookmarks work, and I'm getting used to the taskbars.
    But when I try to save a page, I still get a deluge of tiny files cluttering up the target folder. Firefox doesn't save pages in archive form, but only in the old scattered form. Of course, I could simply save only the html. Maybe that's the best thing, but it loses a lot.
    The chaos of junky files cluttering up my target folder is enough to send me back to Opera, despite its limitations.

    This extension allows for web pages to saved on MHT format. <br />
    https://addons.mozilla.org/en-US/firefox/addon/8051

  • How can I save the web page by using the Page Title as the file name ?

    When i use Internet Explorer ,I can save the web page by using Page Title as the file name(as default no need to adjust anything). But when i use Firefox ,It can not use the Page Title to save as the file name. How can I do that like in Internet Explorer?

    See:
    *File Title: https://addons.mozilla.org/firefox/addon/834
    *Title Save: https://addons.mozilla.org/firefox/addon/712

  • How can I fit a web page to the screen?

    How I can open a web page and have this fit on the screen on my mac. every program I open displays on the top the whole page but only half screen.

    Hi z,
    In the upper left hand corner of the window, click the green button. Or you can "grab" the bottom right corner of the window and drag it to whatever size you want.

  • How can I HIDE a web Page From the Index above?

    I sometimes would like to directly link one page to another instead of listing every single page in the INDEX. For instance, I have a "Renovations" page with two other renovation pages following.....Can this be done from page to page with Hyperlinks ONLY? Is there a way to "hide" the titles in the Index listings above each page so they don't show with the web pages that I DO WANT listed?
    G5   Mac OS X (10.4.6)  

    Welcome twincrks
    You can use Inspector, Page Tab and uncheck "Include
    Page in navigation menu", this will stop the page
    being displayed in the navigation bar at the top of
    the screen.
    You can then manually create links to your other
    pages (see www.willg4pb.com - Hyperlinks for
    creating hyperlinks). Then, again using Inspector,
    this time the Link Tab, choose to Link To: One of My
    Pages, and choose the page you wish to link to from
    the drop down list.
    Will
    AWESOME! I just knew that there was a genius out there to help me. Thank you Will! Kathy

  • How can I get my web page to center in IE?

    My web page aligns left in IE, but not in other browsers.  Any suggestions?   http://www.theolderuggedcross.com/

    Hello,
    you can write here, so it should not be this: "... some reason i could not create e new post." and you should see this:
    If not you should ask Adobe. Particularly well I found that Adobe now also offers the chat. Here the link:
    http://www.adobe.com/support/acrobatdotcom/supportinfo/.
    Another "help-line" you will find here: http://helpx.adobe.com/contact/.
    Hans-Günter

  • How can I email a web page. The older version would enable you to email an entire page contents, but Mavericks does not

    When I try to email a Safari web page with Mavericks, some times it does not allow me to send the actual page and its contents. Instead it offers
    you to email as a web page, link or PDF file. However, when you send the page, the images do not show up in the email.

    Hi,
    You can try to right-click on the page and '''Send Link'''.
    You can also install [https://addons.mozilla.org/en-US/firefox/addon/add-on-compatibility-reporter/?src=search Addon Compatibility Reporter] and try to '''Enable''' the said extension in '''Tools''' ('''Alt''' + '''T''') > '''Addons'''. Hopefully this will restore at least partial functionality. You may also be able to find alternatives via the search box on the top right of the Addons page, or [https://addons.mozilla.org/ AMO].

Maybe you are looking for

  • 2nd monitor went black after upgrading on macbook pro

    Ive read post; however, i need help with getting 2nd mointor to work.  it was working fine before upgrade but now jsut black. I was hoping there was a new solution to fix this thuderbolt / 2nd monitor thing.  Im not comutuer tec. guy so I was hoping

  • Roll out project

    Dear Friends,    As a MM consultant what are the points to be consider during roll out project pls explain in points wise points must be rewarded.   Regards     Amin

  • How to set default values for inputs?

    I have several inputs which are always initialized with 0 when I start the VI. I'd like save own default values for them. I've already tried just setting the numbers and saving the VI but it did not work.

  • BPEL Evaluation with WSIF

    Hi All, I have a xml document that flows through 3 systems(assume workstations) one after another similar to UNIX pipes i.e the output of one system is fed as input to the next system. There is some legacy Java code that sits on each of the systems(w

  • Creator/Sun AppServer - Weblogic 8.1 sp4

    Hi Folks, What I'm tring to do is create a web application that can talk to an existing Enterprise application running on Weblogic 8.1 The web application (in a war) runs fine if I deploy it to Weblogic (by virtue of exporting the war from Creator) b