Extracting HTML Data

I am developing a Java Web Crawler that extracts data from a given website. Is it possible to extract text without the use of regular expressions? If so how?
Thanx

I think you need to be more specific.
from a single String, it's very easy to extract whatever you want, using substring. Sure, you need to look for and expect certain characters in certain places, but it would do the trick.
Additionally, there may be libraries already out there that do exactly what you're asking. I'm too lazy to look myself, but if you're looking for an easy way out, that would probably be the way to go.
Now, if you're trying to write this yourself, I can't imagine it being all that hard to just go through and find a few characters that are boundaries to the text you're looking for, and grab all the text in between those boundaries (chances are these boundaries would be html tags, but in reality it could be whatever). Or, if you're looking to see if the html contains just a specific set of text, just use stringname.contains("whatever you want").
Like I said though, I think you need to be more specific in your question.

Similar Messages

  • Extracting HTML data in BSP from Browser

    Hello,
            I am displaying an Adobe Interactive form as HTML using an IFRAME in BSP, I know there is another way of using Adobe Interactive form in BSP, but it would occupy the entire BSP page and would overwrite any other BSP elements and they will not be displayed. Since the requirement is to enhance the existing BSP page and display buttons like SAVE, SUBMIT on the top of the BSP page and have the Interactive Adobe form occupy the rest of page as HTML display in IFRAME. I do understand that such functionality can easily achieved using web dynpro ABAP or JAVA, but I have very limited options, I used the below code to render the Interactive Adobe form :-
    DATA: cached_response TYPE REF TO if_http_response.
        CREATE OBJECT cached_response
          TYPE
            cl_http_response
          EXPORTING
            add_c_msg        = 1.
        cached_response->set_data( file_content ).
        cached_response->set_header_field( name  = if_http_header_fields=>content_type
                                           value = file_mime_type ).
        cached_response->set_status( code = 200 reason = 'OK' ).
        cached_response->server_cache_expire_rel( expires_rel = 180 ).
        DATA: guid TYPE guid_32.
        CALL FUNCTION 'GUID_CREATE'
          IMPORTING
            ev_guid_32 = guid.
        CONCATENATE runtime->application_url '/' guid INTO display_url.
        cl_http_server=>server_cache_upload( url      = display_url
                                             response = cached_response ).
    Once displayed as HTML using the IFRAME in BSP, is there a way I can capture the data entered Interactive Adobe Form in BSP? I can still extract data even if it were in XML format or XSTRING. Please let me know if there is way to extract the data.
    Regards,
    Shishir.P

    Hi,
    Have you gone through this link
    [check this|http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/d0e58022-2a39-2a10-69a8-c1a892e2b3f4?quicklink=index&overridelayout=true]
    Cheers,
    bhavana

  • Extraction of data from ECC to 3rd Party systems

    Hi All,
    I want to know all the options available for extracting data from ECC to a 3rd party system (custom datawarehouse like Teradata, hyperion etc). Also, I want know if there is a best practice documentation available for extraction of data from ECC to any 3rd party system?
    Thanks,
    SB.

    Hi SB,
    Check the following link
    http://expertisesapbi.blogspot.com/2010/06/how-to-transfer-data-from-sap-system-to.html
    Ranganath.

  • STATIC "HTML Data Set Photo Gallery" with CAPTIONS

    Because of search engines and validation I would really like to use the STATIC "HTML Data Set Photo Gallery" http://labs.adobe.com/technologies/spry/demos/gallery_pe/static/china.html.
    But I definitely need captions for the Photos. I found possibilities to add them via XML, but not for the HTML-Version.
    As far as I got I think I have to edit this function from "gallery_hds.js" and add some RegEx, to extract caption text.
    function PhotosFilter(ds, row, rowIndex)
        var tnStr = row.thumbimg;
        if (tnStr)
            row.path = tnStr.replace(/.*<a[^>]*href="?([^"]*)"?.*/i, "$1");
            row.thumbpath = tnStr.replace(/.*<img[^>]*src="?([^"]*)"?.*/i, "$1");
        return row;
    In the following rows within the html-document
    <span class="thumbnail"><a href="../../gallery/galleries/paris/images/paris_02.jpg"><img src="../../gallery/galleries/paris/thumbnails/paris_02.jpg" alt="paris_02.jpg" /></a></span>
    I would insert text before the closing span-tag (or before the closing a-tag?) - like so
    <span class="thumbnail"><a href="../../gallery/galleries/paris/images/paris_02.jpg"><img src="../../gallery/galleries/paris/thumbnails/paris_02.jpg" alt="paris_02.jpg" /></a>Here is some HTML Text</span>
    This text must be extracted via RegEx - similar to the rows within the if-Statement of the upper mentioned function "PhotosFilter". But here my RegEx-knowledge is definitely to short!
    Thank you very much in advance for any help or suggestion.
    Angela Fengler

    Because of search engines and validation I would really like to use the STATIC "HTML Data Set Photo Gallery" http://labs.adobe.com/technologies/spry/demos/gallery_pe/static/china.html.
    But I definitely need captions for the Photos. I found possibilities to add them via XML, but not for the HTML-Version.
    As far as I got I think I have to edit this function from "gallery_hds.js" and add some RegEx, to extract caption text.
    function PhotosFilter(ds, row, rowIndex)
        var tnStr = row.thumbimg;
        if (tnStr)
            row.path = tnStr.replace(/.*<a[^>]*href="?([^"]*)"?.*/i, "$1");
            row.thumbpath = tnStr.replace(/.*<img[^>]*src="?([^"]*)"?.*/i, "$1");
        return row;
    In the following rows within the html-document
    <span class="thumbnail"><a href="../../gallery/galleries/paris/images/paris_02.jpg"><img src="../../gallery/galleries/paris/thumbnails/paris_02.jpg" alt="paris_02.jpg" /></a></span>
    I would insert text before the closing span-tag (or before the closing a-tag?) - like so
    <span class="thumbnail"><a href="../../gallery/galleries/paris/images/paris_02.jpg"><img src="../../gallery/galleries/paris/thumbnails/paris_02.jpg" alt="paris_02.jpg" /></a>Here is some HTML Text</span>
    This text must be extracted via RegEx - similar to the rows within the if-Statement of the upper mentioned function "PhotosFilter". But here my RegEx-knowledge is definitely to short!
    Thank you very much in advance for any help or suggestion.
    Angela Fengler

  • Extract Excel Data to RT (perl)

    Hi i am trying to extract some data from Excel and export into a perl program (Request Tracker). I am not sure what technologies i should use.
    Eg. convert to xml? cvs? or to java? use JavaEE?
    Please advice
    Thanks

    armalcolm wrote:
    Also, modern versions of Excel will sav files in an xml format, so you could then go straight to perl for a completely java-free solution :)Doesn't even need to do that! ;-)
    [Active State Perl and Excel|http://aspn.activestate.com/ASPN/docs/ActivePerl-5.6/faq/Windows/ActivePerl-Winfaq12.html]

  • Auto extraction of data...

    Hi,
    Is there a software or tool out there I can purchase that
    allows me to accomplish the following tasks? I've looked at screen
    scrape, which only takes care of my "data extraction" requirement.
    I've also looked at iMarcos, which only takes care of the auto
    submit of the form.
    Tasks I need to accomplish:
    I need a software / tool for which I can configure to
    automatically extract certain data from an email message that comes
    into my Microsoft Office Outlook box and auto paste it into a
    textarea (memo) box in a web page and then auto submit the form
    (containing the textarea field with pasted data) into a table in a
    database.
    Thanks in advance!

    Hi cf_menace,
    I have a very simple template (below) that runs to retrieve
    only the first 5 emails (MAXROWS="5") from my mail box, but it
    takes a long time to return the result set. Can you tell me why? Is
    it something in the CF administrator that I have to configure to
    make this faster? Please see code below:
    <!--- This view-only example shows the use of CFPOP
    --->
    <HTML>
    <HEAD>
    <TITLE>CFPOP Example</TITLE>
    </HEAD>
    <BODY>
    <H3>CFPOP Example</H3>
    <P>CFPOP allows you to retrieve and manipulate mail
    in a POP3 mailbox. This view-only example shows how to
    create one feature of a mail client, allowing you to display
    the mail headers in a POP3 mailbox.
    <!--- <P>Simply uncomment this code and run with a
    mail-enabled CF Server to
    see this feature in action. --->
    <CFIF IsDefined("form.server")>
    <!--- make sure server, username are not empty --->
    <CFIF Trim(form.server) is not "" and Trim(form.username)
    is not "">
    <CFPOP SERVER="#server#" USERNAME="#username#"
    PASSWORD="#pwd#" ACTION="GETHEADERONLY" NAME="GetHeaders"
    MAXROWS="5">
    <H3>Message Headers in Your Inbox</H3>
    <P>Number of Records:
    <CFOUTPUT>#GetHeaders.RecordCount#</CFOUTPUT></P>
    <UL>
    <CFOUTPUT QUERY="GetHeaders">
    <LI>Row: #CurrentRow#: From: #From# -- Subject:
    #Subject#
    </CFOUTPUT>
    </UL>
    </CFIF>
    <FORM ACTION="CFPOP.cfm" METHOD="POST">
    <P>Enter your mail server: <INPUT TYPE="Text"
    NAME="server">
    <BR>Enter your username: <INPUT TYPE="Text"
    NAME="username">
    <BR>Enter your password: <INPUT TYPE="password"
    NAME="pwd">
    <P><INPUT TYPE="Submit" VALUE="Get Message
    Headers">
    </FORM>
    </BODY>
    </HTML>

  • Save HTML data in a Oracle Column

    what would be the best way to Save HTML data in a Oracle Column?
    while varchar2 can be used for upto 4000 bytes. it would still mean escaping a lot of special character. Is there a better way to do this? any help would be greatly appreciated.

    Besides the XML types available to you and the associated Oracle provided packages to input and extract XML I have heard arguments that both should be stored in the database. That is you should store the extracted data in normal Oracle columns so it can be used like any other attribute and that you should store the XML as XML which can then be used as XML.
    For data that is only inserted and deleted I can see this method but if updates to information within the XML is required then you just added another set of work requirements and complexity.
    Who is going to access the data? What tools are the users going to use? Where else does the data need to be provied to and in what format? The answers to who and how the data will be used should provide you with the answer of what form the data should be stored in.
    My personal view is that a relational database should be used for what it was designed for, storing relational data.
    HTH -- Mark D Powell --

  • Extracting the date value from digital signature/certificate

    Hello,
    I'd like to extract the date from the signature properties and copy the value over to the date field as shown in snapshot.
    I am aware that we can change the appearance of the digital signature to make the date visible but in most case, it is too small to read on hardcopies.
    We resort by manually typing in the date, zooming into PDF to read visible date (if any) associated with signature image, to click on the signature image to open the Signature Properties dialog, or to open the Signatures tab window docked to the left.
    Manual typing in the date expose us to discrepancy problem when the PDF was created vs. the actual date the PDF was signed (date value associated with digital signature/certificate). For example, person A created a PDF with date typed in and then sent that file over to person B (approving the document), who may digitally sign it a few days later.
    Hope I am making sense.
    Regards,
    Devin
    Note: I have originally posted my question in other thread at http://forums.adobe.com/message/3296355

    You can get the data and other signature properties using the  signatureInfo field method: http://livedocs.adobe.com/acrobat_sdk/9.1/Acrobat9_1_HTMLHelp/JS_API_AcroJS.88.756.html
    But for you application you really should be setting the date field before the signature is applied, since changing it afterwards would invalidate the signature. You can execute a script that sets the valud of the data field with the current date using the "Signaute Signed" event, which you'll see as one of the tabs of the signature field properties dialog.

  • Error while extracting the data in R/3 production system,

    Hi Team,
    We got the following error while extracting the data in R/3 production system,
    Error 7 When Sending an IDoc R3 3
    No Storage space available for extending the inter 44 R3 299
    No storage space available for extending the inter R3 299
    Error in Source System RSM 340
    Please guide us to fix the issue

    It´s very difficult to help you without knowing
    - what is going to be transferred
    - where you get this error
    - system configuration
    - actual memory usage
    - operating system
    - database and configuration etc. etc.etc. etc.
    I suggest you open an OSS call and let the support have a look on your system. It´s much easier if one has system access to find out the cause for that problem.
    Markus

  • Unable to extract the data from ECC 6.0 to PSA

    Hello,
    I'm trying to extract the data from ECC 6.0 data source name as 2LIS_11_VAHDR into BI 7.0
    When i try to load Full Load into PSA , I'm getting following error message
    Error Message: "DataSource 2LIS_11_VAHDR must be activated"
    Actually the data source already active , I look at the datasource using T-code LBWE it is active.
    In BI  on datasource(2LIS_11_VAHDR) when i right click selected "Manage"  system is giving throughing below error message
    "Invalid DataStore object name /BIC/B0000043: Reason: No valid entry in table RSTS"
    If anybody faced this error message please advise what i'm doing wrong?
    Advance thanks

    ECC 6.0 side
    Delete the setup tables
    Fill the data into setup tables
    Schedule the job
    I can see the data using RSA3 (2LIS_11_VAHDR) 1000 records
    BI7.0(Service Pack 15)
    Replicate the datasource in Production in Backgroud
    Migrate Datasource 3.5 to 7.0 in Development
    I did't migrate 3.5 to 7.0 in Production it's not allowing
    When i try to schedule the InfoPakage it's giving error message "Data Source is not active"
    I'm sure this problem relate to Data Source 3.5 to 7.0 convertion problem in production. In Development there is no problem because manually i convert the datasource 3.5 to 7.0
    Thanks

  • How to extract Slide data in 3rd part application from clipboard

    I need to be able to copy/paste or drag/drop from PowerPoint into another application (C# WPF). In my OnDrop method the DragEventArgs Data has these formats:
            [0]    "Preferred DropEffect"    string
            [1]    "InShellDragLoop"    string
            [2]    "PowerPoint 12.0 Internal Slides"    string
            [3]    "ActiveClipBoard"    string
            [4]    "PowerPoint 14.0 Slides Package"    string
            [5]    "Embedded Object"    string
            [6]    "Link Source"    string
            [7]    "Object Descriptor"    string
            [8]    "Link Source Descriptor"    string
            [9]    "PNG"    string
            [10]    "JFIF"    string
            [11]    "GIF"    string
            [12]    "Bitmap"    string
            [13]    "System.Drawing.Bitmap"    string
            [14]    "System.Windows.Media.Imaging.BitmapSource"    string
            [15]    "EnhancedMetafile"    string
            [16]    "System.Drawing.Imaging.Metafile"    string
            [17]    "MetaFilePict"    string
            [18]    "PowerPoint 12.0 Internal Theme"    string
            [19]    "PowerPoint 12.0 Internal Color Scheme"    string
    The "PowerPoint 14.0 Slides Package" is a byte array... can this be converted into Slides?
    If not how would I go about getting high-resolution images + slide text from a drag/drop?
    [Originally posted here: http://answers.microsoft.com/en-us/office/forum/office_2013_release-powerpoint/how-to-extract-slide-data-in-3rd-part-application/a0b5ed64-eb77-49bb-bf44-e0732e23a5eb]

    What I'd like to do:
    Open PowerPoint
    In PPT open a presentation
    In PPT select a slide
    Drag it to my 3rd party WPF application
    In the 3rd party WPF application drop handler get the slide data (text, background image, etc...).
    When I do this I get the DragEventArgs Data (the clipboard data) and it has the 20 supported formats I listed in the 1st post. From these formats #4 seemed like it could have some useful info.
    WPF
    <Window x:Class="PowerPointDropSlide.MainWindow"
    xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
    xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
    Title="MainWindow" Height="350" Width="525" AllowDrop="True" Drop="UIElement_OnDrop" DragOver="UIElement_OnDragOver">
    <Grid HorizontalAlignment="Stretch" VerticalAlignment="Stretch" Background="LightBlue">
    <TextBlock Text="Drop something here!"/>
    </Grid>
    </Window>
    Handlers:
    public void UIElement_OnDragOver(object sender, DragEventArgs e)
    public void UIElement_OnDrop(object sender, DragEventArgs e)
    string[] supportedFormats = e.Data.GetFormats();
    object pptSlidesPackage = e.Data.GetData("PowerPoint 14.0 Slides Package");

  • Not able to extract performance data from .ETL file using xperf commands. getting error "Events were lost in this trace. Data may be unreliable ..."

    Not able to extract  performance data from .ETL file using xperf commands.
    Xperf Commands:
    xperf –i C:\TempFolder\Test.etl -o C:\TempFolder\BootData.csv  –a process
    Getting following error after executing above command:
    "33288636 Events were lost
    in this trace. 
    Data may be unreliable
    This is usually caused
    by insufficient disk bandwidth for ETW lo
    gging.
    Please try increasing the minimum
    and maximum number of buffers
    and/or
                    the buffer size. 
    Doubling these values would be a good first at
    tempt.
    Please note, though, that
    this action increases the amount of me
    mory
                    reserved
    for ETW buffers, increasing memory pressure on your sce
    nario.
    See "xperf -help start"
    for the associated command line options."
    I changed page size file but its does not work for me.
    Any one have idea, how to solve this problem and extract ETL file data.

    I want to mention one point here. I have total 4 machines out of these 3 machines above
    commands working properly. Only one machine has this problem.<o:p></o:p>
    Hi,
    I consider that you can try to use xperf to collect the trace etl file and see if it can be extracted on this computer:
    Refer to following articles:
    start
    http://msdn.microsoft.com/en-us/library/windows/hardware/hh162977.aspx
    Using Xperf to take a Trace (updated)
    http://blogs.msdn.com/b/pigscanfly/archive/2008/02/16/using-xperf-to-take-a-trace.aspx
    Kate Li
    TechNet Community Support

  • How to extract authorization data to standart BW DSO's  from  SAP R/3 system

    Hi All,
    Does anyone have any experience about this topic? I want to use SAP R/3 as a source system and after i extracted the data to business content DSO's in BW  ,i will generate authorization objects from DSO 's.
    I am using standar BC DSO 's
    0TCA_DS01 Authorization data - Values
    &#149; 0TCA_DS02 Authorization data - Hierarchies
    &#149; 0TCA_DS03 Descriptive Text Authorizations
    &#149; 0TCA_DS04 Assignment User Authorizations
    &#149; 0TCA_DS05 Generate users for Authorizations
    I have deep research but cant find anything.
    Best Regards
    Ozan

    Hi Ozan,
    You can go though thread provided by Suman, These DSO's will help to maintain Analysis Authorizations in BW automatically In-short you don't need to maintain it, it will come from R/3 and same will be configured in BW.
    Regards,
    Ganesh

  • How to extract Inventory data from SAP R/3  system

    Hi friends How to extract Inventory data from SAP R/3  system? What are report we may expect from the Inventory?

    Hi,
    Inventory management
    https://www.sdn.sap.com/irj/servlet/prt/portal/prtroot/com.sap.km.cm.docs/documents/a1-8-4/how%20to%20handle%20inventory%20management%20scenarios.pdf
    How to Handle Inventory Management Scenarios in BW (NW2004)
    https://www.sdn.sap.com/irj/sdn/go/portal/prtroot/docs/library/uuid/f83be790-0201-0010-4fb0-98bd7c01e328
    Loading of Cube
    •• ref.to page 18 in "Upgrade and Migration Aspects for BI in SAP NetWeaver 2004s" paper
    http://www.sapfinug.fi/downloads/2007/bi02/BI_upgrade_migration.pdf
    Non-Cumulative Values / Stock Handling
    https://www.sdn.sap.com/irj/sdn/go/portal/prtroot/docs/library/uuid/93ed1695-0501-0010-b7a9-d4cc4ef26d31
    Non-Cumulatives
    http://help.sap.com/saphelp_nw2004s/helpdata/en/8f/da1640dc88e769e10000000a155106/frameset.htm
    http://help.sap.com/saphelp_nw2004s/helpdata/en/80/1a62ebe07211d2acb80000e829fbfe/frameset.htm
    http://help.sap.com/saphelp_nw2004s/helpdata/en/80/1a62f8e07211d2acb80000e829fbfe/frameset.htm
    Here you will find all the Inventory Management BI Contents:
    http://help.sap.com/saphelp_nw70/helpdata/en/fb/64073c52619459e10000000a114084/frameset.htm
    2LIS_03_BX- Initial Stock/Material stock
    2LIS_03_BF - Material movements
    2LIS_03_UM - Revaluations/Find the price of the stock
    The first DataSource (2LIS_03_BX) is used to extract an opening stock balance on a
    detailed level (material, plant, storage location and so on). At this moment, the opening
    stock is the operative stock in the source system. "At this moment" is the point in time at
    which the statistical setup ran for DataSource 2LIS_03_BX. (This is because no
    documents are to be posted during this run and so the stock does not change during this
    run, as we will see below). It is not possible to choose a key date freely.
    The second DataSource (2LIS_03_BF) is used to extract the material movements into
    the BW system. This DataSource provides the data as material documents (MCMSEG
    structure).
    The third of the above DataSources (2LIS_03_UM) contains data from valuated
    revaluations in Financial Accounting (document BSEG). This data is required to update
    valuated stock changes for the calculated stock balance in the BW. This information is
    not required in many situations as it is often only the quantities that are of importance.
    This DataSource only describes financial accounting processes, not logistical ones. In
    other words, only the stock value is changed here, no changes are made to the
    quantities. Everything that is subsequently mentioned here about the upload sequence
    and compression regarding DataSource 2LIS_03_BF also applies to this DataSource.
    This means a detailed description is not required for the revaluation DataSource.
    http://help.sap.com/saphelp_bw32/helpdata/en/05/c69480c357354a8846cc61f7b6e085/content.htm
    http://help.sap.com/saphelp_bw33/helpdata/en/ed/16c29a27db6e4d81a015be8673eb80/content.htm
    These are the standard data sources used for Inventory extraction.
    Hope this helps.
    Thanks,
    JituK

  • How to extract PS data from sap r/3 to bw

    Hi,
    How to extract PS data from sap r/3 to bw
    PS data like plans,budget,accurals&commitmnets
    can any one help me regarding this..
    Thanks in Advance,
    Shankar.

    HI sankar,
    you can refer the belkow link to find the details on the relevant extractors and infoproviders
    http://help.sap.com/erp2005_ehp_04/helpdata/EN/17/416d030524064cb2b8d58ffb306f3a/frameset.htm
    Regards,
    Sathya

Maybe you are looking for