How to extract content from a html-document

Hi!
I need to display the content of a news-feed
like: http://news.bbc.co.uk/go/rss/-/1/hi/sci/tech/7662565.stm
I'd like to display this in a JEditorPane with only the relevant content without the banner, menu and so on.
Any idea on how I can do this?
Cheers

In this particular case, you don't want to parse the HTML but the RSS feed used to build this this page. The feed for this page is [http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/sci/tech/rss.xml]
RSS feeds are XML based so it's easy to parse them. You can do it by hand or use a specialized library. One of the well-known library is ROME ( [http://wiki.java.net/bin/view/Javawsxml/Rome] ).
Example : [Parse a RSS XML file|http://www.rgagnon.com/javadetails/java-0557.html]
Bye.

Similar Messages

  • How to extract elements from an HTML page?

    Hi:
    If we have the following returned page which is the web query result from a library: http://library.fsi.uiuc.edu/dbtw-wpd/exec/dbtwpub.dll?XC=%2Fdbtw-wpd%2Fexec%2Fdbtwpub.dll&BU=http%3A%2F%2Flibrary.fsi.uiuc.edu%2Fdbtw-wpd%2FnewWeb%2Fauthor.asp&submit=Submit+Query&QB0=AND&QF0=Author&QI0=william&QB1=AND&QF1=Corporate+Author&QI1=&MR=20&TN=Catalog&DF=Display-Web&RF=Brief+Title+-+Web&DL=1&RL=1&NP=1&AC=QBE_QUERY
    If we want to store all the returned information in the local memory,
    how can Java be used to extract the records out of the html page?
    What kind of classes should be used?
    For example, should I just use regular expression to do this (java.util.regex class),
    build-in html parser (javax.swing.text.html.parser.Parser, or even javax.swing.text.html.HTMLEditorKit.Parser), or any other classes? Thanks.

    I been working on a simular project but i might not be using a method that you would want.
    I am reading the webpage code and then getting the data by looking for certain words. The problem with this is that it is reading the page "blindly" so it could read the wrong data or screw up on an error on the page. but here is a bit of code i am using. I am using ti to get data that appears in " " (quotes) so i can get urls, images, etc
    This code isnt totaly mine, i amended it from some one and dont know the original owner :(
    try {
            is =                    new BufferedInputStream(new URL(jTextField1.getText()).openStream());
          while ((record = is.read()) != -1) {
            buffer.append((char) record);
            length++;
            System.out.print((char)record);
          }     is.close();
    } catch // error management herei am using a seperate thread running at same time to enter data
    do {
            while (lengthposition != length) {
              //System.out.print(buffer.charAt(lengthposition));
              temp[0] = "" + buffer.charAt(lengthposition);
              if (buffer.charAt(lengthposition) != /*34*/64 && tolist == true)
                temp[1] += ""+ buffer.charAt(lengthposition);
              else if (buffer.charAt(lengthposition) == /*34*/64 && tolist == false)
                tolist = true;
              else if (buffer.charAt(lengthposition) == /*34*/64 && tolist == true) {
                tolist = false;
                list1.addItem(temp[1]);
                temp[1] = "";
              lengthposition++;
          } while (buffer.charAt(lengthposition) != 32);I hope you find out a solution to your problem :)

  • How to extract data from  table in PDF document

    Can somebody let me know that how to read tables from a PDF document as I need to process this data further and I want it to be saved as a database.

    Google around; you might find a couple of sourceforge project.
    Rich.

  • How to Extract data from Oracle DB to BW via DBConncet interface.

    HI All,
    Do you know how to extract data from ORACLE data base to BW, using DBConnect.
    Here we are not using R/3 Business content structures.
    How to do it in both source system  and BW side?
    How to define structures on both sides.
    Please provide any documents on that.
    Thanks in Advance.
    Sri.

    Hi Srilaxmi
    Have a look at these links
    http://help.sap.com/saphelp_nw04/helpdata/en/58/54f9c1562d104c9465dabd816f3f24/content.htm
    http://help.sap.com/saphelp_nw04/helpdata/en/a1/89786c3df35c4ea930a994e884bb4c/content.htm
    http://help.sap.com/saphelp_bw30b/helpdata/en/80/1a618ae07211d2acb80000e829fbfe/content.htm
    and  this thread
    Extract from Oracle View with DB connect
    regards
    KR

  • How to extract data from SAP in FDM 11121

    I came across some documents from version 11113, says there is a SAP adapter, however ,when I check 11121 there's no such adapter, does anyone know how to extract data from SAP in FDM 11121?

    I download a package from Bristlecone, but I dont see any xml files in it, just a bunch of dll and I didn't find any instructions on how to set up/configure in FDM, it just explained how to register in SAP app server and client.
    Hyperion 11113 has readme on sap adapter(Hyperion Readme Template), but I cannot find the same readme file in 11121, all I can find is ERP Integration Adapter document. any idears where I can find at least a readme document?

  • How to extract data from web URL

    I was doing one project which need to extract data from web pages and then analyze these data. the question is how to extract data from there, using html parser? need help, thanks a lot

    I was doing one project which need to extract data
    from web pages and then analyze these data. the
    question is how to extract data from there, using
    html parser? need help, thanks a lotTry this:
    http://java.sun.com/docs/books/tutorial/networking/urls/readingURL.html
    Or, like you said yourself, use an HTML parser:
    http://java-source.net/open-source/html-parsers

  • How to extract data from oracle database directly in to bi7.0 (net weaver)

    how to extract data from oracle database directly in to bi7.0 (net weaver)? is it something do with EDI? can anybody explain me in detail?
    Thanks
    York

    You can use UDConnect to get from Oracle database in to BW
    <b>Data Transfer with UD Connect -</b>
    http://help.sap.com/saphelp_nw04/helpdata/en/78/ef1441a509064abee6ffd6f38278fd/content.htm
    <b>Prerequisites</b>
    You have installed the SAP WAS J2EE Engine with BI Java components.  You can find more information on this in the SAP BW installation guide on the SAP Service Marketplace at service.sap.com/instguides.
    Hope it Helps
    Chetan
    @CP..

  • Code Challange - Code for extracting content from a file in KM

    Hi,
      I want to extract content from a file which is in KM Repository using the Webdynpro View.  I wrote the following code for the extraction. But it is throwing some exception.
    Code Written:
    <b>  
      URLConnection conn = null;
          DataInputStream data = null;
          String line;
          StringBuffer buf = new StringBuffer();
         wdContext.currentContextElement().setTextdisp("Hello");
         try {
          URL theURL = new URL("http://ctssap1:50000/irj/servlet/prt/portal/prtroot/com.sap.km.cm.docs/documents/News/New%20text.txt");     
            conn = theURL.openConnection();
           // conn.connect();
            wdContext.currentContextElement().setTextdisp("Inside Try");      
            data = new DataInputStream(new BufferedInputStream(
                               conn.getInputStream()));
            while ((line = data.readLine()) != null) {
               buf.append(line + "\n");
            data.close();
            String buffer = new String(buf);
            wdContext.currentContextElement().setTextdisp(buffer);
          catch (IOException e) {
              wdContext.currentContextElement().setTextdisp("IO Error:" + e.getMessage());
          }</b>
    Exception Got:
    <b>IO Error:Server returned HTTP response code: 401 for URL: http://ctssap1:50000/irj/servlet/prt/portal/prtroot/com.sap.km.cm.docs/documents/News/New%20text.txt</b>
    Can anyone help me out in this process... or please suggest some other way to extract the content.
    Thanks in Advance.
    Regards,
    Srinivas.

    Hi Srinivas,
    VaPath will have the path of ur document.
    Check this modified code snippet which will answer ur  questions and solve ur problem.
    IWDClientUser wdClientUser = WDClientUser.getCurrentUser();
    IUser sapUser = wdClientUser. getSAPUser();
    com.sapportals.portal.security.usermanagement.IUser ep5User = WPUMFactory.getUserFactory().getEP5User(sapUser);
    IResourceContext context = new ResourceContext(ep5User);
    /Specify the path of ur document here./
    RID pathRID = RID.getRID("/documents/file2");
    IResource resource = ResourceFactory.getInstance()
    .getResource(context);
    InputStream in = resource.getContent().getInputStream();
    ByteArrayOutputStream out = new ByteArrayOutputStream();
    byte[] buffer = new byte[4096];
    int bytesread = 0;
    while ((bytesread = in.read(buffer)) != -1) {
    out.write(buffer, 0, bytesread);
    String myData = out.toString();
    /*myFile will containS the content of the document./
    wdComponentAPI.getMessageManager().reportSuccess(myData);
    Hope this solves ur problem.
    Regards,
    Sowjanya.

  • How to extract text from a PDF file?

    Hello Suners,
    i need to know how to extract text from a pdf file?
    does anyone know what is the character encoding in pdf file, when i use an input stream to read the file it gives encrypted characters not the original text in the file.
    is there any procedures i should do while reading a pdf file,
    File f=new File("D:/File.pdf");
                   FileReader fr=new FileReader(f);
                   BufferedReader br=new BufferedReader(fr);
                   String s=br.readLine();any help will be deeply appreciated.

    jverd wrote:
    First, you set i once, and then loop without ever changing it. So your loop body will execute either 0 times or infinitely many times, writing the same byte every time. Actually, maybe it'll execute once and then throw an ArrayIndexOutOfBoundsException. That's basic java looping, and you're going to need a firm grip on that before you try to do anything as advanced as PDF reading. the case.oops you are absolutely right that was a silly mistake to forget that,
    Second, what do the docs for getPageContent say? Do they say that it simply gives you the text on the page as if the thing were a simple text doc? I'd be surprised if that's the case.getPageContent return array of bytes so the question will be:
    how to get text from this array? i was thinking of :
        private void jButton1_actionPerformed(ActionEvent e) {
            PdfReader read;
            StringBuffer buff=new StringBuffer();
            try {
                read = new PdfReader("d:/getjobid2727.pdf");
                read.getMetaData();
                byte[] data=read.getPageContent(1);
                int i=0;
                while(i>-1){ 
                    buff.append(data);
    i++;
    String str=buff.toString();
    FileOutputStream fos = new FileOutputStream("D:/test.txt");
    Writer out = new OutputStreamWriter(fos, "UTF8");
    out.write(str);
    out.close();
    read.close();
    } catch (Exception f) {
    f.printStackTrace();
    "D:/test.txt"  hasn't been created!! when i ran the program,
    is my steps right?                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       

  • How to extract text from a PDF file using php?

    How to extract text from a PDF file using php?
    thanks
    fabio

    > Do you know of any other way this can be done?
    There are many ways. But this out of scope of this forum. You can try this forum: http://forum.planetpdf.com/

  • Need suggestion on how to extract data from a table

    Many years ago, I wrote many Perl scripts to increase my work productivity.
    Now I am starting learning Java, Javascript and PHP for some purposes. Just wrote a simple html2txt java program.
    Now I need experts' suggestion for one specific purpose - extract data from a html table. One of the examples is here: http://mops.tse.com.tw/nas/t06sa18/200902/A02_2454_200902.htm
    I need to extract datum from the table and make some analysis. (Txt is in Chinese, sorry)
    What is the best way to code it? Java, PHP or other? If Java, any suggestion what classes/module and approach to use?
    Thanks very much.

    xcomme wrote:
    Many years ago, I wrote many Perl scripts to increase my work productivity.
    Now I am starting learning Java, Javascript and PHP for some purposes. Just wrote a simple html2txt java program.
    Now I need experts' suggestion for one specific purpose - extract data from a html table. One of the examples is here: http://mops.tse.com.tw/nas/t06sa18/200902/A02_2454_200902.htm
    I need to extract datum from the table and make some analysis. (Txt is in Chinese, sorry)
    What is the best way to code it? It this a one shot (one time only task)? Then the best way is whatever way you are most comfortable with and which works.
    If on going then I doubt language choice matters but using a html parser rather than attempting to parse it yourself is going to help in any language.
    And are you starting with html files or starting with a http server? The two are very different.

  • How to extract data from Chart History?

    Dear all, I have read a lot of posts, but still don't understand how to extract data from Chart history.
    Suppose you acquired 1024 points of data every time, then use "build array" to build an array, then send to intensity chart to show, then save the history into a .txt file,  but how to extact data from the file?
    How Labview store the data in the 2D history buffer?
    Anybody has any examples?

    The simplest would be to save the 2D array as a spreadsheet file, the read it back the same way.
    Maybe the attached simple example can give you some ideas (LabVIEW 7.1). Just run it. At any time, press "write to file". At any later time, you can read the save data into the second history chart.
    If you are worried about performance, it might be better to use binary files, but it will be a little more complicated. See how far you get.
    LabVIEW Champion . Do more with less code and in less time .
    Attachments:
    IntensityChartHistorySave.vi ‏79 KB

  • How to extract data from planning book

    nu t apo. need help regarding how to extract data from planning book givn the planning book name , view name & some key figures.
    Total Demand (Key Figure DMDTO):
    o     Forecast
    o     Sales Order
    o     Distribution Demand (Planned)
    o     Distribution Demand (Confirmed)
    o     Distribution Demand (TLB-Confirmed)
    o     Dependent Demand
    Total Receipts (Key Figure RECTO):
    o     Distribution Receipt (Planned)
    o     Distribution Receipt (Confirmed)
    o     Distribution Receipt (TLB-Confirmed)
    o     In-Transit
    o     Production (Planned)
    o     Production (Confirmed)
    o     Manufacture of Co-Products
    Stock on Hand (Key Figure STOCK):
    o     Stock on Hand (Excluding Blocked stock)
    o     Stock on Hand (Including Blocked stock)
    In-Transit Inventory (Cross-company stock transfers)
    Work-in-progress (Planned orders / Process orders with start date in one period and finish date in another period)
    Production (Planned), Production (Confirmed) and Distribution Receipt elements need to be converted based on Goods Receipt date for projected inventory calculation.

    Hello Debadrita,
    Function Module BAPI_PBSRVAPS_GETDETAIL2 or BAPI_PBSRVAPS_GETDETAIL can help you.
    For BAPI BAPI_PBSRVAPS_GETDETAIL, the parameters are:
    1) PLANNINGBOOK - the name of your planning book
    2) DATA_VIEW - name of your data view
    3) KEY_FIGURE_SELECTION - list of key figures you want to read
    4)  SELECTION - selection parameters which describe the attributes of the data you want to read (e.g. the category or brand). This is basically a list of characteristics and characteristic values.
    BAPI_PBSRVAPS_GETDETAIL2 is very similar to BAPI_PBSRVAPS_GETDETAI but is only available from SCM 4.1 onwards.
    For the complete list of parameters, you can go to transaction SE37, enter the function module names above and check out the documentation.
    Please post again if you have questions.
    Hope this helps.

  • How to extract data from ODS to non-SAP system

    Hi,
    Can anybody tell me, step by step, how to extract data from ODS to a non-SAP system?
    Is it possible to do it without programming effort? And is there volume limits for this kind of extraction?
    The non-SAP system is an unix system.
    Thanks in advance
    Ella

    Ella,
    You can look at it from the concept of a BADI / Infospoke
    Extract the data from the ODS to a Flat file / RDBMS using an infospoke. I am not sure as to how the infospoke loads data into the RDBMS ( did it very long ago ) but then you can push it into an RDBMS and I am sure it will be system neutral.
    Hope this helps...
    Arun
    Assign points if it helps

  • How to extract data from custom made Idoc that is not sent

    Hi experts,
    Could you please advise if there is a way how to extract data from custom made idoc (it collects a lot of data from different SAP tables)? Please note that this idoc is not sent as target system is not fully maintained.
    As by now, we would like to verify - what data is extracted now.
    Any help, would be appreciated!

    Hi,
    The fields that are given for each segment have their length given in EDSAPPL table. How you have to map is explained in below example.
    Suppose for segment1, EDSAPPL has 3 fields so below are entries
    SEGMENT          FIELDNAME           LENGTH
    SEGMENT1         FIELD1                   4
    SEGMENT1         FIELD2                   2
    SEGMENT1         FIELD3                   2
    Data in EDID4 would be as follows
    IDOC           SEGMENT                          APPLICATION DATA
    12345         SEGMENT1                        XYZ R Y
    When you are extracting data from these tables into your internal table, mapping has to be as follows:
    FIELD1 = APPLICATIONDATA+0(4)        to read first 4 characters of this field, because the first 4 characters in this field would belong to FIELD1
    Similarly,
    FIELD2 = APPLICATIONDATA+4(2).
    FIELD3 = APPLICATIONDATA+6(2).  
    FIELD1 would have XYZ, FIELD2 = R, FIELD3 = Y
    This would remain true in all cases. So all you need to do is identify which fields you want to extract, and simply code as above to extract the data from this table.
    Hope this was helpful in explaining how to derive the data.

Maybe you are looking for