Regular expressions for xml parsing

I have a xml parsing problem that I have to solve using regular expressions. It's not possible for me to use a different method other than regular expression. But there is a problem that I cannot seem to rap my head around. I want to extract the contents of a tag but the problem is that this tag occurs serveral times in the XML file but I only want the contents of one particular occurence. Basically the problem is as follows;
I want to extract
<bp:NAME ***stufff***>(I want this part)</bp:NAME>This tag can occur is serval places. For example here;
<bp:ORGANISM>
***bunch of tags***
<bp:NAME ***stufff***>***stufff***</bp:NAME>
***bunch of tags***
</bp:ORGANISM>or here;
<bp:DATABASE>
***bunch of tags***
<bp:NAME ***stufff***>***stufff***</bp:NAME>
***bunch of tags***
</bp:DATABASE>I do not want the content of those tags. I want the content of the <NAME> tag that is not between either the <ORGANISM> tags or the <DATABASE> tags. These tags can be in any order. I for the life of me cannot seem to figure this problem out. I tried several different approaches. For example I tried using the following regex
(?:<bp:NAME [^>]*>([^<]*).*?<bp:ORGANISM>.*?</bp:ORGANISM>|
<bp:ORGANISM>.*?</bp:ORGANISM>.*?<bp:NAME [^>]*>([^<]*))This kind of works, the information I want is either in the first captured group or in the second one. So I just check which group is not empty and that is the one I want. But this only works if there is only one other tag containing the name tag (in this particular regular expression that is the organism tag). Since there is another tag (the database tag) I have to work around, and these tags can be in any order, the regular expression then becomes three times as large and then there are six different groups in which the information I want can occur. This does not seem like a good idea to me. There has to be another way to do this. So I tried using the following regex;
(?:</bp:ORGANISM>)?.*?(?:</bp:DATABASE>)?.*?<bp:NAME [^>]*>([^<]*)I thought this would get rid of any occurences of the other tags in front of the name tag, but it doesn't work either. It seems like it is not greedy enough. Well I think you get the point. I don't know what to try next so I really need some help.
Here is an example of the type of data I will run into. The tags can be in any order and they do not always have to occur. In the example below the <DATABASE> tag is not part of the data and the name tag I want just happens to be in front of the organism tag but this is not always the case. The name tag I want is the firstname tag in the file, namely;
<bp:NAME rdf:datatype="xsd:string">Progesterone receptor</bp:NAME>So I don't want the name tag that is in between the organism tags.
<bp:protein rdf:ID="CPATH-27885">
&#8722;<bp:COMMENT rdf:datatype="xsd:string">
Belongs to the nuclear hormone receptor family. NR3 subfamily. SIMILARITY: Contains 1 nuclear receptor DNA-binding domain. WEB RESOURCE: Name=NIEHS-SNPs; URL="http://egp.gs.washington.edu/data/pgr/"; WEB RESOURCE: Name=Wikipedia; Note=Progesterone receptor entry; URL="http://en.wikipedia.org/wiki/Progesterone_receptor"; GENE SYNONYMS: NR3C3. COPYRIGHT:  Protein annotation is derived from the UniProt Consortium (http://www.uniprot.org/).  Distributed under the Creative Commons Attribution-NoDerivs License.
</bp:COMMENT>
<bp:SYNONYMS rdf:datatype="xsd:string">Nuclear receptor subfamily 3 group C member 3</bp:SYNONYMS>
<bp:SYNONYMS rdf:datatype="xsd:string">PR</bp:SYNONYMS>
<bp:NAME rdf:datatype="xsd:string">Progesterone receptor</bp:NAME>
&#8722;<bp:ORGANISM>
&#8722;<bp:bioSource rdf:ID="CPATH-LOCAL-112384">
<bp:NAME rdf:datatype="xsd:string">Homo sapiens</bp:NAME>
&#8722;<bp:TAXON-XREF>
&#8722;<bp:unificationXref rdf:ID="CPATH-LOCAL-112385">
<bp:DB rdf:datatype="xsd:string">NCBI_TAXONOMY</bp:DB>
<bp:ID rdf:datatype="xsd:string">9606</bp:ID>
</bp:unificationXref>
</bp:TAXON-XREF>
</bp:bioSource>
</bp:ORGANISM>
<bp:SHORT-NAME rdf:datatype="xsd:string">PRGR_HUMAN</bp:SHORT-NAME>
&#8722;<bp:XREF>
&#8722;<bp:relationshipXref rdf:ID="CPATH-LOCAL-112386">
<bp:DB rdf:datatype="xsd:string">ENTREZ_GENE</bp:DB>
<bp:ID rdf:datatype="xsd:string">5241</bp:ID>
</bp:relationshipXref>
</bp:XREF>
&#8722;<bp:XREF>
&#8722;<bp:unificationXref rdf:ID="CPATH-LOCAL-112387">
<bp:DB rdf:datatype="xsd:string">UNIPROT</bp:DB>
<bp:ID rdf:datatype="xsd:string">P06401</bp:ID>
</bp:unificationXref>
</bp:XREF>
&#8722;<bp:XREF>
&#8722;<bp:unificationXref rdf:ID="CPATH-LOCAL-112388">
<bp:DB rdf:datatype="xsd:string">UNIPROT</bp:DB>
<bp:ID rdf:datatype="xsd:string">A7X8B0</bp:ID>
</bp:unificationXref>
</bp:XREF>
&#8722;<bp:XREF>
&#8722;<bp:relationshipXref rdf:ID="CPATH-LOCAL-112389">
<bp:DB rdf:datatype="xsd:string">GENE_SYMBOL</bp:DB>
<bp:ID rdf:datatype="xsd:string">PGR</bp:ID>
</bp:relationshipXref>
</bp:XREF>
&#8722;<bp:XREF>
&#8722;<bp:relationshipXref rdf:ID="CPATH-LOCAL-112390">
<bp:DB rdf:datatype="xsd:string">REF_SEQ</bp:DB>
<bp:ID rdf:datatype="xsd:string">NP_000917</bp:ID>
</bp:relationshipXref>
</bp:XREF>
&#8722;<bp:XREF>
&#8722;<bp:unificationXref rdf:ID="CPATH-LOCAL-112391">
<bp:DB rdf:datatype="xsd:string">UNIPROT</bp:DB>
<bp:ID rdf:datatype="xsd:string">Q9UPF7</bp:ID>
</bp:unificationXref>
</bp:XREF>
&#8722;<bp:XREF>
&#8722;<bp:unificationXref rdf:ID="CPATH-LOCAL-113580">
<bp:DB rdf:datatype="http://www.w3.org/2001/XMLSchema#string">CPATH</bp:DB>
<bp:ID rdf:datatype="http://www.w3.org/2001/XMLSchema#string">27885</bp:ID>
</bp:unificationXref>
</bp:XREF>
</bp:protein>Edited by: Dani3ll3 on Nov 19, 2009 2:51 AM

Dani3ll3 wrote:
Thanks a lot after I did that the regular expression worked. :)Good. But remember that in real life, you would then have to apply the XML rules to get the actual contents of the text node. For example it might be a CDATA section or it might include characters like ampersands which have been escaped and which you need to unescape. That's why it's better to use a proper parser, as already suggested.
It seems to me this forum is full of posts where people are doing homework questions which teach them to do things the wrong way. But of course there's nothing the student can do about that.

Similar Messages

  • How to form a regular expression for matching the xml tag?

    hi i wanted to find the and match the xml tag for that i required to write the regex.
    for exmple i have a string[] str={"<data>abc</data>"};
    i want this string has to be splitted like this <data>, abc and </data>. so that i can read the splitted string value.
    the above is for a small excercise but the tagname and value can be of combination of chars/digits/spl symbols like wise.
    so please help me to write the regular expression for the above requirement

    your suggestion is most appreciable if u can give the startup like how to do this. which parser is to be used and stuff like that

  • Regular expression and XML

    Hello,
    I have an XML file containing regular expressions and i parse the file, extract the pattern from it and search for it using java regex package. The problem is it works fine when patterns are words but when the pattern is something like
    write \\d+ (write followed by a space followed by one or mre digits) it doesn't work.
    I wrote the same code but with the pattern embedded in it,ie. without using XML and it worked. But when extracting with XML it fails.
    Also if the pattern is write[0-9] it only extracts write[0-9 and gives an error of no closing bracket.
    Could anyone please tell me what i am missing out
    Thank you

    thank you for your replies. Well i have still no got over the problem so i am posting my code here and hoping it can get solved
    import org.xml.sax.*;
    import org.xml.sax.helpers.*;
    import java.io.*;
    import java.util.regex.*;
    class textextractor extends DefaultHandler{
         boolean regex=false;
    public void startElement(String namespaceURI,String localName,String qn,Attributes attr)
              if(localName.equals("REGEX"))
               regex=true;
    public void characters(char [] text,int start,int length)throws SAXException {
              String t=new String(text,start,length);
              boolean flag=false;
              if(regex==true)
                Pattern pattern;
                  String w=new String(t);
              pattern = Pattern.compile(w);
              Matcher matcher;
              matcher=pattern.matcher("there is a bat   read  write 13    error at line ");
              while(matcher.find())
               flag=true;
               System.out.println("I found the text \"" + matcher.group() +"\" starting at index "
               + matcher.start() +"and ending at index " + matcher.end() + ".");
             if(!flag)
               System.out.println("not found");
             regex=false;
    public class saxt2 {
         public static void main(String args[]) {
              try {
                    XMLReader parser= XMLReaderFactory.createXMLReader();
                    ContentHandler handler=new textextractor();
                    parser.setContentHandler(handler);
                                    parser.parse("d:\\regex.xml");
                  }catch (Exception e) {
                   System.err.println(e);
    }The xml file is
                      <RegularExpression>
                      <REGEX>write</REGEX>
                      <REGEX>write \\d+</REGEX>
                      <REGEX>read[0-9]</REGEX>
                      </RegularExpression>by running the code you can see that write is found,write \\d+ doesn't match write 13 in the string and read[0-9] gives and error.
    Any help will be greatly appreciated

  • Regular Expression for filename

    I want to read XML files,If the filename starts with an alphabet.
    Can anybody tell the regular expression for the same.
    Regards
    V Kumar
    Message was edited by:
    user640551

    thanks dhrmendra,
    i got the solution and correct expression is "[a-zA-z].\*.xml"
    regards
    V Kumar

  • Regular expression for recognizing all tables in a sql statement

    Hi all
    I need a regular expression for recognizing all the tables bane in a geberic statement.
    Unlikely i need a regular expression that manage also inner join .I 'm sorry but this matter is new for me and i cannot find any usefull help in the web.
    Regards

    If you insist it should be something like:
    "SELECT ([A-Z0-9_]+)[.][A-Z0-9_]+(,([A-Z0-9_]+)[.][A-Z0-9_]+)* FROM (([A-Z0-9_]+)[.][A-Z0-9_]+) INNER JOIN (([A-Z0-9_]+)[.][A-Z0-9_]+) ON .+" plus spaces etc... Yes it's for this kind of statements only.
    But SQL parser is better because anyway you'll need to at least remove duplicates from founded names...

  • Wat should be the regular expression for string MT940_UB_*.txt to be used in SFTP sender channel in PI 7.31 ??

    Hi All,
    What should be the regular expression for string MT940_UB_*.txt and MT940_MB_*.txt to be used as filename inSFTP sender channel in PI 7.31 ??
    If any one has any idea on this please let me know.
    Thanks
    Neha

    Hi All,
    None of the file names suggested is working.
    I have tried using - MT940_MB_*\.txt , MT940_MB_*.*txt , MT940*.txt
    None of them is able to pick this filename - MT940_MB_20142204060823_1.txt
    Currently I am using generic regular expression which picks all .txt files. - ([^\s]+(\.(txt))$)
    Let me know ur suggestion on this.
    Thanks
    Neha Verma

  • Using regular expressions for validation in i18n

    Can we use regular expressions for validation of inputs in a java application taking care of i18N aspects too. Zip code for different locales are different. Can we use regular expressions to validate zipcode inputs from different locales

    hi,
    For that shall i have to create individual patterns for matching the inputs from different locales or a single pattern will do in the case of validating phone nos. around the world, zip codes etc. In case different patterns are required, programmer should have a konwledge of difference in patters for different locales.
    regards
    sdas

  • What is the best api for xml parsing?

    I think that api comes with j2se is not that good for xml parsing. is there any open source api which is simple,easy and powerful,

    JArsenic wrote:
    Hey I feel XMLBeans would be a optimal solution for XML parsing as I provides you a whole set of methods to parse your XML tags as Java Objects. And you may download XMLBeans @ http://xmlbeans.apache.org/.
    What advantage would that have over JAXB? It already can do all that and is built into Java itself, so you don't need a separate download.
    Also: mapping XML to Java beans is a very specific way of handling XML and is definitely not "the best" in all situations.
    For similar quest you may reach @ [somesite]Please, no advertisement here, read the Code of Conduct that you agreed on singing up with this page.

  • Regular Expression for a Person's Name

    Hi,
    I am using the org.apache.regexp package and trying to find the regular expression for a person's name. It allows only the alphabetic string.
    I tried [a-zA-Z]+. But this also accepts the thing like "BUSH88", which is not what I want...
    Can anybody help me figure this out?
    Thanks in advance,
    Tong

    Hi,
    I am using the org.apache.regexp package and trying to
    find the regular expression for a person's name. It
    allows only the alphabetic string.
    I tried [a-zA-Z]+. But this also accepts the thing
    like "BUSH88", which is not what I want...
    Can anybody help me figure this out?
    Thanks in advance,
    Tongtry this:
    ^[a-zA-Z]+$
    the ^ represents the start of the String and the $ represents the end.
    So the expression is saying: "between the beginning and the end of the String there will only be alphbetical characters"

  • How to write the regular expression for Square brackets?

    Hi,
    I want regular expression for the [] ‘Square brackets’.
    I have tried to insert in the below code but the expression not validate the [] square brackets.
    If anyone knows please help me how to write the regular expression for ‘[]’ Square brackets.
    private static final Pattern DESC_PATTERN = Pattern.compile("({1}[a-zA-Z])" +"([a-zA-Z0-9\\s.,_():}{/&#-]+)$");Thanks
    Raghav

    Since square brackets are meta characters in regex they need to be escaped when they need to be used as regular characters so prefix them with \\ (the escape character).

  • Need a regular expression for the text field

    Hi ,
    I need a regular expression for a text filed.
    if the value is alphanumeric then min 3 char shud be there
    and if the value is numeric then no limit of chars in that field.[0-9].
    Any help is appriciated...
    thanks
    bharathi.

    Try the following in the change event:
    r=/^[a-z]{1,3}$|^\d+$/i;
    if (!r.test(xfa.event.newText))
    xfa.event.change="";
    Kyle

  • Regular Expression for /, \, #, -, & ‘

    Hi,
    Can anybody tell me the regular expression for provided characters.
    Code is preferable.
    Thanks in advance.

    "[-/\\\\#&']"

  • What is the regular expression for the end of a story?

    Forgive me if this is wrong forum for asking this, but I'm trying to use the Find command using GREP and I need to know the regular expression for the end of a story. (Or, the last character of a story.) Thanks in advance.

    I'd try search for .\z (that's a dot in front) which ought to find the very last character in the story, and replace with $0 and your additional text.
    You know you can use a keyboard shortcut to move your cursor to the end of any story, right? Ctrl + End on Windows, Cmd + End, I think, on Mac. Unless you want to do this to every single story in the document, I would think you might be just as well off to put your text on the clipboard, put the cursor in the story and hit the key combo followed by Ctrl/Cmd + V to paste.

  • Regular Expression For Dreamweaver

    I still haven't had the time to really become a professional when it comes to regular expressions, and sadly I am in need of one an finding it difficult to wrap my head around.
    In a text file I have hundreds of instances like the following:
    {Click here to visit my website}{http://www.adobe.com/}
    I need a regular expression for Dreamweaver that I can run within the "Find and Replace" window to switch the order of the above elements to:
    {http://www.adobe.com/}{Click here to visit my website}
    Can anyone provide some guidance? I'm coming up short due to my lack of experience with regular expressions.
    Thank you in advance!

    So you have a string that starts { and goes until the first }.  Then you have another string exactly the same.  And you want to swap them.  I'm not making any assumption that the second one has to look like a URL (that's a whole other minefield, but perhaps you could do something simple like it must start with http). 
    You don't specify how your text file is divided up, have you got this as a complete line to itself, or is it just  a huge block of text?  Preferably as individual lines.
    I don't have Dreamweaver, but this worked for me in Notepad++
    Find: ^{(.*?)}{(.*?)}$
    Replace with: {\2}{\1}
    My file looked like this:
    {Click here to visit my website}{http://www.adobe.com/}
    {some other site}{http://www.example.com/foo}
    And doing a Replace All ended up like this:
    {http://www.adobe.com/}{Click here to visit my website}
    {http://www.example.com/foo}{some other site}

  • Regular expression for LOV?

    I have a list of strings in an LOV. I tried filtering it by typing in "^disk" in the search bar, which I hope will return a list of strings starting with "disk", but I failed.
    Any idea on how to use regular expression for LOVs? Thanks!

    HI Buffalo,
    i have a select list item in my page1 named :P1_EMPNAME with lov query value
    select ename as d, ename as r from emp WHERE EGEXP_LIKE(ename,:P1_SEARCH) or :P1_SEARCH IS NULL
    i have a Search text box in my page1 name :P1_SEARCH
    When i run the page, by default all the empnames will display in the lov list item
    i have given ^buffalo in the text seach item and clicked the submit button ,it shows the Employee buffalo in my list item lov.
    If you want all the entries that start with S, search for ^s
    End with R, use r$
    please try this link http://download.oracle.com/docs/cd/B28359_01/appdev.111/b28424/adfns_regexp.htm
    Thanks
    Logaa

Maybe you are looking for

  • Standard FM LC_POPUP_RADIO_5 blank in ECC 6.0

    Hi All, I require some help regarding a FM which was working fine in ECC 4.6C and is not working in ECC 6.0. There is a FM LC_POPUP_RADIO_5 which was working fine in ECC 4.6C. But now we have upgraded SAP to ECC 6.0. But this FM (LC_OPOUP_RADIO_5) ha

  • ITunes Problems on W XP

    Wondering if anyone can suggest a solution to a problem I have created with running iTunes on my Dell desktop PC with Windows XP? With the computer suffering from a lenghty set-up time when signing in, I decided to delete some unnessary Spyware from

  • Crash when starting Photoshop Elements 10 on Windows 8

    Anyone else having an issue with Photoshop Elements 10 crashing once the screen appears.  Running Windows 8. Faulting application name: PhotoshopElementsEditor.exe, version: 10.0.0.0, time stamp: 0x4e70c16d Faulting module name: ScCore.dll, version:

  • Copy or Duplicating a Slide - Problem w/Editing

    I am working on a presentation (in Captivate 4) that was started by importing a PowerPoint presention.  I would like to make the first/opening slide also the last/closing slide with some minor changes.  I have tried copying the first slide, moving th

  • Extract XML data based on column tag

    Dear All, i want to extract the XML data type based on column selection. For example: i have the below xml in my database stored as XMLTYPE() data. <c1>10</c1><c2>0011</c2><c3>DEBIT</c3><c4>USD</c4> i want to extract the column tag c2&c3 alone. Is th