OSB - Iterating over large XML files with content streaming

Hi @ll
I have to iterate over all item in large XML files and insert into a oracle database.
The file is about 200 MB and contains around 500'000, and I am using OSB 10gR3.
The XML structure is something like this:
<allItems>
<item>.....</item>
<item>.....</item>
<item>.....</item>
<item>.....</item>
<item>.....</item>
</allItems>
Actually I thought about using a proxy service with enabled content streaming and a "for each" action for iterating
over all items. But for this the whole XML structure has to be materialized into a variable otherwise it is not possible!
More about streaming large files can be found here:
[http://download.oracle.com/docs/cd/E13159_01/osb/docs10gr3/userguide/context.html#large_messages]
There is written "When you enable streaming for large message processing, you cannot use the ... for each...".
And for accessing single items you should should use an assign action with a xpath like "$body/allItems/item[1]";
this works fine and not the whole XML stream has to be materialized.
So my idea was to use the "for each" action and processing seqeuntially all items with a xpath like:
$body/allItems/item[$counter]
But the "for each" action just allows iterating over a sequence of xml items by defining an selection xpath
and the variable that contains all items. I would like to have a "repeat until" construct that iterates as long
$body/allItems/item[$counter] returns not null. Or can I use the "for each" action differently?
Does the OSB provides any other iterating mechanism? I know there is this spli-join construct that supports
different looping techniques, but as far I know it does not support content streaming, is this correct?
Did I miss somehting?
Thanks a lot for helping!
Cheers
Dani
Edited by: user10095731 on 29.07.2009 06:41

Hi Dani,
Yes, according to me this would be the best approach. You can use content-streaming to pass this large xml to ejb and once it passes successfully EJB should operate on this. If you want any result back (for further routing), you can get it back from EJB.
EJB gives you power of java to process this file and from java perspective 150 MB is not a very LARGE data. Ensure that you are using buffering. Check out this link for an explanation on Java IO Streams and, in particular, buffered streams -
http://java.sun.com/developer/technicalArticles/Streams/ProgIOStreams/
Try dom4J with xpp (XML Pull Parser) parser in case you have parsing requirement. We had worked with 1.2GB file using this technique.
Regards,
Anuj

Similar Messages

Failed to load XML file with Content ID 'XYZ'

Hello,
We are using UCM Version:11.1.1.8.1DEV-2014-01-06 04:18:30Z-r114490 (Build:7.3.5.185) with site studio for creating templates and web sites.
While switching to contribution mode, we find 'Failed to load XML file with Content ID 'XYZ' error.[Here XYZ is the local checkin content]
In region we are using dynamic converter to convert the style of native document here below are region and its element details.
<region id="region3" name="Add_Content_Here" flags="1111111100100" metadata="xIdcProfile%3AisHidden%3Dtrue%26xTemplateType%3AisHidden%3Dtrue%26xShowInStaff%3AisHidden%3Dtrue%26xShowInVisitors%3AisHidden%3Dtrue%26xShowInFaculty%3AisHidden%3Dtrue%26xDiscussionCount%3AisHidden%3Dtrue%26xDiscussionType%3AisHidden%3Dtrue" dccommand="ssIncDynamicConversionByRule(SS_DATAFILE, 'Colleges_Template_Rule')">
      
      <element id="region3_element1" name="Editor" label="Editor" type="1" flags="111111111111111111111100000111100000000000001111001110111010001111101000000000000000000000000000">
        
        <linktoregioncontent createnewxml="true" createnewnative="false" choosemanaged="true" chooselocal="false" choosenone="false">
          <choosemanagedquerytext corecontentonly="FALSE">
            <![CDATA[xWebsiteObjectType <Matches> `Data File` <OR> xWebsiteObjectType <Matches> `Native Document`]]>
          </choosemanagedquerytext>
        </linktoregioncontent>
      </element>
      <switchregioncontent createnewxml="true" createnewnative="true" choosemanaged="true" chooselocal="false" choosenone="false">
        <createnewnativedoctypes >
          <![CDATA[.doc,.docx,.txt,.rtf]]>
        </createnewnativedoctypes>
        <choosemanagedquerytext corecontentonly="FALSE">
          <![CDATA[xWebsiteObjectType <Matches> `Data File` <OR> xWebsiteObjectType <Matches> `Native Document`]]>
        </choosemanagedquerytext>
        <defaultmetadata >
          <![CDATA[xIdcProfile%3AisHidden%3Dtrue%26xTemplateType%3AisHidden%3Dtrue%26xCollegesList%3AisHidden%3Dtrue%26xShowInStudents%3AisHidden%3Dtrue%26xShowInStaff%3AisHidden%3Dtrue%26xShowInVisitors%3AisHidden%3Dtrue%26xShowInFaculty%3AisHidden%3Dtrue%26xArticleSection%3AisHidden%3Dtrue%26xDiscussionCount%3AisHidden%3Dtrue%26xDiscussionType%3AisHidden%3Dtrue%26dpTriggerValue%3DCSE]]>
        </defaultmetadata>
      </switchregioncontent>
    </region>



Regardrs,
Syed

Hi Syed ,
Add the following trace sections :
requestaudit,sitestudio*,system + Full verbose tracing
Clear the server output .
Replicate the same steps and once error shows up , refresh server output and copy the logs to a text file and upload here .
Thanks,
Srinath

Transform Large XML files with XSL

HELP, LARGE XML FILES
I have got 30 - 50 MB large xml file, and i would like to transform it
with xslt, i tried but i have got OutOfMemory Exception.
I tried to find out the solution on JAVA site, but i didn't find it.
I can not displit my xml file. I hope for some help.
I tried really everything.
Thanks a lot

What is your machine configuration ?
The above 2 suggestions would help, but it does depend on how your software is written.
Please post more info about your environment and object design.
Chintan

Querying large XML files with WEBSERVICE()

I am trying to use the function WEBSERVICE to query large chucks of data, without success
Let's go with an example :
=SERVICEWEB(http://api.eve-central.com/api/marketstat?&regionlimit=10000002&typeid=2305"))
This line works, and give me the following result :
<?xml version='1.0' encoding='utf-8'?>
<evec_api version="2.0" method="marketstat_xml">
      <marketstat><type id="2305">
          <buy><volume>125924386</volume><avg>2.09</avg><max>2.39</max><min>1.00</min><stddev>0.36</stddev><median>2.20</median><percentile>2.37</percentile></buy>
          <sell><volume>384731177</volume><avg>11.87</avg><max>30.01</max><min>3.40</min><stddev>8.34</stddev><median>3.82</median><percentile>3.80</percentile></sell>
          <all><volume>431842145</volume><avg>5.63</avg><max>20.01</max><min>0.57</min><stddev>3.92</stddev><median>3.83</median><percentile>1.57</percentile></all>
        </type></marketstat>
    </evec_api>
This is exactly the same content as the webpage from the url used (http://api.eve-central.com/api/marketstat?&regionlimit=10000002&typeid=2305)
In my case, I am working with several typeid and I would like to send the server only one query instead of N queries (where N is the number of typeid I use)
I am want to use this query :
http://api.eve-central.com/api/marketstat?&regionlimit=10000002&typeid=2073&typeid=2288&typeid=2286&typeid=2306&typeid=2309&typeid=2305&typeid=2311&typeid=2310&typeid=2308&typeid=2270&typeid=2287&typeid=2267&typeid=2307&typeid=2272&typeid=2268&typeid=2393&typeid=2396&typeid=3779&typeid=2401&typeid=2390&typeid=2397&typeid=2392&typeid=3683&typeid=2389&typeid=2399&typeid=2395&typeid=2398&typeid=9828&typeid=2400&typeid=3645&typeid=2329&typeid=3828&typeid=9836&typeid=9832&typeid=44&typeid=3693&typeid=15317&typeid=3725&typeid=3689&typeid=2327&typeid=9842&typeid=2463&typeid=2317&typeid=2321&typeid=3695&typeid=9830&typeid=3697&typeid=9838&typeid=2312&typeid=3691&typeid=2319&typeid=9840&typeid=3775&typeid=2328&typeid=2358&typeid=2345&typeid=2344&typeid=2367&typeid=17392&typeid=2348&typeid=9834&typeid=2366&typeid=2361&typeid=17898&typeid=2360&typeid=2354&typeid=2352&typeid=9846&typeid=9848&typeid=2351&typeid=2349&typeid=2346&typeid=12836&typeid=17136&typeid=28974&typeid=2375&typeid=2868&typeid=2869&typeid=2870&typeid=2871&typeid=2872&typeid=2875&typeid=2876
When you go on
the webpage you can see that all results are there. However I am getting in excel :
#VALUE!
I think Excel is not waiting enough to get all the data, deciding there is no answer and telling me "no correct values".
How could I make this work within one request ?
Thank you for any input
Hello everyone,
I am trying to use the function WEBSERVICE to query large chucks of data, without success
Let's go with an example :
=SERVICEWEB(http://api.eve-central.com/api/marketstat?&regionlimit=10000002&typeid=2305"))
This line works, and give me the following result :
<?xml version='1.0' encoding='utf-8'?>
<evec_api version="2.0" method="marketstat_xml">
      <marketstat><type id="2305">
          <buy><volume>125924386</volume><avg>2.09</avg><max>2.39</max><min>1.00</min><stddev>0.36</stddev><median>2.20</median><percentile>2.37</percentile></buy>
          <sell><volume>384731177</volume><avg>11.87</avg><max>30.01</max><min>3.40</min><stddev>8.34</stddev><median>3.82</median><percentile>3.80</percentile></sell>
          <all><volume>431842145</volume><avg>5.63</avg><max>20.01</max><min>0.57</min><stddev>3.92</stddev><median>3.83</median><percentile>1.57</percentile></all>
        </type></marketstat>
    </evec_api>
This is exactly the same content as the webpage from the url used (http://api.eve-central.com/api/marketstat?&regionlimit=10000002&typeid=2305)
In my case, I am working with several typeid and I would like to send the server only one query instead of N queries (where N is the number of typeid I use)
I am want to use this query :
http://api.eve-central.com/api/marketstat?&regionlimit=10000002&typeid=2073&typeid=2288&typeid=2286&typeid=2306&typeid=2309&typeid=2305&typeid=2311&typeid=2310&typeid=2308&typeid=2270&typeid=2287&typeid=2267&typeid=2307&typeid=2272&typeid=2268&typeid=2393&typeid=2396&typeid=3779&typeid=2401&typeid=2390&typeid=2397&typeid=2392&typeid=3683&typeid=2389&typeid=2399&typeid=2395&typeid=2398&typeid=9828&typeid=2400&typeid=3645&typeid=2329&typeid=3828&typeid=9836&typeid=9832&typeid=44&typeid=3693&typeid=15317&typeid=3725&typeid=3689&typeid=2327&typeid=9842&typeid=2463&typeid=2317&typeid=2321&typeid=3695&typeid=9830&typeid=3697&typeid=9838&typeid=2312&typeid=3691&typeid=2319&typeid=9840&typeid=3775&typeid=2328&typeid=2358&typeid=2345&typeid=2344&typeid=2367&typeid=17392&typeid=2348&typeid=9834&typeid=2366&typeid=2361&typeid=17898&typeid=2360&typeid=2354&typeid=2352&typeid=9846&typeid=9848&typeid=2351&typeid=2349&typeid=2346&typeid=12836&typeid=17136&typeid=28974&typeid=2375&typeid=2868&typeid=2869&typeid=2870&typeid=2871&typeid=2872&typeid=2875&typeid=2876
When you go on
the webpage you can see that all results are there. However I am getting in excel :
#VALUE!
I think Excel is not waiting enough to get all the data, deciding there is no answer and telling me "no correct values".
How could I make this work within one request ?
Thank you for any input
=SERVICEWEB(http://api.eve-central.com/api/marketstat?&regionlimit=10000002&typeid=2305"))
=SERVICEWEB(http://api.eve-central.com/api/marketstat?&regionlimit=10000002&typeid=2305"))
This is exactly the same content as the webpage from the url used (http://api.eve-central.com/api/marketstat?&regionlimit=10000002&typeid=2305)
In my case, I am working with several typeid and I would like to send the server only one query instead of N queries (where N is the number of typeid I use)
I am want to use this query :

What is your machine configuration ?
The above 2 suggestions would help, but it does depend on how your software is written.
Please post more info about your environment and object design.
Chintan

Java.lang.StackOverflowError in geting large xml file with RMI

Hello, Java Gurus.
I have a java client to get xml data from a server using RMI approach. When the xml data has a certain number of nodes, I got quoteResult.xsl on the client side. What is the actual problem ? is the file too big ? or something is wrong in the java code?
thanks. dave
java.lang.StackOverflowError
     at java.io.BufferedInputStream.read1(Compiled Code)
     at java.io.BufferedInputStream.read(Compiled Code)
     at java.io.ObjectInputStream.read(Compiled Code)
     at java.io.DataInputStream.readFully(Compiled Code)
     at java.io.DataInputStream.readUTF(Compiled Code)
     at java.io.DataInputStream.readUTF(Compiled Code)
     at java.io.ObjectInputStream.readUTF(Compiled Code)
     at java.io.ObjectInputStream.readObject(Compiled Code)
     at java.io.ObjectInputStream.inputClassFields(Compiled Code)
     at java.io.ObjectInputStream.defaultReadObject(Compiled Code)
     at java.io.ObjectInputStream.inputObject(Compiled Code)
     at java.io.ObjectInputStream.readObject(Compiled Code)
     at java.io.ObjectInputStream.inputClassFields(Compiled Code)
     at java.io.ObjectInputStream.defaultReadObject(Compiled Code)
     at org.apache.xerces.dom.ChildAndParentNode.readObject(Compiled Code)
     at java.lang.reflect.Method.invoke(Native Method)
     at java.lang.reflect.Method.invoke(Compiled Code)
so many these lines
at java.io.ObjectInputStream.readObject(Compiled Code)
     at java.io.ObjectInputStream.inputClassFields(Compiled Code)
     at java.io.ObjectInputStream.defaultReadObject(Compiled Code)
     at java.io.ObjectInputStream.inputObject(Compiled Code)
at org.apache.xerces.dom.ChildAndParentNode.readObject(Compiled Code)
     at java.lang.reflect.Method.invoke(Native Method)
     at java.lang.reflect.Method.invoke(Compiled Code)
     at java.io.ObjectInputStream.invokeObjectReader(Compiled Code)
at sun.rmi.server.UnicastRef.unmarshalValue(Compiled Code)
     at sun.rmi.server.UnicastRef.invoke(Compiled Code)
at InterfaceImpl_Stub.ExportData(Compiled Code)
     at ClientServlet.handleRequest(Compiled Code)
     at ClientServlet.doGet(Compiled Code)
     at javax.servlet.http.HttpServlet.service(Compiled Code)
     at javax.servlet.http.HttpServlet.service(Compiled Code)
     at org.apache.tomcat.core.ServletWrapper.handleRequest(Compiled Code)
     at org.apache.tomcat.core.ServletWrapper.handleRequest(Compiled Code)
     at org.apache.tomcat.servlets.InvokerServlet.service(Compiled Code)
     at javax.servlet.http.HttpServlet.service(Compiled Code)
     at org.apache.tomcat.core.ServletWrapper.handleRequest(Compiled Code)
     at org.apache.tomcat.core.ContextManager.service(Compiled Code)
     at org.apache.tomcat.service.connector.Ajp12ConnectionHandler.processConnection(Compiled Code)
     at org.apache.tomcat.service.TcpConnectionThread.run(Compiled Code)
     at java.lang.Thread.run(Compiled Code)

Hi yue42, thanks alot for your rely.
The error occured before I tried to apply the xsl template to translate the xml data into html page, anyway here is my template. dave
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="html"/>
<xsl:include href="global.xsl"/>
<xsl:template match="/">
<html>
<head>News</head>
<body>
<table>
<xsl:for-each select="//NEWS/STORY">
<xsl:if test="6>position()">
<tr>
<td><xsl:value-of select="position()"/></td>
<td><xsl:value-of select="./HEADLINE"/></td>
</tr>
</xsl:if>
</xsl:for-each>
</table>
</body>
</html>
</xsl:template>
</xsl:stylesheet>

Transfer 100M XML file with XSL

Hi,
I am trying to transfer 100M XML file with XSL. Input.xml is the XML file, format.xsl is the XSL file. I type in the command line as:
java org.apache.xalan.xslt.Process -IN input.xml -XSL format.xsl -OUT output.xml
It got "out of memeory" error. My questions are:
1. Is it possible to transfer such large XML file with XSLT?
2. The XSL processor used SAX or DOM to parse XML file?
3. Any suggestions?
Thanks.
James

maybe?
java -Xmx200m org.apache.xalan.xslt.Process -IN input.xml -XSL format.xsl -OUT output.xml
http://java.sun.com/j2se/1.3/docs/tooldocs/win32/java-classic.html

Problem with parsing large XML files chunked over HTTP

I'm trying to isolate a bug that was introduced when upgrading the JRE in use from Java 7u51 to 7u71 without changing any code. The problem appears to be very similar to: Bug ID: JDK-8027359 XML parser returns incorrect parsing results.
Further investigation showed that it was also introduced in the same versions (7u71) where that fix was applied. Unlike that bug though, my XML is marked as version 1.0. It also appears to be with only large XML files, on the order of 10MB or so.
The closest I've been able to narrow it down to is the code is using JAXB to unmarshall a stream that the debugger tells me is a org.apache.http.com.EofSensorInputStream / org.apache.http.impl.io.ChunkedInputStream. The exception I get is not consistent, but typically appears to be from chunks being overwritten or shuffled, resulting in letters appearing in attributes that are actually numbers, or like the following where an attribute "testAttribute" gets partially overwritten by the end of a timestamp that was in a different section of the XML.
javax.xml.bind.UnmarshalException
- with linked exception:
[javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,98748]
Message: Attribute name "testAttribu00Z" associated with an element type "testElement" must be followed by the ' = ' character.]
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.handleStreamException(UnmarshallerImpl.java:421)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:357)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal(UnmarshallerImpl.java:334)
Caused by: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,98748]
Message: Attribute name "testAttribu00Z" associated with an element type "testElement" must be followed by the ' = ' character.
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:598)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.StAXStreamConnector.bridge(StAXStreamConnector.java:181)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:355)
... 6 more
Here's some code that seems to reproduce it if you can connect to an XML server that returns a large chunked XML file:
SchemeRegistry registry = new SchemeRegistry();
registry.register(
new Scheme("http", 80, PlainSocketFactory.getSocketFactory()));
HttpClient client = new DefaultHttpClient(new BasicClientConnectionManager(registry));
String url = "http://someUrlReturningAlargeChunkedXML";
HttpGet method = new HttpGet(url);
HttpResponse response = client.execute(method);
InputStream inputStream = response.getEntity().getContent();
XMLStreamReader responseReader = factory.createXMLStreamReader(inputStream);
JAXBElement<JaxBObjectOfResponse> wot = unmarshaller.unmarshal(responseReader, JaxBObjectOfResponse.class);
If you connect using URL.openStream() to the same service there is no error. If I read bytes directly and write to a file, there is no error. The error only happens when I try to unmarshal it, and it's large, and I'm using Java 7u71 (or later). It can be consistently repeated with the jsp webapp that I'm using, but didn't show the error when I used the same code with a Wikipedia dump XML file.
How can I unmarshal in a different way to avoid this problem? Or, how can I better isolate the bug so it can be posted to the appropriate bug system?

Apparently, adding the Woodstox XML libraries avoids the bug. Is there anyone who can reproduce this on another system? Was there any changes to the Stax implementation between u67 and u71 that may have introduced a bug like this?
Edit: When setting the logging level to DEBUG, I once saw the overwritten buffer being logged as if that was what was received (as in the testAttribu00Z example above). I can't repeat that anymore though, and very rarely it does parses with no exception (though it may have still been corrupted). Now the error seems to be consistently on one of the buffer boundaries, as in:
17:08:09,705 DEBUG wire:63 - << "2000[\r][\n]"
17:08:09,705 DEBUG wire:77 - << "trend>....OTHER XML...<trend hours=""
17:08:09,705 DEBUG wire:77 - << "634.0972777777778" datetime="2013-05-21T00:43:48.350Z" t"
17:08:09,705 DEBUG wire:63 - << "[\r][\n]"
17:08:09,705 DEBUG wire:63 - << "2000[\r][\n]"
17:08:09,705 DEBUG wire:77 - << "rend-mode="0">
Exception in thread "main" java.lang.NumberFormatException: t34.0972777777778
at com.sun.xml.internal.bind.DatatypeConverterImpl._parseDouble(DatatypeConverterImpl.java:213)
at mypackage.Trend_JaxbXducedAccessor_hours.parse(TransducedAccessor_field_Double.java:48)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.StructureLoader.startElement(StructureLoader.java:194)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallingContext._startElement(UnmarshallingContext.java:486)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallingContext.startElement(UnmarshallingContext.java:465)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.InterningXmlVisitor.startElement(InterningXmlVisitor.java:60)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.StAXStreamConnector.handleStartElement(StAXStreamConnector.java:231)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.StAXStreamConnector.bridge(StAXStreamConnector.java:165)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:355)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal(UnmarshallerImpl.java:334)
Or:
17:19:12,563 DEBUG wire:63 - << "2000[\r][\n]"
17:19:12,563 DEBUG wire:77 - << ...OTHER XML...<trend index="5"
17:19:12,563 DEBUG wire:77 - << "" label="N"
17:19:12,563 DEBUG wire:63 - << "[\r][\n]"
Exception in thread "main" java.lang.NumberFormatException: Not a number: N
at com.sun.xml.internal.bind.DatatypeConverterImpl._parseInt(DatatypeConverterImpl.java:106)
at com.sun.xml.internal.bind.DatatypeConverterImpl._parseShort(DatatypeConverterImpl.java:118)

Does the parser work with large XML files?

Is there a restriction on the XML file size that can be loaded into the parser?
I am getting a out of memory exception reading in large XML file(10MB) using the commands
DOMParser parser = new DOMParser();
URL url = createURL(argv[0]);
parser.setErrorStream(System.err);
parser.setValidationMode(true);
parser.showWarnings(true);
parser.parse(url);
Win NT 4.0 Server
Sun JDK 1.2.2
===================================
Error output
===================================
Exception in thread "main" java.lang.OutOfMemoryError
at oracle.xml.parser.v2.ElementDecl.getAttrDecls(ElementDecl.java, Compi
led Code)
at java.util.Hashtable.<init>(Unknown Source)
at oracle.xml.parser.v2.DTDDecl.<init>(DTDDecl.java, Compiled Code)
at oracle.xml.parser.v2.ElementDecl.getAttrDecls(ElementDecl.java, Compi
led Code)
at oracle.xml.parser.v2.ValidatingParser.checkDefaultAttributes(Validati
ngParser.java, Compiled Code)
at oracle.xml.parser.v2.NonValidatingParser.parseAttributes(NonValidatin
gParser.java, Compiled Code)
at oracle.xml.parser.v2.NonValidatingParser.parseElement(NonValidatingPa
rser.java, Compiled Code)
at oracle.xml.parser.v2.ValidatingParser.parseRootElement(ValidatingPars
er.java:97)
at oracle.xml.parser.v2.NonValidatingParser.parseDocument(NonValidatingP
arser.java:199)
at oracle.xml.parser.v2.XMLParser.parse(XMLParser.java:146)
at TestLF.main(TestLF.java:40)
null

We have a number of test files that are that size and it works without a problem. However using the DOMParser does require significantly more memory than your doc size.
What is the memory configuration of the JVM that you are running with? Have you tried increasing it? Are you using our latest version 2.0.2.6?
Oracle XML Team

Problem with parsing large xml files

Hello All,
I am parsing a large xml file of 20MB and I use DocumentBuilder.parse(File). This method works for small xml files with size less than 20MB but the application hangs and doesn't through any error message when parsing 20MB xml files. Please let me know what I have to do at this point ?
Thanks & Regards,
Kumar.

Well... i can't agree.
If you have such structure:
<task>
<task/>
<task>
     <task>
        <task/>
     </task>
     <task/>
</task>
</task>
...you may always keep stack of tasks (at startElement push to top, and at endElement pop), so at every leaf of tree you will have all parents of that leaf.
for such structure:
<task id="1" parent="0"/>
<task id="2" parent="1"/>
<task id="3" parent="1"/>
<task id="4" parent="2"/>
<task id="5" parent="3"/>
...it will be much faster to go thro document by sax several times to build tree of tasks, than to load all document into memory...

Problems with Large XML files

I have tried increasing the memory pool using the -mx and -ms options. It doesnt work. I am using your latest XML parser for Java v2. Please let me know if there are some specific options I should be using.
Thanx,
-Sameer
We have a number of test files that are that size and it works without a problem. However using the DOMParser does require significantly more memory than your doc size.
What is the memory configuration of the JVM that you are running with? Have you tried increasing it? Are you using our latest version 2.0.2.6?
Oracle XML Team
Is there a restriction on the XML file size that can be loaded into the parser?
I am getting a out of memory exception reading in large XML file(10MB) using the commands
DOMParser parser = new DOMParser();
URL url = createURL(argv[0]);
parser.setErrorStream(System.err);
parser.setValidationMode(true);
parser.showWarnings(true);
parser.parse(url);
Win NT 4.0 Server
Sun JDK 1.2.2
===================================
Error output
===================================
Exception in thread "main" java.lang.OutOfMemoryError
at oracle.xml.parser.v2.ElementDecl.getAttrDecls(ElementDecl.java, Compi
led Code)
at java.util.Hashtable.<init>(Unknown Source)
at oracle.xml.parser.v2.DTDDecl.<init>(DTDDecl.java, Compiled Code)
at oracle.xml.parser.v2.ElementDecl.getAttrDecls(ElementDecl.java, Compi
led Code)
at oracle.xml.parser.v2.ValidatingParser.checkDefaultAttributes(Validati
ngParser.java, Compiled Code)
at oracle.xml.parser.v2.NonValidatingParser.parseAttributes(NonValidatin
gParser.java, Compiled Code)
at oracle.xml.parser.v2.NonValidatingParser.parseElement(NonValidatingPa
rser.java, Compiled Code)
at oracle.xml.parser.v2.ValidatingParser.parseRootElement(ValidatingPars
er.java:97)
at oracle.xml.parser.v2.NonValidatingParser.parseDocument(NonValidatingP
arser.java:199)
at oracle.xml.parser.v2.XMLParser.parse(XMLParser.java:146)
at TestLF.main(TestLF.java:40)
null

You might try using a different JDK/JRE - either a 1.1.6+ or 1.3 version as 1.2 in our experience has the largest footprint. If this doesn't work can you give us some details about your system configuration. Finally you might try the SAX interface as it does not need to load the entire DOM tree into memory.
Oracle XML Team

Veeery Slooow (Dom +quite large xml file)

I have an xml file, where is 50000 to 100000 xml elements like this:
<mittaus>
<pvm>25.1.2006</nimi>
<klo>12.10.27.234</klo>
<arvo>44</arvo>
</mittaus>
When I try to print the contents of the XML file with this code:
File file = new File(gv.FilePath() +nimi+".xml");
Document doc = builder.parse(file);
    //Juurielementti
    Element root = doc.getDocumentElement();
    // Kaikki mittaus -lohkot
    NodeList time = root.getElementsByTagName("mittaus");
    for (int i=0; i < time.getLength() ; i++) {
    Element element = (Element)time.item(i);
    NodeList arvoList = element.getElementsByTagName("arvo");
    Element arvoElement = (Element)arvoList.item(0);
    String arvo = arvoElement.getFirstChild().getNodeValue();
    NodeList timeList = element.getElementsByTagName("klo");
    Element fileElement = (Element)timeList.item(0);
    String timeS = fileElement.getFirstChild().getNodeValue();
    NodeList dateList = element.getElementsByTagName("pvm");
    Element dateElement = (Element)dateList.item(0);
    String dateS = dateElement.getFirstChild().getNodeValue();
    palautus = arvo +";" +timeS +";" +dateS; The palautus -String is then printed on the screen.
It takes about 1sec per couple elements, so the whole file printing takes about 20000 seconds!
Is there something wrong with the code? If the file size over 100000 elements, I get also an out of memory failure (java.lang.memoryoutof... or something like that).
What the heck is wrong? With smaller xml files (under 1000 elements) it works ok.

If you use a SAX parsing methodology, then the memory problem goes away. We are using it here at work to parse large streams. It is very fast, and uses little resource.
Open up your file as an InputStream, send it to the SAX parser, and handle the tags in the given event handler.
Inside the event handler, send your data to your print or other output stream.
Example fragments:
import javax.xml.parsers.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;
StringBuffer someStringCollector = new StringBuffer();
//parser
public void some xmlParser() throws Exception{
DefaultHandler handler = new MyHandler();
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.newSAXParser().parse((InputStream) XMLByteStream,DefaultHandler) handler);
//parse event handler (not all events represented in the fragment)
//simply overload the handler event methods as specified in http://java.sun.com/j2ee/1.4/docs/api/index.html
class MyHandler extends Default Handler{
//handle events
public void startElement(String namespaceURI, String localName, String qName, Attributes atts){
//code to deal with the start of the element.
public void end Element(String namespaceURI, String localname, String qName){
//send contents of current string buffer to printer, or other handling as necessary and reset charcter buffer of type StringBuffer perhaps.
if(qName.equals("something")){
//handle data
someStringCollector.setLength(0);
public void characters(char [] ch, int start, int length){
//gather up data in a String Buffer for handling when endElement reached.
someStringCollector.append(ch, start, length);
Some good examples can be found at: http://userpage.fu-berlin.de/~ram/pub/pub_jf47htuHHt/java_sax_parser_en

What are the best tools for opening very large XML files and examining the tree and confirming they are valid?

I am generating some very large XML files (600,000+ lines, 50MB+ characters). I finally have them all being valid XML and valid UTF-8.
But the files are so large Safari and Chrome will often not open them. FireFox will though.
Instead of these browsers, I was wondering if there are there any other recommended apps for the Mac for opening and viewing the XML, getting an error message if they are not valid for some reason and examing the XML tree?
I opened the file in the default app for XML which is Xcode, but that is just like opening it in a plain text editor. You can't expand/collapse the XML tree like you can with a browser, and it doesn't report errors.
Thanks,
Doug

Hi Tom,
I had not seen that list. I'll look it over.
I'm also in touch with the developer of BBEdit (they are quite responsive) and they are willing to look at the file in question and see why it is not reporting UTF-8 errors while Chrome is.
For now I have all the invalid characters quashed and things are working. But it would be useful in the future.
By the by, some of those editors are quite pricey!
doug

How to parse large xml file

I need to parse large xml file which contains following tag. The size of the file is upto 10MB-50MB or more.
<departments>
<department>
<a_depart id="124">
<b_depart id="Bss_253">
<bss_depart id="253">
<attributes>
<name_one>abc</name_one>
</attributes>
</bss_depart id="253">
</b_depart id="Bss_253">
</a_depart id="124">
</department>
<department>
<a_depart id="124">
<b_depart id="Bss_254">
<mss_depart id="253">
          <attributes>
          <name_one>abc</name_one>
          <name_two>xyz</name_one>
          </attributes>
     </mss_depart>
     </b_depart>
</a_depart>
</department>
<department>
<a_depart id="124">
<b_depart id="Bss_254">
<mss_depart id="255">
          <attributes>
          <name_one>abc</name_one>
          <name_two>xyz</name_one>
          </attributes>
     </mss_depart>
     </b_depart>
</a_depart>
</department>
<department>
<a_depart id="125">
<b_depart id="Bss_254">
<mss_depart id="253">
          <attributes>
          <name_one>abc</name_one>
          <name_two>xyz</name_one>
          </attributes>
     </mss_depart>
     </b_depart>
</a_depart>
</department>
I want to get the infomation for that xml file. like mss_depart id=233, building xpath dyanmically for every id and loading
that using dom4j. which is very very slow.
Is there any other solution for that to read the data using sax parser only.
I want to execute the xpath or data for the following way.
//a_depart/@id ------> all the ids of a_depart tags if it returns 3 values say 123,124,125
after that i want to execute
//a_depart[@id='123']/b_depart/@id like this ...to retrive the values of all the levels ...
     I am executing following xpath for every unique ids at all levels.
     List l = doc.selectNodes(xPathForID);
     List l1 = doc.selectNodes(xPathForAttributes+attributes.get(j)+"/text()");
But it is very slow and taking lot of time.
Is there any other way to solve this problem. If any please mail me it is urgent.
I am using jdk1.4 and jdk1.5
Is there any support for sax parser to execute xpath in jdk1.5 direclty, with out using dom4j
Thanks in advance....

I doubt you will find a preexisting solution to your problem.
SAX is usually recommended for processing big files (where "big" is undefined"). It works on big files by avoiding the messy problem of storing the data -- that is left as an exercise to you.
DOM (and its variants) works by building a Document object as the head of the tree of objects for the entire contents. With DOM, you can then use XPath, because there is something to search that is already in memory. To use XPath, you seem to have two choices, build a DOM-ish tree, or if you can find an XPath processor (I'm not sure if one exists) that can process the XML file directly, but it will be slow, since you are looking for "all" occurences of an attribute, and this means you have to read the entire file each time.
It might be worth exploring a hybrid approach -- use SAX to get some information, and build your own objects to store the data. Maybe a HashMap as the main index. But, that will keep you from using XPath, since you do not have the data structures it expects.
A third alternative would be to look at JAXB. It builds Java code from a Schema of your data and then when you import the data, it creates the necessary objects and fills in values. But, I don't think XPath woll work there either.
Dave Patterson

Reading large XML file using a file event generator and a JPD process

I am using a FileEventGenerator and a JPD Subscription process to read a large XML file. The large XML file basically contains repeated XML elements. My understanding is that the file subscription method reads the whole file in memory which causes lots of problem for huge file size like 1MB. Is there a way to read the file size-wise or is there a way to read chunks of data from a large size file..or any other alternative. I would like to process the file in a loop iteration by iteration.

Hitejain,
Here are a couple of pointers you could try. One is that the file event generator has a pass by reference (filename) functionality which you could use so that you could do the following inside of your process.
1) Read file name from the reference
2) Move the file to a processed directory (so it doesn't get picked up again. Note: I don't know how the embedded archive methods of the file event generator plays with pass by reference.
3) Open a stream to the file.
4) Use a SAX or SAX - DOM combined approach to parse your XML while managing the memory usage inside of your process
There is another possibility which might fit your needs and it is related to the RawData object that BEA provides. If I understand it correctly provides wrapping functionality around a stream object, but depending on your parsing methods might just postpone the problem.
Hope this helps
Chris Falling
Stormforge Software

Loading, processing and transforming Large XML Files

Hi all,
I realize this may have been asked before, but searching the history of the forum isn't easy, considering it's not always a safe bet which words to use on the search.
Here's the situation. We're trying to load and manipulate large XML files of up to 100MB in size.
The difference from what we have in our hands to other related issues posted is that the XML isn't big because it has a largly branched tree of data, but rather because it includes large base64-encoded files in the xml itself. The size of the 'clean' xml is relatively small (a few hundred bytes to some kilobytes).
We had to deal with transferring the xml to our application using a webservice, loading the xml to memory in order to read values from it, and now we also need to transform the xml to a different format.
We solved the webservice issue using XFire.
We solved the loading of the xml using JAXB. Nevertheless, we use string manipulations to 'cut' the xml before we load it to memory - otherwise we get OutOfMemory errors. We don't need to load the whole XML to memory, but I really hate this solution because of the 'unorthodox' manipulation of the xml (i.e. the cutting of it).
Now we need to deal with the transofmation of those XMLs, but obviously we can't cut it down this time. We have little experience writing XSL, but no experience on how to use Java to use the XSL files. We're looking for suggestions on how to do it most efficiently.
The biggest problem we encounter is the OutOfMemory errors.
So I ask several questions in one post:
1. Is there a better way to transfer the large files using a webservice?
2. Is there a better way to load and manipulate the large XML files?
3. What's the best way for us to transform those large XMLs?
4. Are we missing something in terms of memory management? Is there a better way to control it? We really are struggling there.
I assume this is an important piece of information: We currently use JDK 1.4.2, and cannot upgrade to 1.5.
Thanks for the help.

I think there may be a way to do it.
First, for low RAM needs, nothing beats SAX. as the first processor of the data. With SAX, you control the memory use since SAX only processes one "chunk" of the file at a time. You supply a class with methods named startElement, endElement, and characters. It calls the startElement method when it finds a new element. It calls the characters method when it wants to pass you some or all of the text between the start and end tags. It calls endElement to signal that passing characters is over, and to let you get ready for the next element. So, if your characters method did nothing with the base-64 data, you could see the XML go by with low memory needs.
Since we know in your case that the characters will process large chunks of data, you can expect many calls as SAX calls your code. The only workable solution is to use a StringBuffer to accumulate the data. When the endElement is called, you can decode the base-64 data and keep it somewhere. The most efficient way to do this is to have one StringBuffer for the class handling the SAX calls. Instantiate it with a big enough size to hold the largest of your binary data streams. In the startElement, you can set the length of the StringBuilder to zero and reuse it over and over.
You did not say what you wanted to do with the XML data once you have processed it. SAX is nice from a memory perspective, but it makes you do all the work of storing the data. Unless you build a structured set of classes "on the fly" nothing is kept. There is a way to pass the output of one SAX pass into a DOM processor (without the binary data, in this case) and then you would wind up with a nice tree object with the rest of your data and a group of binary data objects. I've never done the SAX/DOM combo, but it is called a SAXFilter, and you should be able to google an example.
So, the bottom line is that is is very possible to do what you want, but it will take some careful design on your part.
Dave Patterson

OSB - Iterating over large XML files with content streaming

Similar Messages

Maybe you are looking for