Parsing Larg files

Has anyone tried parsing large XML files. I need parse a fiel
about 70M+. When I try and parse this file I get Invalid UTF8 encoding
When I break the file down into smaller size about 2M it has no
problems. Also the sample data I am using was created using
OracleXMLQuery. Anyone have the same problem, solutins,
Any help is appricated

I wasn't using any kind of encoding but I did have problems
parsing large files with the DOMParser. It seemed that whenever
you attempt to use a reader or InputSource, it would throw some
error...I think it was arrayoutofbounds error. I used the
following code, and it seemed to work:
xmlDoc is a String of xml
byte aByteArr [] = xmlDoc.getBytes();
ByteArrayInputStream bais = new ByteArrayInputStream
(aByteArr, 0, aByteArr.length);
This also works if you use the URL version of .parse as well,
but I am under the contraint of not being able to write out a
file, so I need to use some kind of memory-based buffer.
ByteArrayInputStream works for me. I think the reason this
works is that the actual length of the stream is specified.
Hope this helps.
Arpan (guest) wrote:
: Has anyone tried parsing large XML files. I need parse a fiel
: about 70M+. When I try and parse this file I get
: Invalid UTF8 encoding
: When I break the file down into smaller size about 2M it has
: problems. Also the sample data I am using was created using
: OracleXMLQuery. Anyone have the same problem, solutins,
: suggestions?
: Any help is appricated
: Thanks

Similar Messages

  • Parse large (2GB) text file

    Hi all,
    I would like your expert tips on efficient ways (speed and memory considerations) for parsing large text files (~2GB) in Java.
    Specifically, the text files in hand contain mobility traces that I need to process. Traces have a predefined format, and each trace is given on a new line in the text file.
    To obtain the attribues of each trace i use java.util.regex.Pattern and java.util.regex.Matcher.
    Thanks in advance,

    Memory mapped files are faster when you need random access and you don't need to load all the data, however here it just add complexity you don't need IMHO.
    I suspect most of the time is taken by the parser so if you customise your parser it could be faster. Here is a simple custom parser
    public static void main(String... args) throws IOException {
        String template = "tr at %.1f \"$obj(1) pos 123.20 270.98 0.0 2.4\"%n";
        File file = new File("/tmp/deleteme.txt");
    //        if(!file.exists()) {
            System.out.println(new Date()+": Writing to "+file);
            PrintWriter pw = new PrintWriter(file);
            for(int i=0;i<Integer.MAX_VALUE/template.length();i++)
                pw.printf(template, i/10.0);
            System.out.println(new Date()+": ... finished writing to " + file + " length= " + file.length() / 1024 / 1024 + " MB.");
        long start = System.nanoTime();
        final BufferedReader br = new BufferedReader(new FileReader(file), 64 * 1024);
        for(String line;(line = br.readLine()) != null;) {
            int pos = 6;
            int end = line.indexOf(' ', pos);
            double time = Double.parseDouble(line.substring(pos, end));
            pos = line.indexOf('s', end+12)+2;
            end = line.indexOf(' ', pos+1);
            double x = Double.parseDouble(line.substring(pos, end));
            pos = end+1;
            end = line.indexOf(' ', pos+1);
            double y = Double.parseDouble(line.substring(pos, end));
            pos = end+1;
            end = line.indexOf(' ', pos+1);
            double z = Double.parseDouble(line.substring(pos, end));
            pos = end+1;
            end = line.indexOf('"', pos+1);
            double velocity = Double.parseDouble(line.substring(pos, end));
        long time = System.nanoTime() - start;
        System.out.printf(new Date()+": Took %,f sec to read %s%n", time / 1e9, file.toString());
    Sun May 08 09:38:02 BST 2011: Writing to /tmp/deleteme.txt
    Sun May 08 09:42:15 BST 2011: ... finished writing to /tmp/deleteme.txt length= 2208 MB.
    Sun May 08 09:43:21 BST 2011: Took 66.610883 sec to read /tmp/deleteme.txt

  • How to parse large xml file

    I need to parse large xml file which contains following tag. The size of the file is upto 10MB-50MB or more.
    <a_depart id="124">
    <b_depart id="Bss_253">
    <bss_depart id="253">
    </bss_depart id="253">
    </b_depart id="Bss_253">
    </a_depart id="124">
    <a_depart id="124">
    <b_depart id="Bss_254">
    <mss_depart id="253">
    <a_depart id="124">
    <b_depart id="Bss_254">
    <mss_depart id="255">
    <a_depart id="125">
    <b_depart id="Bss_254">
    <mss_depart id="253">
    I want to get the infomation for that xml file. like mss_depart id=233, building xpath dyanmically for every id and loading
    that using dom4j. which is very very slow.
    Is there any other solution for that to read the data using sax parser only.
    I want to execute the xpath or data for the following way.
    //a_depart/@id ------> all the ids of a_depart tags if it returns 3 values say 123,124,125
    after that i want to execute
    //a_depart[@id='123']/b_depart/@id like this retrive the values of all the levels ...
         I am executing following xpath for every unique ids at all levels.
         List l = doc.selectNodes(xPathForID);
         List l1 = doc.selectNodes(xPathForAttributes+attributes.get(j)+"/text()");
    But it is very slow and taking lot of time.
    Is there any other way to solve this problem. If any please mail me it is urgent.
    I am using jdk1.4 and jdk1.5
    Is there any support for sax parser to execute xpath in jdk1.5 direclty, with out using dom4j
    Thanks in advance....

    I doubt you will find a preexisting solution to your problem.
    SAX is usually recommended for processing big files (where "big" is undefined"). It works on big files by avoiding the messy problem of storing the data -- that is left as an exercise to you.
    DOM (and its variants) works by building a Document object as the head of the tree of objects for the entire contents. With DOM, you can then use XPath, because there is something to search that is already in memory. To use XPath, you seem to have two choices, build a DOM-ish tree, or if you can find an XPath processor (I'm not sure if one exists) that can process the XML file directly, but it will be slow, since you are looking for "all" occurences of an attribute, and this means you have to read the entire file each time.
    It might be worth exploring a hybrid approach -- use SAX to get some information, and build your own objects to store the data. Maybe a HashMap as the main index. But, that will keep you from using XPath, since you do not have the data structures it expects.
    A third alternative would be to look at JAXB. It builds Java code from a Schema of your data and then when you import the data, it creates the necessary objects and fills in values. But, I don't think XPath woll work there either.
    Dave Patterson

  • Help : Parsing large XML files

    someone please help, I am trying to parse XML files of about 60 MB, I have to parse throught 120 of them , search for a particular node and print it. I am using jdk1.3.x , using jdom
    On the sample filesd that r available of 114KB i am able to run my code and get the result, but as soon as the large files are used I get the following error
    Exception in thread main

    I guess you are using a DOM parser which builds a complete tree of the document. For what you are trying to do this is probably not necessary so a SAX parser may be better. If JDOM doesn't have one try using Xerces from Apache.

  • Parsing large xml file and display using swing

    Hi all,
    I want to read a large xml file and display graphically in swing as a tree structure.
    I implemented it and works fine for files of 5MB size after increasing the jvm heap size (-Xmx). If the file size is larger than 5MB it throws out of memory error. I'm creating a custom datastructure from the xml and I'm using sax parsing.
    After displaying the datastructure, the user could do some operation on this, like search etc.
    Can any of you suggest a method, to support larger files ? What I'm looking for is create the datastructure in file system, rather than in memory.
    Any other tips for memory management would be greatly appreciated
    Thanks in Advance.

    Use a memory-mapped file?

  • Windows Explorer misreads large-file .zip archives

       I just spent about 90 minutes trying to report this problem through
    the normal support channels with no useful result, so, in desperation,
    I'm trying here, in the hope that someone can direct this report to some
    useful place.
       There appears to be a bug in the .zip archive reader used by Windows
    Explorer in Windows 7 (and up, most likely).
       An Info-ZIP Zip user recently reported a problem with an archive
    created using our Zip program.  The archive was valid, but it contained
    a file which was larger than 4GiB.  The complaint was that Windows
    Explorer displayed (and, apparently believed) an absurdly large size
    value for this large-file archive member.  We have since reproduced the
       The original .zip archive format includes uncompressed and compressed
    sizes for archive members (files), and these sizes were stored in 32-bit
    fields.  This caused problems for files which are larger than 4GiB (or,
    on some system types, where signed size values were used, 2GiB).  The
    solution to this fundamental limitation was to extend the .zip archive
    format to allow storage of 64-bit member sizes, when necessary.  (PKWARE
    identifies this format extension as "Zip64".)
       The .zip archive format includes a mechanism, the "Extra Field", for
    storing various kinds of metadata which had no place in the normal
    archive file headers.  Examples include OS-specific file-attribute data,
    such as Finder info and extended attributes for Apple Macintosh; record
    format, record size, and record type data for VMS/OpenVMS; universal
    file times and/or UID/GID for UNIX(-like) systems; and so on.  The Extra
    Field is where the 64-bit member sizes are stored, when the fixed 32-bit
    size fields are too small.
       An Extra Field has a structure which allows multiple types of extra
    data to be included.  It comprises one or more "Extra Blocks", each of
    which has the following structure:
           Size (bytes) | Description
                2       | Type code
                2       | Number of data bytes to follow
            (variable)  | Extra block data
       The problem with the .zip archive reader used by Windows Explorer is
    that it appears to expect the Extra Block which includes the 64-bit
    member sizes (type code = 0x0001) to be the first (or only) Extra Block
    in the Extra Field.  If some other Extra Block appears at the start of
    the Extra Field, then its (non-size) data are being incorrectly
    interpreted as the 64-bit sizes, while the actual 64-bit size data,
    further along in the Extra Field, are ignored.
       Perhaps the .zip archive _writer_ used by Windows Explorer always
    places the Extra Block with the 64-bit sizes in this special location,
    but the .zip specification does not demand any particular order or
    placement of Extra Blocks in the Extra Field, and other programs
    (Info-ZIP Zip, for example) should not be expected to abide by this
    artificial restriction.  For details, see section "4.5 Extensible data
    fields" in the PKWARE APPNOTE:

       A .zip archive reader is expected to consider the Extra Block type
    codes, and interpret accordingly the data which follow.  In particular,
    it's not sufficient to trust that any particular Extra Block will be the
    first one in the Extra Field.  It's generally safe to ignore any Extra
    Block whose type code is not recognized, but it's crucial to scan the
    Extra Field, identify each Extra Block, and handle it according to its
       Here are some relatively small (about 14MiB each) test archives which
    illustrate the problem:

       Correct info, from UnZip 6.00 ("unzip -lv"):
     Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
    4362076160  Defl:X 14800839 100% 05-01-2014 15:33 6d8d2ece  test_4g.txt
     Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
    4362076160  Defl:X 14800839 100% 05-01-2014 15:33 6d8d2ece  test_4g.txt
     Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
    4362076160  Defl:X 14800839 100% 05-01-2014 15:33 6d8d2ece  test_4g.txt
    (In these reports, "Length" is the uncompressed size; "Size" is the
    compressed size.)
       Incorrect info, from (Windows 7) Windows Explorer:
    Archive        Name          Compressed size   Size    test_4g.txt         14,454 KB   562,951,376,907,238 KB  test_4g.txt         14,454 KB   8,796,110,221,518 KB  test_4g.txt         14,454 KB   1,464,940,363,777 KB
       Faced with these unrealistic sizes, Windows Explorer refuses to
    extract the member file, for lack of (petabytes of) free disk space.
       The archive has the following Extra Blocks: universal
    time (type = 0x5455) and 64-bit sizes (type = 0x0001).
    has: PWWARE VMS (type = 0x000c) and 64-bit sizes (type = 0x0001). has: NT security descriptor (type = 0x4453), universal
    time (type = 0x5455), and 64-bit sizes (type = 0x0001).  Obviously,
    Info-ZIP UnZip has no trouble correctly finding the 64-bit size info in
    these archives, but Windows Explorer is clearly confused.  (Note that
    "1,464,940,363,777 KB" translates to 0x0005545500000400 (bytes), and
    "0x00055455" looks exactly like the size, "0x0005" and the type code
    "0x5455" for a "UT" universal time Extra Block, which was present in
    that archive.  This is consistent with the hypothesis that the wrong
    data in the Extra Field are being interpreted as the 64-bit size data.)
       Without being able to see the source code involved here, it's hard to
    know exactly what it's doing wrong, but it does appear that the .zip
    reader used by Windows Explorer is using a very (too) simple-minded
    method to extract 64-bit size data from the Extra Field, causing it to
    get bad data from a properly formed archive.
       I suspect that the engineer involved will have little trouble finding
    and fixing the code which parses an Extra Field to extract the 64-bit
    sizes correctly, but if anyone has any questions, we'd be happy to help.
       For the Info-ZIP ( team,
       Steven Schweda

    > We can't get the source (info-zip) program for test.
       I don't know why you would need to, but yes, you can:

    You can also get pre-built executables for Windows:

    > In addition, since other zip application runs correctly. Since it should
    > be your software itself issue.
       You seem to misunderstand the situation.  The facts are these:
       1.  For your convenience, I've provided three test archives, each of
    which includes a file larger than 4GiB.  These archives are valid.
       2.  Info-ZIP UnZip (version 6.00 or newer) can process these archives
    correctly.  This is consistent with the fact that these archives are
       3.  Programs from other vendors can process these archives correctly.
    I've supplied a screenshot showing one of them (7-Zip) doing so, as you
    requested.  This is consistent with the fact that these archives are
       4.  Windows Explorer (on Windows 7) cannot process these archives
    correctly, apparently because it misreads the (Zip64) file size data.
    I've supplied a screenshot of Windows Explorer showing the bad file size
    it gets, and the failure that occurs when one tries to use it to extract
    the file from one of these archives, as you requested.  This is
    consistent with the fact that there's a bug in the .zip reader used by
    Windows Explorer.
       Yes, "other zip application runs correctly."  Info-ZIP UnZip runs
    correctly.  Only Windows Explorer does _not_ run correctly.

  • File-format module cannot parse the file

    What does the error message: "Could not complete your request because the file-format module cannot parse the file." mean? I made the png image in PS. Now it can't parse it, but I don't know what that means or how to correct it.

    Dear Curious,
    I volunteer my time too. I help others learn about Adobe software. I am not "trained" in this, but learn as I go. I have never had the warning I wrote about and that is why I posted. I thought as much, but posted anyway to be sure that I wasn't missing some clue. I deal with a large number of applications and all levels of members. It may be easy to be glib, but the hours I spend helping others is as valid as yours. I may not have points, but that is because I don't 'know' all I feel I should to answer. We each contribute as we can. I do wish I had more training, and I work on line with the series. Thank you for your help. I hope your curiosity is satisfied. :-)
    If you have a lot of skills in PS and didn't mind the occasional question I'd like to write you again.

  • Error parsing XSL file (weblogic.xml.jaxp.RegistryXMLReader cannot be cast

    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
    <meta name="generator" content="HTML Tidy for Java (vers. 26 Sep 2004), see">
    <meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
    <link type="text/css" rel="stylesheet" href="css/CascadeMenu.css">
    <body id="Bdy">
    Hello all, I've run into a perplexing problem with a new and unexptected error on a web application that resides in a JDeveloper 11g environment. I just run it from JDeveloper on my laptop. No deployement other than to the default server at run time Integratedweblogicserver. I am doing an XML transform using XSLT and it has been working fine until I tried to use the page yesterday. I get the following error. javax.servlet.ServletException: javax.xml.transform.TransformerConfigurationException: XML-22000: (Fatal Error) Error while parsing XSL file (weblogic.xml.jaxp.RegistryXMLReader cannot be cast to oracle.xml.parser.v2.SAXParser). at weblogic.servlet.jsp.PageContextImpl.handlePageException( at jsp_servlet.__transform._jspService( at weblogic.servlet.jsp.JspBase.service( at weblogic.servlet.internal.StubSecurityHelper$ at weblogic.servlet.internal.StubSecurityHelper.invokeServlet( at weblogic.servlet.internal.ServletStubImpl.execute( at weblogic.servlet.internal.ServletStubImpl.onAddToMapException( at weblogic.servlet.internal.ServletStubImpl.execute( at weblogic.servlet.internal.TailFilter.doFilter( at weblogic.servlet.internal.FilterChainImpl.doFilter( at$ at Method) at at at at at weblogic.servlet.internal.FilterChainImpl.doFilter( at oracle.dms.wls.DMSServletFilter.doFilter( at weblogic.servlet.internal.FilterChainImpl.doFilter( at weblogic.servlet.internal.WebAppServletContext$ at at at weblogic.servlet.internal.WebAppServletContext.securedExecute( at weblogic.servlet.internal.WebAppServletContext.execute( at at at Caused by: javax.xml.transform.TransformerConfigurationException: XML-22000: (Fatal Error) Error while parsing XSL file (weblogic.xml.jaxp.RegistryXMLReader cannot be cast to oracle.xml.parser.v2.SAXParser). at oracle.xml.jaxp.JXSAXTransformerFactory.reportConfigException( at oracle.xml.jaxp.JXSAXTransformerFactory.newTemplates( at oracle.xml.jaxp.JXSAXTransformerFactory.newTransformer( at weblogic.xml.jaxp.RegistryTransformerFactory.newTransformer( at org.apache.taglibs.standard.tag.common.xml.TransformSupport.doStartTag( at jsp_servlet.__transform._jsp__tag2( at jsp_servlet.__transform._jspService( ... 25 more Caused by: java.lang.ClassCastException: weblogic.xml.jaxp.RegistryXMLReader cannot be cast to oracle.xml.parser.v2.SAXParser at oracle.xml.jaxp.JXSAXTransformerFactory.newTemplates( ... 30 more ------------------------------------------------ I changed no code or moved any XML or XSLT file. I do see an error in the log regarding a bad URL ----------------------------------------------- XML-22108: (Error) Invalid Source - URL format is incorrect. XML-22000: (Fatal Error) Error while parsing XSL file (weblogic.xml.jaxp.RegistryXMLReader cannot be cast to oracle.xml.parser.v2.SAXParser). &lt;[ServletContext@10343785[app:QSBQAR module:QSBQAR-QSBQAR-context-root path:/QSBQAR-QSBQAR-context-root spec-version:2.5], request: weblogic.servlet.internal.ServletRequestImpl@699744[ GET /QSBQAR-QSBQAR-context-root/Transform.jsp?reqtype=1 HTTP/1.1 Accept: image/gif, image/jpeg, image/pjpeg, application/x-ms-application, application/, application/xaml+xml, application/x-ms-xbap, application/x-shockwave-flash, application/, application/, application/msword, */* Accept-Language: en-us User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; WOW64; Trident/4.0; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; . ------------------------------ Here is the XML ------------------------------ <?xml version="1.0" encoding="windows-1252" standalone="no"?>
    ACME Bird Seed Co. Capture the Road Runner using a boulder, rope and bird seed. Quinn Brian 00 00 00 00 00 00 00 00 00 11 08 08 08 08 00 43 43 Hours have been approved. APPROVED Smart Jean 00 00 00 00 00 00 00 00 00 Hours approved. APPROVED --------------------------------------------------------------------------------------- Here is the XSL --------------------------------------------------------------------------------------- <?xml version="1.0" encoding="windows-1252"?>
    <!-- Root template -->
    <h2>Project Hours Worked</h2>
    ----------------------------------------------------------------------------------------- Here is the JSP with the transform ----------------------------------------------------------------------------------------
    <td>Week Ending Date:--</td>
    <th>Last Name</th>
    <th>First Name</th>
    <td>Total Hours: </td>
    <%@ page contentType="text/html;charset=windows-1252"%><%@ taglib uri="" prefix="x" %><%@ taglib uri="" prefix="c" %></table>
    <script type="text/javascript" src="scripts/CascadeMenu.js">
    <% int bad = 1; %>
    <div id="menuBar" class="menuBar">
    <div id="Bar1" class="Bar">Home</div>
    <div id="Bar3" class="Bar">Accounting</div>
    <div id="Bar4" class="Bar">Help</div>
    <div style="background:#84ffff; color:Aqua; "><br>
    <p style="color:Orange; font-size:x-large; font-style:italic; font-weight:bold;
    font-family:Arial, Helvetica, sans-serif; "><img src="images/logoqsq.jpg" style="border:1" height="120" width="120" alt="Q Squared">
    <p style="color:Black; font-size:x-large; font-style:italic; font-weight:bold; font-family:Arial, Helvetica, sans-serif;"><img src="images/dilbert.gif" alt="Dilbert" height="100" width="100">
    ? ? Welcome to Q Squared-Brian Quinn Consulting - Manager Time Approval</p>
    <table width="100%" class="table1">
    <td style="width:15%; border-width:medium; background-color:silver ">
    <h3>Contractor Resources</h3>
    <ul style="list-style-type:circle; ">
    <li>Time Entry</li>
    <h3>Manager Resources</h

    LOL - I didn't think about the forum message area having trouble displaying my XML XSLT problem
    It seemed to mix the code with the site XML.
    Oh brother
    The deal is this.
    The XML XSLT transform was working and now it is not and I think it has something to do with
    the HTTP links for either the Oracle core and/or XML TAGLIBs. Either that or the has
    outdated XSLT http links.
    Anyone know if changes have been made to any of these taglib links?
    This in the JSP
    <%@ taglib uri="" prefix="x" %>
    <%@ taglib uri="" prefix="c" %>
    <c:import url="HoursWorked.xml" var="xmlHoursWorked" charEncoding="windows-1252"/>
    <c:import url="./HoursWorked3.xsl" var="xslt" charEncoding="windows-1252"/>
    <x:transform xml="${xmlHoursWorked}" xslt="${xslt}" />
    This in the XSL
    <xsl:stylesheet version="2.0" xmlns:xsl="">
    And the other JSP having the same problem.
    <%@ page contentType="text/html;charset=windows-1252"
    import="java.util.List, qsbqar.XMLHandler, org.w3c.dom.NodeList,
    org.w3c.dom.Node, oracle.xml.parser.v2.*,, " %>
    <%@ taglib uri="" prefix="c" %>
    <%@ taglib uri="" prefix="x" %>
    <xsl:param name="employeeID" value="2"/>
    <%session.setAttribute("employee_ID", request.getParameter("consultantID")); %>
    <c:import url="HoursWorked.xml" var="xmlHoursWorked" charEncoding="windows-1252"/>
    <c:import url="./HoursWorked4.xsl" var="xslt" charEncoding="windows-1252"/>
    <x:transform xml="${xmlHoursWorked}" xslt="${xslt}">
    <x:param name="employeeID" value="${sessionScope.employee_ID }"/>
    Edited by: B of Carbon on Dec 19, 2010 12:25 AM

  • Please advice me on reading a large file

    I need and advice on reading a large file using JSP. I need to read a text file which contains around 2 millions records (lines). Also some times I need to find a specific line from that.
    I have to make a small web application to generate reports on the Text files after parsing those into a database (Oracle or MySQL). I have already a small similar application which based on PHP / MySQL. But still time consumption issue is there while parsing the text file into Database.
    How will it be improve if I go for Tomcat / Oracle / Java Web Component Development?
    Please advice me.

    Try changing your approach - don't read it into an array, process it one line at a time. Obviously any approach where you have the whole file in memory is going to exceed memory at some size of the file.

  • Parsing HTML files

    I have a question about parsing HTML files. Usually when I get an HTML file and I need to find all the text in it I do this. This stuff just collects all of the hyperlinks and ignores all the html tags just keeping the actual text. It's fine for smaller files but occasionally I'll hit a large online text file and it will work but its way to slow for large files. I don't need to do all of this HTML tag stripping however for text files. Is there a way to still grab all the text without doing any tag searching to make it faster?
    private void find() throws IOException
            //Really slow for large text files.  Need a way to just use a regular scanner on an internet text file
            new ParserDelegator().parse(new InputStreamReader(myBase.openStream()),
                    new ParserListener(),
         * Inner class for processing all "<a href.."> tags when reading a base URL.
        private class ParserListener extends HTMLEditorKit.ParserCallback
            final String IGNORED_LINKS = "^(http|mailto|\\W).*";
            public void handleStartTag (HTML.Tag t, MutableAttributeSet a, int pos)
                if (t == HTML.Tag.A)
                    String href = (String)(a.getAttribute(HTML.Attribute.HREF));
                    //System.out.println(href.matches(IGNORED_LINKS) + "\t" + href);
                    if (! (href == null || href.matches(IGNORED_LINKS)) && !myURLs.contains(href))
                //TODO fix
                if (t == HTML.Tag.TITLE)
                    String title = (String) (a.getAttribute(HTML.Attribute.TITLE));
                    if (!(title == null))
                        myTitle = title;
                    else myTitle = "No title was found";
            public void handleText (char[] data, int pos)
                myText.append(" ");

    JFactor2004 wrote:
    My question is. If I know an html file is actually just a txt fileThis isn't a question. HTML files are text by definition.
    is it possible to look through it (maybe use something similar to a regular scanner) without doing anything with html.That depends on what you mean by "doing something with HTML". You can certainly read it one line at a time.

  • PS CC: Camera Raw filter gives "cannot parse your files" error

    I can open files fine in Photoshop CC, but if I try to use the Camera Raw filter, I get the "file format module cannot parse your files" error. This only happens with larger TIFFs, say 300 MB or so - 70 MB files work fine. I have a fast machine running Windows 7 with 12GB of RAM, so I would expect to have enough space, but maybe it isn't configured properly in PS preferences?
    Thanks for any help,

    siskin11 wrote:
    I can open files fine in Photoshop CC, but if I try to use the Camera Raw filter, I get the "file format module cannot parse your files" error. This only happens with larger TIFFs, say 300 MB or so - 70 MB files work fine.
    300MB Tiff seems like a large files size for a Tiff which leads me to believe its a layered Tiff.  ACR does not have Layers support try opening the Tiff file in Photoshop not through ACR to see it there are layers. If there are target  all of them and convert to a smart Object then try ACR as a filter. A smart object is a single layer in Photoshop.

  • Recommended Structure for Large Files

    I am working at re-familiarizing myself with Oracle development and management so please forgive my ignorance on some of these topics or questions. I will be working with a client who is planning a large-scale database for what is called "Flow Cytometry" data which will be linked to research publications. The actual data files (FCS) and various text, tab-delimited and XML files will all be provided by researchers in a wrapper or zip-container which will be parsed by some as-yet-to-be-developed tools. For the most part, data will consist of a large FCS file containing the actual Flow Cytometry Data, along with various/accompanying text/XML files containing the metadata (experiment details, equipment, reagents etc). What is most important is the metadata which will be used to search for experiments etc. For the most part the actual FCS data files (up to 100-300 mb), will only need to be linked (stored as BLOB's?) to the metadata and their content will be used at a later time for actual analysis.
    1: Since the actual FCs files are large, and may not initially be parsed and imported into the DB for later analysis, how best can/should Oracle be configured/partitioned so that a larger/direct attached storage drive/partition can be used for the large files so as not to take up space where the actual running instance of Oracle is installed? We are expecting around 1TB of data files initially
    2: Are there any on-line resources which might be of value to such an implementation?

    Large files can be stored as BFILE datatypes. The data need not be transferred to Oracle tablespaces and the files will reside in OS.
    It is also possible to index bfiles using Oracle text indexing.

  • XML parsing huge file

    I have a 36M XML file i need to parse, I'm new to XML.
    I usually get a 200K file in CSV format from most of my client that they transfer into there account i then simply update the MSSQL database with the CSV file at midnight on my server. But now i have 74 clients that are regroup and they send me 1 XML file.
    When i run it using the sample they gave me it works fine but on the 36M file i get a Jrun error then i found out that :
    <CFFile action="READ" variable="xmlfile" file="c:\mypath\#clientfile#.xml" charset="utf-8">
    <cfset xmlObj = xmlParse(#xmlfile#)>
    Doesnt work on big files because it runs out of memory.
    I need a way to parse that file using Java i downloaded xmlsax.js but i dont know how to use it to parse then get my parsed var back from it can anyone help me please.
    I got the file here :
    Thank you

    In response to Owain Norths' comments about DOM parsing.
    I'm not sure if the memory issues are the fault of the DOM parsing method being used or if the problem is in how CF converts XML text into CF objects (arrays, structs) that the XML text represents.  It possible that the CF objects are responsible for using excessive amounts of memory.  Either way it sounds like CF's XML parsing capabilities aren't appropriate for larger (large being a relative term) XML files.
    It might be an interesting experiment to use third party Java components (such as Xerces2) to parse some XML files and see what the performance and memory usage look like.
    I will re-state my original advice.  The poster needs to import data from XML files into tables on MS SQL Server.  Bulk import tasks, such as from XML or CSV files, are generally better handled in MS SQL Server.  Some options include: a job that executes T-SQL, an Integration Services package, or the Bulk Copy Program (BCP) utility. 
    From: Owain North [email protected]
    Sent: Fri 11/12/2010 8:57 AM
    Subject: XML parsing huge file
    Couldn't agree more, and to be honest I can't believe this hasn't come up before. To me, the thought that something like CF should have to be bypassed when you get to files of a few megs is utterly ridiculous. I haven't looked into the different methods of parsing XML as it's really not my thing, but are we saying that DOM parsing is necessary for CF to be able to perform the functions it does on  the resulting XML object? Or does one create the same result, just through a different method?
    Owain North
    Code Monkey
    Titan Internet Ltd <>
    Owain North is a mildly overweight computer programmer who likes to sit in the corner of a darkened room tapping away on his keyboard whilst wearing a massive set of headphones to avoid human contact where possible. He particularly likes to avoid natural light and salad.
    In his spare time he likes to pet his dog and work on his track car:
    The other day he went up to the toilets upstairs and there were no hand towels left! Bad times.
    It's Filthy Friday, so we all got Dominos for lunch. Large (obviously) half and half Mighty Meaty and American Hot. Good it was, especially as one of the other guys didn't want his garlic & herb dip = win.
    At the moment, he's having to look into WCF for a new project on  server monitoring. He doesn't know anything about it yet but after a  quick session on Amazon with the company credit card and some  extortionate delivery fees he's well on his way to writing his first WCF  service.
    In case you're interested - in the end, he just had to dry his hands on his jeans.

  • PS CS3 Cannot parse the file?

    Hello all,
    I have been having this problem all year.  Sometimes, not always, I cannot bring a png file into my cs3.  It tells me that the file format module cannot parse the file.  I can't even find info on what this means, let alone how to fix it.
    It is usually a file larger than 1 mb.
    Created in cs3, by me. 
    Completely graphics.
    No photos.
    Not dowloaded from any browser or site.
    Sometimes I can bring the file in IF I close all other files I am working on in CS3.  Sometimes it never works.
    My external hard drive that I save these files on is a terabyte and by no means anywhere near even 5% full.  My computer has 4gb of memory and 250gb hard drive.  Before I started to use the terabyte I had the same problem.  It is not the terabyte hard drive.
    I am the first to admit I am not computer savy, but I have been working with PS CS3 for 9 months, 10+ hours a day and cannot solve this problem.
    My husband is an application developer and cannot figure out the solution either.
    I have been forced to save smaller files just so I can open them in CS3 to continue editing them.
    It is extremely frustrating and I have not found ANY help at all online or in PS help.
    I don't understand what I am doing wrong, or what is wrong with the program.
    If you have ANY suggestions please post them.
    Irrational Andi

    Oops, just noticed that these are PNG's FROM PS. I was trying to beat the scheduled shutdown of the fora and missed that.
    The only thing even close to this has been with saving .PSD's from PS without the Compatibility, and then InDesign and AI not being able to "parse the file." That should not happen, but it has to the point that I now have the Compatibility Mode set to default for me.
    Sorry for my comments, as they are not likely to help you,

  • Rtorrent: issue with DL large files ( 4GB) to NTFS

    Using latest rtorrent/rutorrent:  every time I DL a large >4GB file with rtorrent to the NTFS drive it shows it downloading the whole file MB by MB, but when I go to hash check (via rutorrent), there's only a partial percentage DLded.  Say if I DL a 4.36 GB .mkv file, I hash check and only 10% is done ~400MB or about 6 minutes of the video.
    If I do ls -l --block-size=MB, the file shows normal 4GB+ size.
    If I do ls -s, file appears to be only a few hundred MB.
    If I DL to my root ext4 drive, there's no issue unless I change the save path of the torrent in rutorrent and elect for the files to be moved to the NTFS drive.
    I've transferred large files with 'cp' from another NTFS to this NTFS with no issue.
    I thought the problem was rutorrent plugin autotools, but I removed it from my plugins folder and the problem persists.
    I have all the relevant directories in /etc/php.ini open_basedir:  the user/session, the mounted drive, and /srv/http/rutorrent
    I did #chown -R http:http /srv/http/rutorrent
    http is a member of the group with NTFS drive access
    the rutorrent/tmp directory is changed to be within /srv/http/rutorrent
    This is a pesky issue that I didn't have with my last arch install using the same general set up.
    I DL to an NTFS formatted drive and mount it the same way I did before: ntfs-3g defaults,auto,uid=XXXX,gid=XXXX,dmask=027,fmask=037
    My rtorrent user is the uid (owner) and is in the group that has access to the drive (along with my audio server user and http)
    I run rtorrent in screen as the rtorrent user
    I imagine this is an issue with rutorrent?
    Any tips before I reformat the whole 4TB to ext4?
    EDIT:  the issue is definitely isolated to rtorrent.  I manually added large size torrent using rtorrent, it completed.  I then hash checked (in rtorrent) and again only ~10% was shown as complete.
    EDIT2:  It is most definitely not a permissions issue.  Tried this again without mount permissions options and the same thing happens.
    Last edited by beerhoof (2015-01-30 22:05:57)

    I'm afraid I don't understand the question.
    7.2 now correctly parses the Canon XF .CIF sidecar files to determine whether the media is supposed to be spanned or not.  This has been a feature request that has been finally addressed to work correctly.
    (It also was there in 7.1 & previous, but had limitations:  the performance wasn't as good, there had been issues in the past with audio pops at cut points, and it required that the Canon XF folder structure remain intact, ie if you copied the media to a flattened folder structure, it would fail to do the spanning correctly.)
    If you are looking for a means to disable the automatic spanning, simply removing the .CIF files will achieve that.  Although i'm not sure I understand why you're looking to do that.  Most people *want* spanning to happen automatically, otherwise you're forced to manually sync spanned media segments by hand. 

Maybe you are looking for

  • Compiling .java file on tomcat 4.1

    i wana know where to store the .java file on the tomcat directory for tomcat to compile it? and then after putting it in the appropriate place is this command gona compile it: javac -cp "c:program files\apache group\tomcat 4.1\common\lib\jasper-compi

  • J2sdk1.4.1 instalation in Solaris 2.6

    Hi, I've installed the "" (Solaris j2sdk1.4.1) and "J2SE_Solaris_2[3].6_x86_Recommended.tar.Z" (patch for solaris 2.6). And make the j2sdk1.4.1 de default java. When running java, the following message ocurrs: 18:03 eid@

  • Data export from BPC 7.5 NW

    I have created and validated my transformation file, which takes two dimensions and concatenates them. For example Company and ProfitCenter, in the mapping section: U_Entity=Company+ProfitCenter. Where my company = 5110 and my profit center = SVS8110

  • Mapping gate pass in G/R process

    Hi Friends, We are facing one issue relating goods received from the transporters not getting offloaded on time & incurring huge demurrage charges. The modus operandi is..., 1) transporters arrives with the goods at weigh bridge near security gate. 2

  • Can't Edit library image in Photoshop  =(

    I just installed Captivate 2 After recording my first demo, I tried to edit the slide (the "background" image in the library using the EDIT IN command. This command opens MSPaint (gahh! what a horrible program!). There is no way that I can see to tel