Fastest Text Parsing Technique

Hi all,
I have a set of huge text files that I need to quickly parse and remove all the asteriks characters from. At the moment my simple program looks like this:
import java.io.*;
public class QuickParse
     public static void main(String[] args) throws Exception
          BufferedReader in = new BufferedReader(new FileReader(args[0]));
          PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter(args[1])));
          String str = "";
          while ((str = in.readLine()) != null)
               out.write(str.replaceAll("\\*","") + "\r\n");
          in.close();
          out.close();
}Is there a faster way of doing this? These files are well over a gig each. I'm aware that at the moment this program may well be leaving lots of String objects in memory, but I'm not sure what to do about it, or if this would slow the process down.
Thanks for any help or ideas.

I created a test file with 100k lines and total file size of about 10.6M. Each line looked like this:
0*1000000000*2000000000*3000000000*4000000000*5000000000*6000000000*7000000000*8000000000*9000000000*
I never know exactly how to test IO since the second time you read a file it is usally faster because part of the file is in memory. Anyway, I ran my test program 6 times changing the order of method invocation within each run. Here are my results:
c:\java\temp>java QuickParse 1
Replace : 2297
Tokenize: 1859
Movebyte: 1938
c:\java\temp>java QuickParse 2
Tokenize: 1000
Movebyte: 1937
Replace : 2219
c:\java\temp>java QuickParse 3
Movebyte: 1937
Replace : 2281
Tokenize: 1563
c:\java\temp>java QuickParse 1
Replace : 2297
Tokenize: 1469
Movebyte: 1922
c:\java\temp>java QuickParse 2
Tokenize: 1015
Movebyte: 1906
Replace : 2219
c:\java\temp>java QuickParse 3
Movebyte: 1906
Replace : 2281
Tokenize: 969
Order of method execution didn't seem to matter. The results (from best to worst) was always:
a) Tokenize
b) Movebyte
c) Replace
Here is the code I executed:
import java.io.*;
import java.util.*;
public class QuickParse
     public static void main(String[] args) throws Exception
//          String input = "C:\\Java\\j2sdk1.4.2\\src\\javax\\swing\\JComponent.java";
          String input = "data100000";
          if (args[0].equals("1"))
               replace(input, "output.replace");
               tokenize(input, "output.tokenize");
               movebyte(input, "output.movebyte");
          else if (args[0].equals("2"))
               tokenize(input, "output.tokenize");
               movebyte(input, "output.movebyte");
               replace(input, "output.replace");
          else if (args[0].equals("3"))
               movebyte(input, "output.movebyte");
               replace(input, "output.replace");
               tokenize(input, "output.tokenize");
     public static void replace(String input, String output) throws Exception
          long start = System.currentTimeMillis();
          BufferedReader in = new BufferedReader( new FileReader( input ) );
          BufferedWriter out = new BufferedWriter( new FileWriter( output ) );
          String str = "";
          while ((str = in.readLine()) != null)
               out.write(str.replaceAll("\\*","") + "\r\n");
          in.close();
          out.close();
          System.out.println("Replace : " + (System.currentTimeMillis() - start) );
     public static void tokenize(String input, String output) throws Exception
          long start = System.currentTimeMillis();
          BufferedReader in = new BufferedReader( new FileReader( input ) );
          BufferedWriter out = new BufferedWriter( new FileWriter( output ) );
          String str = "";
          while ((str = in.readLine()) != null)
               StringTokenizer st = new StringTokenizer(str, "*");
               while (st.hasMoreTokens())
                    out.write(st.nextToken());
               out.write("\r\n");
          in.close();
          out.close();
          System.out.println("Tokenize: " + (System.currentTimeMillis() - start) );
     public static void movebyte(String input, String output) throws Exception
          long start = System.currentTimeMillis();
          BufferedInputStream in = new BufferedInputStream( new FileInputStream( input ) );
          BufferedOutputStream out = new BufferedOutputStream( new FileOutputStream( output ) );
          byte[] buffer = new byte[4 * 1024];
          int bytes;
          while ( (bytes = in.read( buffer )) != -1 )
               for (int i = 0; i < bytes; i++)
                    if (buffer[i] != (byte)'*')
                         out.write( buffer, i, 1 );
          in.close();
          out.close();
          System.out.println("Movebyte: " + (System.currentTimeMillis() - start) );
}

Similar Messages

Parsing Techniques?

I am working on a little tool to make it easier to manage some Text-based account files by using the StringTokenizer class. Although, those files have to be built in a very specific way so the program can understand the data correctly. I figured out that surely, my parsing technique wasn't really user-friendly. Since then, I've been wondering what are the best Parsing Techniques that I should use. Taking the example of the Java Compiler, it can understand many code structures, and that's why I'd like to have my program use the best parsing techniques :)
If anyone can point me to:
-tutorials
-APIs
Or if anyone can teach me directly in this post,
I'd be very thankful :)

Here's an extra tip for you. I think you might be better off if you change from an INI-like file format to XML documents. That would enable you to add a degree of validation by creating an XML schema and implementing the existing Java XML parser technology into your product. You probably don't want to do something like that right away, but I'm just saying it's a good thing to consider it for the (near) future.
Example XML based on your example INI:
<account name="AccountName">
<password>password</password>
<level>admin</level>
<lang>en</lang>
<totalconnecttime>845</totalconnecttime>
</account>
<account name="AccountName2">
<password>password</password>
<level>guest</level>
<lang>fra</lang>
<totalconnecttime>157</totalconnecttime>
</account>
With an XML schema, you can specify things such as "totalconnecttime must be numeric" and "lang must be 2 or 3 characters length". By incorporating an existing, validating Java XML parser you have put relatively little effort into validating your files. Effectively, you make your files more reliable this way...

How to use parsing technique to access a particular value from a webpage

hi,
i'm in need of a coding to access a particular value from a webpage only by using its link. one of my friend said the we can do this by parsing technique. but i doesn't have knowledge about it. can any one help me?

ksnagendran26 wrote:
hi,
i'm in need of a coding to access a particular value from a webpage only by using its link. one of my friend said the we can do this by parsing technique. but i doesn't have knowledge about it. can any one help me?I'm sorry could you explain in detail what do you mean by +"access a particular value from a webpage only by using its link"+?

Text parsing and java

I'd like to develop a java application that takes an .html plain text file and finds an html table within the file, and puts that table into a jtable (formatting it according to my preferences). I'm assuming a text parser would be my best bet. I can build my own text parser, but I'd like to avoid the work if there's an efficient parser in the API. Is there?

These examples from [url http://javaalmanac.com/egs/javax.swing.text.html/pkg.html]Java Developers Almanac might help you out.

Rt2:Full Text Parser

Hi,
if a collector cannot parse a records and calls sendUnsupported(), rt2 get's
set to 'Full Text Parser'. In Sentinel 7, the event then is marked with a
"binary" icon.
I've seen this behavior with the Generic and the SLES collector. However if
I do this from a customer collector build from the 2001.1r1 template, the
event in Sentinel will have rt (SentinelProcessingComponent) set to the name
of the collector plugin.
Nobert

Hi Bhuppyd,
Thanks for your posting.
Could you please post the sample code to create FULL-TEXT catalo? Here is the document for your reference:
CREATE FULLTEXT CATALOG:
http://msdn.microsoft.com/en-us/library/ms189520(v=sql.105).aspx
In addition, I would suggest you provide more detail error message information so that we can do further investigation. Troubleshooting Errors in a Full-Text Population:
http://technet.microsoft.com/en-us/library/ms142495(v=sql.105).aspx
Regards,
Elvis Long
TechNet Community Support

Internal.text parsed by XML object ?

Is it possible to use the XML object to parse internal xml
instead of loading an external file?
I am writing an application that converts arrays &
objects into xml.
In my development, I am displaying the xml in a text field on
the stage. In testing I want to read this text field as a xml file.
Currently I copy and past the xml from flash into the external xml
file (with notepad) then reload the xml file with
sample_xml.load("xml/sample_1.xml");
Can I point to the text field in flash to get the xml parsing
instead of the external file?
Thanks

Actually two ways:
var myXML:XML = new XML(myTextField.text)
or
var myXML:XML = new XML();
//maybe:
myXML.ignoreWhite=true;
//and later...
myXML.parseXML(myTextField.text);

Java text parsing

Hi all!
I'm a Java beginner and how I have to parse a text file of the form:
attribute1 1|2|443|554|56
attribute2 someString
attribute3 someOtherString
...and so on...
I have to change the value of one of the attributes.
Could you tell me how to overwrite this attribute.
Thanx,
Pesho

There are some very powerful String tools, which you might want to look at. The first is the Scanner Class http://java.sun.com/j2se/1.5.0/docs/api/java/util/Scanner.html , which helps you to tokenize your Strings. It also helps you with regular expressions, which you should look up in the Java tutorial and here: http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/package-frame.html
A simpler version which is easier to handle is the StringTokenizer Class:
http://java.sun.com/j2se/1.5.0/docs/api/java/util/StringTokenizer.html
About changing the file: you just can't change a few bits of your file. You have to read the whole file into some data structure, change the data structure and save it back to disk.

Text parsing into proper array

I think I have been staring at this too long. I am working with very old equipment. It gives me the attached test.txt. I need to parse it to where I can pull individual variable. I can't parse it to proper array because it has different number of 'white space'. I was wondering can any help me with this.
Attachments:
test.txt ‏1 KB

Hi Henry,
i understand it. But you can first read your file with it. You can also read it as a "normal" text file. I played with your file and attached you can find a picture which shows how it´s possible to replace the multiple spaces with only one space.
Hope it helps.
Mike
Message Edited by MikeS81 on 08-24-2008 09:25 PM
Attachments:
remove_multipleSpaces.PNG ‏25 KB
other_cases.PNG ‏10 KB

Oracle Text Parser/Filter

Hi everyone,
I'm currently working with oracle text, everything is working fine but now I have to extend some features so the problem is this:
- Oracle provides some parsers for common file types such as .doc, .docx, .pdf, .html, etc.. but now I must create a custom parser - its use will possibly be parsing custom file types or encrypted files.
Anyone knows some documentation that can help me create my custom parser and get it to work properly?
Any help would be apreciated.
Regards,

- In what language an format (.exe, .pl, .class, etc..) should my custom parser be developed?Any executable file.
- Where should it be in the database directory tree?Whatever directory you have your sql files stored in. The default is your bin sub-directory of your oracle home directory path, except on Windows it must be in the ctx/bin sub-directory of the oracle home directory path.
- How can I tell that my custom parser is only to be used when the document extension meets certain critiria? (I want to mix doc, docx and my custom extension so the parsers provided by oracle should be available and working too)Use a format column.
- Where shall I insert the tokens and with what info?If you create a user_filter preference, then set your executable file as the command attribute to that preference, then use that preference when creating your index, the tokens are automatically inserted in the dr$your_index_name$i domain index table.
You may not have scrolled down far enough in the documentation to see all of the information and complete example provided.

Plain-text parser

Hi everybody
The project I've been working on involves parsing plain-text, structured messages. These messages contain records of fixed length, and of several different types. Some of these record types have data & behaviour in common, so that it feels quite natural AND good practice to use inheritance
to illustrate this, here's (roughly) the grammar of my messages :
Message => MessageHeader Invoice* MessageFooter
Invoice => InvoiceHeader Record* InvoiceFooter
Record => RecordX | RecordY | RecordZas I said, RecordX, RecordY and RecordZ have data & behaviour in common (same applies for MessageHeader/Footer and InvoiceHeader/Footer) and are composed of many fields of different types
The first treatment I want to apply is to parse such messages from the filesystem and construct a graph of objects (in order to save it to DB afterwards via hibernate)
How would you organise a suitable and proper-OO parser package in order to minimize code redundancy, allow TDD... I have already written dozens of solutions, none satisfies me 100%...
If you want further info / description / excerpts of my precedent solutions, no problem ! I don't want to be TOO specific, since I have the feeling that there's some kind of generic pattern that could apply to this problem elegantly... Am I wrong ?

Parsing is seperate from the graph.sure ok about that
So what exactly do you think needs improvement?I don't like these long parse(InputStream) methods
that just call some "basic parser" (implementing
methods like parse, parseInt, parseShort, and so on)
again and again, especially when a lot of parsing
activity code is duplicated (same sub-sequences of
fields in records of different types)
It gives you something like that :
public RecordX parse(InputStream inputStream) {
....I don't see a problem.
You can generalizing it using a decription component if you wish.
But, speaking from experience, unless you have a lot of these which are very similar then doing so it makes it harder to maintain (due to special cases.) And it doesn't make it faster.
>
>>
And "many fields" doesn't say much. In particular
just because an attribute exists or doesn't exist
does not necessarily imply that inheritance is
needed. For example I am still a human whether I
have a job or not.what I meant is : some records of different types
have many attributes and the most part of behaviour
in common so parsing them in separate classes
(specialised parsers) induces lengthy methods and
lots of duplicate code... This couldn't possibly
scale if there were many more types of records (I
feel I'm deserving a good "YAGNI" here ;-))
I suspect that you look at something like the following two snippets and see 'duplicate' code (pseudo representation of what you are doing.)
         fieldx1 = parse.parseInt(5)
         parse.skip(3)
         fieldx2 = parse.parseInt(7)
         fieldy1 = parse.parseInt(5)
         parse.skip(3)
         fieldy2 = parse.parseInt(4)
         fieldy3 = parse.parseInt(7)
There are similarities in the above but they are not duplicate.
And you could create a routine that handles both but there are so many dissimilarities between them that the duplicate part would become obscure by the part that isn't similar.
roughly, when I'm looking at the parser module of the
project and the classes involved, I can't help the
feeling there's something wrong :
It isn't quick and easy but learning how to generate code makes the production of such code a lot easier.
Instead of writing code you would produce something like this....
       fieldx1 int 5
       skip 3
       fieldx2 int 7
This would then be fed to a generator which then produces the appropriate code.
Various ways exist to do this...
- JavaCC
- XDoclet
- roll your own
And there are probably others (pretty sure there are alternatives to JavaCC.)
However whether this "helps" depends on what is being parsed. (Although for myself I find it more interesting to generate code even if it takes me as long to do it as if I did it manually.)
This couldn't possibly scale if there were many more types of recordsDon't strive to build code that meets all possibilities. Strive to build code that meets the current requirements and which meets the known future possibilities. (Known is a tentative term and isn't limited to just requirements in hand but can be based on a comment from someone that suggests something like this might be needed in the future.)
Having built systems that parse various structures I find it unlikely that additions in the future are going to easily fit a current model. Many times they are completely different.

Text Parsing and the Competence of the Programmer

Often we as programmers are encouraged to devise our own algorithims to accomplish certain tasks. I know we use prewritten code in zibraries/packages to eliminate reduncancies in code and to more easily implement tasks that we frequently need to. We also read tutorials on how to do certain things. After we practice writing the code within the guidlines of the code presented in the tutorial, it becomes easy for us to implement the things those tutorials tought us. However, I want to ask the question: If we fail to implement certain functionalities because we could not then devise the needed code, and we still succeed in doing so by following a pattern we find in tutorials, does that prove that we do not have a knack for programming, and that we are not as gifted as we thought we are? I mean, if I attempt to write a class that can translate a certain style of speech into another (for example: prep talk, slang etc, to propper English) and I fail to find a way to do it, but learn the basic pattern of algorithims necessary to do it and implement it following that pattern. (Or even if I simply learn the mechanics of it and implement it with my understanding of those mechanics) does that mean I am a poor programmer?
Tell me, what is your opinion? Just what does it mean if I do that?

Hmmm I'm not sure exactly what you're aiming at... Finding algorithms to accomplish what you want does seem to require a good level of intelligence but whether or not you succeed in solving a particular problem also largely depends on the problem itself. Some things are just complex and it's not a shame if you don't find a solution in a short time. Some problems can even be proven to be very hard to solve (NP-completeness), see the funny picture with the boss here: http://www.codinghorror.com/blog/archives/001270.html.
To stay in the spirit of your example, try making a good text translator. Just for fun check out google translate or bablefish and translate between two languages you know reasonably well. Sometimes the results are good sometimes they are just plain bad.
Does that mean the programmers of these tools are bad programmers? I'd think the problem of translating text is simply a very complex matter. Yes better programmers may build better text translators, but it doesn't mean you are a bad programmer if you can't write one in a short time.
As you pointed out there are good libraries and tutorials out there and we shouldn't hesitate to use them. We should learn from what others have done and avoid reinventing the wheel. This too can be sometimes hard in practice. Even with the internet at my disposal and knowing that someone else out there has solved the same problem that I'm facing!
Certainly for parsing text I would not try to reinvent the wheel but look up existing solutions. At the least have a look at what regular expressions can do for you; they can save you heaps of procedural code.
But yes, in practice you will not always find a textbook example and you have to break up your problem in subproblems yourself. This is not only a matter of intelligence but also a matter of experience and of trying different ways to look at the same problem.

Text parsing

I want to parse non-xml text files, but do it in an event driven framework. I cant seem to find any guides for this. Can someone point me to a good tutorial? I am familiar with all the regex/tokenizer java classes, and sax and dom parsing, but the latter 2 dont seem applicable since this isnt xml...

I've never tried it, but maybe JavaCC will help.

Unusual request; text parse??

Hello brains,
Below is the DB info for my question.
Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 - 64bit Production
PL/SQL Release 11.1.0.7.0 - Production
CORE     11.1.0.7.0     Production
TNS for IBM/AIX RISC System/6000: Version 11.1.0.7.0 - Production
NLSRTL Version 11.1.0.7.0 - Production
I’m in need of your assistance. I have an unusual request.
I have data stored in varchar2(4000) field that is a cut and paste from a screen; from the cut and paste data I need the following.
PRINCIPAL BALANCE      391,787.58
INTEREST 12/02/10      38,198.56
PRO RATA MIP/PMI           .00
ESCROW ADVANCE      8,601.59
ESCROW BALANCE           .00
SUSPENSE BALANCE           .00
HUD BALANCE           .00
REPLACEMENT RESERVE           .00
RESTRICTED ESCROW           .00
TOTAL-FEES      56.00
ACCUM LATE CHARGES 209.16
ACCUM NSF CHARGES           .00
OTHER FEES DUE      75.95
PENALTY INTEREST           .00
FLAT/OTHER PENALTY FEE           .00
CR LIFE/ORIG FEE RBATE           .00
RECOVERABLE BALANCE      5,747.10
TOTAL TO PAYOFF 444,675.94
Here is the insert statement
Insert into TEST_TABLE
(LM_NOTES)
Values
(' PRINCIPAL BALANCE 391,787.58 ----------- RATE CHANGES ----------
INTEREST 12/02/10 38,198.56 CALC INT FROM RATE AMOUNT
PRO RATA MIP/PMI .00 09/01/08 4.43500 18,523.39
ESCROW ADVANCE 8,601.59 08/01/09 4.27600 1,396.07
ESCROW BALANCE .00 09/01/09 4.12600 1,347.10
SUSPENSE BALANCE .00 10/01/09 3.98300 1,300.41
HUD BALANCE .00 11/01/09 3.85700 1,259.27
REPLACEMENT RESERVE .00 12/01/09 3.76900 1,230.54
RESTRICTED ESCROW .00 01/01/10 3.70600 1,209.97
TOTAL-FEES 56.00 02/01/10 3.69600 1,206.71
ACCUM LATE CHARGES 209.16 03/01/10 3.68800 1,204.09
ACCUM NSF CHARGES .00 04/01/10 3.66600 1,196.91
OTHER FEES DUE 75.95 05/01/10 3.64600 1,190.38
PENALTY INTEREST .00 06/01/10 3.63800 1,187.77
FLAT/OTHER PENALTY FEE .00 TOTAL INTEREST 38,198.56
CR LIFE/ORIG FEE RBATE .00 TOTAL TO PAYOFF 444,675.94
RECOVERABLE BALANCE 5,747.10 NUMBER OF COPIES: 1 PRESS PF1 TO PRINT');
This is a DUMP of the data
32,80,82,73,78,67,73,80,65,76,32,66,65,76,65,78,67,69,32,32,32,32,32,32,32,32,51,57,49,44,55,56,55,46,53,56,32,32,32,32,32,32,32,32,45,45,45,45,45,45,45,45,45,45,45,32,82,65,84,69,32,67,72,65,78,71,69,83,32,45,45,45,45,45,45,45,45,45,45,13,10,32,73,78,84,69,82,69,83,84,32,49,50,47,48,50,47,49,48,32,32,32,32,32,32,32,32,32,51,56,44,49,57,56,46,53,54,32,32,67,65,76,67,32,32,73,78,84,32,70,82,79,77,32,32,32,32,32,82,65,84,69,32,32,32,32,32,32,32,32,32,65,77,79,85,78,84,13,10,32,80,82,79,32,82,65,84,65,32,77,73,80,47,80,77,73,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,46,48,48,32,32,32,32,32,32,32,32,48,57,47,48,49,47,48,56,32,32,32,32,52,46,52,51,53,48,48,32,32,32,32,32,32,49,56,44,53,50,51,46,51,57,13,10,32,69,83,67,82,79,87,32,65,68,86,65,78,67,69,32,32,32,32,32,32,32,32,32,32,32,32,32,56,44,54,48,49,46,53,57,32,32,32,32,32,32,32,32,48,56,47,48,49,47,48,57,32,32,32,32,52,46,50,55,54,48,48,32,32,32,32,32,32,32,49,44,51,57,54,46,48,55,13,10,32,69,83,67,82,79,87,32,66,65,76,65,78,67,69,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,46,48,48,32,32,32,32,32,32,32,32,48,57,47,48,49,47,48,57,32,32,32,32,52,46,49,50,54,48,48,32,32,32,32,32,32,32,49,44,51,52,55,46,49,48,13,10,32,83,85,83,80,69,78,83,69,32,66,65,76,65,78,67,69,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,46,48,48,32,32,32,32,32,32,32,32,49,48,47,48,49,47,48,57,32,32,32,32,51,46,57,56,51,48,48,32,32,32,32,32,32,32,49,44,51,48,48,46,52,49,13,10,32,72,85,68,32,66,65,76,65,78,67,69,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,46,48,48,32,32,32,32,32,32,32,32,49,49,47,48,49,47,48,57,32,32,32,32,51,46,56,53,55,48,48,32,32,32,32,32,32,32,49,44,50,53,57,46,50,55,13,10,32,82,69,80,76,65,67,69,77,69,78,84,32,82,69,83,69,82,86,69,32,32,32,32,32,32,32,32,32,32,32,32,32,46,48,48,32,32,32,32,32,32,32,32,49,50,47,48,49,47,48,57,32,32,32,32,51,46,55,54,57,48,48,32,32,32,32,32,32,32,49,44,50,51,48,46,53,52,13,10,32,82,69,83,84,82,73,67,84,69,68,32,69,83,67,82,79,87,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,46,48,48,32,32,32,32,32,32,32,32,48,49,47,48,49,47,49,48,32,32,32,32,51,46,55,48,54,48,48,32,32,32,32,32,32,32,49,44,50,48,57,46,57,55,13,10,32,84,79,84,65,76,45,70,69,69,83,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,53,54,46,48,48,32,32,32,32,32,32,32,32,48,50,47,48,49,47,49,48,32,32,32,32,51,46,54,57,54,48,48,32,32,32,32,32,32,32,49,44,50,48,54,46,55,49,13,10,32,65,67,67,85,77,32,76,65,84,69,32,67,72,65,82,71,69,83,32,32,32,32,32,32,32,32,32,32,32,50,48,57,46,49,54,32,32,32,32,32,32,32,32,48,51,47,48,49,47,49,48,32,32,32,32,51,46,54,56,56,48,48,32,32,32,32,32,32,32,49,44,50,48,52,46,48,57,13,10,32,65,67,67,85,77,32,78,83,70,32,67,72,65,82,71,69,83,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,46,48,48,32,32,32,32,32,32,32,32,48,52,47,48,49,47,49,48,32,32,32,32,51,46,54,54,54,48,48,32,32,32,32,32,32,32,49,44,49,57,54,46,57,49,13,10,32,79,84,72,69,82,32,70,69,69,83,32,68,85,69,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,55,53,46,57,53,32,32,32,32,32,32,32,32,48,53,47,48,49,47,49,48,32,32,32,32,51,46,54,52,54,48,48,32,32,32,32,32,32,32,49,44,49,57,48,46,51,56,13,10,32,80,69,78,65,76,84,89,32,73,78,84,69,82,69,83,84,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,46,48,48,32,32,32,32,32,32,32,32,48,54,47,48,49,47,49,48,32,32,32,32,51,46,54,51,56,48,48,32,32,32,32,32,32,32,49,44,49,56,55,46,55,55,13,10,32,70,76,65,84,47,79,84,72,69,82,32,80,69,78,65,76,84,89,32,70,69,69,32,32,32,32,32,32,32,32,32,32,46,48,48,32,32,32,32,32,32,32,32,84,79,84,65,76,32,73,78,84,69,82,69,83,84,32,32,32,32,32,32,32,32,32,32,32,51,56,44,49,57,56,46,53,54,13,10,32,67,82,32,76,73,70,69,47,79,82,73,71,32,70,69,69,32,82,66,65,84,69,32,32,32,32,32,32,32,32,32,32,46,48,48,32,32,32,32,32,32,32,32,84,79,84,65,76,32,84,79,32,80,65,89,79,70,70,32,32,32,32,32,32,32,32,32,52,52,52,44,54,55,53,46,57,52,13,10,32,82,69,67,79,86,69,82,65,66,76,69,32,66,65,76,65,78,67,69,32,32,32,32,32,32,32,32,53,44,55,52,55,46,49,48,32,32,78,85,77,66,69,82,32,79,70,32
I don’t even know where to begin. ANY help would be truly appreciated. I was thinking about a loop parsing the 13,10 NL feeds and Returns getting the first decimal and substring line but not sure. Any ideas?? They basically need that data formatted; when i export it into a doc.

Hi,
Thanks for posting the INSERT statement.
You may have noticed that this site normally compresses whitespace. When posting any formatted text on this site, type these 6 characters:
\(small letters only, inside curly brackets) before and after each section of formatted text, to preserve spacing.
Do you want a SQL result set (with separate columns of different datatypes), or just formatted text as output?
This job will be a lot easier if you can use PL/SQL. Divide the sting into separate lines, as you suggested.
On each separate line, you seem to be interested in the first number (in '999,999,999.00' format), and the text that comes before that number (which may include numbers, like '12' or '02' or '10'). In that caseREGEXP_SUBSTR ( single_line
     , ' '          || -- a space
     '(\d{0,3})'      || -- 0 to 3 digits
          '(,\d{3})*'     || -- any number of 3-digit groups, each starting with a comma
          '\.'          || -- a decimal point
          '\d\d'          -- exactly 2 digits
will return the number (in '999,999,999.00' format). If the expression above returns NULL, then the string did not contain such a number.

Please Help with text parsing problem

Hello,
I have the following text in a file (cut and pasted from VI)
12 15 03 12 15 03 81 5 80053 1 1,2,3 $23.00 1 ^M
12 15 03 12 15 03 81 5 84550 1 1,2,3 $15.00 1 ^M
12 15 03 12 15 03 81 5 84100 1 1,2,3 $15.00 1 ^M
12 15 03 12 15 03 81 5 83615 1 1,2,3 $15.00 1 ^M
12 15 03 12 15 03 81 5 82977 1 1,2,3 $15.00 1 ^M
12 15 03 12 15 03 81 5 80061 1 1,2,3 $44.00 1 ^M
12 15 03 12 15 03 81 5 83721 1 1,2,3 $15.00 1 ^M
12 15 03 12 15 03 81 5 84439 1 1,2,3 $44.00 1 ^M
12 15 03 12 15 03 81 5 84443 1 1,2,3 $40.00 1 ^M
12 15 03 12 15 03 81 5 85025 1 1,2,3 $26.00 1 ^M
12 15 03 12 15 03 81 5 85008 1 1,2,3 $5.00 1 ^M
this method reads the text from a file and stores it in a ArrayList
    public ArrayList readInData(){
        File claimFile = new File(fullClaimPath);
        ArrayList returnDataAL = new ArrayList();
        if(!claimFile.exists()){
            System.out.println("Error: claim data - File Not Found");
            System.exit(1);
        try{
            BufferedReader br = new BufferedReader(new FileReader(claimFile));
            String s;
            while ((s = br.readLine()) != null){
                     System.out.println(s + " HHHH");
                    returnDataAL.add(s);
        }catch(Exception e){ System.out.println(e);}
        return returnDataAL;
    }//close loadFile()if i print the lines from above ... from the arraylist ... here is waht i get ...
2 15 03 12 15 03 81 5 80053 1 1,2,3 $23.00 1 HHHH
HHHH
12 15 03 12 15 03 81 5 84550 1 1,2,3 $15.00 1 HHHH
HHHH
12 15 03 12 15 03 81 5 84100 1 1,2,3 $15.00 1 HHHH
HHHH
12 15 03 12 15 03 81 5 83615 1 1,2,3 $15.00 1 HHHH
HHHH
12 15 03 12 15 03 81 5 82977 1 1,2,3 $15.00 1 HHHH
HHHH
12 15 03 12 15 03 81 5 80061 1 1,2,3 $44.00 1 HHHH
HHHH
12 15 03 12 15 03 81 5 83721 1 1,2,3 $15.00 1 HHHH
HHHH
12 15 03 12 15 03 81 5 84439 1 1,2,3 $44.00 1 HHHH
HHHH
12 15 03 12 15 03 81 5 84443 1 1,2,3 $40.00 1 HHHH
HHHH
12 15 03 12 15 03 81 5 85025 1 1,2,3 $26.00 1 HHHH
HHHH
12 15 03 12 15 03 81 5 85008 1 1,2,3 $5.00 1 HHHH
HHHH
I see the ^M on the end of the lines ... but i dont understand why im getting the blank lines
in between each entry ... I printed "HHHH" just to help with debugging ... anyone have any ideas why i am getting the extra blank lines?
thanks,
jd

maybe its a FileReader deal.. Im not sure, maybe try using InputStreams. This code works for me (it reads from data.txt), give it a try and see if it works:
import java.io.*;
public class Example {
     public Example() throws IOException {
          BufferedReader b = new BufferedReader(new InputStreamReader(new FileInputStream("data.txt")));
          String s = "";
          while ((s = b.readLine()) != null) {
               System.out.println(s);
          b.close();
     public static void main(String[] args) {
          try {
               new Example();
          catch (IOException e) {
               e.printStackTrace();
}

Text parsing tool - need suggestion

I need to do the following on a text file, tab separated.
The lines I'm interested in begin all with 4 digits; other lines must remain untouched.
The structure of a line is:
4-digit code
start
end
desciptive string
If the "end" of line X is equal to the "start" of line X+1, the end of the first line must be replaced by (for instance) --- ; I need to compare line 1 and 2, then line 2 and 3, then line 3 and 4, and so on. If a line doesn't begin with 4 digits, it must be printed literally and the comparison must be done with the following line.
Something like:
line 1 begins with 4 digit, so compare it with the following line
line 2 begins with text: print it and skip to the following for the comparison with line 1
line 3 begins with 4 digit: use it for the comparison with line 1 and then compare it with the following line
line 4 begins with 4 digit: use it for the comparison with the previous line and then compare it with the following line
### example input ###
Some garbage text
0123 001 003 dfghdfhg
4569 002 003 fdsygiusdg
Some other garbage text
5001 003 005 fguoiauhg
4786 005 007 fdhgsdhg
5641 007 008 kgjhlks
4455 009 010 dlkhsjsdf
### example desidered output ###
Some garbage text
0123 001 003 dfghdfhg
4569 002 --- fdsygiusdg
Some other garbage text
5001 003 --- fguoiauhg
4786 005 --- fdhgsdhg
5641 007 008 kgjhlks
4455 009 010 dlkhsjsdf
Can someone suggest the right tool for this and give some hints?

I've added my own comments to the code:
#!/usr/bin/perl
use strict;
use warnings;
# initialize an array to hold the matches from a line that fits our regular expression
# we initialize it here to make it global so we can use it in the loop
my @last_line = ();
# the stack will temporarily hold all the lines that don't match the regular expression
# we initialize it here too so that we can use it in the loop
my $stack = '';
# here's the main loop... it gets run once for each line of input
# the "my $line = <STDIN>" sets "$line" to the value of the next line on STDIN
# the "defined()" function is just to make sure that we get all of the input because
# perl can interpret some strings and digits as "False"
# without the "defined()" part, the actual value of each line would be used
while (defined(my $line = <STDIN>))
# first we need to decide how to print the lines
# the tricky part is that before we print one line,
# we need to check the NEXT line to see if it's "start"
# is this line's "end"
# we also need to keep the junk text in the right order,
# which means that we can't print the last line or any of
# the junk lines between it and the next matching line
# until we've checked it so we dump the junk lines on
# a stack that we can print after we've printed the last
# matching line
# here we check for lines that begin with 4 digits,
# followed by spaces, digits, spaces, more digits,
# then whatever's left on the line
# we capture the first 4 digits and the following spaces in first group
# the "start" in the second group
# the spaces between the "start" and the "end" in the third group
# the "end" in the fourth group
# and then the rest of the line in the fifth group
# the "m" at the beginning is not necessary as it defaults
# to matching, but I prefer to be explicit when sharing code
# to make it clearer... you could skip it
if ($line =~ m/^(\d{4}\s+)(\d+)(\s+)(\d+)(.+\n)$/)
# not we assign the matches to an array
# remember that arrays are 0-indexed, so
# group 1 will be at [0], etc
my (@current_line) = ($1,$2,$3,$4,$5);
# check if the "start" of the current line
# (group 2, index 1) matches the "end" of the
# last line (group 4, index 3)
if ($last_line[3] eq $current_line[1])
# if it does, replace each character in
# the "end" of the last line with "-"
$last_line[3] =~ s/./-/g;
# print the last line by joining the 5 groups
# together in a string without spaces then
# print all the junk lines that came after it
# before the current matching line
print join ('',@last_line), $stack;
# make the current line the "last line" for
# the comparison with the next matching line
@last_line=@current_line;
# clear the stack because we've printed those
# lines that were in it already
$stack = '';
# otherwise, the line didn't match our regular express
else
# so drop it on the stack for later
# if we printed it now, it would be printed before
# the last matching line
$stack .= $line;
# now we've run through all the input but we haven't
# printed the last matching line because we always waited
# for the next matching line to compare its "end" with the
# next "start"
# now there's nothing more to compare so we can print
# the last line and the stack
print join ('',@last_line), $stack;
Let's look at a modified version of your example input (with line numbers) and what happens when the script is run
1 Some garbage text
2 0123 001 003 dfghdfhg
3 4569 002 003 fdsygiusdg
4 Some other garbage text
5 Even more garbage text
6 5001 003 005 fguoiauhg
7 4786 005 007 fdhgsdhg
8 5641 007 008 kgjhlks
9 4455 009 010 dlkhsjsdf
@last_line is initalized (empty) and so is $stack
line 1 doesn't match the regex so it get's dropped on the stack
line 2 matches the regex
line 2's "start" (001) doesn't match @last_line's "end" (which is empty) so no changes are made to @last_line's "end"
print @last_line (""), followed by the stack (line 1) and clear the stack
@last_line = line 2
line 3 matches the regex
line 3's "start" (002) doesn't match @last_line's "end" (003) so no changes are made
print @last_line (line 2) followed by the stack ("")
@last_line = line 3
line 4 doesn't match the regex so it's dropped on the stack
same for line 5 so the stack is now line 4 + line 5
line 6 matches the regex
line 6's "start" (003) matches @last_line's (line 3's) "end" (003) so replace "003" with "---"
print @last_line (line 3) followed by the stack (line 4 + line 5)
etc
Without the stack, lines 4 and 5 would have been printed before line 3 because we were holding on to it for comparison with line 6.
(I sense another one of those "you have too much time on your hands" comments in this thread... I can't help it though, I like to help, plus it's good exercise to explain how you've programmed something and why)
Last edited by Xyne (2009-03-10 20:34:35)

Fastest Text Parsing Technique

Similar Messages

Maybe you are looking for