Process sampled data and perform simple calculation to by pass Matlab?

I have files that contain large number current samples (like 300,000 for example). I need them to performing simple calculations (for example to calculate energy).
Before, I used Matlab to load all those samples into array and to do all calculations there.
However... that process is kind of slow. What can I use to simply this and replace Matlab by some script?
Can I use perl or php to perform this operation on set of sample files to spit out final calculation values into new file?
Last edited by kdar (2012-05-11 13:21:05)

I'm surprised no one has suggested numpy and scipy, which are modules for python 2. They have syntax that's similar to mattlab, they're very fast, and python is useful in other circumstances and easy to learn.
I did something similar for spectral data. Here's an example of what I did in python:
#!/usr/bin/env python
#This program goes through rayleigh line data and finde the mean shift
#in nanometers and the standard deviation
import sys, os
import numpy as np
import scipy as sp
import scipy.optimize as op
import time
ray = []
filenames = []
line = 633
def rs(wavelength,laser):
return ((float(1)/laser)-(float(1)/wavelength))*(10**7)
def main(argv): #Goes through a file and finds the peak position of the rayleigh line
f = np.loadtxt(argv).transpose() #opens the file
maxi = np.amax(f[1]) #Finds the value of hte peak of the rayleigh line
intensity = [f[1,i] for i in range(len(f[1]))] #extrants the array into a list
indi = intensity.index(maxi) #Finds the index of the rayleigh line
ray.append(f[0,indi])
filenames.append(str(argv))
# Goes through each file named in the CLI call and applies the main function to it
for filename in sys.argv[1:]:
main(filename)
# Use numpy for some basic calculations
mean = np.mean(ray)
StandardDeviation = np.std(ray)
median = np.median(ray)
variance = np.var(ray)
ramanshift = [rs(ray[i],line) for i in range(len(ray))]
rsmean = np.mean(ramanshift)
rsSD = np.std(ramanshift)
rsmedian = np.median(ramanshift)
rsvariance = np.var(ramanshift)
tname = str(time.asctime())
# Write all calculations to a file
output = open('rayleigh_'+tname+'.dat','w')
output.write('#The files used for this compilation are:\n')
for i in range(len(filenames)):
output.write('#'+filenames[i]+'\n')
output.write('The wavelengths of the Rayleigh line are (in nm):\n')
for i in range(len(ray)):
output.write(str(ray[i])+'\n')
output.write('The raman shifts of the rayleigh line for '+str(line)+'nm are (in rel. cm^(-1):\n')
for i in range(len(ray)):
output.write(str(ramanshift[i])+'\n')
output.write('Mean = '+str(mean)+'nm, or '+str(rsmean)+' rel. cm^(-1)\n')
output.write('Standard Deviation = '+str(StandardDeviation)+' nm, or '+str(rsSD)+' rel. cm^(-1)\n')
output.write('Median = '+str(median)+'nm or, '+str(rsmedian)+' rel. cm^(-1)\n')
output.write('Variance = '+str(variance)+'nm or, '+str(rsvariance)+' rel. cm^(-1)\n')
output.close()
Last edited by Yurlungur (2012-05-13 21:14:54)

Similar Messages

  • Date and time diffrence calculations

    Dear Experts,
    I have a doubt regarding the date and time diffrence calculation please help me out.
    This is regarding the report called u201CVehicle tracking systemu201D. The purpose of this report is to give the user information like how many vehicles have come in and how much time each vehicle has parked in the filling in the refinery and also the average time taken for a customer.
    For this we have created 4 info objects called
    1)     Park In Time
    2)     Park In Date
    3)     Gate Out Time
    4)     Gate Out Date
    We have created a routine which calculate the difference in Vehicle Park In Date & Time to Gate Out Time and Date
    The Time difference is number format.
    Example: For Vehicle1
    Park in Date and Park in Time are         27.11.2008 and 17:43:35
    Gate out Date and Gate out Time are     29.11.2008 and 09:36:16            
    Here routine will calculate the difference in the above date and time and the output is like this     1,155.138,000 (i.e. 1day, 15 hours 51 mins and 38 seconds) that means the vehicle has parked for 1 day and 15 hours etc.
    But we require the output like    1 day, 15:51:38
    Please give us the solution how to go about it.
    Thanks in advance,
    venkat

    Hi,
    I think in regard to your problem, the solution will be to make the variables in the query runtime by the replacement path variable.
    Please find the below link, its very much self explanatory and in regard to your problem.
    Hope it helps!
    [http://sd-solutions.com/documents/SDS_BW_Replacement%20Path%20Variables.html]
    Regards,
    Neha.

  • Need help in framing an SQL query - Sample data and output required is mentioned.

    Sample data :
    ID Region State
    1 a A1
    2 b A1
    3 c B1
    4 d B1
    Result should be :
    State Region1 Region2
    A1 a b
    B1 c d

    create table #t (id int, region char(1),state char(2))
    insert into #t values (1,'a','a1'),(2,'b','a1'),(3,'c','b1'),(4,'d','b1')
    select state,
     max(case when region in ('a','c') then region end) region1,
      max(case when region in ('b','d') then region end) region2
     from #t
     group by state
    Best Regards,Uri Dimant SQL Server MVP,
    http://sqlblog.com/blogs/uri_dimant/
    MS SQL optimization: MS SQL Development and Optimization
    MS SQL Consulting:
    Large scale of database and data cleansing
    Remote DBA Services:
    Improves MS SQL Database Performance
    SQL Server Integration Services:
    Business Intelligence

  • Increase the number of background work processes for data load performance

    Hi all,
    There are 10 available background work processes in the BW system. We're doing some mass load to multiple ODS.But system uses only 3 background processes. How can i
    increase the number of used background work processes for new data load.
    I tried to change number of prosesses with RSODSO_SETTINGS. But no successes. Are there any other settings need to change?
    thanks,
    Yigit

    Hi sankar,
    I entered the max proc. number into ROIDOCPRMS. But it doesn't make difference. System still uses only 3 of background processes. RSCUSTA2 is replaced with
    RSODSO_SETTINGS in BI 7.0 and this trans. can only change the processes for data activation, SID generation and rollback. I need to change the process numbers for data extraction.

  • Picking a DD payment date and a bill calculation d...

    I want to pay by monthly direct debit and pay off the whole montly bill in whole every month (like the good loyal BT customer that I am). To suit my way of doing my accounts I need to get advised of the bill amount just before the end of a month in advance of a DD that gets taken just after the start of a month. Having both the bill calculated and the payment taken part way through the month would not work.
    I phoned BT in the past and asked if it was possible to change the DD and bill dates and was told it was not possible.
    Can it be done? Can the DD and bill dates be changed?
    Solved!
    Go to Solution.

    Hi greenbandit,
    Each area has a different bill cycle, however if you choose to pay your entire bill by direct debit each month (Whole Bill Direct Debit Monthly) your bills will be produced on a specific date, not of your choosing, and the direct debit will be taken within 10 days.
    You can set up Monthly Payment Plan (MPP) which pays monthly set amounts to a quarterly bill.  This allows you to pick a specific date each month for you payment to be made but not the date of the bill production.
    Hope this helps
    Thx
    Craig
    BTCare Community Mod
    If we have asked you to email us with your details, please make sure you are logged in to the forum, otherwise you will not be able to see our ‘Contact Us’ link within our profiles.
    We are sorry but we are unable to deal with service/account queries via the private message(PM) function so please don't PM your account info, we need to deal with this via our email account :-)”
    td-p/30">Ratings star on the left-hand side of the post.
    If someone answers your question correctly please let other members know by clicking on ’Mark as Accepted Solution’.

  • SAP BI Date and Time Difference calculations in DSO Transformation

    Hello Guys,
    Could you please tell me how to calculate the date and Time difference between 2 fields.
    I have 2 date fields :
    Arrival Date : 6/16/2007
    Departure date : 6/19/2007
    Also i have 2 time fields for the above
    Arrival Time : 13:00:00
    Departure Time : 11:50:00
    I want to display all the four fields and 2 fields for the difference in Date and Time.
    Is it better to calcuate the differences in DSO Transformation or can you do it in the report itself.Could you please let me know the solution.
    Thanks,
    BI Consultant

    Hello Consulant BI,
    Computing the difference of two dates is easy (assuming you really just want the number of days). You can simply subtract the two dates using ABAP:
    data: w_arrival_date type sy-datum,
            w_dep_date type sy-datum,
            w_diff type i.
    w_arrival_date = <your arrival date field here>.
    w_arrival_date = <your departure field here>.
    w_diff  = w_arrival_date - w_dep_date.
    Getting the time difference isn't really much logical. I think what you want instead is to compute the totals days (and extra hours) right? If this is the case, then you can convert the date+time for both the arrival and departure  into a timestamp variable first and then t the get difference.
    Hope this helps.

  • Trying to figure out how to perform simple calculations.

    I have a database that user input information.  I simply want to add all the numbers and divide by the number of entries to get an average.  It will change as users input more entries though.  Is this something that can be done fairly easy with CS4?

    Ok, but how do I do it outside?  What does that mean?
    Try this
    SELECT AVG(score) AS average_score
    FROM (Select score from mytable
    order by date desc
    Limit 20)
    >Or can I just calculate it when it's pulled out.
    Normally you should not store calculated fields. This violates rules of normalization. There are exceptions.
    >That would be find also, but when I try it only calculates for
    >the first record in the table...not for each one. Please help.
    You need to explain better. Are you displaying records in a repeating region? If so, you would put the calcuation in the region and it would display the calcuation for each row.

  • How to compare to date and From date in XSLT

    Hi,
    I have to check in my process "*To date*" and "*From date*" and make "*To date*" greater than or equlas to "*From date*"..
    Is there any function which performs this task.
    I will appreciate if someone can provide any Solution on the same..
    Thanks in advance..

    Hi,
    You can compare dates in XSLT as in BPEL and in XPath using &lt; and > and =, you can even add a duration in days or months to your date using xp20:add-dayTimeDuration-to-dateTime()
    The only but is all the dates have to be in ISO 8601 Format...
    http://www.w3.org/TR/NOTE-datetime
    For example:
    *1994-11-05T08:15:30-05:00* corresponds to November 5, 1994, 8:15:30 am, US Eastern Standard Time.
    *1994-11-05T13:15:30Z* corresponds to the same instant.
    The following will return true if $dateFrom is less than $dateTo...
    <xsl:value-of select="$dateFrom &lt; $dateTo"/>Cheers,
    Vlad

  • Mac has become insanely slow : Processes SystemUIServer, UserEventAgent and loginwindow using a lot of memory

    I have been using my Mac for for many months without any problem. But recently all of a sudden the Mac became insanely slow.
    I opened Activity Manager to see what was happening. For three processes SystemUIServer, UserEventAgent and loginwindow, the memory gradually increases and reaches upto 2 GB for each process. This completely hangs up my Mac.
    I tried the following:
    Restart Mac
    Restart Mac in safe mode
    Manually kill the processes
    Remove Date and Time from Menu bar (this was supposed to be the problem for the SysteUIServer process's memory according to many users)
    Removed the externally connected keyboard and mouse(some had suggested this for UserEventAgent's memory)
    No luck with any of those. The moment I log in, the memory spikes up.
    Any idea what on earth is happening? Please help.

    Create a new user account with admin status. Log out of your account, then log into the new one. Does your problem stop? If not then boot into Safe Mode. Does the problem stop? Boot normally. Does the problem stop.
    If none of the above help, then you have a corrupted system and need to reinstall Lion.
    Reinstalling Lion/Mountain Lion Without Erasing the Drive
    Boot to the Recovery HD: Restart the computer and after the chime press and hold down the COMMAND and R keys until the menu screen appears. Alternatively, restart the computer and after the chime press and hold down the OPTION key until the boot manager screen appears. Select the Recovery HD and click on the downward pointing arrow button.
    Repair the Hard Drive and Permissions: Upon startup select Disk Utility from the main menu. Repair the Hard Drive and Permissions as follows.
    When the recovery menu appears select Disk Utility. After DU loads select your hard drive entry (mfgr.'s ID and drive size) from the the left side list.  In the DU status area you will see an entry for the S.M.A.R.T. status of the hard drive.  If it does not say "Verified" then the hard drive is failing or failed. (SMART status is not reported on external Firewire or USB drives.) If the drive is "Verified" then select your OS X volume from the list on the left (sub-entry below the drive entry,) click on the First Aid tab, then click on the Repair Disk button. If DU reports any errors that have been fixed, then re-run Repair Disk until no errors are reported. If no errors are reported click on the Repair Permissions button. Wait until the operation completes, then quit DU and return to the main menu.
    Reinstall Lion/Mountain Lion: Select Reinstall Lion/Mountain Lion and click on the Continue button.
    Note: You will need an active Internet connection. I suggest using Ethernet if possible because it is three times faster than wireless.

  • How to calculate Date and Days

    Hi BW Experts,
    I have requirement, I have field original GI Date and  it is calculating based on 'Original promise date'-'Transport time'. Formula is 'Original GI Date' = 'Original promise date'-'Transport time'.
    We are getting data for original promise date as Date format and  Transport time as Days (ex: 1 or 2 days).
    Here, how can I convert into Date or how can I calculate Date and Days. I am working on BW3.5 .
    Please help me how can I overcome this requirement.
    Points will be assign.
    regards
    Yedu.

    Hi Ventatesh
    It's not a problem
    You can subtract days from date to get the resultant date
    Original Promise time  =  GI Date  - Transit time ( in days)
    Add Original Promise Time in your data target and fill that up with the above rule.
    If the above is not working you can use this function module
    DATE_IN_FUTURE
    Here you need to pass Date and Days to get future date. Only trick you need to apply that if the transit days is 5 days, you pass -5 to this function module.
    But this function module is not available in BW system. Just copy the code from ECC system and create a Z FM for BW system
    Regards
    Anindya
    Edited by: Anindya Bose on Feb 9, 2012 4:36 AM

  • Auto Lexer on XML Data and Exadata (11.2.0.2.0)

    I am trying to migrate to Auto Lexer from World Lexer. We have a table with a CLOB Column where we store some data in XML format and we search via INPATH
    WHERE CONTAINS (XML,'(( (2005) INPATH(/record/extend/identifier) ))' ) > 0
    Index create scripts are
    WORLD_LEXER
    CREATE INDEX TEST_XML ON TEST(XML)
    INDEXTYPE IS CTXSYS.CONTEXT
    PARAMETERS
    filter GTP_FILTER_NULL
    section group GTP_SECTION_GROUP_PATH
    lexer GTP_LEXER_WORLD
    wordlist GTP_WORDLIST_BASIC
    stoplist GTP_STOPLIST_NULL
    storage GTP_STORAGE_BASIC
    For the AUTO LEXER => Lexer GTP_LEXER_AUTO
    For the World Lexer Index I get results but the Auto Lexer returns zero rows.
    I am running on ExaData as well. (11.2.0.2)
    Edited by: amin_adatia on 16-Aug-2011 11:31 AM

    Can you put together a reasonably complete test script that has some sample data and shows the problem occurring? I'm using 11.2.0.2 (not Exadata), and I haven't had any issues.

  • ODI Not Processing Stage Data

    Hello Friends,
    We have ODI 11.1.1.7.0 installed properly for which we ran Main package selecting the option of "no agent" on log level 6. The package was properly processing the data for 2 days and was pushing it properly to Reporting DB (Oracle) for 2 days. Suddenly, after second day, it has stopped processing any data and is not coming out of the sleep state. It has been about 15 days and "ODISleep" is running. There is no error in the logs. Events are being continuously generated and are reaching the Stage DB.
    It would be of really a great help if you all could provide guidance on this issue.
    Thank you....
    Still didn't find any specific reason yet. Please reply in case anyone is aware of the reason.
    Thank you..
    Regards,
    Adi.

    Hi,
    I'd like to provide some KnowledgeBase articles here and here about utilizing queues in your VI. This should also fix the problem with your data flow, or you could try using shift registers. Are you receiving any errors when you run your VI?
    Amanda Howard
    Americas Services and Support Recruiting Manager
    National Instruments

  • Sample rate and data recording rate on NI Elvis

    I am currently working on a project that requires me to record my data at 1ms intervals or less. Currently the lowest timing interval I can record is at 10ms. If I change my wait timer to anything below 10 the recorded data in excel will skip time. For example instead of it starting at 1 ms and counting 2,3,4,5,6...,etc. It is skipping from 2,5,12,19,....,etc. So my question is if it is a limitation that I have reached on the NI Elvis or if it is possibly a problem with how I've created my LabVIEW code. My program from an operational stand point is working great, but it is my data recording that is causing me to not be able to move to my testing phase. Any help on this matter would be greatly appreciated.
    Other information that might be relevant:
    Operating System: Windows 7
    Processor: Intel(R) Xeon(R) CPU E31245 @ 3.00 GHz
    Memory: 12GB
    DirectX Version: 11
    Attachments:
    Count Digital(mod12).vi ‏76 KB

    Hi crashdx,
    So my immediate thought on this issue is that the code inside your primary while loop might be taking too long to process to achieve such a high sample rate. Especially when making calls into external applications (such as Excel) which can take a large amount of itme. 
    There is a very useful debugging tool called the Performance and Memory tool. If you aren't familiar with this tool, it will allow you to see how much memory the various chunks of your code are using and, more importantly here, how much time each subVI is taking to execute. Does the code inside your while loop take longer than 1ms to run? If so, then you will definitely see unwanted logging behavior and will need to change your approach. Would it be possible to collect more than a single sample at a time and perform calculations on a large number of samples at once before writing them to Excel in bigger chunks?
    I've included a link to the LabVIEW helpful detailing the Profile Performance and Memory tool.
    http://zone.ni.com/reference/en-XX/help/371361H-01/lvdialog/profile/
    I would first try and figure out how long it's taking your loop code to execute and go from there.
    I hope this helps!
    Andy C.
    Applications Engineer
    National Instruments

  • Date Based Dynamic Member Calculation Performance (MTD, YTD, PeriodsToDate)

    I'm working on an SSAS 2012 cube and I have defined several dynamic calculations based on a Date (MTD, YTD, TD, Thru Previous Month, etc.).
    The cube has a well defined Date dimension and I have set up a DYNAMIC calculated set in the cube as shown below.
    CREATE DYNAMIC SET CURRENTCUBE.[Latest Date]
    AS TAIL(EXISTS([Payment Date].[Calendar Date].[Date].Members,[Payment Date].[Calendar Date].CURRENTMEMBER,"Claim Payment"));
    ... and here is an example of one of my calculations.
    CREATE MEMBER CURRENTCUBE.[Measures].[Face Amount Paid MTD]
    AS SUM(MTD([Latest Date].ITEM(0).ITEM(0)), [Measures].[Face Amount Paid]),
    FORMAT_STRING = "$#,##0.00;-$#,##0.00",
    //NON_EMPTY_BEHAVIOR = { [Claim Payment Fact Count] },
    VISIBLE = 1 , DISPLAY_FOLDER = 'Face Amount' , ASSOCIATED_MEASURE_GROUP = 'Claim Payment';
    This calculation returns the correct results, but performs horribly.  I've noticed that changing the [Latest Date] set to STATIC, performance greatly improves, but numbers are no longer accurate as they are based on the Tail of the Claim Payment measure
    group without considering the filtered dates.  This is because the date is evaluated at process time, but needs to be based on users date selection to be accurate.  Therefore, STATIC does not appear to be an option.  Is there a better way to
    perform this calculation dynamically based on the filtered or unfiltered date dimension?

    Typically I would just do YTD/MTD/etc off the date the user has selected.  It seems like you want to do it based on the last date with data within the range they have selected.  Why not just give the user what they are asking for?
    In other words, why not
    YTD([Payment Date].[Calendar Date].currentmember)

  • ASCII character/string processing and performance - char[] versus String?

    Hello everyone
    I am relative novice to Java, I have procedural C programming background.
    I am reading many very large (many GB) comma/double-quote separated ASCII CSV text files and performing various kinds of pre-processing on them, prior to loading into the database.
    I am using Java7 (the latest) and using NIO.2.
    The IO performance is fine.
    My question is regarding performance of using char[i] arrays versus Strings and StringBuilder classes using charAt() methods.
    I read a file, one line/record at a time and then I process it. The regex is not an option (too slow and can not handle all cases I need to cover).
    I noticed that accessing a single character of a given String (or StringBuilder too) class using String.charAt(i) methods is several times (5 times+?) slower than referring to a char of an array with index.
    My question: is this correct observation re charAt() versus char[i] performance difference or am I doing something wrong in case of a String class?
    What is the best way (performance) to process character strings inside Java if I need to process them one character at a time ?
    Is there another approach that I should consider?
    Many thanks in advance

    >
    Once I took that String.length() method out of the 'for loop' and used integer length local variable, as you have in your code, the performance is very close between array of char and String charAt() approaches.
    >
    You are still worrying about something that is irrevelant in the greater scheme of things.
    It doesn't matter how fast the CPU processing of the data is if it is faster than you can write the data to the sink. The process is:
    1. read data into memory
    2. manipulate that data
    3. write data to a sink (database, file, network)
    The reading and writing of the data are going to be tens of thousands of times slower than any CPU you will be using. That read/write part of the process is the limiting factor of your throughput; not the CPU manipulation of step #2.
    Step #2 can only go as fast as steps #1 and #3 permit.
    Like I said above:
    >
    The best 'file to database' performance you could hope to achieve would be loading simple, 'known to be clean', record of a file into ONE table column defined, perhaps, as VARCHAR2(1000); that is, with NO processing of the record at all to determine column boundaries.
    That performance would be the standard you would measure all others against and would typically be in the hundreds of thousands or millions of records per minute.
    What you would find is that you can perform one heck of a lot of processing on each record without slowing that 'read and load' process down at all.
    >
    Regardless of the sink (DB, file, network) when you are designing data transport services you need to identify the 'slowest' parts. Those are the 'weak links' in the data chain. Once you have identified and tuned those parts the performance of any other step merely needs to be 'slightly' better to avoid becoming a bottleneck.
    That CPU part for step #2 is only rarely, if every the problem. Don't even consider it for specialized tuning until you demonstrate that it is needed.
    Besides, if your code is properly designed and modularized you should be able to 'plug n play' different parse and transform components after the framework is complete and in the performance test stage.
    >
    The only thing that is fixed is that all input files are ASCII (not Unicode) characters in range of 'space' to '~' (decimal 32-126) or common control characters like CR,LF,etc.
    >
    Then you could use byte arrays and byte processing to determine the record boundaries even if you then use String processing for the rest of the manipulation.
    That is what my framework does. You define the character set of the file and a 'set' of allowable record delimiters as Strings in that character set. There can be multiple possible record delimiters and each one can be multi-character (e.g. you can use 'XyZ' if you want.
    The delimiter set is converted to byte arrays and the file is read using RandomAccessFile and double-buffering and a multiple mark/reset functionality. The buffers are then searched for one of the delimiter byte arrays and the location of the delimiter is saved. The resulting byte array is then saved as a 'physical record'.
    Those 'physical records' are then processed to create 'logical records'. The distinction is due to possible embedded record delimiters as you mentioned. One logical record might appear as two physical records if a field has an embedded record delimiter. That is resolved easily since each logical record in the file MUST have the same number of fields.
    So a record with an embedded delimiter will have few fields than required meaning it needs to be combined with one, or more of the following records.
    >
    My files have no metadata, some are comma delimited and some comma and double quote delimited together, to protect the embedded commas inside columns.
    >
    I didn't mean the files themselves needed to contain metadata. I just meant that YOU need to know what metadata to use. For example you need to know that there should ultimately be 10 fields for each record. The file itself may have fewer physical fields due to TRAILING NULLCOS whereby all consecutive NULL fields at the of a record do not need to be present.
    >
    The number of columns in a file is variable and each line in any one file can have a different number of columns. Ragged columns.
    There may be repeated null columns in any like ,,, or "","","" or any combination of the above.
    There may also be spaces between delimiters.
    The files may be UNIX/Linux terminated or Windows Server terminated (CR/LF or CR or LF).
    >
    All of those are basic requirements and none of them present any real issue or problem.
    >
    To make it even harder, there may be embedded LF characters inside the double quoted columns too, which need to be caught and weeded out.
    >
    That only makes it 'harder' in the sense that virtually NONE of the standard software available for processing delimited files take that into account. There have been some attempts (you can find them on the net) for using various 'escaping' techniques to escape those characters where they occur but none of them ever caught on and I have never found any in widespread use.
    The main reason for that is that the software used to create the files to begin with isn't written to ADD the escape characters but is written on the assumption that they won't be needed.
    That read/write for 'escaped' files has to be done in pairs. You need a writer that can write escapes and a matching reader to read them.
    Even the latest version of Informatica and DataStage cannot export a simple one column table that contains an embedded record delimiter and read it back properly. Those tools simply have NO functionality to let you even TRY to detect that embedded delimiters exist let alone do any about it by escaping those characters. I gave up back in the '90s trying to convince the Informatica folk to add that functionality to their tool. It would be simple to do.
    >
    Some numeric columns will also need processing to handle currency signs and numeric formats that are not valid for the database inpu.
    It does not feel like a job for RegEx (I want to be able to maintain the code and complex Regex is often 'write-only' code that a 9200bpm modem would be proud of!) and I don't think PL/SQL will be any faster or easier than Java for this sort of character based work.
    >
    Actually for 'validating' that a string of characters conforms (or not) to a particular format is an excellent application of regular expressions. Though, as you suggest, the actual parsing of a valid string to extract the data is not well-suited for RegEx. That is more appropriate for a custom format class that implements the proper business rules.
    You are correct that PL/SQL is NOT the language to use for such string parsing. However, Oracle does support Java stored procedures so that could be done in the database. I would only recommend pursuing that approach if you were already needing to perform some substantial data validation or processing the DB to begin with.
    >
    I have no control over format of the incoming files, they are coming from all sorts of legacy systems, many from IBM mainframes or AS/400 series, for example. Others from Solaris and Windows.
    >
    Not a problem. You just need to know what the format is so you can parse it properly.
    >
    Some files will be small, some many GB in size.
    >
    Not really relevant except as it relates to the need to SINK the data at some point. The larger the amount of SOURCE data the sooner you need to SINK it to make room for the rest.
    Unfortunately, the very nature of delimited data with varying record lengths and possible embedded delimiters means that you can't really chunk the file to support parallel read operations effectively.
    You need to focus on designing the proper architecture to create a modular framework of readers, writers, parsers, formatters, etc. Your concern with details about String versus Array are way premature at best.
    My framework has been doing what you are proposing and has been in use for over 20 years by three different major nternational clients. I have never had any issues with the level of detail you have asked about in this thread.
    Throughout is limited by the performance of the SOURCE and the SINK. The processing in-between has NEVER been an issu.
    A modular framework allows you to fine-tune or even replace a component at any time with just 'plug n play'. That is what Interfaces are all about. Any code you write for a parser should be based on an interface contract. That allows you to write the initial code using the simplest possible method and then later if, and ONLY if, that particular module becomes a bottlenect, replace that module with one that is more performant.
    Your intital code should ONLY use standard well-established constructs until there is a demonstrated need for something else. For your use case that means String processing, not byte arrays (except for detecting record boundaries).

Maybe you are looking for