XML Solutions for Large Data Sets

Hi,
I'm working with a large data set (9 million records comprising 36 gigabytes) and am exploring the use of XML with it.
I've experimented with a JDBC app (taken straight from Steve Muench's excellent <i>Oracle_XML_Applications</i>) for writing to CLOBS, but achieve throughputs of much less than 40k/s (the minimum speed required to process the data in < 10 days).
What kind of throughputs are possible loading XML records from CLOBs into multiple tables (using server-side Java apps)?
Could anyone comment whether XML is a feasible possibility for this size data set?
Regards,
Mike

Just would like to identify myself (I'm the submitter):
Michael Driscoll <[email protected]>.
null

Similar Messages

Report download is slow for large data sets

APEX 4.1
I have a classic report which retrieves around 1lakh plus rows.
While downloading report as excel it takes 5-10 mins for download.
Any solution for this?
Can we download in sets like first 1000 records n then next 1000 etc.?

what I understood is we can export CSV in background using Kubicek's export_to_excel package.
1.We can provide a button to execute procedure. - jfosteroracle says custom CSV download was slow
2.Use job to download excel in background. - need to check with client if they wish to go ahead with this.
Correct. You need to use the custom package and button on the page to submit the request for downloading the report in back-end.
Is it possible to zip a file first and then download it?
No. As of my knowledge it's not possible to zip a file and then download it.
Thanks
Lakshmi

Best Version of SQL Server to Run on Windows 7 Professional for Large Data Sets

My company will soon be upgrading my work PC from XP to Windows 7 Professional. I am currently running SQL Server 2000 on my PC and use it to load and analyze data large volumes of data. I often need to work with 3GB to 5GB of data, and
have had databases reach 15GB in size. What would be the best version of SQL Server to install on my PC after the upgrade? SQL Server Express just won't cut it. I need more than 2GB of data and the current version of DTS functionality to
load and transform data.
Thanks.

Hi,
Its difficult to say what would be best for you. You can install SQL Server 2012 standard edition because that is supported on Windows 7 SP1, enterprise edition is not supported. SQL Server 2012 express has now database limitation of 10G which does not includes
File stream size and log file size but does not provide SSIS features.
Just have a look at features supported by various editions of SQL Server it will help you in deciding better
I inclination is towards SQL Server 2012 because its now with SP2 and more stable than 2014( my personal opinion)
Please mark this reply as answer if it solved your issue or vote as helpful if it helped so that other forum members can benefit from it
My Technet Wiki Article
MVP

Large data sets and key terms

Hello, I'm looking for some guidance on how BI can help me. I am a business analyst in a health solutions firm, but not proficient in SQL. However, I have to work with large data sets that just exceed the capabilities of Excel.
Basically, I'm having to use Excel to manaully search for key terms and apply a values to those results. For instance, I have a medical claims file, with Provider Names, Tax ID, Charges, etc. It's 300,000 records long and 15-25 columsn wide. I need to search for key terms in the provider name like Ambulance, Fire Dept, Rescue, EMT, EMS, etc. Anything that resembles an ambulance service. Also, need to include abbreviations of them such as AMB, FD, or variations like EMT, E M T, EMS, E M S, etc. Each time I do a search, I have filter and apply an "N/A" flag.
That's just one key term. I also have things like Dentists or DDS, Vision, Optomemtry and a dozen other Provider Types that need to be flagged as "N/A".
Is this something that can be handled using BI? I have access to a BI group, but I need to understand more about the capabilities of what can be done. As an analyst, I'm having to deal with poor data inegrity. So, just cleaning up the file can be extremely taxing and cumbersome.
Some insight would be very helpful. Thanks.

I am not sure if you are looking for an explanation about different BI products? If so, may be this forum is not the place to get a straight answer.
But, Information Discovery product suite might be useful in your case. Regarding the "large date set" you mentioned, searching and analyzing 300,000 records may not be considered a large data set at least in Endeca standards :).
All your other requests, could also be very easily implemented using Endeca's product suite. Please reach out to Oracle's Endeca product team and they might guide you on how this product suite would help you.

How to handle large data sets?

Hello All,
I am working on a editable form document. It is using a flowing subform with a table. The table may contain up to 50k rows and the generated pdf may even take up to 2-4 Gigs of memory, in some cases adobe reader fails and "gives up" opening these large data sets.
Any suggestions?

On 25.04.2012 01:10, Alan McMorran wrote:
> How large are you talking about? I've found QVTo scales pretty well as
> the dataset size increases but we're using at most maybe 3-4 million
> objects as the input and maybe 1-2 million on the output. They can be
> pretty complex models though so we're seeing 8GB heap spaces in some
> cases to accomodate the full transformation process.
Ok, that is good to know. We will be working in roughly the same order
of magnitude. The final application will run on a well equipped server,
unfortunately my development machine is not as powerful so I can't
really test that.
> The big challenges we've had to overcome is that our model is
> essentially flat with no containment in it so there are parts of the
We have a very hierarchical model. I still wonder to what extent EMF and
QVTo at least try to let go of objects which are not needed anymore and
allow them to be garbage collected?
> Is the GC overhead limit not tied to the heap space limits of the JVM?
Apparently not, quoting
http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html:
"The concurrent collector will throw an OutOfMemoryError if too much
time is being spent in garbage collection: if more than 98% of the total
time is spent in garbage collection and less than 2% of the heap is
recovered, an OutOfMemoryError will be thrown. This feature is designed
to prevent applications from running for an extended period of time
while making little or no progress because the heap is too small. If
necessary, this feature can be disabled by adding the option
-XX:-UseGCOverheadLimit to the command line."
I will experiment a little bit with different GC's, namely the parallel GC.
Regards
Marius

Working with Large data sets Waveforms

When collection data at a high rate ( 30K ) and for a long period (120 seconds) I'm unable rearrange the data due to memory errors, is there a more efficient method?
Attachments:
Convert2Dto1D.vi ‏36 KB

Some suggestions:
Preallocate your final data before you start your calculations. The build array you have in your loop will tend to fragment memory, giving you issues.
Use the In Place Element to get data to/from your waveforms. You can use it to get single waveforms from your 2D array and Y data from a waveform.
Do not use the Transpose and autoindex. It is adding a copy of data.
Use the Array palette functions (e.g. Reshape Array) to change sizes of current data in place (if possible).
You may want to read Managing Large Data Sets in LabVIEW.
Your initial post is missing some information. How many channels are you acquiring and what is the bit depth of each channel? 30kHz is a relatively slow acquisition rate for a single channel (NI sells instruments which acquire at 2GHz). 120s of data from said single channel is modestly large, but not huge. If you have 100 channels, things change. If you are acquiring them at 32-bit resolution, things change (although not as much). Please post these parameters and we can help more.
This account is no longer active. Contact ShadesOfGray for current posts and information.

Need to load large data set from Oracle table onto desktop using ODBC

I don't have TOAD nor any other tool for querying the database. I'm wondering how I can load a large data set from an Oracle table onto my desktop using Excel or Access or some other tool using ODBC or not using ODBC if that's possible. I need results to be in a .csv file or something similar. Speed is what is important here. I'm looking to load more than 1 million but less than 10 million records at once. Thanks.

hillelhalevi wrote:
I don't have TOAD nor any other tool for querying the database. I'm wondering how I can load a large data set from an Oracle table onto my desktop using Excel or Access or some other tool using ODBC or not using ODBC if that's possible. I need results to be in a .csv file or something similar. Speed is what is important here. I'm looking to load more than 1 million but less than 10 million records at once. Thanks.
Use Oracle's free Sql Developer
http://www.oracle.com/technetwork/developer-tools/sql-developer/downloads/index.html
You can just issue a query like this
SELECT /*csv*/ * FROM SCOTT.EMP
Then just save the results to a file
See this article by Jeff Smith for other options
http://www.thatjeffsmith.com/archive/2012/05/formatting-query-results-to-csv-in-oracle-sql-developer/

64-bit LabVIEW - still major problems with large data sets

Hi Folks -
I have LabVIEW 2009 64-bit version running on a Win7 64-bit OS with Intel Xeon dual quad core processor, 16 gbyte RAM. With the release of this 64-bit version of LabVIEW, I expected to easily be able to handle x-ray computed tomography data sets in the 2 and 3-gbyte range in RAM since we now have access to all of the available RAM. But I am having major problems - sluggish (and stoppage) operation of the program, inability to perform certain operations, etc.
Here is how I store the 3-D data that consists of a series of images. I store each of my 2d images in a cluster, and then have the entire image series as an array of these clusters. I then store this entire array of clusters in a queue which I regularly access using 'Preview Queue' and then operate on the image set, subsets of the images, or single images.
Then enqueue:
I remember talking to LabVIEW R&D years ago that this was a good way to do things because it allowed non-contiguous access to memory (versus contigous access that would be required if I stored my image series as 3-D array without the clusters) (R&D - this is what I remember, please correct if wrong).
Because I am experiencing tremendous slowness in the program after these large data sets are loaded, and I think disk access as well to obtain memory beyond 16 gbytes, I am wondering if I need to use a different storage strategy that will allow seamless program operation while still using RAM storage (do not want to have to recall images from disk).
I have other CT imaging programs that are running very well with these large data sets.
This is a critical issue for me as I move forward with LabVIEW in this application. I would like to work with LabVIEW R&D to solve this issue. I am wondering if I should be thinking about establishing say, 10 queues, instead of 1, to address this. It would mean a major program rewrite.
Sincerely,
Don

First, I want to add that this strategy works reasonably well for data sets in the 600 - 700 mbyte range with the 64-bit LabVIEW.
With LabVIEW 32-bit, I00 - 200 mbyte sets were about the limit before I experienced problems.
So I definitely noticed an improvement.
I use the queuing strategy to move this large amount of data in RAM. We could have used other means such a LV2 globals. But the idea of clustering the 2-d array (image) and then having a series of those clustered arrays in an array (to see the final structure I showed in my diagram) versus using a 3-D array I believe even allowed me to get this far using RAM instead of recalling the images from disk.
I am sure data copies are being made - yes, the memory is ballooning to 15 gbyte. I probably need to have someone examine this code while I am explaining things to them live. This is a very large application, and a significant amount of time would be required to simplify it, and that might not allow us to duplicate the problem. In some of my applications, I use the in-place structure for indexing
data out of arrays to minimize data copies. I expect I might have to
consider this strategy now here as well. Just a thought.
What I can do is send someone (in US) via large file transfer a 1.3 - 2.7 gbyte set of image data - and see how they would best advise on storing and extracting the images using RAM, how best to optimize the RAM usage, and not make data copies. The operations that I apply on the images are irrelevant. It is the storage, movement, and extractions that are causing the problems. I can also show a screen shot(s) of how I extract the images (but I have major problems even before I get to that point),
Can someone else comment on how data value references may help here, or how they have helped in one of their applications? Would the use of this eliminate copies? I currently have to wait for 64-bit version of the Advanced Signal Processing Toolkit for LabVIEW 2010 before I can move to LabVIEW 2010.
Don

Sample XML request for Parametric Data Collection.

Hello Experts,
Can any one please post a Sample XML request for Parametric Data Collection.
Thanks in advance.
Rgds
Nityanand Singh

Stuart,
I have dealt with the issue you mentioned as a bug with DcGroupRef. In fact, it is not a bug but a misinterpretation of the WS. Those fields you mentioned are a part of Test Plan measurement collection (MeasureGroup and MeasureName fields in particular). However, if you check the database, Test Plan measurement values are collected to a separate table which is not processed by WS at all, however, the WS pushes the data to the table used by DC500. And within this context, DcGroupRef is the must.
I have rectified such situation at one customer by manually adding the values to the incomplete records. They used to collect the data in such way as of ME 5.0 or even earlier by means of Prod XML.
Implementation of WS in 5.2 and 6.0 are completely different, especially taking into account switching to usage of PAPI.
So, it is not a bug but a feature not implemented yet.
And with that, I'm taking over the ticket.
Regards,
Sergiy

Not all my titles are showing for each data set

Hello,
I have made a column graph for 4 different data sets, I need the titles to show for each data set along the X axis. I highlight the titles in my data when making the graph, however, only two of the four are showing. Is there any way I can get all 4 to show?
^^^^ Need the title after Baseline and after Numb Hand
Thank you

Try clicking the chart then the blue 'Edit Data References' rectangle, and then in the lower left of the window switch from 'Plot Columns as Series' to 'Plot Rows as Series'. Then switch back if needed. All four should then show up.
SG

XML file for large number of data records

I have to create an xml file using the data from multiple table. The problem That i am facing is the data is huge it is in millions so I was wondering that is there any efective way of creating such an xml file.
It would be great if you can suggest some approach to achieve my requirement.
Thanks,
-Vinod

You'd probably be better asking over in the XML DB forum as that forum is dedicated to dealing with the XML side of things.
The FAQ in that forum is here: {thread:id=410714}

XY graphs under-perform on large data sets

If for example you have 3 signals with 8 million points each and you plot these on a regular waveform graph, the user interface is able to display the data smoothly. All graph palette operations (zoom, scroll etc.) respond in "real-time".
Put the same 3x8 million points on an XY graph, and you have one sluuuuggish user interface. Scrolling is for example no longer possible in any practical fashion.
I'm sure a lot of it has to do with the overhead of having all those X-values (often unnecessarily many - as discussed in this idea), but the performance degradation compared to a regular waveform graph (even if the latter is fed twice the amount of Y values for example) is severe.
Are there ways around this performance issue? Sure. We can e.g. write code that decimates the data we send to the indicator, and refills it when the user zooms or scrolls and therefore needs additional data points. But this requires lots of code, and can never become as transparent/integrated and smooth as an implementation within the indicator itself.
And competing products are already there, that's what bugs me right now. I've got colleagues that get such functionality "for free" with the graphing tools they have.
So, we're about to develop an XControl that makes it possible to present such large non-continuous data sets in a smooth manner. (Ironically one solution is to add data points so that I have continuous data - and then use the regular graph...) But has anyone already done this? Andhowfaroffisa nativeXYgraphindicatorthatmakes such code obsolete?
MTO

Have a look to the Topic "Lost reference of main controller within popup" Lost reference of main controller within popup
"I hate windows popups" and MVC too.
In newest versions there is a nice popup managed via DHTML (like Web Dynpro does) but basically you should have a common reference to the data somewhere. You can use server side cookies, attributes of your application class, public and static attributes of a specific controller....
Sergio

Just in case any one needs a Observable Collection that deals with large data sets, and supports FULL EDITING...

the VirtualizingObservableCollection does the following:
Implements the same interfaces and methods as ObservableCollection<T> so you can use it anywhere you’d use an ObservableCollection<T> – no need to change any of your existing controls.
Supports true multi-user read/write without resets (maximizing performance for large-scale concurrency scenarios).
Manages memory on its own so it never runs out of memory, no matter how large the data set is (especially important for mobile devices).
Natively works asynchronously – great for slow network connections and occasionally-connected models.
Works great out of the box, but is flexible and extendable enough to customize for your needs.
Has a data access performance curve so good it’s just as fast as the regular ObservableCollection – the cost of using it is negligible.
Works in any .NET project because it’s implemented in a Portable Code Library (PCL).
The latest package can be found on nugget. Install-Package VirtualizingObservableCollection. The source is on github.

Good job, thank you for sharing
Best Regards,
Please remember to mark the replies as answers if they help

Date Format for Spry data set sort

Hi
Just feeling my way through the use of Spry Datasets for the
first time and have a couple of issues hopefully someone with more
knowledge of it than me know the answer to.
I had an issue with my Spry Dataset initially that it would
not work in ie7 but was OK in FF3.
After some mucking round I realised the error I was getting -
albeit obscure in IE was telling me it was a date format issue in
my data set.
The db data I am playing with here is fairly simple: I have a
couple of text fields, an integer field that contains the
unixtimestamp of the entry (its a simple diary application) and a
formal date field that holds the same date in mysql's date format.
When I display the date in the dataset I do so in the format
"23rd May 2009" (as an example)- I code this using phps date
function in my xml query.
I had set this field to date format in the Spry dataset
conditions but IE seems to barf on this- I can change it to
textstring but then my sort is done on Alpha/Numeric sort of the
first character in the date field which is rubbish.
I only display the date in this format and one of the other
text fields in my dataset- the unixdate is for programmatic
purposes not general display so I cannot sort on this field if its
invisible. How can I acheive a date based sort with this set up- or
what date formats does Spry code prefer for date sorts? (My client
wants the date to show as I have explained). Many thanks. I have a
second query which I will post separately!
Kenny

"Tanushiheadbash" <[email protected]> wrote
in message
news:gqa70o$iat$[email protected]..
> I think I follow what you are saying and in fact I think
its what I
> already
> have. I have set the sort order to use the unixtime when
the page
> initially
> loads and thats OK.
I am sorry, but you aren't following my explanation. English
isn't my mother
tongue, and I am not able to explain it any better.
> However what I need to be able to do is to have the
AJAX/Javascript sort
> (done in this case with Spry- ) to sort on the date when
the column header
> is
> clicked. The problem I have is the date format in this
visible column is
> in
> DDth Month YYYY format and Spry does not recocnise this
as a date format-
> it
> wants it as a string ( or ie gives an error). Maybe its
not possible what
> I am
> trying to do- just thought there might be a clever way
to implement this.
You can take a look at this page:
http://visual.unipv.it/tmt_calendar/admin/reports/events.cfm
Even if all the dates here are incidentally using the
yyyy-mm-dd format, the
dataset display the date from the "start_date_formatted"
field:
<td><a href="javascript:"
onclick="showUpdate('{event_id}')">{start_date_formatted}</a></td>
But uses another field to sort the table:
<th scope="col" spry:sort="start_date">Start
date</th>
You can have the same date, using two different formats,
inside two,
separated dataset fields. One is used for display, the other
one is used for
sort.
You may try to read again my previous explanations, look at
the code in the
page above and see if you get the idea.
Massimo Foti, web-programmer for hire
Tools for ColdFusion, JavaScript and Dreamweaver:
http://www.massimocorner.com

Reg:Efficient solution for a data upload scenario

Hi All,
I have the following task.
Required data from a legacy system(generate data only in the form of flat files)to SAP R3 as FB01 journals and the output file should be generated periodically(daily,weekly,fortnightly etc)
Solution Approaches:
1)Write a BDC program to extract the data.
2) Write a ABAP Progam to populate IDoc (if standard IDOc is available) or generate a Outbound proxy (If standard IDoc is not available) to push the data into SAP XI.
Could anyone tell me which would be the the best and efficient approach for this task and need your recommendations.
Thanks in Advance.
B.Lavanya
Edited by: Lavanya Balanandham on Mar 31, 2008 2:23 PM

Hi Lavanya,
Required data from a legacy system(generate data only in the form of flat files)to SAP R3 as FB01 journals - use BDC for this thing because it will be better for large source files.
the output file should be generated periodically(daily,weekly,fortnightly etc) - if this output file contains acknowledgment for the data uploaded by the above process, create a ABAP report for it and schedule it..........but if this output contains some other IDOC data which you need to send as a file to a third-party system, then go for SAP XI provided the IDOC data is not too large... but if the IDOC size is huge, then just create a ABAP report for outputting data to a file on application server and FTP the file to third-party system.
Regards,
Rajeev Gupta

XML Solutions for Large Data Sets

Similar Messages

Maybe you are looking for