Performance improve using TEZ/HIVE

Hi,
I’m newbie in HDInsight. Sorry for asking simple Questions. I have queries around performance improvement of my HIVE query on File data of 90 GB (15 GB * 6).
We have enabled execution engine has TEZ, I heard the AVRO format improves the speed of execution, Is AVRO SERDE enabled TEZ Queries or do I need upload *.jar files to WASB. I’m using latest version. Any sample Query.
In TEZ, Will ORC Column Format and Avro compression can work together, when we set ORC compression level on hive has
Snappy and LZO ?. Is there any Limitation of Number of columns for ORC tables.
Is there any best compression technique to upload data file to Blob, I mean compress and upload.  I used *.gz, which compressed by 1/4<sup>th</sup> of File Size and upload to Blob, but problem *.gz is not split able and it will always
uses less (single ) Mapper or should I use Avro with Snappy Compression . Is the Microsoft Avro Library performs snappy Compression or is there any compress which can be  split and compress.
If data structure for file change over time , will there be necessity of reloading older data?. Can existing query works without change in code.
It has been said that TEZ has Real Time Reporting capability , but when I Query 90 GB file (It includes Group By, order by clauses) is taking almost 8 mins of time on 20 nodes, are there any pointers to improve performance further and get the Query result
in Seconds.
Mahender

-- Tez is an execution engine, I don't think you need any additional jar file to get AVRO Serde working on Hive when Tez is used.  You can used  AvroSerDe, AvroContainerInputFormat & AvroContainerOutputFormat to get AVRO working when tez is
used.
-- I tried creating a table with about 220 columns, although the table was empty, I was able to query from the table, how many columns does your table hold?
CREATE EXTERNAL TABLE LargColumnTable02(t1 string,.... t220 string)
PARTITIONED BY(EventDate string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC LOCATION '/data'
tblproperties("orc.compress"="SNAPPY");
--  You can refer
http://dennyglee.com/2013/03/12/using-avro-with-hdinsight-on-azure-at-343-industries/   
Getting Avro data into Azure Blob Storage Section
-- It depends on what data has change , if you are using Hadoop, HBase etc..
-- You will have to monitor your application check node manager logs if there is any pause in execution again. It depends on what you are doing, would suggest open a Support case to investigate further.

Similar Messages

  • Performance improvement using RIDC

    Hi all,
    which one is faster to get the images??
    Actually i am using RIDC to get the folder information ,once i got it i m passing docId to the GetFileById webservice to get the image/File object..THis is bit slow...can give any other suggestions to improve my performance.
    below is my code for ur reference..
    //To get the folder information
    IdcClientManager manager = new IdcClientManager();
    IdcClient idcClient = manager
    .createClient(ucmUrl);
    IdcContext userContext = new IdcContext(ucmUserName,ucmPassword);
    DataBinder dataBinder = idcClient.createBinder();
    dataBinder.putLocal("IdcService", "COLLECTION_GET_CONTENTS");
    dataBinder.putLocal("hasCollectionPath", "true");
    dataBinder.putLocal("dCollectionPath", folderPath);
    ServiceResponse response = idcClient.sendRequest(userContext,
    dataBinder);
    DataBinder serverBinder = response.getResponseAsBinder();
    DataResultSet resultSet = serverBinder.getResultSet("CONTENTS");
    if (resultSet != null) {
    for (DataObject dataObject1 : resultSet.getRows()) {
    String fileId = dataObject1.get("dID");
    //Calling web service to get the blob data etc..
    GetFileByID request = objFactory.createGetFileByID();
    request.setDID(new Integer(fileId));
    request.setRendition("Primary");
    soapAct = new SoapActionCallback(soapAction);
    //call the
    resp = (GetFileByIDResponse) webServiceTemplate
    .marshalSendAndReceive(request, soapAct);
    //Response
    fileResponse = resp.getGetFileByIDResult();
    //Once got the file data,displaying blob data in the web page etc...
    }

    Why not do a search:
    public String getURL(String docName)
            DataBinder binder = idcClient.createBinder();
            binder.putLocal("IdcService", "GET_SEARCH_RESULTS");
            binder.putLocal("QueryText", "dDocName <matches> `" + docName + "`");
            binder.putLocal("ResultCount", "1");
            ServiceResponse response =
                idcClient.sendRequest(new IdcContext(ucmContext.getUsername(),
                                                     ucmContext.getPassword()),
                                      binder);
            DataBinder serverBinder = response.getResponseAsBinder();
            binder = response.getResponseAsBinder();
            DataResultSet resultSet = binder.getResultSet("SearchResults");
           for (DataObject dataObject : resultSet.getRows()) {
                 return dataObject.get("WebfilePath");
            } You can use the return value in an <img> tag. This is the fastest way to do and does not give any performance issues. I also use a similiar technique in one of my projects and it works great.

  • Help required regarding performance improvement using reflection.

    Good Day Fellow programmers,
    I need some help .
    I am working on a project in which it is quite unavoidable to use Reflection and i am getting some performance issues regarding that . The program is not meeting the exceptions and problem is due to reflections .
    Its sought of OR mapper in which i have to fetch the data from the database and corresponding to every row from the resultset there is a Helper class object which will hold data from the table row in itselt in some fields.
    Now my application gets the class name ,field name etc from a XML file so i dont know their method names before hand .
    Now since every class instance corresponds to a row in the database and i have to call get and set methods of each class instance so the performance keeps on degrading as the no of columns and no of rows increase .
    I cant use any exisiting application (hibername etc) as i am doing some processing in between.
    I had use method caching techinique but still invoke is taking too much time.
    Can somebody suggest some improvement regading this and regarding creating multiple instances of the same object
    Please Help
    Thanks
    Saurabh

    The program is not meeting the expectations
    and the problem is due to reflection.Do we know this for certain?
    ... my application gets the class name, field name
    etc. from an XML file so i don't know their method names
    beforehand .
    Now since every class instance corresponds to a row
    in the database and i have to call get and set
    methods of each class instance so the performance
    keeps on degrading as the number of columns and rows increase .
    Can somebody suggest some improvement regarding this
    and regarding creating multiple instances of the same object Class.forName() will be using a hash already, so there is probably not much room for improvement.
    Class.newInstance() probably does not take significantly more processing than a simple "new Fubar();".
    Umpteen reflective method invokations (one per column) for each row/instance - Are you saying these are the problem?
    You can test this easy enough.
    If you comment out the reflective method invocations and leave the rest of your code untouched,
    does your application processing speed up significantly?

  • Query performance improvement using pipelined table function

    Hi,
    I have got two select queries one is like...
    select * from table
    another is using pielined table function
    select *
    from table(pipelined_function(cursor(select * from table)))
    which query will return result set more faster????????
    suggest methods for retrieving dataset more faster (using pipelined table function) than a normal select query.
    rgds
    somy

    Compare the performance between these solutions:
    create table big as select * from all_objects;
    First test the performance of a normal select statement:
    begin
      for r in (select * from big) loop
       null;
      end loop;
    end;
    /Second a pipelined function:
    create type rc_vars as object 
    (OWNER  VARCHAR2(30)
    ,OBJECT_NAME     VARCHAR2(30));
    create or replace type rc_vars_table as table of  rc_vars ;
    create or replace
    function rc_get_vars
    return rc_vars_table
    pipelined
    as
      cursor c_aobj
             is
             select owner, object_name
             from   big;
      l_aobj c_aobj%rowtype;
    begin
      for r_aobj in c_aobj loop
        pipe row(rc_vars(r_aobj.owner,r_aobj.object_name));
      end loop;
      return;
    end;
    /Test the performance of the pipelined function:
    begin
      for r in (select * from table(rc_get_vars)) loop
       null;
      end loop;
    end;
    /On my system the simple select-statement is 20 times faster.
    Correction: It is 10 times faster, not 20.
    Message was edited by:
    wateenmooiedag

  • How to improve the load performance while using Datasources for the Invoice

    HI All,
    How to improve the  load performance while using Datasources for the Invoice . Actually my invoice load (Appx. 0.4 M records) is taking very long time nearly ~16 to 18 hrs  to update data from R/3 to 0ASA_DS01.
    If I load through flat file it will load with in ~20 Min for the same amount of data.
    Please suggest how to improve load performance.
    PS: I have done the Inpo package settings as per the OSS note.
    Regads
    Srininivasarao.Namburi.

    Hi Srinivas,
    Please refer to my blog posting [/people/divyesh.jain/blog/2010/07/20/package-size-in-spend-performance-management-extraction|/people/divyesh.jain/blog/2010/07/20/package-size-in-spend-performance-management-extraction] which gives the details about the package size setting for extractors. I am sure that will be helpful in your case.
    Thanks,
    Divyesh
    Edited by: Divyesh Jain on Jul 20, 2010 8:47 PM

  • Tabular Model Performance Improvements

    Hi !
    We have a bulitv tabular model inline which has a fact table and 2 dimension tables .The performance of SSRS report is very slow and we have bottle neck in deciding SSRS as reporting tool.
    Can you help us on performance improvements with Tabular Inline
    Regards,

    Hi Bhadri,
    As Sorna said, it hard to give you the detail tips to improve the tabular model performance according the limited information. Here are some useful link about performance Tuning of Tabular Models in SQL Server 2012 Analysis Services, please refer to the
    link below.
    http://msdn.microsoft.com/en-us/library/dn393915.aspx
    If this is not what you want, please elaborate the detail information, so that we can make further analysis.
    Regards,
    Charlie Liao
    TechNet Community Support

  • What are the limitations in terms of data size  or performance while using csv or text file as datasource?

    <p>Also what are the limitations in terms of data size  or performance related issues while using csv or text file?</p><p>Is it the best practice to use csv , text file to use as a datasource to improve performance?</p><p>Please Advice.... </p><p>&#160;</p>

    <p>Hi,</p><p>Create Same Data Input for CSV and Text File ,Create 2 different reports one for CSV and One for Text ,run them one you have done that.</p><p>Go to Report Menu and Select Performance Information .Use the Data in that to check which one is good datasource to improve performance</p><p>Cheers</p><p>Rahul</p>

  • DS 5.2 P4 performance improvement

    We have +/- 300,000 users that regularly authenticate using our DS. The user ou is divided in ou=internal (20,000 ids) and ou=external (280,000) uids. Approximately 85-90% percent of the traffic happens on the internal ou. The question is: Could I get any performance improvement by separating the internal branch into its own suffix/database? Would running two databases adversely affect the performance instead? We see performance impacts when big searches are performed on the ou=external branch. Would the separation isolate the issue, or those searches will most likely affect the DS as a whole?
    Thanks for your help!
    Enrique.

    Thank you for the info. Are u a Sun guy - do you work
    for sun?Yes I am. I'm the Architect for Directory Server Enterprise Edition 6.0. Previously I worked on all DS 5 releases (mostly on Replication).
    You are getting the Dukes!Thanks.
    Ludovic.

  • Performance improvement in a function module

    Hi All,
    I am using SAP 6.0 version. I have a function module to retrive the PO's . for just 10,000 records its taking long time.
    Can any one sugguest the ways to improve the performance.
    Thanks in advance.

    Moderator message - Welcome to SCN.
    But
    Moderator message - Please see Please Read before Posting in the Performance and Tuning Forum before posting
    Just 10,000 records? The first rule in performance improvement is to reduce the amount of selected data. If you cannot do that, it's going to take time.
    I wouldn't bother with a BAPI for so many records. Write some custom code to get only the data you need.
    Tob

  • Pls help me to modify the query for performance improvement

    Hi,
    I have the below initialization
    DECLARE @Active bit =1 ;
    Declare @id int
    SELECT @Active=CASE WHEN id=@id and [Rank] ='Good' then 0 else 1 END  FROM dbo.Students
    I have to change this query in such a way that the conditions id=@id and [Rank] ='Good' should go to the where condition of the query. In that case, how can i use Case statement to retrieve 1 or 0? Can you please help me to modify this initialization?

    I dont understand your query...May be below? or provide us sample data and your output...
    SELECT *  FROM dbo.students
    where @Active=CASE
    WHEN id=@id and rank ='Good' then 0 else 1 END
    But, I doubt you will have performance improvement here?
    Do you have index on id?
    If you are looking for getting the data for @ID with rank ='Good' then use the below:Make sure, you have index on id,rank combination.
    SELECT *  FROM dbo.students
    where  id=@id
    and rank ='Good' 

  • Performance issue using WebView and OpenLayers

    Hi Team,
    i am running javafx 2.2 from jdk7u21.
    openlayers version used is 2.12
    when i am trying to add a huge number of vectors to a vector layer, the CPU usage goes high and nothing gets displayed.
    Most of the CPU usage is done by com.sun.javafx.sg.prism.NGWebView.update()
    it works well till around 4K vectors, but goes for a toss beyond that.
    OpenLayers Code is as follows.
    vector layer Definition
    ==============
    var vector1= new OpenLayers.Layer.Vector("Drivers",{
    styleMap: new OpenLayers.StyleMap({
    "default": new OpenLayers.Style(OpenLayers.Util.applyDefaults({
    pointRadius: 3,
    fillColor : "blue",
    graphicName : "circle",
    fillOpacity : 1
    }, OpenLayers.Feature.Vector.style["default"])),
    "select": new OpenLayers.Style({
    externalGraphic: "${select_externalGraphic}"
    vector1.events.on({
    "featureselected": function(e) {
    //TODO: on Selection
    app.printOnConsole("From Vector Event>>"+e.feature.attributes.name);
    app.printOnConsole("From Vector Event"+Object.toJSON(e.feature.attributes));
    "featureunselected": function(e) {
    //TODO: on Deslect:
    Adding the vector code
    ===============
    var mymarker = new OpenLayers.Feature.Vector(
    new OpenLayers.Geometry.Point(LON,LAT),{
    default_externalGraphic: 'triangle_8.png',
    select_externalGraphic: 'triangle_8.png',
    rat : jsonObj.rat
    mylayer.addFeatures([mymarker]);
    Please suggest if there is a way i can fine tune my code to display the vectors.

    If running under a windows make sure you use a 32 bit java runtime (64 bit java runtimes on windows do not have a JavaScript jit compiler and are many times slower).
    Try java 8, it has many performance improvements.
    https://jdk8.java.net/download.html
    Use code tags
    https://forums.oracle.com/forums/ann.jspa?annID=1622

  • Performance improvement in OBIEE 11.1.1.5

    Hi all,
    In OBIEE 11.1.1.5 reports takes long time to load , Kindly provide me some performance improvement guides.
    Thanks,
    Haree.

    Hi Haree,
    Steps to improve the performance.
    1. implement caching mechanism
    2. use aggregates
    3. use aggregate navigation
    4. limit the number of initialisation blocks
    5. turn off logging
    6. carry out calculations in database
    7. use materialized views if possible
    8. use database hints
    9. alter the NQSONFIG.ini parameters
    Note:calculate all the aggregates in the Repository it self and Create a Fast Refresh for MV(Materialized views).
    and you can also do one thing you can schedule an IBOT to run the report every 1 hour or some thing so that the report data will be cached and when the user runs the report the BI Server extracts the data from Cache
    This is the latest version for OBIEE11g.
    http://blogs.oracle.com/pa/resource/Oracle_OBIEE_Tuning_Guide.pdf
    Report level:
    1. Enable cache -- change nqsconfig instead of NO change to YES.
    2. GO--> Physical layer --> right click table--> properties --> check cacheable.
    3. Try to implement Aggregate mechanism.
    4.Create Index/Partition in Database level.
    There are multiple other ways to fine tune reports from OBIEE side itself:
    1) You can check for your measures granularity in reports and have level base measures created in RPD using OBIEE utility.
    http://www.rittmanmead.com/2007/10/using-the-obiee-aggregate-persistence-wizard/
    This will pick your aggr tables and not detailed tables.
    2) You can use Caching Seeding options. Using ibot or Using NQCMD command utility
    http://www.artofbi.com/index.php/2010/03/obiee-ibots-obi-caching-strategy-with-seeding-cache/
    http://satyaobieesolutions.blogspot.in/2012/07/different-to-manage-cache-in-obiee-one.html
    OR
    http://hiteshbiblog.blogspot.com/2010/08/obiee-schedule-purge-and-re-build-of.html
    Using one of the above 2 methods, you can fine tune your reports and reduce the query time.
    Also, on a safer side, just take the physical SQL from log and run it directly on DB to see the time taken and check for the explain plan with the help of a DBA.
    Hope this help's
    Thanks,
    Satya
    Edited by: Satya Ranki Reddy on Aug 12, 2012 7:39 PM
    Edited by: Satya Ranki Reddy on Aug 12, 2012 8:12 PM
    Edited by: Satya Ranki Reddy on Aug 12, 2012 8:20 PM

  • MV Refresh Performance Improvements in 11g

    Hi there,
    the 11g new features guide, says in section "1.4.1.8 Refresh Performance Improvements":
    "Refresh operations on materialized views are now faster with the following improvements:
    1. Refresh statement combinations (merge and delete)
    2. Removal of unnecessary refresh hint
    3. Index creation for UNION ALL MV
    4. PCT refresh possible for UNION ALL MV
    While I understand (3.) and (4.) I don't quite understand (1.) and (2.). Has there been a change in the internal implementation of the refresh (from a single MERGE statement)? If yes, then which? Is there a Note or something in the knowledge base, about these enhancements in 11g? I couldn't find any.
    Considerations are necessary for migration decision to 11g or not...
    Thanks in advance.

    I am not quit sure, what you mean. You mean perhaps, that the MVlogs work correctly when you perform MERGE stmts with DELETE on the detail tables of the MV?
    And were are the performance improvement? What is the refresh hint?
    Though I am using MVs and MVlogs at the moment, our app performs deletes and inserts in the background (no merges). The MVlog-based fast refresh scales very very bad, which means, that the performance drops very quickly, with growing changed data set.

  • Why GN_INVOICE_CREATE has no performance improvement even in HANA landscape?

    Hi All,
    We have a pricing update program which is used to update the price for a Material Customer combination(CMC).This update is done using the FM 'GN_INVOICE_CREATE'.
    The logic is designed to loop on customers, wherein this FM will be called passing all the materials valid for that customer.
    This process is taking days(Approx 5 days) to get executed and updated for CMC of 100 million records.
    Hence we are planning to move towards HANA for better improvement in performance.
    We designed the same programs in the HANA landscape and executed it in both systems for 1 customer and 1000 material combination.
    Unfortunately, both the systems gave same runtimes around 27 seconds for execution.
    This is very disappointing thinking the performance improvement we should have on HANA landscape.
    Could anyone throw light on any areas where we are missing out and why no performance improvement was obtained ?
    Also is there any configuration related changes to be done on HANA landscape for better performance.?
    The details regarding both the systems are as below.
    Suite on HANA:
    SAP_BASIS : 740
    SAP_APPL  : 617
    ECC
    SAP_BASIS : 731
    SAP_APPL  : 606
    Also see the below screenshots of the system details.
    HANA:
    ECC:
    Thanks & regards,
    Naseem

    Hi,
    just to fill in on Lars' already exhaustive comments:
    Migrating to HANA gives you lots of options to replace your own functionality (custom ABAP code) wuth HANA artifacts - views or SQLscript procedures. This is where you can really gain on performance. Expecting ABAP code to automatically run faster on HANA may be unrealistic, since it depends on the functionality of the code and how well it "translates" to a HANA environment. The key to really minimize run time is to replace DB calls with specific HANA views or procedures, then call these from your code.
    I wrote a blog on this; you might find it useful as a general introduction:
    A practical example of ABAP on HANA optimization
    When it comes to SAP standard code, like your mentioned FM, it is true that SAP is migrating some of this functionality to HANA-optimized versions, but this doesn't mean everything will be optimized in one go. This particular FM is probably not among those being initially selected for "HANAification", so you basically have to either create your own functionality (which might not be advisable due to the fact that this might violate data integrity) or just be patient.
    But again, the beauty of HANA lies in the brand new options for developers to utilize the new ways of pushing code down to the DB server. Check out the recommendations from Lars and you'll find yourself embarking on a new and exciting journey!
    Also - as a good starting point - check out the HANA developer course on open.sap.com.
    Regards,
    Trond

  • Performance Improvement between GDK and EDK portlets

    Are there any performance improvements to be expected by migrating a portlet from using the GDK library to EDK library? Not looking at what GDK and EDK offers, more on whether we would improve the load time of a portal page if we change a portlet from GDK to EDK.....

    With GDK, my pages inherit from "Plumtree.Remote.Csp.UI.Page" and under the hood, the context is created (SettingsManager) automatically. Apparently, this is not the case anymore with the EDK. Am I correct?
    According to the EDK doc, I need to call "PortletContextFactory.CreatePortletContext(Request,Response)" for such purpose. Still correct?
    -- Yes, correct. In the EDK, no SettingsManager is used, and the functionality is wrapped into IPortletRequest and IPortletResponse.
    The other more important change is that with the GDK, the language of the current thread is automatically set to the language passed by the portal in the "Accept-Language" HTTP header. This is not the case anymore, to my knowledge, and I found out that I need to insert this:
    String sLanguage = HttpContext.Current.Request.UserLanguages[0];System.Threading.Thread.CurrentThread.CurrentCulture=new System.Globalization.CultureInfo(sLanguage);
    Is this correct or did I miss something?
    -- You do not need to use the HttpContext object of .NET. The Plumtree EDK allows you to retrieve the language as follows: The portal language is stored in a User Pref named "strLocale". A remote portlet can read this User Pref.The only point to note is that, as with all User Prefs, you must ensure that the specific prefs are sent to the portlet in the Portlet Web Service registration.
    PortletRequest.GetSettingValue(Plumtree.Remote.Portlet.SettingType.User, "strTimeZone")

Maybe you are looking for