AXI DMA velocity and DCache doubts

This is a 2 question in one thread.
I'm basing my model on the matrix multiplication example,
First set of questions:
After some optimizations I have now a MM2S velocity of 1009 Mbytes/s and a S2MM velocity of 383 Mbytes/s.
1.1- Why is S2MM so slow when comparing to MM2S?
1.2- Are these the maximum velocities that you think I can get? Currently I'm not using interrupts and just waiting for each transfer to end before continuing. If I use interrupts will it be faster?
Second set of questions:
Following some advices I am going to divide my 512x512 matrices in 32x32 ones and treat each of them separately, since the Zedboard resources are not that big.
I already tested my design and the first 32x32 matrix is properly normalized. Input and output values are correct. However if I try and do a for cycle for all the matrices, the results are not correct and I think this is related to the cache flush and invalidate.
Here's my code:
#define DIM 32
//Start my IP block and enable auto-restart (doubts about this..)
mm2s_bufferPtr=(u32 *)(SLOPE_ADDR+i*DIM*DIM*4);
result_Ptr =(u32 *)(FINAL_ADDR+i*DIM*DIM*4);
Status = XAxiDma_SimpleTransfer(&axi_dma, (u32)mm2s_bufferPtr, DIM*DIM*4, XAXIDMA_DMA_TO_DEVICE);
if (Status != XST_SUCCESS)
xil_printf("ERROR! Failed to kick off MM2S transfer!\n\r");
while (XAxiDma_Busy(&axi_dma,XAXIDMA_DMA_TO_DEVICE));
// Kick off DMA S2MM transfer
Status = XAxiDma_SimpleTransfer(&axi_dma, (u32)result_Ptr, DIM*DIM*4, XAXIDMA_DEVICE_TO_DMA);
if (Status != XST_SUCCESS)
xil_printf("ERROR! DMA transfer from Vivado HLS block failed!\n\r");
while (XAxiDma_Busy(&axi_dma,XAXIDMA_DEVICE_TO_DMA));
I already read this carefully about DCacheFlush and DCacheInvalidate but I still clearly have not understood very well.
Note: In the linker script I changed the heap and stack sizes to 10Kbytes instead of 1Kbyte.
Ty very much in advance for the help.

Hello jmales,
Right now my main IP (designed with Vivado HLS) is receiving a 32x32 matrix of floats. So, everytime I call my IP block within SDK I transfer 4096 bytes with the MM2S and then 4096 back with the S2MM transfer. So is 4096 my bandwidth ?
Oh okay, I see what you mean. The remaining question to answer first is:
- How often do you need to transfer a new 32x32 matrix? Or probably more useful question is how often do you need to transfer the 512x512 matrices? In other words, how fast are you acquiring new 512x512 matrices. What is your sample rate. This will tell you how much bandwidth you need.
If you only take in a new matrix every 10 seconds, then we only need to move data at a rate of 512 * 512 * 4 (bytes) / 10 = ~105KB/s in which case we don't really care about DMA efficiency because data is coming in so slowly. All we need to do is clock the DMA in the MHz range and you'll easily be able to keep up. The DMA will be sitting idle while we wait for new matrix for a while anyway, so any overhead associated with setting up the next transfer will be completely eclipsed and inconsequential. However, if you're taking in a new matrix every 10 milliseconds, we need to move data at 512*512*4/10e-3 = ~105MB/s which becomes a more difficult problem to solve and DMA efficiency may become a larger factor.
Another thing to think about is latency. Do you care about the absolute time (in milliseoncs) from when you acquire your matrix until it arrives at it's final destination? This will affect the rate that you clock your accelerator hardware (including DMAs, interconnects, etc). Take our 10 second matrix acquisition rate example. Even though the data rate is slow, if we need to pass data from DDR to accelerator, perform the matrix multiplication, then send data back to DDR all under 10 milliseconds, then you'll need to run your DMA operations faster. For minimal latency, every clock cycle counts so DMA configuration overhead might not be acceptable.
I was looking into this and actually posted a new thread regarding that meanwhile. So if I set the frequency of the MM2S and S2MM ports to 200MHz and the frequency of my custom IP block to 100Mhz, a FIFO ow 2 FIFOS will make sure everything runs smoothly?
Reading the other thread, I want to be clear about which problem this will solve. Doing this will ensure that your 100MHz clock domain will receive a continuous stream of data (i.e. no bubble cycles where the accelerator is not taking in new data). This is because the FIFO will have data in it which the accelerator can be processing while the DMA is being configured for the next transfer.
From a raw bandwidth (MB/s) perspective, this won't be as high as just running everything at 200MHz and allowing for those handful of buble cycles where the DMA is being reconfigured.
Maybe running some number will help to show the distinction:
1) FIFO case where you run processing at 150MHz with no buble cycles (in the processing clock domain) due to DMA downtime. 150MHz * 4 bytes per clock * 100% efficiency = 600MB/s
2) No FIFO, everything runs at 200MHz, DMA transfer length is 32*32 = 1024 cycles (or 4096 bytes), and say it takes 20 cycles to reconfigure the DMA for each transfer. The efficiency is 1024/(1024+20) = ~98%. So total bandwidth is 200MHz * 4 bytes per clock * 98% = 784MB/s
2) Again, no FIFO and everything runs at 200MHz with 20 cycles to reconfigure the DMA, but this time lets say we transfer the data one row at a time so our transfer length is 32 cycles (or 128 bytes). Now our efficiency is only 32/(32+20) = ~61.5%. So total bandwidth is 200MHz * 4 bytes per clock * 61.5% = 492MB/s
So if we figure out the number of matrices you need to process per second and the input-to-output latency requirements, then we can start deciding on architectural stuff about how best to move the data around.

    Hi, i've 2 issues: 1) if skype is already logged in - it constantly switches status from offline to online and back 2) if i logout, i can't login again. tested @ Mac (6.15.335) (10.6. and 2 WinClassic clients ( (win8 and win8.1) On all aff