Does the existing SAS dataset overwrite more time?

I have a short question. If we create a SAS dataset, say Sample.sas7bdat, which already exists, will the code have more time to execute (because the code should overwrite the existing dataset here) than the case when this dataset did not exist yet?

data sample; ..... ..... run; 

I did some research on the Internet, but could not find a satisfactory answer. It seems to me that the code should take a little extra time, although I'm not sure what effect this will affect the 10-gigabyte data set.

+8
sas
source share
3 answers

You can check it out yourself quite easily. A few caveats:

  • Make sure that you have a large enough data set so that you do not miss the differences in the simple random activity of the processor. 100 + MB is usually a good target.
  • Make sure that you run the test several times - the more, the better, in the absence of time between them. One test will always be insufficient and will always show the first data set faster, because it benefits from caching the record (basically the OS says that it was written when it wasn’t, but it just had the record queued in memory).

Here is an example of my test. This is a data set of 100 million rows with two 8-byte numbers, so 1.6 GB.

First, the results. I see a few seconds. What for? SAS performs several operations when replacing a dataset:

 Write dataset to temporary file Delete the old dataset Rename temporary dataset to new dataset 

On some operating systems, this seems faster than others; I found the Windows desktop computer to be pretty slow compared to Unix or even the Windows Server operating system, which is pretty fast. I assume Windows deletes more carefully than just changes the file system pointer, but I really don't know. This, of course, is not copying the entire file from the utility program directory (there is not enough time for this). I also suspect that write caching is still fueling new datasets a bit, especially as the time for all datasets grows as I write. The difference is probably only about a second or so, the difference between _REP iteration 2 and _NEW iteration 3 seems most reasonable to me.

 Iteration 1 _NEW=7.26999998099927 _REP=12.9079999922978 Iteration 2 _NEW=10.0119998454974 _REP=11.0789999961998 Iteration 3 _NEW=10.1360001564025 _REP=15.3819999695042 Iteration 4 _NEW=14.7720000743938 _REP=17.4649999142056 Iteration 5 _NEW=16.2560000418961 _REP=19.2009999752044 

Note that the first iteration of the new one is much faster than the others, and the total time increases as you go (as caching records is less and less able to keep up). I suspect that if you allow it to continue (or use an even larger file, for which I do not have time right now), you can see an even more consistent time. I'm also not sure what happens with write caching when a file that is written to caching is deleted; maybe he needs to wait until the write caching is written to disk before doing a delete operation or something like that. You can run a test where you waited 30 seconds between _NEW and _REP to check this.

The code:

 %macro test_me(iter=1); %do _i=1 %to &iter.; %let start = %sysfunc(time()); data test&_i.; do x = 1 to 1e8; y=x**2; output; end; run; %let mid=%sysfunc(time()); data test&_i.; do x = 1 to 1e8; y=x**2; output; end; run; %let end=%sysfunc(time()); %let _new = %sysevalf(&mid.-&start.); %let _rep = %sysevalf(&end.-&mid.); %put Iteration &_i. &=_new. &=_rep.; %end; proc datasets nolist kill; quit; %mend test_me; options nosource nonotes nomprint nosymbolgen; %test_me(iter=5); 
+5
source share

When overwriting, more file operations occur. After creating the table, SAS will delete the old table and rename the new one. In my tests, it took 0.2 seconds of extra time.

+3
source share

In a short test, my 800 megabyte dataset took 4 seconds to create a new one and 10-15 seconds to overwrite it. I assume that this is because SAS must preserve the existing dataset until datastep completes execution in order to maintain data integrity. This is why you can get the following log message:

 WARNING: Data set dset was not replaced because this step was stopped. 

Overwrite test

 NOTE: The data set WORK.SAMPLE has 100000000 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 10.06 seconds user cpu time 3.08 seconds system cpu time 1.48 seconds memory 1506.46k OS Memory 26268.00k Timestamp 08/12/2014 11:43:06 AM Step Count 42 Switch Count 38 Page Faults 0 Page Reclaims 155 Page Swaps 0 Voluntary Context Switches 190 Involuntary Context Switches 288 Block Input Operations 0 Block Output Operations 1588496 

New data test

  NOTE: The data set WORK.SAMPLE1 has 100000000 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 3.94 seconds user cpu time 3.14 seconds system cpu time 0.80 seconds memory 1482.18k OS Memory 26268.00k Timestamp 08/12/2014 11:43:10 AM Step Count 43 Switch Count 38 Page Faults 0 Page Reclaims 112 Page Swaps 0 Voluntary Context Switches 99 Involuntary Context Switches 294 Block Input Operations 0 Block Output Operations 1587464 

The only difference between the log messages is real time , which for me indicates that SAS handles file system operations in the data set files.

NB I tested this on SAS (r) Native Release 9.4 TS1M2 software, which I run through SAS Studio online. I think this is a Linux operating system, the results may vary depending on your operating system.

+1
source share

All Articles