How to stop SAS from adding an extra empty byte for each string variable when I use PROC EXPORT?

When I export a dataset to Stata format using PROC EXPORT , SAS 9.4 automatically expands and adds an extra (empty) byte to each observation of each string variable. For example, in this dataset:

 data test1; input cust_id $ 1 month 3-8 category $ 10-12 status $ 14-14 ; datalines; A 200003 ABC C A 200004 DEF C A 200006 XYZ 3 B 199910 ASD X B 199912 ASD C ; quit; proc export data = test1 file = "test1.dta" dbms = stata replace; quit; 

the cust_id , category and status variables must be str1 , str3 and str1 in the final Stata file and, therefore, occupy 1 byte, 3 bytes and 1 byte, respectively, for each observation. However, SAS automatically adds an extra empty byte to each str4 , which extends their data types to str2 , str4 and str2 in the output Stata file.

This is very problematic because an extra byte is added to each observation of each string variable. For large data sets (I have about 530 million cases and numerous string variables), this can add a few gigabytes to the exported file.

Once the file is uploaded to Stata, the compress command in Stata can automatically delete these empty bytes and compress the file, but for large datasets, PROC EXPORT adds so many extra bytes to the file, which I don’t always have enough memory to load the dataset into Stata first of all.

Is there a way to stop the SAS from populating string variables in the first place? When I export a file with one character string variable (for example), I want this variable to be stored as one character string variable in the output file.

+7
sas stata
source share
2 answers

Here's how you can do it using existing functions.

 filename FT41F001 temp; data _null_; file FT41F001; set test1; put 256*' ' @; __s=1; do while(1); length __name $32.; call vnext(__name); if missing(__name) or __name eq: '__' then leave; substr(_FILE_,__s) = vvaluex(__name); putlog _all_; __s = sum(__s,vformatwx(__name)); end; _file_ = trim(_file_); put; format month f6.; run; 

To avoid using _FILE _;

 data _null_; file FT41F001; set test1; __s=1; do while(1); length __name $32. __value $128 __w 8; call vnext(__name); if missing(__name) or __name eq: '__' then leave; __value = vvaluex(__name); __w = vformatwx(__name); put __value $varying128. __w @; end; put; format month f6.; run; 
+1
source share

If you agree with the answer to a flat file, I came up with a fairly simple way to create one that, it seems to me, has the required properties:

 data test1; input cust_id $ 1 month 3-8 category $ 10-12 status $ 14-14 ; datalines; A 200003 ABC C A 200004 DEF C A 200006 XYZ 3 B 199910 SD X B 199912 DC ; run; data _null_; file "/folders/myfolders/test.txt"; set test1; put @; _FILE_ = cat(of _all_); put; run; /* Print contents of the file to the log (for debugging only)*/ data _null_; infile "/folders/myfolders/test.txt"; input; put _infile_; run; 

This should work as it is, provided that the total assigned length of all the variables in your dataset is less than 32767 (the limit of the cat function in the environment of data steps is the lower limit of 200 characters, since only when using cat to create a variable that is not length has been assigned). In addition, you may encounter truncation problems. The workaround when this happens is only to cat combine a limited number of variables at a time - a manual process, but much less time-consuming than writing out statements based on the lengths of all variables, and it may never come depending on your data .

Alternatively, you can go along a more complex macro route by getting a variable length from the vlength or dictionary.columns function and using these plus variable names to create the necessary put statements.

0
source share

All Articles