Implementing string literal concatenation in C and C ++

AFAIK, this question applies equally to C and C ++

Step 6 of the "translation phases" specified in the C standard (5.1.1.2 in the draft C99 standard) indicates that adjacent string literals should be combined into one literal. Ie

printf("helloworld.c" ": %d: Hello " "world\n", 10); 

It is equivalent (syntactically):

 printf("helloworld.c: %d: Hello world\n", 10); 

However, the standard does not indicate which part of the compiler should handle this - whether it should be a preprocessor ( cpp ) or the compiler itself. Some online research tells me that this function should usually be performed by a preprocessor ( source # 1 , source # 2 , and there are more), which makes sense.

However, running cpp on Linux shows that cpp does not:

 eliben@eliben-desktop:~/test$ cat cpptest.c int a = 5; "string 1" "string 2" "string 3" eliben@eliben-desktop:~/test$ cpp cpptest.c # 1 "cpptest.c" # 1 "<built-in>" # 1 "<command-line>" # 1 "cpptest.c" int a = 5; "string 1" "string 2" "string 3" 

So my question is: where should this language function be processed, in the preprocessor or in the compiler itself?

There may not be a single good answer. Heuristic answers based on experience, well-known compilers and overall good engineering practice will be evaluated.


PS If you're wondering why I care about this ... I'm trying to find out if my Python-based P analyzer should handle string literal concatenation (which is not currently running), or leave it with the cpp that it accepts before running.

+7
c ++ c c-preprocessor string-literals
source share
5 answers

The standard does not specify the preprocessor and compiler, it simply indicates the translation phases that you have already noted. Traditionally, steps 1 through 4 were in the preprocessor, phases 5, although 7 in the compiler, and phase 8 - the linker, but none of them are required by the standard.

+8
source share

If no preprocessor is specified to process this parameter, it is safe to consider it as a compiler.

Edit:

.

Your " ie, "the link at the beginning of the post answers the question:

Adjacent string literals are combined at compile time; this allows long strings to be split into several strings, and also allows string literals as a result of the C preprocessor also defines macros to be added to strings at compile time ...

+3
source share

In the ANSI C standard, this detail is described in section 5.1.1.2, paragraph (6):

5.1.1.2 Translation Phases
...

4. Execute preprocessor directives and expand the macro ...

5. Each source character set element and escape sequence in character constants and string literals are converted to a member of the execution character set.

6. Literal characters with adjacent character strings are concatenated, and adjacent widescreen literal tokens are concatenated.

The standard does not specify that an implementation should use a preprocessor and compiler as such.

Stage 4 is clearly the responsibility to the preprocessor.

Step 5 requires that the "execution character set" be known. This information is also required by the compiler. It is easier to port the compiler to a new platform if the preprocessor does not contain platform dependencies, so the trend is to implement step 5 and, therefore, step 6 in the compiler.

+2
source share

There are complex rules for how string literal concatenation interacts with escape sequences. Suppose you have

 const char x1[] = "a\15" "4"; const char y1[] = "a\154"; const char x2[] = "a\r4"; const char y2[] = "al"; 

then x1 and x2 should be equal according to strcmp , as well as for y1 and y2 . (This is what Hit gets when quoting translation steps - the escape transition occurs before the string constant is concatenated.) There is also a requirement that if any of the string constants in the concatenation group has the L or U prefix, you get a wide or Unicode string. Put it all together, and this leads to a much more convenient execution of this work as part of the "compiler" rather than the "preprocessor".

+1
source share

I would process it in terms of the parser validation token, therefore in the compiler. It seems more logical. The preprocessor does not need to know the "structure" of the language, and in fact, it usually ignores it, so macros can generate incompatible code. It processes nothing more than what it has the right to process directives that are specially addressed to it ( # ... ), and the "consequences" of them (for example, those from #define xh that caused the preprocessor to change a lot of x to h )

0
source share

All Articles