Confirm user-entered PHP code before passing it to eval ()

Before passing the string to eval (), I would like to make sure that the syntax is correct and allows:

  • Two functions: a () and b ()
  • Four operators: / * - +
  • Brackets :()
  • Numbers: 1,2, -1, 1

How can I do this, maybe this is something related to PHP Tokenizer?

I'm actually trying to make a simple formula interpreter, so a () and b () will be replaced with ln () and exp (). I do not want to write a tokenizer and parser from scratch.

+7
source share
5 answers

As for the verification, the following symbol tokens are valid:

operator: [/*+-] funcs: (a\(|b\() brackets: [()] numbers: \d+(\.\d+)? space: [ ] 

A simple check can then check if the input string matches any combination of these patterns. Since the funcs token funcs fairly accurate, and it does not collide with other tokens, this check should be fairly stable without the need to implement any syntax / grammar:

 $tokens = array( 'operator' => '[/*+-]', 'funcs' => '(a\(|b\()', 'brackets' => '[()]', 'numbers' => '\d+(\.\d+)?', 'space' => '[ ]', ); $pattern = ''; foreach($tokens as $token) { $pattern .= sprintf('|(?:%s)', $token); } $pattern = sprintf('~^(%s)*$~', ltrim($pattern, '|')); echo $pattern; 

Only if the entire input line matches the token-based template does it check. It can still be syntactically incorrect PHP, suppose you can make sure that it is built only on the specified tokens:

 ~^((?:[/*+-])|(?:(a\(|b\())|(?:[()])|(?:\d+(\.\d+)?)|(?:[ ]))*$~ 

If you build the template dynamically - as in the example, you can more easily change the language tokens.

In addition, this may be the first step to your own tokenizer / lexer. Then, the token stream can be passed to the parser, which can parse and interpret it. This piece of user187291 wrote about .

As an alternative to writing the full lexer + syntax, and you need to check the syntax, you can also formulate your grammar based on tokens, and then perform marker grammar based on expressions on the marker representation of the input.

Signs are words that you use in your grammar. You will need to more accurately describe the brackets and the definition of the function in tokens, and the tokenizer must follow the clearer rules that the token replaces another token. The concept is set out in another question of mine . It also uses regex to formulate grammar and check syntax, but it still does not parse. In your case, eval will be the parser that you are using.

+3
source

Parser generators have already been written for PHP, and in particular, LIME comes with a typical example of a “calculator”, which will become an obvious starting point for your “mini-language”: http://sourceforge.net/projects/lime-php/

Years have passed since the last time I played with LIME, but he was already mature and stable.

Notes:

1) Using the parser generator gives you the advantage that you avoid PHP eval () if you want - you can force LIME to emit a parser that effectively provides the "eval" function for expressions written in your mini-language (with validation, baked in). This gives you the added benefit of allowing you to add support for new features as needed.

2) At first, it might seem unnecessary to use a parser generator for such an apparently small task, but as soon as you get some work examples, you will be impressed by how easy it is to modify and extend them. And it’s very easy to underestimate the difficulty of writing a parser without errors (even “trivial”) from scratch.

+2
source

Yes, you need a Tokenizer or something similar, but this is only part of the story. A tokenizer (usually called "lexer") can only read and parse the elements of an expression, but it cannot detect something like "foo () + * bar)" is invalid. You will need the second part, called the parser , which could organize the tokens in the form of a tree (called "AST") or provide an error when this fails. Ironically, as soon as you have a tree, "eval" is no longer required, you can evaluate your expression directly from the tree.

I would recommend you write the parser manually, because it is a very useful exercise and a lot of fun. Recursive descent parsers are fairly easy to program.

0
source

You can use token_get_all() , check each token and abort the first invalid token.

0
source

Hakre's answer, using regular expressions is a good solution, but it is a bit complicated. Also, handling the whitelist of functions becomes quite confusing. And if it goes wrong, it can greatly affect your system.

Is there a reason you are not using javascript 'eval'?

0
source

All Articles