Finding the entire string in the PHP code base

I have a PHP database with several million lines without a true separation of display and logic, and I'm trying to extract all the lines that are presented in the code for localization purposes. Separating the display and logic is a long-term goal, but for now I just want to be able to localize .

In the code, strings are presented in all possible formats for PHP, so I need a theoretical (or practical) way to analyze our entire source and at least LOCATE, where each line lives. Ideally, of course, I would replace each line with a function call, for example

  "this is a string" 

will be replaced by

  _ ("this is a string") 

Of course, I will need to support both single and double quote format . I’m not too worried about others; they appear so rarely that I can manually change them.

In addition, I would not want to localize array indices, of course. So strings like

  $ arr ["value"] 

should not become

  $ arr [_ ("value")] 

Can someone help me get started with this?

+4
source share
3 answers

You can use token_get_all() to get all tokens from a PHP file for example.

 <?php $fileStr = file_get_contents('file.php'); foreach (token_get_all($fileStr) as $token) { if ($token[0] == T_CONSTANT_ENCAPSED_STRING) { echo "found string {$token[1]}\r\n"; //$token[2] is line number of the string } } 

You can do a really dirty check that it is not used as an array index with something like:

 $fileLines = file('file.php'); //inside the loop and if $line = $fileLines[$token[2] - 1]; if (false === strpos($line, "[{$token[1]}]")) { //not an array index } 

but you will really try to do it right, because someone can write something that you do not expect, for example:

 $str = 'string that is not immediately an array index'; doSomething($array[$str]); 

Edit As Ant P says, you will probably be better off looking [ and ] in the surrounding tokens for the second part of this answer, and not for my strpos hack, something like this:

 $i = 0; $tokens = token_get_all(file_get_contents('file.php')); $num = count($tokens); for ($i = 0; $i < $num; $i++) { $token = $tokens[$i]; if ($token[0] != T_CONSTANT_ENCAPSED_STRING) { //not a string, ignore continue; } if ($tokens[$i - 1] == '[' && $tokens[$i + 1] == ']') { //immediately used as an array index, ignore continue; } echo "found string {$token[1]}\r\n"; //$token[2] is line number of the string } 
+12
source

There are some more situations that may exist in the code base that you completely violate by doing an automatic search and replacing in addition to associative arrays.

SQL queries:

 $myname = "steve"; $sql = "SELECT foo FROM bar WHERE name = " . $myname; 

Reference to an indirect variable.

 $bar = "Hello, World"; // a string that needs localization $foo = "bar"; // a string that should not be localized echo($$foo); 

SQL string processing.

 $sql = "SELECT CONCAT('Greetings, ', firstname) as greeting from users where id = ?"; 

There is no automatic filtering method for all features. Perhaps the solution would be to write an application that creates a “moderation” queue of possible lines and displays each selected and in the context of several lines of code. Then you can take a look at the code to determine if it needs a string that needs localization or not, and press one key to localize or ignore the string.

+5
source

Instead of trying to solve this problem with an overly smart command line using perl or grep, you should write a program for this :)

Write perl / python / ruby ​​/ whatever script to search each file for a pair of single or double quotes. Each time he finds a match, he should offer you to replace it with his underline function, and you can either tell him to do this or move on to the next.

In an ideal world, you should write something that would do all this for you, but in the end it will take less time and you will encounter fewer errors.

Pseudo:

 for fname in yourBigFileList: create file handle for actual source file create temp file handle (like fname +".tmp" or something) for fline in fname: get quoted strings for qstring in quoted_strings: show it in context, ie the entire line of code. replace with _()? if Y, replace and write line to tmp file if N, just write that line to the tmp file close file handles rename it to current name + ".old" rename ".tmp" file to name of orignal file 

I'm sure there is more than * nix-fu way to do this, but this method will allow you to look at each instance yourself and decide. if it's a million lines, and each of them contains a line, and each of you takes 1 second to evaluate, then you will need about 270 hours to do it all ... Maybe you should ignore this post :)

-3
source

All Articles