PHP internal string representation

Question

PHP internal string representation

I am writing a simple site parser in PHP 5.2.10.
When using the default internal encoding (which is ISO-8859-1), I always get an error with the same function call:

$start = mb_strpos($index, '<a name=gr1>');

Fatal error: The allowed memory size of 50331648 bytes has been exhausted (tried to allocate 11924760 bytes)

The length of the $ index string in this case was 2981190 bytes - exactly 4 times less than what PHP was trying to allocate.

Now if i use

 mb_internal_encoding('UTF-8')

the error disappears. Does this mean that PHP uses more memory for single-byte strings, what for multi-byte ones? How is this possible? Any ideas?

UPD: memory usage does not seem to depend on the encoding: the average get_usage () memory is almost the same using UTF-8 and ISO-8859-1. I think the problem may be in mb_strpos. Actually, the string $ index is encoded in Windows-1251 (Cyrillic), so it contains characters that are not allowed for UTF-8. This can lead to the fact that mb_strpos will somehow try to convert or just use additional memory for some needs. Let's try to find the answer in mb_strpos sources.

+6

string php memory

Dmitry Aug 25 '12 at 20:19

source share

1 answer

Adamjonr · Accepted Answer · 2012-08-29T05:41:45+0000

Sorry if you already thought about these potential issues.

Multibyte string functions will check UTF-8 encodings for errors and, if there are invalid characters, returns an empty string or false (as in the case of mb_strpos (): http://www.serverphorums.com/read.php?7,552099

Are you checking the result obtained with the === operator to make sure that you are not getting false instead of 0 ?

The mb_strpos() function uses mbfl_strpos() , which makes copies of strings (needle, haystack) when it should perform conversions (which leads to an increase in memory, as you noticed): https://github.com/php/php- src / blob / master / ext / mbstring / libmbfl / mbfl / mbfilter.c # L811

So, I am wondering if the use of the default internal encoding (ISO-8859-1) was allowed, as well as the memory limit, while the utf-8 encoding was short-circuited due to illegal characters and returned false (which, if you were testing using == , it might seem that the function simply could not find a match.)

It’s worth taking a picture :)

PHP internal string representation

More articles: