Make encoding uniform before string comparison in PHP

I am working on a function that requires me to retrieve the contents of a web page and then check if certain text is present on this page. This is a backlink verification tool.

The problem is that the function works fine in most cases, but sometimes it puts the page in the absence of a link when the link is explicitly there. I tracked this to the point of visually comparing the lines in the output, and they only match the penalty, but using the == php operator tells me that they don't match.

Recognizing that this is probably some kind of encoding problem, I decided to see what happens if I use base64_encode () on them, so I could see if these produce different results between two lines (which seems to be exactly the same).

My suspicions were confirmed - using base64_encode in matching strings gave different strings from each. Problem Found! The problem is that I do not know how to solve it.

Is there a way to make these lines the same based on the output text (which matches), so when I compare them in php, do they match?

+3
source share
6 answers

I do not completely sell your opinion that this is an encoding. PHP will internally store all its lines in the same format. Could you try this code? It will compare the ascii value of each character in both lines, which can show something that you don't see, visually comparing the lines.

$str1 = ...; $str2 = ...; if(strlen($str1) != strlen($str2)) { echo "Lengths are different!"; } else { for($i=0; $i < strlen($str1); $i++) { if(ord($str1[$i]) != ord($str2[$i]) { echo "Character $i is different! str1: " . ord($str1[$i]) . ", str2: " . ord($str2[$i]); break; } } } 
+2
source

Without application code, it's hard to say what is happening.

Try using trim () in lines to remove trailing spaces that are invisible to the naked eye.

You can find strcmp also better results.

+1
source

how to work with the sanitizing filter (if you have php> 5.2.0). I don’t know that he will do anything, but he can.

http://www.phpro.org/tutorials/Filtering-Data-with-PHP.html#12

0
source

Try mb_strstr () and trim () as indicated by dcaunt.

0
source

You can try using Dom Extension for PHP. When creating a new Dom Document, you can specify the encoding of the base document / web page. According to this website , internally everything is done in UTF-8. Then you could find the dom nodes you are interested in and compare the text content of the node

If you have not used web pages with the appropriate character encoding, I would suggest using multibyte functions, in particular mb_detect_encoding and mb_convert_encoding

0
source

If you cannot reliably get the encoding, you can use mb_convert_encoding .

 $string1 = mb_convert_encoding($string1, 'utf-8', 'auto'); $string2 = mb_convert_encoding($string2, 'utf-8', 'auto'); 

If you can specify the encoding (from the http headers or the meta tag), you must specify the encoding instead of using "auto".

 $string1 = mb_convert_encoding($string1, 'utf-8', $encoding1); $string2 = mb_convert_encoding($string2, 'utf-8', $encoding2); 
0
source

All Articles