Matching html attributes with regex in php

I try to make an expression that will search on the page, for example how2bypass.co.cc, and return the contents of the action attribute in the "form" tag, and the contents of the "name" and "type" in any input tags. I cannot use the html parser because my ultimate goal is to automatically determine if a given page is a web proxy, and as soon as the sites catch it, I do this, they will probably start doing stupid things like writing an entire document with javascript to stop me from parsing it.

I am using code

preg_match_all('/<form.*action\="(.*?)".*>[^<]*<input.*type\=/i', $pageContents, $inputMatches); 

which is great for the action attribute, but after I put "after type \ = code stops working", why is this? Does it work fine once, but not twice?

+4
source share
2 answers

Regular expressions are greedy ...

If you check the source of the page, perhaps it matches the first <input with the last type= and will write everything in between.

 `<input.*type\=` 

You cannot capture the form and all inputs with your current expression, because not every input has a form markup prefix. You need to approach it in one of the following ways:

  • Capturing all form markup, <form>...</form> and then regex to match all inputs in the capturing
  • Adjust your current expression as not greedy .*? and allow multiple input markup captures.
+1
source

Without seeing the landing page you want to extract from, there are only a few things you can assume:

  • The type= attribute may not have double quotes, as it also has type=text . Or it can have single quotes or some spaces around = .
  • Placeholders .* May fail if there are new lines between or within tags. The use of the /s regex flag is recommended.
  • And it’s usually more reliable to use [^<>]* or [^"] as the character instead of the character .* Character-negative classes.
  • You do not need to leave the equal sign \= .

And maybe you should split it. Use one regex to extract the <form>..</form> block. And then find the <input> inside.

0
source

All Articles