Let's look at the requirements. You have text provided by the user that you want to display using hyperlinks.
- The http: // protocol prefix must be optional.
- Both domains and IP addresses must be accepted.
- Any valid top-level domain must be accepted, such as .aero and .xn - jxalpdlp.
- Port numbers must be allowed.
- URLs should be allowed in the normal context of the proposal. For example, in "Visit stackoverflow.com." The last period is not part of the URL.
- You probably want to allow the URL "https: //" as well as possibly others.
- As always, when displaying user-provided text in HTML, you want to prevent cross-site scripting (XSS). In addition, you need ampersands in the URLs to be properly escaped as & amp ..
- You may not need IPv6 address support.
- Change As noted in the comments, email support is certainly a plus.
- Edit : Only plain text input should be supported - HTML input tags should not be counted. (The Bitbucket version supports HTML input.)
Edit : Check out GitHub for the latest version with support for email addresses, authenticated URLs, quoted and bracketed URLs, HTML input, and an updated TLD list.
Here is my take:
<?php $text = <<<EOD Here are some URLs: stackoverflow.com/questions/1188129/pregreplace-to-detect-html-php Here the answer: http://www.google.com/search?rls=en&q=42&ie=utf-8&oe=utf-8&hl=en. What was the question? A quick look at http://en.wikipedia.org/wiki/URI_scheme#Generic_syntax is helpful. There is no place like 127.0.0.1! Except maybe http://news.bbc.co.uk/1/hi/england/surrey/8168892.stm? Ports: 192.168.0.1:8080, https://example.net:1234/. Beware of Greeks bringing internationalized top-level domains: xn--hxajbheg2az3al.xn--jxalpdlp. And remember.Nobody is perfect. <script>alert('Remember kids: Say no to XSS-attacks! Always HTML escape untrusted input!');</script> EOD; $rexProtocol = '(https?://)?'; $rexDomain = '((?:[-a-zA-Z0-9]{1,63}\.)+[-a-zA-Z0-9]{2,63}|(?:[0-9]{1,3}\.){3}[0-9]{1,3})'; $rexPort = '(:[0-9]{1,5})?'; $rexPath = '(/[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]*?)?'; $rexQuery = '(\?[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]+?)?'; $rexFragment = '(#[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]+?)?'; // Solution 1: function callback($match) { // Prepend http:// if no protocol specified $completeUrl = $match[1] ? $match[0] : "http://{$match[0]}"; return '<a href="' . $completeUrl . '">' . $match[2] . $match[3] . $match[4] . '</a>'; } print "<pre>"; print preg_replace_callback("&\\b$rexProtocol$rexDomain$rexPort$rexPath$rexQuery$rexFragment(?=[?.!,;:\"]?(\s|$))&", 'callback', htmlspecialchars($text)); print "</pre>";
- To avoid & lt; and & characters, I throw all the text through htmlspecialchars before processing. This is not ideal, since html escaping can lead to incorrect URL borders.
- As "And remember. Nobody is perfect." in a line (which I didn’t remember. Nobody considers it as a URL due to lack of space), further verification of valid top-level domains may be useful.
Edit : The following code fixes the two above problems, but more verbose, as I am more or less preg_replace_callback using preg_match .
// Solution 2: $validTlds = array_fill_keys(explode(" ", ".aero .asia .biz .cat .com .coop .edu .gov .info .int .jobs .mil .mobi .museum .name .net .org .pro .tel .travel .ac .ad .ae .af .ag .ai .al .am .an .ao .aq .ar .as .at .au .aw .ax .az .ba .bb .bd .be .bf .bg .bh .bi .bj .bm .bn .bo .br .bs .bt .bv .bw .by .bz .ca .cc .cd .cf .cg .ch .ci .ck .cl .cm .cn .co .cr .cu .cv .cx .cy .cz .de .dj .dk .dm .do .dz .ec .ee .eg .er .es .et .eu .fi .fj .fk .fm .fo .fr .ga .gb .gd .ge .gf .gg .gh .gi .gl .gm .gn .gp .gq .gr .gs .gt .gu .gw .gy .hk .hm .hn .hr .ht .hu .id .ie .il .im .in .io .iq .ir .is .it .je .jm .jo .jp .ke .kg .kh .ki .km .kn .kp .kr .kw .ky .kz .la .lb .lc .li .lk .lr .ls .lt .lu .lv .ly .ma .mc .md .me .mg .mh .mk .ml .mm .mn .mo .mp .mq .mr .ms .mt .mu .mv .mw .mx .my .mz .na .nc .ne .nf .ng .ni .nl .no .np .nr .nu .nz .om .pa .pe .pf .pg .ph .pk .pl .pm .pn .pr .ps .pt .pw .py .qa .re .ro .rs .ru .rw .sa .sb .sc .sd .se .sg .sh .si .sj .sk .sl .sm .sn .so .sr .st .su .sv .sy .sz .tc .td .tf .tg .th .tj .tk .tl .tm .tn .to .tp .tr .tt .tv .tw .tz .ua .ug .uk .us .uy .uz .va .vc .ve .vg .vi .vn .vu .wf .ws .ye .yt .yu .za .zm .zw .xn--0zwm56d .xn--11b5bs3a9aj6g .xn--80akhbyknj4f .xn--9t4b11yi5a .xn--deba0ad .xn--g6w251d .xn--hgbk6aj7f53bba .xn--hlcj6aya9esc7a .xn--jxalpdlp .xn--kgbechtv .xn--zckzah .arpa"), true); $position = 0; while (preg_match("{\\b$rexProtocol$rexDomain$rexPort$rexPath$rexQuery$rexFragment(?=[?.!,;:\"]?(\s|$))}", $text, &$match, PREG_OFFSET_CAPTURE, $position)) { list($url, $urlPosition) = $match[0]; // Print the text leading up to the URL. print(htmlspecialchars(substr($text, $position, $urlPosition - $position))); $domain = $match[2][0]; $port = $match[3][0]; $path = $match[4][0]; // Check if the TLD is valid - or that $domain is an IP address. $tld = strtolower(strrchr($domain, '.')); if (preg_match('{\.[0-9]{1,3}}', $tld) || isset($validTlds[$tld])) { // Prepend http:// if no protocol specified $completeUrl = $match[1][0] ? $url : "http://$url"; // Print the hyperlink. printf('<a href="%s">%s</a>', htmlspecialchars($completeUrl), htmlspecialchars("$domain$port$path")); } else { // Not a valid URL. print(htmlspecialchars($url)); } // Continue text parsing from after the URL. $position = $urlPosition + strlen($url); } // Print the remainder of the text. print(htmlspecialchars(substr($text, $position)));
Søren Løvborg Jul 27 '09 at 14:55 2009-07-27 14:55
source share