Extract all urls inside a string in Ruby

I have text content with a list of the urls it contains.

I am trying to extract all the urls and put them in an array.

I have this code

content = "Here is the list of URLs: http://www.google.com http://www.google.com/index.html"

urls = content.scan(/^(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(([0-9]{1,5})?\/.*)?$/ix)

I am trying to get the final results:

['http://www.google.com', 'http://www.google.com/index.html']

The above code does not work correctly. Does anyone know what I'm doing wrong?

thank

+5
source share
4 answers

Another approach, from school to the thought of perfect is the opponent of good:

urls = content.split(/\s+/).find_all { |u| u =~ /^https?:/ }
+5
source

Easy:

ruby-1.9.2-p136 :006 > require 'uri'
ruby-1.9.2-p136 :006 > URI.extract(content, ['http', 'https'])
  => ["http://www.google.com", "http://www.google.com/index.html"] 
+42
source

, String.scan , , . , :

[['http', '.google.com'], ...]

/(?:stuff)/, .

( ): , . (^ $), , content. -, ([0-9]{1,5})? , , , .

: , - :

content = "Here is the list of URLs: http://www.google.com http://www.google.com/index.html http://example.com:3000/foo"
urls = content.scan(/(?:http|https):\/\/[a-z0-9]+(?:[\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(?:(?::[0-9]{1,5})?\/[^\s]*)?/ix)
# => ["http://www.google.com", "http://www.google.com/index.html", "http://example.com:3000/foo"]

... , URL- IP- (, http://127.0.0.1), - [a-z]{2,5} TLD.

+5

:

Ruby has a URI module that has a regular expression for doing these things:

require "uri"

uris_you_want_to_grap = ['ftp','http','https','ftp','mailto','see']

html_string.scan(URI.regexp(uris_you_want_to_grap)) do |*matches|
  urls << $&
end

For more information, visit Ruby Ref: URI

+4
source

All Articles