Regex to match hashtags in ruby โ€‹โ€‹sentence

I am trying to extract hashtags for a simple college project using ruby โ€‹โ€‹on rails. I am facing a problem with tags that include only numeric values โ€‹โ€‹and tags with no spaces.

text = "Pack my #box with #5 dozen liquor.#jugs link.com/liquor#jugs #2good #first#second" 

I have a regular expression /(?:^|\s)#(\w+)/i ( source )

This regular expression returns #["box", "5", "2good", "first"]

How to make sure that it returns #["box", "2good"] and ignores the rest, since they are not "real" hashtags?

+6
source share
3 answers

Can you try this regex:

 /(?:^|\s)(?:(?:#\d+?)|(#\w+?))\s/i 

UPDATE 1:
There are several cases where the above regular expression will not match: # blah23blah and # 23blah23. Therefore, the regular expression is modified to take care of all cases.

Regex:

 /(?:\s|^)(?:#(?!\d+(?:\s|$)))(\w+)(?=\s|$)/i 

Breakdown:

  • (?:\s|^) - Specifies the previous space or the beginning of a line. Do not capture the match.
  • # - Sets a hash, but is not fixed.
  • (?!\d+(?:\s|$))) - non-negative Lookahead to avoid all numeric characters between # and space (or end of line)
  • (\w+) - Captures and captures all characters of the word
  • (?=\s|$) - Positive Lookahead to provide the next space or end of the line. This is necessary to ensure that it matches adjacent valid hash tags.

Example text modified to capture most cases:

#blah Pack your #box C # 5 with a dozen # good2 # 3good liquor. # jugs link.com/liquor#jugs # mkvef214asdwq sd # 3e4 flsd # 2good # first # second # 3

Matches:

Match 1: blah

Match 2: box
Match 3: good2
Match 4: 3good
Match 5: mkvef214asdwq
Match 6: 3e4
Match 7: 2years

Ruble link

UPDATE 2:

To exclude words beginning or ending with an underscore, simply include your exceptions in a negative way:

 /(?:\s|^)(?:#(?!(?:\d+|\w+?_|_\w+?)(?:\s|$)))(\w+)(?=\s|$)/i 

Pattern, regex and matches are written in Rubular link

+9
source

I would think so:

 text.scan(/ #[[:digit:]]?[[:alpha:]]+ /).map{ |s| s.strip[1..-1] } 

which returns:

 [ [0] "box", [1] "2good" ] 

I am not trying to do everything in regex. I prefer to keep them as simple as possible, and then filter and cripple as soon as I get the baseline data. My reasoning is that regex is harder to maintain more complex than they become. I would rather spend my time doing something else than maintaining templates.

+2
source

Try the following:

 /\s#([[\d]]?[[az]]+\s)/i 

Conclusion:

 1.9.3-p194 :010 > text = "Pack my #box with #5 dozen liquor.#jugs link.com/liquor#jugs #2good #first#second" => "Pack my #box with #5 dozen liquor.#jugs link.com/liquor#jugs #2good #first#second" 1.9.3-p194 :011 > puts text.scan /\s#([[\d]]?[[az]]+\s)/i box 2good => nil 
+1
source

Source: https://habr.com/ru/post/923615/


All Articles