Perl paragraph n-gram

Question

Perl paragraph n-gram

Say I have a text sentence:

$body = 'the quick brown fox jumps over the lazy dog';

and I want to get this sentence in a hash of "keywords", but I want to allow verbose keywords; I have the following words for a single word:

$words{$_}++ for $body =~ m/(\w+)/g;

After this is completed, I have a hash that looks like this:

'the' => 2,
'quick' => 1,
'brown' => 1,
'fox' => 1,
'jumps' => 1,
'over' => 1,
'lazy' => 1,
'dog' => 1

The next step, so that I can get keywords from 2 words, the following:

$words{$_}++ for $body =~ m/(\w+ \w+)/g;

But it only gets every “other” pair; as follows:

'the quick' => 1,
'brown fox' => 1,
'jumps over' => 1,
'the lazy' => 1

I also need a single word offset:

'quick brown' => 1,
'fox jumps' => 1,
'over the' => 1

Is there an easier way to do this than the following?

my $orig_body = $body;
# single word keywords
$words{$_}++ for $body =~ m/(\w+)/g;
# double word keywords
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body = $orig_body;
# triple word keywords
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body = $orig_body;
$body =~ s/^(\w+ \w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;

+5

perl n-gram

Glen solsberry Aug 18 '10 at 20:58

source share

5 answers

look-ahead, , . , :

my $body = 'the quick brown fox jumps over the lazy dog';

my %words;

++$words{$1}         while $body =~ m/(\w+)/g;
++$words{"$1 $2"}    while $body =~ m/(\w+) \s+ (?= (\w+) )/gx;
++$words{"$1 $2 $3"} while $body =~ m/(\w+) \s+ (?= (\w+) \s+ (\w+) )/gx;

, \s+ ( /x, ), $2 .

+2

cjm 18 . '10 21:28

- lookaheads:

:

$words{$_}++ for $body =~ m/(?=(\w+ \w+))\w+/g;

, ( ), 1.

:

%words: {
          'brown fox' => 1,
          'fox jumps' => 1,
          'jumps over' => 1,
          'lazy dog' => 1,
          'over the' => 1,
          'quick brown' => 1,
          'the lazy' => 1,
          'the quick' => 1
        }

, , count:

my $n    = 4;
$words{$_}++ for $body =~ m/(?=(\w+(?: \w+){$n}))\w+/g;

+2

Axeman 18 . '10 21:43

pos

pos SCALAR
, m//g ($_ , ).

@-

@LAST_MATCH_START
@-
$-[0] - . $-[n] - , n- , undef, .

, , :

#! /usr/bin/perl

use warnings;
use strict;

my $body = 'the quick brown fox jumps over the lazy dog';

my %words;
while ($body =~ /(\w+ (\w+))/g) {
  ++$words{$1};
  pos($body) = $-[2];
}

for (sort { index($body,$a) <=> index($body,$b) } keys %words) {
  print "'$_' => $words{$_}\n";
}

:

'the quick' => 1
'quick brown' => 1
'brown fox' => 1
'fox jumps' => 1
'jumps over' => 1
'over the' => 1
'the lazy' => 1
'lazy dog' => 1

+1

Greg Bacon 18 . '10 21:07

- , ? split , . - :

#!/usr/bin/env perl

use strict;
use warnings;

my $text = 'the quick brown fox jumps over the lazy dog';
my $max_words = 3;

my @words = split / /, $text;
my %counts;

for my $pos (0 .. $#words) {
  for my $phrase_len (0 .. ($pos >= $max_words ? $max_words - 1 : $pos)) {
    my $phrase = join ' ', @words[($pos - $phrase_len) .. $pos];
    $counts{$phrase}++;
  }
} 

use Data::Dumper;
print Dumper(\%counts);

:

$VAR1 = {
          'over the lazy' => 1,
          'the' => 2,
          'over' => 1,
          'brown fox jumps' => 1,
          'brown fox' => 1,
          'the lazy dog' => 1,
          'jumps over' => 1,
          'the lazy' => 1,
          'the quick brown' => 1,
          'fox jumps' => 1,
          'over the' => 1,
          'brown' => 1,
          'fox jumps over' => 1,
          'quick brown' => 1,
          'jumps' => 1,
          'lazy' => 1,
          'jumps over the' => 1,
          'lazy dog' => 1,
          'dog' => 1,
          'quick brown fox' => 1,
          'fox' => 1,
          'the quick' => 1,
          'quick' => 1
        };

: $phrase_len, , , cjm.

+1

Dave Sherohman 18 . '10 22:13

Grrrr · Accepted Answer · 2010-08-18T21:35:44+0000

, CPAN, n-? Text::Ngrams ( Text::Ngram) n- .

Perl paragraph n-gram

pos SCALAR

@LAST_MATCH_START

@-

More articles: