How to write a simple Ragel tokenizer (no return)?

UPDATE 2

The original subject: can I avoid using Ragel |**| if I don't need a backlink?

Updated answer: Yes, you can write a simple tokenizer with ()* if you don't need a backlink.

UPDATE 1

I realized that the question of XML tokening is a red herring, because what I am doing is not specific to XML.

END UPDATES

I have a Ragel scanner / tokenizer that just searches for FooBarEntity elements in files like:

 <ABC > <XYZ > <FooBarEntity> <Example >Hello world</Example > </FooBarEntity> </XYZ > <XYZ > <FooBarEntity> <Example >sdrastvui</Example > </FooBarEntity> </XYZ > </ABC > 

Scanner Version:

 %%{ machine simple_scanner; action Emit { emit data[(ts+14)..(te-15)].pack('c*') } foo = '<FooBarEntity>' any+ :>> '</FooBarEntity>'; main := |* foo => Emit; any; *|; }%% 

Version without a scanner (i.e. uses ()* instead of |**| )

 %%{ machine simple_tokenizer; action MyTs { my_ts = p } action MyTe { my_te = p } action Emit { emit data[my_ts...my_te].pack('c*') my_ts = nil my_te = nil } foo = '<FooBarEntity>' any+ >MyTs :>> '</FooBarEntity>' >MyTe %Emit; main := ( foo | any+ )*; }%% 

I realized this and wrote tests for him at https://github.com/seamusabshere/ruby_ragel_examples

You can see the read / buffer code https://github.com/seamusabshere/ruby_ragel_examples/blob/master/lib/simple_scanner.rl and https://github.com/seamusabshere/ruby_ragel_examples/blob/master/lib/simple_tokenizer.rl

+4
source share
2 answers

You do not need to use a scanner to parse XML. I implemented a simple XML parser in Ragel without a scanner. Here is a blog post with some timings and more information.

Edit: You can do this in many ways. You can use a scanner. You can analyze words, and if you see STARTANIMAL , you start collecting words until you see STOPANIMAL .

+3
source

To paraphrase Ockham: you do not need a scanner if you do not need it. Without a scanner, you can process one character at a time, perhaps by reading it from a stream without a buffer.

+1
source