If I have a significant amount of text and try to find patterns that occur most often, I thought about solving it using the N-Gram approach, and in fact it was proposed as a solution in this , but my requirement is slightly different. To clarify, I have text like this:
I wake up every day morning and read the newspaper and then go to work
I wake up every day morning and eat my breakfast and then go to work
I am not sure that this is the solution but I will try
I am not sure that this is the answer but I will try
I am not feeling well today but I will get the work done and deliver it tomorrow
I was not feeling well yesterday but I will get the work done and let you know by tomorrow
and trying to extract the "patterns" as follows:
I wake up every day morning and ... and then go to work
I am not sure that this is the ... but I will try
I ... not feeling well ... but I will get the work done and ... tomorrow
I'm looking for an approach that can scale up to a million lines of text, so I'm just wondering if I can adapt the same N-gram approach to solve this problem, or are there any alternatives?