I use R to mean typing; after tokenization, I end with the char vector, in which punctuation marks, apostrophes and hyphens are stored. For example, I have this original text
txt <- "this ain't a Hewlett-Packard box - it an Apple box, a very nice one!"
After tokenization (which I do with help scan_tokenizerfrom the package tm), I get the following char vector
> vec1
[1] "this" "ain't" "a" "Hewlett-Packard"
[5] "box" "-" "it's" "an"
[9] "Apple" "box," "a" "very"
[13] "nice" "one!"
Now, to get rid of the punctuation marks, I do the following
vec2 <- gsub("[^[:alnum:][:space:]']", "", vec1)
This, I replace everything that is not alphanumeric, spaces and apostrophes "; however, this is the result
> vec2
[1] "this" "ain't" "a" "HewlettPackard" "box"
[6] "" "it's" "an" "Apple" "box"
[11] "a" "very" "nice" "one"
I want to keep the portable words sych as "Hewlett-Pacakard" by getting rid of lone hyphens. Basically I need a regular expression to exclude the hyphen form word \w-\win the expression gsubfor vec2.