Reliably removing duplicates is much more complicated than sorting a file. As another answer indicates, there is no guaranteed way to pinpoint duplicates accurately without storing a full copy of each row in memory, which seems to be exactly what you are trying to avoid.
You can save the in-memory or on-disk hash index and use them to extract the actual lines from the file vault for comparison, but this essentially duplicates what the database can do for you.
An alternative is the subsequent processing of the file after its completion. The UNIX sort command is pretty good in large files ( How can the UNIX sort command sort a very large file? ), So I expect the standard UNIX command line approach to work is reasonable:
sort my-file-of-strings.txt | uniq > my-filtered-file-of-strings.txt
(Note that files must be sorted first before moving to uniq to remove duplicates).
If you do not have available tools (or equivalents), you can always try to implement some kind of external merge.
Jon moore
source share