How to reduce sqlite3 database size for iphone?

edit: many thanks for all the answers. Here are the results after applying the optimizations so far:

  • Switching to sorting characters and encoding path lengths - new size 42 MB
  • Deleting Indexes in Bulers - New DBM 33M Size

The very nice part is that it did not require any changes to the iphone code

I have an iphone application with a large dictionary stored in sqlite format (read-only). I am looking for ideas to reduce the size of a DB file, which is currently very large.

Here is the number of records and the resulting sqlite DB size:

franks-macbook:DictionaryMaker frank$ ls -lh dictionary.db -rw-r--r-- 1 frank staff 59M 8 Oct 23:08 dictionary.db franks-macbook:DictionaryMaker frank$ wc -l dictionary.txt 453154 dictionary.txt 

... an average of about 135 bytes per record.

Here is my DB schema:

 create table words (word text primary key, sowpods boolean, twl boolean, signature text) create index sowpods_idx on words(sowpods) create index twl_idx on words(twl) create index signature_idx on words(signature) 

Here are some sample data:

 photoengrave|1|1|10002011000001210101010000 photoengraved|1|1|10012011000001210101010000 photoengraver|1|1|10002011000001210201010000 photoengravers|1|1|10002011000001210211010000 photoengraves|1|1|10002011000001210111010000 photoengraving|1|1|10001021100002210101010000 

The last field is the frequency of letters to search for anagram (each position is in the range 0..9). Two booleans are sub dictionaries.

I need to make queries such as:

 select signature from words where word = 'foo' select word from words where signature = '10001021100002210101010000' order by word asc select word from words where word like 'foo' order by word asc select word from words where word = 'foo' and (sowpods='1' or twl='1') 

One of my ideas is more efficient coding of letter frequencies, for example. binaries encode them as blob (possibly with RLE, since there are many zeros?). Any ideas on how best to achieve this, or other ideas to reduce size? I create a database in ruby ​​and read it over the phone in lens C.

Also, is there a way to get database statistics so that I can see what uses the most space?

+7
ruby sqlite iphone compression
source share
11 answers

I do not understand all the uses for the signature field, but it seems that keeping the alphabetical version of the word would be beneficial.

+2
source share

Have you tried typing the “vacuum” command to make sure you don't have extra space in db that you forgot to return?

+5
source share

Remove indexes on sowpods and twl - they probably don't help your queries and definitely take up a lot of space.

You can get database statistics using sqlite3_analyzer on the SQLite downloads page.

+4
source share

As a completely different approach, you can try using a flowering filter instead of a comprehensive database. In principle, a flowering filter consists of many hash functions, each of which is associated with a bit field. For each legal word, each hash function is evaluated, and the corresponding bit in the corresponding bit field is set. The disadvantage is theoretically getting false positives, but they can be minimized / practically eliminated with a sufficient number of hashes. Plus side - huge space savings.

+3
source share

It is best to use compression, which, unfortunately, SQLite does not currently support. Fortunately, someone took the time to develop a compression extension that might be what you need.

Otherwise, I would recommend storing your data mostly in a compressed format and expanding it on the fly.

+1
source share

The creator of SQLite is selling a version of SQLite that includes database compression (and encryption). That would be great.

+1
source share

At least 26 * 8 bytes per record (208 bytes) are currently used as the signature text field, but if you were to pack the data in a bit field, you could probably get away with 3 bits per letter (decreasing the maximum frequency by letter up to 7). This would mean that you could pack the entire signature in 26 * 3 bits = 78 bits = 10 bytes. Even if you used 4 bits per letter (for a maximum frequency of 15 per letter), you would only use 104 bits (13 bytes).

EDIT: after a bit more thoughtful, I think 4 bits per letter (instead of 3) would be a better idea, because it would facilitate binary math.

EDIT2: Reading documents in SQLite Data Types , it seems that you can just make the "signature" field a field of 26 columns of type INTEGER and SQLite will do the right thing and use only as many bits as it takes to store the value.

+1
source share

Am I correct that you have approximately 450 thousand words in your database?

I don’t have the slightest idea about the iPhone, nor a serious one regarding sqlitem, but ... while sqlite does not allow you to save the file as gz right away (maybe it already does it internally? No, it doesn’t look like when you talk about 135 b per entry , not even with both indices), I would move away from the approach to the table, save it manually by compressing the dictionary and building the rest on the fly and in memory. This should reflect your data type very well.

Wait ... Do you use this signature to allow full-text search or typo? Will full-text search on sqlite not expire this field?

0
source share

As already noted, storing the "Signature" more effectively seems like a good idea.

However, it also seems that you can get a ton of space savings by using some sort of lookup table for words - since you seem to take the root word and then add "er", "ed", "es", etc. , why not have a column with a numerical identifier that refers to the root word from a separate search table, and then a separate column with a numerical identifier that refers to the table of common suffixes of words that will be added to the base word.

If there were any tricks regarding storing abbreviated signatures for several records with one root word, you could also use them to reduce the size of stored signatures (not sure which algorithm produces these values)

It also makes a lot of sense to me, since you have the word column as the primary key, but don't even index it - just create a separate numeric column, which is the main identifier for the table.

0
source share

mhmm ... iPhone ... doesn't it have a permanent data connection? I think here webapplication / webservice can jump fast. Transfer most of your business logic to the web server (it will have real SQL with FTS and looooots memory) and pull this information online to the client on the device.

0
source share

As mentioned elsewhere, indexes in boolean columns lose, they will almost certainly be slower (if used at all) than table scans, and space will be useless.

I would consider applying simple compression to words, Huffman's encoding is pretty good for this kind of thing. In addition, I would look at the signatures: sort the columns in the frequency order of letters and not interfere with storing trailing zeros, which can be implied. I think you could bury Huffman too.

Always assuming your encoded strings don't upset SQLite, of course.

0
source share

All Articles