Working with large chunks of data

todddixon · December 2, 2015, 5:13am

Hi all.

I have a poser regarding the handling of large text files, in this case a massive dictionary of words as part of a game. Perhaps in the order of 200 000 words. The question is quite simple in one regards, ie. how do I handle them?

Whilst initially there will be only one dictionary, in the future I hope to use in app purchases to enable users to purchase other subject specific word lists.

To develop the app to this point I imported the text in to tabs in codea after wrapping it up in to a table where each table entry is a group of words with the index being the corresponding length. e.g. words[3] = {“aah”,“all”,“aba” etc} and words[4] = {“aali”,“able”,“ably” etc.} Nice and easy.

The first problem is that even if I give one tab to each table entry (maximum 10 tags because the largest word will be ten letters) then Codea chokes and takes forever to open up a tab. Now in theory this is OK because I don’t need to open those word tabs once the data is there. However the questions I face now and in the future are:

I wish to perform editing of my codea apps on the computer and not on the iPad itself. Aircode is just not good enough and lacks the search and replace and other features of a fully functional text editor and it doesn’t like playing with those huge tabs.

In the past I used xcodea which was great but it is broken now with recent versions of codea and the author does not look like he is going to continue work on it.

Subsequently I have set up codea-SCM and integrated it with GIThub but it is slow, likely because it has to pull down the whole project every time an update is done on the computer and so it makes doing quick edits/changes and testing not feasible. Moreover on the reverse side doing a quick fix on the iPad and then an attempt to push it back to github fails. It times out or gets a “too large” error because of the transfer size. This is likely related to the size of each of the chunks of words.

To get around this whilst editing I guess I could do an import to projectdata and reference it from there but how will this go when I export to Xcode? Will this project data be sitting in Xcode ready to be compiled? I am a little unclear on that side of things.

2] As mentioned above I will want to offer in app-purchases of different dictionaries. I assume they will be downloaded upon purchase via an http.request and then stored in projectdata (or global data?) as above. If the app gets wiped/reinstalled then it will need to down the dictionary again I guess. Alternatively I could preinstall all dictionaries in to the app (and do app updates to implement adding new ones for purchase). The ability to do this with either method and keep the editing of Codea via a computer fast enough is what I am after.

Any insights on the general direction I might head would be truly appreciated. Thanks!

Ignatz · December 2, 2015, 5:51am

I can’t help you with Xcode, but there has been some discussion of dictionaries on the forum a couple of years ago.

http://codea.io/talk/discussion/1506/scramwords-a-word-game

@Mark may have some advice for you

I wouldn’t try to include the dictionary in code tabs, it chokes as you say, and there isn’t room for multiple dictionaries. The idea of downloading them on demand and storing them locally seems sensible. It also gives you the option to remotely correct or upgrade dictionaries as required.

Another option that has been explored is storing dictionaries in images, because you can store three letters per pixel, in R,G,B. So if you have 200,000 words averaging 5 letters, with a separator character, that’s 1.2m characters, which is only a 633x633 image. This gives you good compression, but the downside is a little time spent extracting the data from the image.

todddixon · December 2, 2015, 6:52am

@Ignatz Thanks for the feedback. I had a good read of that and its everything I am encountering. I am going to do some experiments with Xcode and establish the persistence of the data as it is exported compiled and run as an App. This of course leads me to examine the differences between saving global, project or local data when it comes to getting it across to Xcode. However upon further reading the reference manual am I missing the obvious and not looking closely enough at readText and saveText? The asset packs migrate to Xcode for compilation so nice plain text files should do the job fine however can you write to an asset pack after you have compiled your project as an app? If not then I guess that is where you can use the saveprojectData etc commands. I’ll do some speed tests too to find the optimal solution. My biggest desire is to minimise the to’ing and fro’ing between mac and iPad when it comes to editing.

Cheers

yojimbo2000 · December 2, 2015, 7:22am

The easiest way to do this would be to store the text files in Dropbox/Apps/Codea and then read them with readText (it’s a bit picky about extensions, there’s a list somewhere on the forum of what it accepts, .txt, .json etc). When you export to Xcode, Codea might not auto detect the assets, but it’s not hard to drag and drop them into the resulting Dropbox.assets folder in the Xcode repo. Then, add Git to the repo (you have to do this on the command line).

todddixon · December 2, 2015, 8:25am

@yojimbo2000 Thanks also for you feedback.

I can see how importing the dictionary files upon launch from a text file to their respective tables would work nicely whilst either in codea or when compiled. Much more elegant to maintain should I wish to tweak those dictionary files e.g. remove offensive words if they surface. A lot easier too than manually table.concat’ing them to the projectdata before compilation and then ‘exploding’ them when running the app!

However what about when I want to implement in-app purchases of different dictionaries? I know I could simply bundle all the dictionaries in there at the beginning and unlock them as required via in-app purchases. However if I wanted the app to dynamically download dictionaries on the fly from an http source I would have a problem I imagine because am I right in believing that the savetext only works when its running in codea and not when compiled? i.e. can you write to asset packs in a compiled app? I assume not. If I bring them down via an http.request I may have no alternative but to save them in projectdata? Does that make sense?

Oh and I don’t understand what you are getting at with the Git reference.

yojimbo2000 · December 2, 2015, 10:46am

Yes, you could grab new files with http.request and use saveText to save it to one of the asset folders (Dropbox, documents, or project). In Xcode projects I’ve only ever used Readtext, not saveText, so you might want to experiment to check it works.

Re Git, I just meant that Xcode has an option to add git to a newly created repo, but not to an already existing one. So if you’re wanting to use XCode’s source control options on the repo exported by Codea (and I’d recommend that all projects do tbh), you need to add git in the terminal (Google “add git to existing Xcode repo”) before you can use the version control GUI within Xcode.

yojimbo2000 · December 2, 2015, 10:48am

Check out json.encode and decode for converting strings to tables and back, but be aware that it is quite slow, so depending on your needs, you might be better rolling your own concat routine with table.concat

dave1707 · December 2, 2015, 3:15pm

@todddixon I have a dictionary file containing 300,249 words that are in a txt file in my Dropbox folder. I use readText and put them in a single table. On my iPad Air, it reads the file and creates the table in less than a second. To read all the words in the table takes approx 1/10 second.

EDIT: Actual time to read text file and build table, .796 seconds. Actual time to read each word in the table, .044 seconds.

yojimbo2000 · December 2, 2015, 8:37pm

If anyone’s interested in how the automatic asset bundling works in Codea Xcode export, assets are included if you use readText/ readImage with the literal name of the asset, but don’t get included if the name is generated programatically. eg

readText("Dropbox:myTextFile")  -- file automatically included on export

assets = {"Dropbox:myTextFile", "Dropbox:myTextFile2", "Dropbox:myTextFile3"}

for _, fileName in ipairs(assets) do
  readText(fileName) --files not automatically included
end

It’s not hard to add missing assets to the repo after export

Ignatz · December 2, 2015, 9:34pm

@yojimbo2000 - we could do with a good FAQ on porting to Xcode. I would try, but I know nothing about it.

yojimbo2000 · December 2, 2015, 11:32pm

There’s quite a few changes in Codea 2.3.2 reflecting Xcode 7/iOS 9. I just hope Apple pull their finger out and approve 2.3.2

se24vad · December 3, 2015, 9:10am

@Ignatz - Quite interesting/clever approach to use images for this task! Actually, I think you could save even 4 characters into one pixel (R,G,B,A). With some clever algorithm you could compress it even more.
I also think images would be quicker to parse (read) then a text file.

Ignatz · December 3, 2015, 9:39am

@se24vad - you can’t use A, because it is anti aliased automatically behind the scenes, which alters the value

Ignatz · December 3, 2015, 9:40am

I’ve also written code to compress text files using RLE and the available ASCII characters used by Lua (about 80), which achieved a compression ratio equivalent to zip.

But text tends to be pretty small, so is there any need, these days?

yojimbo2000 · December 3, 2015, 1:21pm

You can turn off the anti-aliasing with noSmooth (does that fix the alpha issue?) But I agree that there’s probably not much point using images to pass non-image data. The only thing I’d use it for is passing data to a vertex shader.

se24vad · December 3, 2015, 6:59pm

@Ignatz does rendering (antialiasing) really play any role? I wouldn’t put the image on screen, just into memory and read the values. I did a pixel animation app some time ago, where I saved images with different opacities and read them without any problem. I also didn’t notice any shifts of values…

dave1707 · December 3, 2015, 8:37pm

@se24vad Not showing the image doesn’t matter. It’s saving the image and reading it back that causes the problem. And setting noSmooth doesn’t help either. I asked Simeon about this a couple of years ago when I first saved a dictionary file in an image and he said the problem was in either saving the file or reading it back. I tried to find his exact answer, but I’m not sure if it was in a post or an email. Ignoring the alpha value and just using the bits of r,g,b , you can save either 4 or 5 characters depending on what the 5th character is.

West · December 3, 2015, 11:44pm

@toddixon

Here is an old thread which was on the go when I wrote Anagramal which might give food for thought(and @Ignatz helped out a lot).

http://codea.io/talk/discussion/4762/anagram-permutations-and-spell-checking#latest

I have 26 tabs - one for each letter of the alphabet. Did all the editing on the iPad with the main code in a separate tab and didn’t have issues editing once I got them in in the first place. I realise this doesn’t address your air code issue but as a thought have you tried having the dictionary tabs contained as another project the referenced as a dependency (untried but just a thought). I also realise that this doesn’t help the multiple dictionaries issue - sorry!