Problem Solutions: suffix array advanced problem 209

colin_mcnicholl 2019-07-11 12:55:34

Hi Rodion,

I quickly learned my earlier naive solution gave a Memory Error with the input size for this problem. Looking at the choice of suffix array algorithms out there I chose to have a go at the SA-IS one, it being said to have time complexity O(n). With some effort I got it working correctly and when I submitted my solution (Python) I got the feedback that the answer was correct but not within 90 seconds.

Profiling my code I find it takes ~ 164 seconds for input size ~ 500,000 words and that each time I have the number of words the run time halved. I cannot see any obvious faults. The function that gives a True or False output when called to check if a character is an S-type or not takes the biggest proportion of the runtime but has a very small runtime per call - I just happens to get called a very large number of times.

Google seaches suggest an MP algorithm by Maniscalco and Puglisi is faster in practice. However, I any references to this algorith I found heavy reading.

Any suggestions?

Rodion (admin) 2019-07-13 07:50:31

Colin, Hello!

Sorry for delay, but this puzzled me significantly :) First of all I quite vaguely remember this task. Also I'm definitely not acquainted with the algorithm you use. And also never before I've tried running python profiler (but now I've found it is quite convenient tool).

Regretfully Python (as any scripting language) is tough language when we want high performance...

Important notice - your solution is fast enough. It works for about 40 seconds on my desktop (and this is roughly 10-year old machine with about 3GHz 64 bit cpu).

So it may be important to check what OS and Python interpreter do you use. Mine is default one found in ubuntu, 3.5 I believe.

Also of course this would be some times slower if run on the, say, netbook with "energy-saving" 1 GHz CPU. On slower machines it may make sense trying PyPy3 interpreter or what is now popular...

Alternatively using numpy library provides us with considerably faster arrays instead of lists.

Now about the code. As about isLMSChar(offset, typemap) function - it is called much more times than there are various offsets, right? So it may be worth caching its values.

I have put where we build typemap:

islms = []  # on global level, just to show it

...

def makeSuffixArrayByInducedSorting
    ...
    global islms  # we want this be global
    typemap = buildTypeMap(string)
    islms = [isLMSChar(x, typemap) for x in range(len(typemap))] # from now use this list

Also passing typemap everywhere hints that probably we may make it global variable too (though it is not that important as we eliminated calls to most-called function).

Regretfully, this reduces running time for me only about 20%. Next we have long executing induceSortX functions which heavily rely on list access operations - and it may be where we depend much on interpreter and list type. May be arrays will work faster. Maybe numpy will be even better. May be just switching python interpreter may help. Regretfully I'm not quite guru here :)

P.S. I shall try to recollect what solution I have meant myself. Probably it is something simpler and not necessarily O(N) but we know that some O(N)-s really become very slow due to big constant factors...

P.P.S. Here is the web-site with algorithms in which I take small part - and it seems to have description of N*logN version of the algorithm... But I can't remember right now if it is much simpler...

Rodion (admin) 2019-07-13 09:37:58

UPD: I've tried downloading pypy3, portable version (first time in my life) - your solution runs in 6 seconds with it instead of 30 I achieved before. So it may be worth trying.

colin_mcnicholl 2019-07-13 11:16:20

Hi Rodion,

No problem whatsoever.

Wow! - an impressively comprehensive reply. Your suggestions make a lot of sense, even to me. Several to try - so I will, in due course, ley you know how I get on.

Many Thanks,

colin_mcnicholl 2019-07-19 13:22:37

Hi Rodion,

I have an ~ 8 year old 1.65 Ghz, 64-bit cpu with Windows7 OS. Python v3.7.3.

About the code:

The function isLMSChar(offset, typemap) is called about 57 times for every word, which I would expect is many more times than there are offsets.

Unsurprisinly I experienced a 20% improvement on following your recommendation of making islms a global variable. This resulted in 131 seconds to complete.

I am no expert with numpy either. I converted my arrays to numpy arrays but this resulted in an increase in the run time - so perhaps my implementation was flawed.

I was not sure how to use pypy. Download fine but download to where and then what. It was not clear to me.

I tried using lru_cache and other memoization variations on the lmsSubstringsAreEqual(string, offsetA, offsetB) function but could not overcome the error:

TypeError: unhashable type: 'list'

problem.

I have not had a proper read through the article in the web link you shared as yet.

I will park this problem for now and wait for some inspiration.

Rodion (admin) 2019-07-19 18:08:26

Dear Colin, Hi!

I'm sorry for leaving here that many confusing thoughts.

I think there is no need in inspiration or clever articles - just as I've tried later and added somewhere above - try this speed-optimized python interpreter - pypy. For me it boosted your code about 5 times.

As I understand the only drawback of PyPy is that it may be not compatible with some libraries - but in our case it is there is no libraries, so no problem.

colin_mcnicholl 2019-07-22 09:32:47

Hi Rodion,

Your thoughts were clear enough. Problem may be that I am not so confident in some of the terminology.

I can type directly into the pypy interpreter and get output and I can use import to import my script. However, when I type:

suffixarrayadvanced_problem209.main()

I get:

with open(filename, 'r') as f: IOError: [Errno 2] No such file or directory: 'words.txt'

I thought if might have been associated with using the with keyword so I tried

f = open(filename, 'r') instead.

That did not work.

I have 'words.txt' in the same folder as my script file, 'suffixarrayadvanced_problem209' which is in a folder that is in the python module search path and the latter did not give an error on import so I am not sure how to proceed.

colin_mcnicholl 2019-07-23 10:02:29

Hi Rodion,

By placing the file 'words.txt' in a different folder I managed to get a step further.

Now when I type:

suffixarrayadvanced_problem209.main()

at the interactive interpreter prompt I no longer get the IOError.

Instead I get

TypeError: new() got an unexpected keyword argument 'encoding'

Removing the keyword leads to a TypeError: list index must be an integer for my code

res[char] += 1

Rodion (admin) 2019-07-23 11:53:37

Colin, Hi!

I'm afraid I'm a bit bewildered. You've mentioned you use Windows 7. So you have started windows console and run pypy from it? In this case I think we want something like this:

place your suffix_array.py program and words.txt into the same folder
in this folder execute c:\some\where\it\is\installed\pypy.exe suffix_array.py...

But you probably are trying some different way like load and execute code from interpreter? I'm not sure, this could be successful (probably by import-ing code), but may be somewhat more difficult...

colin_mcnicholl 2019-07-24 09:25:36

Actually I launched pypy.exe by double clicking on it.

I tried from the windows console as you suggest using the command given. Same result.

I tried from the windows powershell (what I normally use) console. Same result.

Downloaded pypy3 instead and success !! 22 secs down from 131 secs.

Thanks a lot for your support - I probably would have given up otherwise.