General discussions: A Proposed New Problem - Array Merging

CSFPython 2021-08-09 14:24:01

Having really enjoyed working my way through the problems on this site, I thought it was probably time that I contributed some ideas of my own.

I have tried to devise a problem where the basic idea is simple and where the solution can be derived in a fairly straightforward way for small data sets and where a clever strategy is needed to handle larger data sets. This means that the problem could be set at two different levels (Basic and Advanced).

The problem concerns an array of positive integers and the differences between each pair of adjacent integers in the array. The idea is to rearrange the integers in the array in order to minimise the sum of these differences. If there is no restriction on the way in which the numbers can be arranged then the problem becomes trivial, so the problem I am going to describe distinguishes between allowed and invalid orderings. A valid ordering is achieved by starting with two arrays of equal size. These are merged into one array such that the order of integers in the two small arrays is preserved in the larger array. The following examples should make this clearer.

Consider the following array of 5 positive integers.

[61, 89, 45, 71, 93]

There are 4 pairs of adjacent integers, with differences 28, 44, 26 and 22; giving a sum of differences of 120.

Next consider the two arrays (each of size 5):

A = [61, 89, 45, 71, 93]    B = [52, 73, 72, 85, 3]

We want to merge these into a single array of size 10 without changing the order of the integers in array A or in array B. There are many ways of doing this. One possible way is

[52, 61, 89, 73, 45, 72, 85, 3, 71, 93]

It is easier to see that the order of integers in A and B has been preserved by looking at the following two copies of the array. The first has all elements of B replaced by * and the second has all elements of A replaced by *.

[**, 61, 89, **, 45, **, **, **, 71, 93]

[52, **, **, 73, **, 72, 85,  3, **, **]

The following merged array is invalid because the numbers 73 and 72 from array B are not in the same order in the merged array.

[52, 61, 89, 72, 45, 73, 85, 3, 71, 93]

We now need to define the mechanism for generating the arrays A and B. This is done using a Linear Congruential Generator. The one I have chosen is

X[n+1] = K*X[n] mod M   where K = 48271 and M = 2^31 -1 = 2147483647

This generates a sequence of random numbers over a wide range of values. Each random value X[n] is generated from the previous value X[n-1]. All that is needed to generate the entire sequence is to provide the initial value X[1].

We will fill array A first and then array B. The initial value X[1] is the first value in array A.

As an example, consider arrays A and B each of size 5, giving a merged array of size 10. If we take the value of X[1] to be 987654321 then arrays A and B will be created as follows:

A = [987654321, 924765591L, 1764756619L, 185446553L, 978719167L]
B = [1260159904L, 1704424709L, 2039127922L, 830962617L, 696926541L]

The problem is to find a valid merged array which minimises the sum of the differences between adjacent integers in the merged array. There are usually multiple solutions having the same sum of differences so it is only necessary to find the numeric value of the sum. The corresponding merged array is not required.

For this example the minimum sum of differences is 3824205044.

One possible valid merged array which gives this value is

[987654321, 924765591L, 1260159904L, 1704424709L, 1764756619L, 2039127922L, 830962617L, 185446553L, 696926541L, 978719167L].

To try something more substantial start with the same value of X[1] and create arrays A and B of size 12, giving a merged array of size 24. The minimum sum of differences is now 8554354549.

THE ADVANCED VERSION OF THE PROBLEM

The previous two examples are easily solved by working through each of the valid merged arrays. In fact this is easily possible for arrays A and B up to a size of 15. Beyond that you would need to leave your program running for a long time. So I suggest that a size of 15 is the maximum for a basic version of the problem.

For the Advanced version I suggest that the arrays A and B should each have a size of 1000, giving a merged array of size 2000. To tackle a problem of this size you need to develop a clever stategy!

If you want to try an example, let us use the same starting value of X[1] = 987654321. The minimum sum of differences is now 865033913336.

This calculation takes 1.5 seconds in normal Python. Most other languages will be significantly faster so there will be no problems for the online checker to check submitted solutions.

Since the value of X[1] lies in the range 0 < X[1] < 2147483647 there are clearly many possible values for X[1] so the online problem setter can set a different problem each time it is invoked.

You can try this problem whether or not it is taken up by CodeAbbey. I hope you enjoy it.

qwerty 2021-08-10 09:17:16

This calculation takes 1.5 seconds

Took 8 hours to complete for size = 12 with several optimizations. Perhaps you meant it takes 1.5 seconds for size = 5?

I think size should not be greater than 10 for basic version.

CSFPython 2021-08-10 18:41:54

Hi qwerty_one. Thanks for the feedback. 1.5 seconds in Python is genuine for size 1000 and a clever method. For size 12 in normal Python the clever method is instantaneous. A method that generates all of the valid orderings takes 15 seconds. I can only assume that your program is generating all possible orderings. It is fairly simple to generate only the valid orderings. There are far fewer of these. This method takes 20 minutes in normal Python for size 15. That is why I suggested a maximum of 15 for the basic problem.

Think about how you can generate the valid orderings directly. It might help to try this on paper with size 5.

qwerty 2021-08-11 06:35:34

Think about how you can generate the valid orderings directly.

Unfortunately, this is just what my program does - generates all the valid orderings, and only those.

There are 252 valid orderings for size 5 and 2704156 valid orderings for size 12, right?

Processing 252 orderings takes less than second, but 2704156 takes around 8 hours to complete.

I can only suppose you have some beast PC, but you should not rely on users all having such PC, so it seems sensible for me to limit basic version to size 10.

qwerty 2021-08-11 06:44:04

Anyway, I greatly appreciate such interesting problem! Let's hope that we will see it live on Codeabbey.

CSFPython 2021-08-12 22:46:32

I am intrigued that you think that I own a supercomputer. I wish!

Since I don't play computer games I have no need for a powerful processor. My processor is an Intel Core i3-8350K 4 GHz. The i3 range is intended to be low priced and hence of only moderate performance. This is also an 8th generation processor. i3 is now on to the 11th generation.

I assume that your program is in Python. If so I would expect the run times to be similar to those of my Python programs (within a factor of 2; possibly a factor of 4 if you are using a fairly old machine).

Your program takes 8 hours and mine takes 15 seconds. That makes my program 1920 times faster! This could only be achieved by a huge amount of parallel processing (i.e. a supercomputer).

You are, of course, completely correct in your statement of the number of valid orderings for size 5 and size 12. For each of the 2.7 million orderings for size 12 the computer will need to do 23 subtractions and a similar number of additions. Even in Python this is not going to take a huge amount of time. I am now really curious to know what it is that is causing your program to run so slowly. It might be a problem with internal memory. If you are storing all of the orderings before processing them you will certainly be using quite a bit of memory. Each ordering contains 24 integers. Python will use at least 4 bytes for each of these (probably rather more). That results in a total storage requirement of at least 260 Mbytes. A modern computer should have no difficulty in coping with this but yours may have limited memory, especially if other parts of your system are using a substantial part of what is available. If you are storing the orderings I suggest that you do the calculations for each ordering as soon as it is created. It can then be discarded since you only need to keep the best sum of differences achieved so far. If this does not make a difference I am not sure what to suggest next. I may need to see your program to work out what is happening.

qwerty 2021-08-13 22:43:20

This may well be a memory issue... I am not so experienced in the language to fully understand its workings...

I may need to see your program to work out what is happening.

Please take a look

Thank in advance.

Rodion (admin) 2021-08-29 08:33:09

Friends, thanks a lot for suggestion! I obviously missed something, but hopefully I'll work through the description soon and get back with more thoughts! Sorry for delay!

Rodion (admin) 2021-08-30 06:46:20

UPD: played a bit with it, trying few stupid approaches, however none of them produces optimal "interweaving" of sequences. I obviously need to figure this out to create checker :)

BTW, do I understand that you not only have "smart strategy" but also proof of its result being the best? Or it is "heuristically proved" on smaller input sets?

CSFPython 2021-08-30 21:44:42

Rodion, thanks for taking a look at this.

I can certainly guarantee that my smart algorithm always gives the correct result. This is fairly obvious from the algorithm itself but I do not have a mathematical proof. I did check it against the brute force algorithm for several data sets each for all sizes from 2 to 15. This checking was mainly done to ensure that I had no bugs in my program code. If you see the algorithm you will see that it must give a correct result.

I have no proof that my smart strategy is the best but I cannot think of any way of improving on it. The fact that it takes 1.5 seconds to solve size 1000 in normal Python should be a good enough yardstick on which to base any further estimates. The efficiency of the algorithm is not affected by the particular data set being used. The algorithm also uses a very modest amount of internal memory, so would work on qwerty's old computer!

The program for the smart method is actually simpler than the program for the more obvious brute force method (which will work for sizes up to 10 or possibly 12 to 15). I am a problem solver rather than a programmer so I wanted to create a problem where it was necessary to think carefully to design an efficient method rather than to use some clever programming tools.

I can send you my Python program if you like. I have a fairly simple programming style and make minimal use of library functions. My programs are often very similar to pseudocode so should be fairly easy to read. I can also send a written description of the algorithm (it is actually surprisingly simple).

I had thought that the problem could be presented as a basic and advanced version. This is similar to Easter Eggs and Easter Eggs Advanced, except that the Easter Eggs Advanced does require some knowledge of maths. The problem I have presented here does not require any specialist knowledge.

You may well think that the basic version of the problem is too simple to use. I understand your need for a low level cut-off for the difficulty of new programs but only you can really decide what that is.

Rodion (admin) 2021-09-05 17:20:56

Clive, hello again!

Thanks for suggestion and details!

I'm in a curious position - obviously can't come up with solution myself (perhaps I can ask people at other sites though) - so we have at least a couple of options:

either we add it with "precalc" datasets - you provide 100-200 input sets (or rather their generating numbers) with corresponding answers - I embed these into the checker and the thing is ready to amuse our other "top-solvers" - it is probably the easiest way for me :)
or you take pains to explain algorithm to me (I believe verbal description rather than python code will be better) - and after understanding it I encode it into the checker, so it will provide people with true random data every time (additionally I probably will understand the algorithm to prove it to myself).

In both cases there remains small chance that some day someone will find more optimal algorithm (though I don't believe this) - but supposedly it won't be much problem, things happen!

The program for the smart method is actually simpler than the program for the more obvious brute force

I immediately suspected this, since the "brute force" requires obvious but tedious generation of different ways of merging. I just haven't come upon some simple way I can prove :)

I had thought that the problem could be presented as a basic and advanced version

We can do this, yep. Not 100% sure the "basic" version is a useful exercise - but at least it may teach people eh... how is it called? some kind of combinations or allocations? In this manner I'd say it is not really "basic" ha-ha.

Perhaps I should start with it then (and perhaps I'm lucky to come upon your smart method in process yet, though I don't believe such luck).

Meanwhile I want to ask you about suitable generating algorithm for producing long sequences for this task - do you feel LCG is ok? For I know it creates a kind of fragments with similar behavior here and there? Or this doesn't really matter for the task?

Also is it important that numbers are large? E.g. say for sequence of 1000 numbers is it ok to use values in range up to just 10000?

CSFPython 2021-09-06 15:12:30

Rodion, Hi.

As it is not possible to discuss the algorithm in an open forum I have sent a description of the algorithm to you by e-mail. I hope it does not get classified as junk!

I am happy to go with anything that you decide for the LCD. You know far more about these things than I do. Anything that produces a sequence of vaguely random numbers will do. The range of the numbers is also not important. A maximum of 10,000 seems perfectly reasonable.

I don't think it makes much difference whether the user generates the numbers from a given seed or that the checker program generates the arrays (from a random seed) and presents these to the user.

I think that the checker should make use of the algorithm to determine the correct solution. Having banks of pre-prepared data seems rather unnecessary.

I am happy to go with any decision that you make about setting both versions of the problem, or just one, or even neither if you don't like my algorithm! My reasoning for suggesting both versions is that this might provide the necessary encouragement for members to think about ways to improve their simple algorithm. I envisage solutions of the simple version of the problem (with size up to 15) generating each of the valid sequences. (Size 15 takes almost 20 minutes in Python). If members are told that it is possible to solve size 1000 in less than 2 seconds, using an algorithm that is designed to avoid unnecessary calculations, they are probably more likely to give it some serious thought if they have already been rewarded for a correct solution to the easy version. The program for the simple version is not trivial so it is still a worthy exercise.

I remember doing the Easter Eggs problem. I am sure that almost everyone (me included) simply generated combinations until they found the correct one. There seemed little point in doing anything else. The Advanced version of the problem was a good way of introducing members to the fact that certain things can be done far more efficiently. Because I knew some relevant maths I found the Advanced version quite straightforward. I wonder if it is worth including in a problem that some specialist knowledge would be useful. For example, a statement along the lines of:

"You might find that reading about X would be useful before attempting this problem."

Conversely members might be put off attempting a problem if they suspect that some specialist knowledge or the use of some fancy library routine will be needed in order to solve the problem. I purposely set out to avoid both of these and hope that this information could be included with the problem in order to encourage people to think about it.

Rodion (admin) 2021-09-06 17:08:06

Clive, glad to see you again!

Just at the same time I was strugging on "basic" version - so while I'm ingesting your answer, you (and everyone) are welcome to test it:

Train Merge

CSFPython 2021-09-06 19:25:08

Rodion, Hi.

I really like what you have done with the problem, turning into two trains.

I like the idea of having multiple data sets for the small version. The full version should have just one.

I have written a Python program which takes a simple approach to this problem. For sizes 10 to 15 it takes the following times:

10: 1 sec, 11: 4 sec; 12: 17 sec; 13: 1 min 13 sec; 14: 4 min 30 sec; 15: 20 min

I initially suggested an upper limit of size 15, assuming that there would be only one data set. A simple algorithm in Python would be punished by a run-time of 20 minutes but should still succeed.

I have looked at the problem posed by the online system. This has 14 different data sets. This is a good idea but 6 of them are at size 15. My simple program would take 2.5 hours to run through them all. This is probably too much of a penalty.

With multiple data sets it might be a good idea to use sizes in the range 8 to 13.

As for testing the problem online, I would prefer not to. It seems only right that someone else should have the honour of claiming the first solution.

Rodion (admin) 2021-09-06 20:15:33

Clive, thanks for quick respond!

My simple program would take 2.5 hours to run through them all

Ha-ha, I feel myself evil :) But that's how it happened - I read (above) your report of 20min for N=15 and decided it is good "reference implementation".

So as Python is the most used language here, I feel it is just that people should care about its performance at least a bit... Really, there are multiple options:

either try finding your "wise strategy"
or try optimizing "simple way"
or at least download PyPy interpreter
or use faster language sometimes :)

Seemingly, there are quite different versions of the "simple approach". We should try as much as possible to avoid shifting data in array (or anything like this). My version spends about 10 seconds, though it is in C - multiply it about x20 - I think this would be standard Python performance for the same - i.e. reducing it to about 30 min instead of 2 hours.

Simply using PyPy instead should have similar effect. Using both then probably should make it solving N=15 in about 1 minute...

Thus honestly I'd prefer to leave it as is - so that people may feel a bit unsatisfied with simple approach :)

As for testing the problem online, I would prefer not to

If you don't want to have it passed, you can just submit 1 - simply to look that expected answers are correct. This may be quite helpful since I found a couple of stupid bugs in my implementation which lead to slightly wrong results in some cases. This definitely may enrage other people, if I missed some more :)

CSFPython 2021-09-07 20:49:08

Rodion, Hi.

I have just tried the Train Merge program ten times in a row. This must amount to about 130 data sets. I used my fast algorithm to solve them and the result agreed with the checker every time.

I think this is reasonable evidence that the problem is working as it should. Hopefully people will actually notice the fact that it has appeared in the problem list. I think quite a few of the members should be able to solve it.

I think the train pictures are great!

TimeToMoveOn 2021-09-07 23:16:03

Hehe, the trains provide a very good visualization and storry :D The problem was a lot of fun :) The online compiler unfortunately couldn't handle it, but locally I could compile it (in Java) between 15 - 20 seconds, which was still acceptable for me :)

Unfortunately it's blunt BruteForce, over the night I'll rummage around in my mind if dynamic programming could help me here or if there's another approach I didn't have so directly in mind.

Anyway, that was another great exercise for recursion (funnily enough I'm writing a module exam tomorrow called "Algorithms and Data Structures" - recursion is one of the main topics) :D

Accordingly, thank you for the implementation! Gladly more of it :)

CSFPython 2021-09-08 08:31:43

TimeToMoveOn, Hi.

Congratulations on being first to solve the new problem. I hope that the full version (with size 1000) will appear on the site at some stage. Let me know if you manage to solve it. It is certainly worth the effort. The solution is remarkably simple and rather elegant.

If you like this kind of problem I can devise others from time to time (providing that Rodion is happy to include them on the site).

Rodion (admin) 2021-09-08 11:18:33

Ah, Friends, be sure Rodion is quite happy, right - it's far better than to try inventing myself :)

As about full version - I'm sure it is going to arrive soon. I'm just yet trying to come up with solution without peeking at your email - just had some idea since my "brute-force" is recursive, to turn it to DP.

TimeToMoveOn 2021-09-09 21:47:02

It looks like Rodion has moved on ;) I've tried many different approaches now...unfortunately without success :( The last ones were mostly based on choosing the "smarter" element...e.g. smaller, or the one with the smaller difference (I'm only posting this here because it obviously doesn't work). Well, it's not meant to be I guess :) It would have been quite elegant. Maybe one day in the distant future :)