Comment Stripper - new problem 242

Back to General discussions forum

Rodion (admin)     2021-11-21 14:15:21
User avatar

We wanted some continuation for "regexp" related problems as I remember, so here comes one, please.

Comment Stripper

I suspect the most general solution will be bit too hard, so test cases are somewhat easy and hopefully it is possible to come up with some sort of regexp in a few minutes.

HouseDwarf     2021-11-26 20:01:27

Hi,

I looked at one of the solutions posted and am suspicious that it would fail the following test case:

print('newline character: \n.')

Or in general any string in which a backslash is followed by something other than a quote or a backslash. Do there exist test cases like that in the tests?

Rodion (admin)     2021-11-26 20:16:18
User avatar

Hi! You are completely right, such case is not covered. But it is in problem statement:

Suppose that no other symbol except single or double quote, or backslash could follow escaping backslash

I.e. we do not try to cover full real python syntax - I suspect it is still possible with regexps, but they will look extremely horribly... Such problem should really be solved with state automat, rather than regexp (which, of course, contains state automat inside, but we have limited way of "programming" it)...

BTW curious to see that someone (you) decided to start with latest problems. Surely that's better, more diversity :)

gardengnome     2021-11-26 20:28:55
User avatar

Feels like my long lost cousin ... :)

HouseDwarf     2021-11-26 20:41:25

Ah, my mistake - I only skim read the problem statement and didn't notice that. I figured that the disclaimers would essentially mean you wouldn't do something tricky like:

print(r'\\'') # this is a bit evil

In the test cases :-).

The reason I decided to do some of the later problems is that I occasionally talk to someone else that uses the site regularly and they'll describe the problems they're doing. After that I sometimes get the desire to do the problem we were talking about

Rodion (admin)     2021-11-26 21:45:50
User avatar

Feels like my long lost cousin ... :)

Ha-ha, that note is more puzzling rather than clarifying - either you met your cousin, or just think you met, or you think that person could be your cousin if you ever had one...

But probably this wasn't meant to be clarifying at all!

As a side note I dare say I expected this problem to be very simple, and solution by HouseDwarf is even simpler than mine - but after it hanged here for a few days before first solutions, I start suspecting I misjudged it a bit...

print(r'\\'') # this is a bit evil

Honestly, I'm looser in Python so this made me go-and-check - this isn't syntactically correct at all, right?

gardengnome     2021-11-26 21:57:28
User avatar

gardengnome & HouseDwarf ...

HouseDwarf     2021-11-26 21:58:22

Ah, just ran that in the interpreter and that is indeed not correct. I misunderstood the semantics of raw strings. It looks like they probably mean "accept a normal string literal but don't convert any of the escapes when converting it to its value". I thought they would have different definitions for what follows the backslash than normal string literals. I found it surprising, for instance, that:

print(r'\'')   # is fine - prints: \'
print(r'\\'')  # is not fine
print(r'\\\'') # is fine - prints: \\\'

In which case if the actual meaning of raw string is to process a normal string literal but don't process the escapes when converting it to its value then I can't think of much more tricky stuff that you can do... The only other trick I can think of is:

print(''' this's evil ''')

Which would break solutions naive to triple quotes...

Rodion (admin)     2021-11-26 22:05:14
User avatar

gardengnome & HouseDwarf

oh my, I'm either blind or too stupid. thanks :)

by the way I remember long ago I first misread gnome for genome and was quite puzzled. Is there any curious story behind this username, if not a secret?

Which would break solutions naive to triple quotes...

Ah, that's true. I vaguely remembered r'...' and u'...' and triple quotes, but thought the latter are mainly for multiple line usage. Well, I even hate to think about writing regexp to cover this all. Anyway initial problem was inspired by exercise in K&R ancient book, which was written probably in pre-regex times :)

gardengnome     2021-11-26 22:05:22
User avatar

With regard to the difficulty of this problem: the actual ask and format of the solution is a little unclear to me. Is the regexp supposed to match only the comment part, or a whole line with the comment being one of matched groups? What format should the replacement pattern take - a string, no quotes, single quotes, double quotes?

Rodion (admin)     2021-11-26 22:10:20
User avatar

Aha, thanks for this note, I'll try to improve problem statement.

Regexp could cover either comment (provided that replacement is empty string) or more (but then one needs to use capture groups in replacement part, I guess). You have freedom in this sense.

What format should the replacement pattern take

not sure I understand the question, but I'd better add alternative example with capture group:

(.*)#.* $1

Ignoring quoted strings, such simple regexp and replacement will capture anything before sharp sign and replace the whole matched part with the content of the matched group. Hm. I shall try to write it down carefully in the morning :)

But one can just play with "regexp" button to get idea...

HouseDwarf     2021-11-27 09:29:56

I thought about this problem a little more and wrote a solution that is triple quote aware (I've posted that as my solution now - I think it's quite readable for what it is). I've spent a while trying to think of anything that would break it, but I can't think of any special cases it can't handle. If anyone else can I'd be interested to know :-). Edit: to be clear - I'll be interested in any valid line of python that breaks my solution (other than triple quote strings across multiple lines). After messing around with raw strings a bit I think that my solution should handle them.

Oh, and by the way, because of the comment I decided to look up when K&R was written and when regex became a thing. K&R was written in 1978, and regular expressions were devised in the 1950s. I figured I'd look this up as I think of regular expressions as "mathsy" and C as "engineery", so thought regular expressions would be older. Personally, my main memory of regular expressions is automata theory and language hierarchies at university, where it had a very "theoretical" feel. I think that the kind of regular expressions that we use day to day aren't actually regular expressions as they add some operations which are strictly more powerful than formal regular expressions (depending on the implementation).

Also, when thinking about writing a regular expression to match this I considered that if the problem were too difficult to write a regular expression by hand it might be easier to write a DFA and then convert that to a regular expression. I didn't previously know whether there existed a DFA to RE conversion algorithm, but figured there probably did and that DFA and RE are equivalent. Anyway, I looked it up and found there do indeed exist DFA to RE conversion algorithms like this. I don't know how far down the regular expression rabbit hole you want to go on this site, but I think making a problem where the easiest way to generate a regular expression to solve the problem is writing a DFA to RE compiler could be fun XD.

Vadim Pelyushenko     2022-02-16 12:25:16
User avatar

I would not wish this problem on my worst enemy. Took me quite a bit of time to figure out the subtle flaw in my regex (needed [^\\'] instead of [^'] and [^\\"] instead of [^"]) On the bright side, my solution seems to be the shortest, though I didn't make it handle triple quotes, and maybe it's still wrong but no test case covers the flaw.

Fun fact: There exist poly-time algorithms for minimizing DFAs (and the minimal DFA for a regular language is unique). Meanwhile minimizing regexs is NP-Hard.

TestUser     2022-02-16 16:44:55
User avatar

Vadim, Hi! Glad to hear from you again!

Ha-ha. Imagine someone left such a subtle flaw in production code to next generation of developers :)

But I wonder... About DFA... We now have problems on turing machine... though it is a bit larger than DFA, we can employ it for the original task (of stripping comments from C code)... And I remember there was some suggestion of problem related to TM from you also!

Vadim Pelyushenko     2022-02-16 19:56:15
User avatar

Obviously we should have a problem where we use a TM that can simulate an arbitrary TM on an arbitrary input sequence. :P

In other words, a problem where you have to create a Universal Turing Machine (UTM).

Vadim Pelyushenko     2022-02-16 20:02:28
User avatar

As for doing this comment stripping problem by implementing it as a TM... I would honestly think that doing this problem w/ a TM would be easier than doing it with Regex, as TMs are essentially nothing but DFAs which are able to have some specific side effects. And a DFA is the most natural solution to this comment stripping problem in my opinion (after struggling w/ making the regex I drew out the DFA and tried to base the regex off that).

Please login and solve 5 problems to be able to post at forum