Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escape hyphens to ease conversion to Ruby? #1

Closed
ghost opened this issue Mar 22, 2021 · 5 comments
Closed

Escape hyphens to ease conversion to Ruby? #1

ghost opened this issue Mar 22, 2021 · 5 comments

Comments

@ghost
Copy link

ghost commented Mar 22, 2021

I'm getting regex compile warnings on regexes like

mail.[a-zA-Z0-9-.]+.pl\\/o\\/
email.[a-zA-Z0-9-.]+\\/o\\/[a-zA-Z0-9-_.]{64}

Ruby issues a warning when it compiles these:

warning: character class has '-' without escape:...

If I change the original regexes to

mail.[a-zA-Z0-9\\-.]+.pl\\/o\\/
email.[a-zA-Z0-9\\-.]+\\/o\\/[a-zA-Z0-9\\-_.]{64}

to escape the 'naked' dash, the regexes compile just fine. I'm pretty sure this is an artifact of JS that makes this legal but I just don't know for sure...

My question is: Are the second versions legal JS regexes?
My followup question is: Does this change alter the meanings of the original regexes?
My 2nd followup question would be: Which of the two versions are 'more correct™' ?

I'm not trying to start a JS vs Ruby flamewar here...just trying to make the awk script as robust as I can. I'm not a JS guy... So many languages...so little time...

@leggett
Copy link
Owner

leggett commented Mar 22, 2021

I'll test and let you know.

@leggett
Copy link
Owner

leggett commented Mar 23, 2021

It seems to be fine. It isn't technically valid as hyphens do not need to be escaped. So I think I could add it.

Another option, first do a replace for a-zA-Z0-9-a-zA-Z0-9\\- as this will catch all the instances of a hyphen that you would need to escape for the conversion.

@leggett leggett changed the title Question? Escape hyphens to ease conversion to Ruby? Mar 23, 2021
@ghost
Copy link
Author

ghost commented Mar 23, 2021

The particular statements are:

sub(/-_.]/, "\\\\-_.]")    # substitute the string '-_.]' with the string '\\-_.]'
sub(/-.]/, "\\\\-.]")      # substitute the string '-.]' with the string '\\-.]'        

If you could change your regexes, that would be great...I could get rid of this bit of ugliness.

I've also written a ruby program that uses the hash created by the awk script...like what I asked for here. It's a simple filter but demonstrates the technique. It only uses a few common gems. It should be understandable by anyone using the mail gem for handling mail in ruby.

It detects/removes the tracker <img> tags and adds an X-Trackers-Blocked header to the message. The content of the X-Trackers-Blocked is a JSON-formatted array of the names of the blocked trackers. It also blocks <img> tags with height="1" width="1" and calls those unknown. I'll add all of this as a demonstration after I do some cleanup and add some comments. As always, use at your own risk and no support expressed or implied.

It's based on something I already had (which is much more sophisticated) so I didn't have to re-invent the wheel...

@leggett
Copy link
Owner

leggett commented Mar 23, 2021

If you have a fix for escaping the hyphens, I'd rather not escape them in JS since they don't need to be escaped there.

Your sub misses a case though:
"Mailcastr.com": "mailcastr.com\\/image\\/[a-zA-Z0-9-_]{64}",

So I think you'd be better off with:
sub(/a-zA-Z0-9-/, "a-zA-Z0-9\\\\-") # substitute the string 'a-zA-Z0-9-' with the string 'a-zA-Z0-9\\-'

I will try to keep the hyphen always after the 0-9 in these patterns.

Nice job removing 1x1 images. I would also remove 0x0 images as those happen sometimes too.

@leggett leggett closed this as completed Mar 23, 2021
@ghost
Copy link
Author

ghost commented Mar 23, 2021

You write:

If you have a fix for escaping the hyphens, I'd rather not escape them in JS since they don't need to be escaped there.

OK.

So I think you'd be better off with:
sub(/a-zA-Z0-9-/, "a-zA-Z0-9\\\\-") # substitute the string 'a-zA-Z0-9-' with the string 'a-zA-Z0-9\\-'

OK... thanks.

Nice job removing 1x1 images. I would also remove 0x0 images as those happen sometimes too.

Thanks for the tip...an easy fix -- change '==' to '<=' in a couple of places...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant