English 中文(简体)
Regexp matching in pig
原标题:

Using apache pig and the text

hahahah.  my brother just didnt do anything wrong. He cheated on a test? no way!

I m trying to match "my brother just didnt do anything wrong."

Ideally, I d want to match anything beginning with "my brother just" and end with either punctuation(end of sentence) or EOL.

Looking at the pig docs, and then following the link to java.util.regex.Pattern, I figure I should be able to use

extrctd = FOREACH fltr GENERATE FLATTEN(EXTRACT(txt, (my brother just .*\p{Punct}) )) as (txt:chararray);

But that seems to match until the end of the line. Any suggestions for performing this match? I m ready to pull my hair out, and by pull my hair out, I mean switch to python streaming

最佳回答

By default quantifiers are greedy. This means they match as much as possible. In this case you want to match only up to the first punctuation mark. In other words you want to match as little as possible.

So to solve your problem you should make the quanitifer non greedy by adding a ? immediately after it:

my brother just .*?\p{Punct}
                  ^

Note that the use of ? here is different from its use as a quantifier where it means match zero or one .

问题回答

Have you tried: .*(my brother just .*\p{Punct})

It looks like your expression wanted the my brother part to be the begining of the string, but in your example it s in the middle of the string so you have to account for everything before my brother.

You are matching .* which is... everything... try [az]* to match letters only





相关问题
Uncommon regular expressions [closed]

Recently I discovered two amazing regular expression features: ?: and ?!. I was curious of other neat regex features. So maybe you would like to share some tricky regular expressions.

regex to trap img tag, both versions

I need to remove image tags from text, so both versions of the tag: <img src="" ... ></img> <img src="" ... />

C++, Boost regex, replace value function of matched value?

Specifically, I have an array of strings called val, and want to replace all instances of "%{n}%" in the input with val[n]. More generally, I want the replace value to be a function of the match ...

PowerShell -match operator and multiple groups

I have the following log entry that I am processing in PowerShell I m trying to extract all the activity names and durations using the -match operator but I am only getting one match group back. I m ...

Is it possible to negate a regular expression search?

I m building a lexical analysis engine in c#. For the most part it is done and works quite well. One of the features of my lexer is that it allows any user to input their own regular expressions. This ...

regex for four-digit numbers (or "default")

I need a regex for four-digit numbers separated by comma ("default" can also be a value). Examples: 6755 3452,8767,9865,8766,3454 7678,9876 1234,9867,6876,9865 default Note: "default" ...

热门标签