python - string mask and offset with regex -


I have a string on which I try to create a regex mask which is of N The word will show the number, an offset given that I have the following string:

"Quick, brown fox jumps on lazy dogs."

I want to show 3 words:

offset 0 : "quick, brown"
Offset 1 : "quick, brown fox"
offset 2 : "brown fox jump" Offset 3 : "offset 5 : " offset <"> on the office 4 : "jumps over" more than lazy "
offset 6 : " lazy dogs. "

And I'm using the following simple regex to find 3 words Land:

& gt; & Gt; & Gt; Import re
& gt; & Gt; & Gt; S = "Quick, brown fox jumps on lazy dogs."
& gt; & Gt; & Gt; Re.search (r '(\ w + w *) {3}', s) .group ()
'quick, brown'

< P> But I can not understand what kind of mask should be to show the next 3 words, not the beginners. I need to put a punctuation mark.

prefix matching options

you can first offset words, and having a variable-prefix rezux to capture the word Tripathi in a group.

Then something like this:

  import re s = "Quick, brown fox jumps on lazy dogs." Print re-search (R: '(?: \ W + \ w *) {(}: ((?: \ W + \ w *) {3}), S). GROUP (1) # Quick, brown Print search again (r? (?: \ W + \ w *) {1} ((?: \ W + \ w *) {3}), s) .group (1) # fast, brown fox Print re-search (R? (?: \ W + \ w *) {2} ((?: \ W + \ w *) {3}), S). Group (1) # brown fox jumper  

Let's look at the pattern:

  _ "word" _ _ "word" _ / \ / (?: \ W + \ w *) {2} ( (? :: W + \ w *) {3}) \ _____________ / Group1  

This does what the person says: match 2 then Capturing Group 1, match 3 word.

(?: ...) Construction is used for grouping for recurrence, but they are non-capturing.

< H3> reference

Focus on the "word" pattern

It should be said that \ w + \ w * is a bad option for a "word" pattern, as demonstrated by the following example:

< Pre> import again = "nothing" print re-search (r '(\ w + \ w *) {3}', s). Group () #Nothing

There are no 3 words, but regex was able to match anyway, because \ w * is empty String allows to match.

Perhaps a better pattern is something like this:

  \ w + (?: \ W + | $)  

that is , \ W + is done after either \ W + or the end of the string $ .


Capturing LookHead Option

As suggested by cabbage in a comment, this option is simple that you have only one static pattern to catch all the fairs. Uses findall :

  import re s = "Quick, brown fox jumps on lazy dogs." Triplets = R.Fundall (R "\ b (? = ((?:: W W + (?: \ W + | $)) {3})", s) Print triples # ['The Quick, Brown', ' Early, brown fox ',' brown fox jumps', # 'fox jumps',' jumping over ',' over lazy ',' lazy dogs. '] Print jumps over three times [3] # fox < Code>  

How it works is that it matches the zero-width word limit \ b , to capture 3 "words" in group 1 LookHead is used

  ______lookahead______ / ___ "word" __ \ / / \ \ b (? = ((?: \ W + (?: \ W + | $)) {3}) ) \ ___________________ / group1  

context


Comments