regex - Speeding up regular expressions in Python -


I need to extract text quickly from HTML files. I am using the following regular expression instead of the full time parser , Because I should be faster rather than exact (I'm more than a terabyte of text). Profiler shows that most of the time in my script is spent in the sub process. What are the best ways to speed up my process? I can apply a fraction to C. But I wonder if it will help that time inside re.sub, which I think will be implemented efficiently.

  Remove scripts, tags, entities, and external locations: scriptRx = re.compile ("& lt; script. *? / Script>", again I) styleRx = Re.compile ("& lt; genre. *? / Style & gt;", again. I) tagsRx = re.compile ("& lt; [! /]? [A-zA-Z -] + [^ & Lt; & gt;] * & gt; "Institutions Rx = re.compile (" & Amp; [0- 9a-zA-Z] +; ") spacesRx = re.compile (" \ s {2,} ") .... text = scriptRx.sub (" ", text) text = styleRx. Sub ("", text) ....  

Thanks! First, use a built-in HTML parser for it, like BeautifulSo:

Then, you can identify remaining slow spots with the profiler:

And to learn about regular expressions, Regular expressions have found very valuable, no programming language:

Apart from this:

Repeat the use case To do South, then this request, I would say that the above is not what you want. My recommendation will be:


Comments