ruby - Can I use Hpricot to find the main article text of any/most websites? -


I want a way to remove the main text from any webpage that displays an article the way it runs on any website The main lesson can be found in the same way.

I am using Ruby on Rails, so I think HPRock is my best bet. I am looking at Hpricot in possible way? Is there an example? Thanks for reading.

You definitely use Hpricot to scrap content from any HTML page can do.

Here's a step-by-step tutorial:

Using the HPricot Expression is ideal for parse the file with a known HTML structure.

However, for any normal writing, read the conflicts web pages and identify the main text text. I think you will need some kind of minority A (at least) which can well be outside the purview of HPRCOT.

If you can do this, then possibly writing a set of code for such common HTML formats, which you want to scrape (probably wordpress, tumbler, blogger etc.) if it is set

I am also convinced that you can come along with something to try to do it (depending on how good the readability is, what I think - it seems that it is completely away from Works)

for the first time Identify (a fixed) set of tags that can be considered as part of the "main section of text" (e.g. & lt; br & gt; etc.).

2) Find the scrap page and the largest block of text on that page (1) contains the tag.

3) (1) Return the text (2) with the tag from the deleted tag.

Looking at the results of readability, I guess this guess also works about.


Comments