Has anyone tried to come to an open source crawler bixo? Can you share learning? Can we easily create crawlers (compared to Nach / Heritrics)? Thanks, Bixo was used in production in a large social networking site (100m page views / day). User Content Classification (originally produced by any user with a link in it)
If you know that the spacing in Bixo actually works like any other cascading component, essentially the required form of url Sector emit a bunch as pages of information production and output.
One thing that I initially stated was that for a very large crawler, it is that the crawling aspect "only" small pieces in the puzzle, the whole workflow around it can be very complex and if you have another Different crawlers go with the product, then you have to find a way to integrate it. Using cascading, Bixo gets only one more input in your workflow.
Bicso seems very solid Ken Kruger (Lead Dev) is super reactive and within a day I was able to fix some hanging issues (many in the "dirty" url in my dataset Are). He has a very comprehensive automated test suite, to ensure that Bixo works according to the design.
Overall, I can not recommend it to the extreme enough that the whole system was created by me in 6-9 months and I do not think that I could do it in that time frame.
Comments
Post a Comment