sequence of batch operations
1. inject -> populates CrawlDB from seed list
2. Generate -> Selets URLS to fetch in segment
3. Fetch -> Fetches URLs from segment
4. Parse -> Parses content(text + metadata)
5. UpdateDB -> Updates CroawlDB(new URLs, new status...)
6. InvertLinks -> Build Webgraph
7. SOLR Index -> Send docs to SOLR
8. SOLR Dedup -> Remove duplicate docs based on signature
Repeat steps 2 to 8
Or use the all-in-one crawl script