Focused crawls are collections of frequently-updated webcrawl data from narrow (as opposed to broad or wide) web crawls, often focused on a single domain or subdomain.
Table Enforcer is my attempt to apply a sort of "test driven development" workflow to data cleaning and validation. A python package to facilitate the iterative process of developing and using schema-like representations of DataFrames in pandas for recoding and validating instances of these data.
CSVParser is a tool to parse csv file using univocity and commons csv parsers. It cleans new line (\n) character & special characters between data. It also handle various garbage data like odd no of quotes or delimiters in side quotes. It validate each record with specified delimiter count and separate it out to _GoodRecords.CSV and _BadRecords.CSV file. This is a Data Cleaner tool to run before ingestion to Data Lake. It make sure data is in right csv format to build table on it.