COLLECTED BY
The Open Syllabus collection contains WARC files from a mid-2021 crawl of about 50 million unique seed URLs extracted from the Open Syllabus version 2.6 dataset and their page requisites. The bulk of the seed URLs are from ".com", ".org", ".edu", and ".uk" TLDs.
Crawl Summary
Crawl start: 2021-04-12 Crawl end: 2021-09-05 Seed URLs: 49,735,419 Archived URLs: 338,690,414 Collection Size: 25 TB Crawler: Heritrix/3.3.0-hq1-SNAPSHOT-2015-03-16T18:09:23Z Crawl depth: maxHops=0
Seed Summary
Unique URLs: 49,735,419 Unique Canonical URLs: 48,956,395 Unique Hosts: 984,223 IPv4 Addresses: 3,328 Unique TLDs: 21,761 Unique IANA Valid TLDs: 739 Wayback Machine URLs*: 6,568,213 * NOTE: More than 13% URLs in the dataset point to Wayback Machine!
The Wayback Machine - https://web.archive.org/web/20210413224656/https://github.com/TEIC/TEI-Simple/issues
This repository has been archived by the owner. It is now read-only.
You canβt perform that action at this time.
You signed in with another tab or window. Reload to refresh your session.
You signed out in another tab or window. Reload to refresh your session.