Tools and utilities for writing, reading, inspecting and managing WARC files.
Sites 25
Loading new listings for you to review...
- GitHub: cc-warc-examples CommonCrawl WARC/WET/WAT examples and processing code.
- GitHub: warc-mapreduce Warc and wet support for Hadoop's mapreduce api.
- GitHub: warc-tools Miscellaneous tools for processing WARC files from the CommonCrawl.
- Heritrix The Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
- GitHub: WarcMiddleware Lets download a mirror copy of a website when running a web crawl with the Python web crawler Scrapy.
- GitHub: WarcProxy Saves proxied HTTP traffic to a WARC file.
- GitHub: WarcMITMProxy HTTP(S) proxy that saves traffic to a WARC file, using libmitmproxy.
- GitHub: Alard/warc-proxy Viewer for browsing the contents of a WARC file.
- GitHub: Megawarc Nondestructive warc-in-tar to warc conversion.
- GitHub: warctozip-service An HTTP-based warc-to-zip converter.
- GitHub: archiveteam-megawarc-factory Scripts to bundle Archive Team uploads and upload them to Archive.org.
- GitHub: CDX-Writer Python script to create CDX index files of WARC data.
- GitHub: Heritrix-Cassandra A library for writing Heritrix output directly to Cassandra.
- GitHub: python-heritrix Simple Python wrapper around Heritrix API.
- WARCreate Extension that allows a user to create a Web ARChive (WARC) file from any browseable webpage. The resulting files can then be used with other tools like the Internet Archive's open source Wayback Machine.
- GitHub: Wpull Wget-compatible web downloader and crawler.
- Java Web Archive Toolkit (JWAT) A package to read and validate WARC, ARC and GZip files.
- SiteStory Transactional Archiving. Consists of selectively capturing and storing transactions that take place between a web client (browser) and a web server.
- NetarchiveSuite A complete web archiving package whose primary function is to plan, schedule and run web harvests of parts of the Internet. Is built around the Heritrix web crawler.
- GItHub: WarcQtViewer UI to view and manage .warc and .warc.gz files.
- WarcManager Database web application which indexes and provides a browsing and search interface to a collection of warc data.
- DeDuplicator (Heritrix Add-on) An add-on module (plug-in) for the web crawler Heritrix. It offers a means to reduce the amount of duplicate data collected in a series of snapshot crawls.
- IIPC: Open Wayback Development Landing site for open source Wayback development.
- Web Archiving Integration Layer (WAIL) A graphical user interface (GUI) atop multiple web archiving tools intended to be used as an easy way for anyone to preserve and replay web pages.
- WARCAT Python tool and library for handling Web ARChive (WARC) files.