wpull
https://github.com/archiveteam/wpull
HTML
Wget-compatible web downloader and crawler.
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
HTML not yet supported8 Subscribers
Add a CodeTriage badge to wpull
Help out
- Issues
- URL fetches are not logged in cygwin environment
- Next warc is started on resuming, regardless --warc-max-size, when --warc-append
- Change order of retries: retry all errors once before reattempting the remaining errors
- wpull parsing HTMLs for links even if it doesn't have to
- ftp crash: sre_constants.error: bad character range
- Support text file for sitemaps or general link extraction
- Logging error: RuntimeError: reentrant call inside _io.BufferedWriter
- Show general progress of fetched vs todo URLs
- Support DNS record lookups as first-class citizen
- Scripting hooks: Links from get_urls do not have True verdict on accept_url
- Docs
- HTML not yet supported