paperoni

Author	SHA1	Message	Date
Kenneth Gitere	dbac7c3b69	Refactor `grab_article` to return a Result - Add ReadabilityError field - Refactor `article` getter in Extractor to return a &NodeRef. This relies on the assumption that the article has already been parsed and should otherwise panic.	2021-04-21 19:11:57 +03:00
Kenneth Gitere	ae1ddb9386	Add printing of table for failed article downloads - Map errors in `fetch_html` to include the source url - Change `article_link` to `article_source` - Add `Into` conversion for `UTF8Error` - Collect errors in `generate_epubs` for displaying in a table	2021-04-20 21:33:24 +03:00
Kenneth Gitere	60fb30e8a2	Add url field in Extractor struct	2021-04-20 21:06:54 +03:00
Kenneth Gitere	04a1eed4e2	Add progress indicators for the cli	2021-04-17 17:28:07 +03:00
Kenneth Gitere	217cd3e442	Minor refactor Change cli to grab version from the Cargo manifest Rename fetch_url to fetch_html	2021-04-17 12:37:53 +03:00
Kenneth Gitere	7e9dcfc2b7	Add custom error types and ignore failed image downloads Using this custom error type, many instances of unwrap are replaced with mapping to errors that are then logged in main.rs. This allows paperoni to stop crashing when downloading articles when the errors are possibly recoverable or should not affect other downloads. This subsequently introduces ignoring the failed image downloads and instead leaving the original URLs intact.	2021-04-17 12:04:06 +03:00
Kenneth Gitere	912bc9d915	Add flag for configuring maximum concurrent requests Change printing macro for error messages to go out to stderr	2021-02-21 13:11:26 +03:00
Kenneth Gitere	b0c4c47413	Add support for merging articles into a single epub This is still experimental as it lacks validation of the target file name	2021-02-11 13:51:21 +03:00
Kenneth Gitere	003953332f	Refactor downloading of HTML pages This change allows for parallel downloads of HTML pages upto a maximum number of concurrent HTTP requests which is more efficient than before where all HTTP requests are likely to begin at the same time.	2021-02-06 17:06:03 +03:00
Kenneth Gitere	b402472ba6	Add http and epub modules	2021-02-06 12:59:03 +03:00
Kenneth Gitere	08f847531f	Remove empty lines when reading from an input file	2021-02-03 07:39:51 +03:00
Kenneth Gitere	3d56023592	Add -f flag for adding links from a file instead of needing to use cat	2021-02-01 11:31:24 +03:00
Kenneth Gitere	21c3ffd922	Refactor fetch_url This adds: - More validation of responses to ensure the HTML response is valid. - Better handling of redirecting URLs which allows for fetching of links proxied to Medium.	2021-01-24 17:52:31 +03:00
Kenneth Gitere	8407c613df	Bug fixes - Prevent downloading images with base64 strings as the source - Add escaping of quotation characters in the serializer - Disable redirects when downloading images which fails on multiple sites - Remove invalid characters for making the epub export file name - Fix version number in release	2020-12-24 14:03:36 +03:00
Kenneth Gitere	725c73c83f	Add basic redirect provided by surf and early exit of the program if the response is not a 200	2020-11-24 18:31:16 +03:00
Kenneth Gitere	5f99bddc10	Add custom serializer for XHTML	2020-11-24 14:54:23 +03:00
Kenneth Gitere	37cb4e1fd2	Change from structopt to clap This allows printing the help message if no args are passed	2020-11-24 09:58:50 +03:00
Kenneth Gitere	aff4054ca9	Update crates and fix bugs The bug fixes are for: - <base> elements with "/" as the href - articles containing an ampersand in the title which would create corrupted manifest files.	2020-11-23 15:55:58 +03:00
Kenneth Gitere	ef3efdba81	Refactor to use temp directory and update surf Change from using res directory for image downloads to using temp directories. Update surf to v2 which required changing the way Content-Type headers are read from.	2020-11-23 13:38:58 +03:00
Kenneth Gitere	ab800d0174	Bug fix and add printing of the name of the extracted EPUB The fix prevents creating the res directory if it already exists	2020-11-23 09:06:13 +03:00
Kenneth Gitere	be48cc1e47	Fix alignment in README Update manifest file Add fix in serialized file to have self closing tags which is invalid xhtml	2020-10-22 19:18:18 +03:00
Kenneth Gitere	1b4c4ee658	Change CLI option to allow for multiple arguments Add basic looping in async runtime	2020-10-22 15:22:56 +03:00
Kenneth Gitere	db11e78d8c	Add template for epub output Change output format to name file with the title name Add getters in MetaData	2020-10-22 13:55:02 +03:00
Kenneth Gitere	703de7e3bf	Merge the readability module with the rest of the extractor	2020-10-22 12:12:30 +03:00
Kenneth Gitere	75018894ae	Add regexes module in moz_readability that contains the regular expressions used. For optimal performance, the regular expresions are compiled to static values to prevent recompiling in loops	2020-10-15 22:25:10 +03:00
Kenneth Gitere	e1debf5630	Add moz_readability initial code and accompanying unit tests This currently contains the preprocessing code of the Readability. It is a port of Readability.js by Mozilla.	2020-08-31 19:30:09 +03:00
Kenneth Gitere	9f56c58dd9	Add simple CLI wrapper	2020-05-16 10:09:44 +03:00
Kenneth Gitere	271d3c8951	Change download code to save images to a folder Add downloaded images to the output epub file	2020-05-05 12:24:11 +03:00
Kenneth Gitere	4e8812c1ee	Add first attempt to save an epub file	2020-05-02 19:25:31 +03:00
Kenneth Gitere	e5a318282d	Update img tags with new src values to point to the local files	2020-05-02 19:06:03 +03:00
Kenneth Gitere	78ba40f57a	Add image download functionality	2020-05-02 18:33:45 +03:00
Kenneth Gitere	f24e72e70f	Change signature of `extract_content` to copy the reference to article DOM node instead of writing to file	2020-05-02 14:51:53 +03:00
Kenneth Gitere	529704d227	Add test for extract content	2020-05-01 20:42:41 +03:00
Kenneth Gitere	b5336e078d	Factor out text extraction into extractor module	2020-05-01 16:17:59 +03:00
Kenneth Gitere	4527fb07d9	Initial extraction code to get meta information on a blog	2020-04-30 11:05:53 +03:00

35 commits