paperoni

Archived

Author	SHA1	Message	Date
Kenneth Gitere	b217448601	Add printing of tables upon successful extraction	2021-04-20 14:02:56 +03:00
Kenneth Gitere	04a1eed4e2	Add progress indicators for the cli	2021-04-17 17:28:07 +03:00
Kenneth Gitere	217cd3e442	Minor refactor Change cli to grab version from the Cargo manifest Rename fetch_url to fetch_html	2021-04-17 12:37:53 +03:00
Kenneth Gitere	7e9dcfc2b7	Add custom error types and ignore failed image downloads Using this custom error type, many instances of unwrap are replaced with mapping to errors that are then logged in main.rs. This allows paperoni to stop crashing when downloading articles when the errors are possibly recoverable or should not affect other downloads. This subsequently introduces ignoring the failed image downloads and instead leaving the original URLs intact.	2021-04-17 12:04:06 +03:00
Kenneth Gitere	d6cbbe405b	Fix bug in `inline_css_str_to_map`	2021-04-14 18:07:39 +03:00
Kenneth Gitere	2762bc5086	Merge pull request #7 from hipstermojo/dev Update README	2021-02-24 13:28:56 +03:00
Kenneth Gitere	b8c0cf29f1	Update README	2021-02-24 13:27:43 +03:00
Kenneth Gitere	e9f96d2970	Merge pull request #6 from hipstermojo/dev Update to 0.3.0	2021-02-24 13:13:36 +03:00
Kenneth Gitere	165b2187be	Bump version	2021-02-24 13:03:52 +03:00
Kenneth Gitere	912bc9d915	Add flag for configuring maximum concurrent requests Change printing macro for error messages to go out to stderr	2021-02-21 13:11:26 +03:00
Kenneth Gitere	b0c4c47413	Add support for merging articles into a single epub This is still experimental as it lacks validation of the target file name	2021-02-11 13:51:21 +03:00
Kenneth Gitere	f0a610c2ac	Bug fix with empty titles The code for title retrieval previously assumed that meta tags concerned with the title would always contain a value but some sites leave the value empty thus it had to be checked for as well.	2021-02-09 12:56:07 +03:00
Kenneth Gitere	65fdd967c1	Refactor image downloading and update README Image downloads uses streams instead of spawned tasks to ensure that it does not start an unbounded number of spawned tasks	2021-02-09 10:34:35 +03:00
Kenneth Gitere	003953332f	Refactor downloading of HTML pages This change allows for parallel downloads of HTML pages upto a maximum number of concurrent HTTP requests which is more efficient than before where all HTTP requests are likely to begin at the same time.	2021-02-06 17:06:03 +03:00
Kenneth Gitere	6b62051942	Add `replace_metadata_value` function	2021-02-06 13:53:04 +03:00
Kenneth Gitere	b402472ba6	Add http and epub modules	2021-02-06 12:59:03 +03:00
Kenneth Gitere	08f847531f	Remove empty lines when reading from an input file	2021-02-03 07:39:51 +03:00
Kenneth Gitere	3d56023592	Add -f flag for adding links from a file instead of needing to use cat	2021-02-01 11:31:24 +03:00
Kenneth Gitere	c82071a871	Merge pull request #5 from hipstermojo/dev Merge 0.2.2-alpha-1	2021-01-24 18:00:50 +03:00
Kenneth Gitere	b98c0a69a6	Bump version	2021-01-24 17:54:33 +03:00
Kenneth Gitere	21c3ffd922	Refactor fetch_url This adds: - More validation of responses to ensure the HTML response is valid. - Better handling of redirecting URLs which allows for fetching of links proxied to Medium.	2021-01-24 17:52:31 +03:00
Kenneth Gitere	1dc7b3432b	Bug fixes The bug fixes include: - `<html>` nodes being added to the replaced image when `unwrap_noscript_tags` is called. - Remove `srcset` attribute of <img> tags after downloading the image. This prevented readers like Foliate from displaying the downloaded image	2021-01-12 10:27:46 +03:00
Kenneth Gitere	ca1f9e2800	Merge pull request #4 from hipstermojo/dev Update to 0.2.1-alpha1	2020-12-24 14:11:42 +03:00
Kenneth Gitere	8407c613df	Bug fixes - Prevent downloading images with base64 strings as the source - Add escaping of quotation characters in the serializer - Disable redirects when downloading images which fails on multiple sites - Remove invalid characters for making the epub export file name - Fix version number in release	2020-12-24 14:03:36 +03:00
Kenneth Gitere	3c7dc9a416	Merge pull request #3 from hipstermojo/dev 0.2.0 update	2020-11-24 18:42:29 +03:00
Kenneth Gitere	3bfa82ba60	Update README and version	2020-11-24 18:39:51 +03:00
Kenneth Gitere	725c73c83f	Add basic redirect provided by surf and early exit of the program if the response is not a 200	2020-11-24 18:31:16 +03:00
Kenneth Gitere	5f99bddc10	Add custom serializer for XHTML	2020-11-24 14:54:23 +03:00
Kenneth Gitere	37cb4e1fd2	Change from structopt to clap This allows printing the help message if no args are passed	2020-11-24 09:58:50 +03:00
Kenneth Gitere	cdfbc2b3f6	Refactor inline_css_str_to_map to use a better tokenizer	2020-11-24 08:29:00 +03:00
Kenneth Gitere	aff4054ca9	Update crates and fix bugs The bug fixes are for: - <base> elements with "/" as the href - articles containing an ampersand in the title which would create corrupted manifest files.	2020-11-23 15:55:58 +03:00
Kenneth Gitere	ef3efdba81	Refactor to use temp directory and update surf Change from using res directory for image downloads to using temp directories. Update surf to v2 which required changing the way Content-Type headers are read from.	2020-11-23 13:38:58 +03:00
Kenneth Gitere	ab800d0174	Bug fix and add printing of the name of the extracted EPUB The fix prevents creating the res directory if it already exists	2020-11-23 09:06:13 +03:00
Kenneth Gitere	b0e402d685	Resize logo	2020-10-24 08:20:47 +03:00
Kenneth Gitere	fbf2f0b3d8	Merge pull request #2 from hipstermojo/dev Merge v0.1.0	2020-10-22 19:25:19 +03:00
Kenneth Gitere	566c3427be	Merge pull request #1 from hipstermojo/readability Add Readability port	2020-10-22 19:24:31 +03:00
Kenneth Gitere	be48cc1e47	Fix alignment in README Update manifest file Add fix in serialized file to have self closing tags which is invalid xhtml	2020-10-22 19:18:18 +03:00
Kenneth Gitere	6aef1631e3	Add README	2020-10-22 16:03:57 +03:00
Kenneth Gitere	1b4c4ee658	Change CLI option to allow for multiple arguments Add basic looping in async runtime	2020-10-22 15:22:56 +03:00
Kenneth Gitere	db11e78d8c	Add template for epub output Change output format to name file with the title name Add getters in MetaData	2020-10-22 13:55:02 +03:00
Kenneth Gitere	703de7e3bf	Merge the readability module with the rest of the extractor	2020-10-22 12:12:30 +03:00
Kenneth Gitere	679bf3cb04	Add logic for attempting different rounds for content extraction with different flags set Add additional test in `fix_relative_uris`	2020-10-22 11:50:34 +03:00
Kenneth Gitere	a0f69ccf80	Fix bug in `is_probably_visible` Add fix in `grab_article` when appending nodes. This internally detaches children so it can end up running only once	2020-10-22 11:37:02 +03:00
Kenneth Gitere	a94798cc95	Add flags for conditional cleaning and removal of nodes This also includes updating the function signatures of the affected methods	2020-10-22 08:24:46 +03:00
Kenneth Gitere	f17c9bfbc9	Add bug fixes for overflows in subtraction, giving a default for capture groups and in extracting nodes. Add fix in `is_probably_visible`	2020-10-21 20:48:21 +03:00
Kenneth Gitere	350447d1c4	Change calls on replacing regexes to `replace_all` Add `fix_relative_uris`, `clean_classes`, `clean_readability_attrs` and `post_process_content`	2020-10-21 19:55:22 +03:00
Kenneth Gitere	aacb442b7a	Move MetaAttr to `moz_readability` and rename to `MetaData` Add get_article_metadata, get_article_title and unescape_html_entities and their tests	2020-10-20 22:27:40 +03:00
Kenneth Gitere	d99b1c687b	Fix counting of h2 nodes in prep_article Add test for prep_article	2020-10-20 10:13:34 +03:00
Kenneth Gitere	94fa8db218	Fix bug in deletion of multiple nodes. When calling `detach` in a for loop or `for_each` iterator consumer, only the first node is ever deleted. Fix replacement of table nodes in prep_article Edit clean_conditionally to remove unnecessary assignment.	2020-10-20 10:04:12 +03:00
Kenneth Gitere	ccdbbb5a16	Add initial implementation of `grabArticle` Change function signature of setNodeTag to return a NodeRef Minor fix in clean, clean_headers and clean_conditionally	2020-10-20 07:42:32 +03:00

1 2

75 commits