paperoni

Archived

Author	SHA1	Message	Date
Kenneth Gitere	21c3ffd922	Refactor fetch_url This adds: - More validation of responses to ensure the HTML response is valid. - Better handling of redirecting URLs which allows for fetching of links proxied to Medium.	2021-01-24 17:52:31 +03:00
Kenneth Gitere	1dc7b3432b	Bug fixes The bug fixes include: - `<html>` nodes being added to the replaced image when `unwrap_noscript_tags` is called. - Remove `srcset` attribute of <img> tags after downloading the image. This prevented readers like Foliate from displaying the downloaded image	2021-01-12 10:27:46 +03:00
Kenneth Gitere	8407c613df	Bug fixes - Prevent downloading images with base64 strings as the source - Add escaping of quotation characters in the serializer - Disable redirects when downloading images which fails on multiple sites - Remove invalid characters for making the epub export file name - Fix version number in release	2020-12-24 14:03:36 +03:00
Kenneth Gitere	3bfa82ba60	Update README and version	2020-11-24 18:39:51 +03:00
Kenneth Gitere	725c73c83f	Add basic redirect provided by surf and early exit of the program if the response is not a 200	2020-11-24 18:31:16 +03:00
Kenneth Gitere	5f99bddc10	Add custom serializer for XHTML	2020-11-24 14:54:23 +03:00
Kenneth Gitere	37cb4e1fd2	Change from structopt to clap This allows printing the help message if no args are passed	2020-11-24 09:58:50 +03:00
Kenneth Gitere	cdfbc2b3f6	Refactor inline_css_str_to_map to use a better tokenizer	2020-11-24 08:29:00 +03:00
Kenneth Gitere	aff4054ca9	Update crates and fix bugs The bug fixes are for: - <base> elements with "/" as the href - articles containing an ampersand in the title which would create corrupted manifest files.	2020-11-23 15:55:58 +03:00
Kenneth Gitere	ef3efdba81	Refactor to use temp directory and update surf Change from using res directory for image downloads to using temp directories. Update surf to v2 which required changing the way Content-Type headers are read from.	2020-11-23 13:38:58 +03:00
Kenneth Gitere	ab800d0174	Bug fix and add printing of the name of the extracted EPUB The fix prevents creating the res directory if it already exists	2020-11-23 09:06:13 +03:00
Kenneth Gitere	b0e402d685	Resize logo	2020-10-24 08:20:47 +03:00
Kenneth Gitere	566c3427be	Merge pull request #1 from hipstermojo/readability Add Readability port	2020-10-22 19:24:31 +03:00
Kenneth Gitere	be48cc1e47	Fix alignment in README Update manifest file Add fix in serialized file to have self closing tags which is invalid xhtml	2020-10-22 19:18:18 +03:00
Kenneth Gitere	6aef1631e3	Add README	2020-10-22 16:03:57 +03:00
Kenneth Gitere	1b4c4ee658	Change CLI option to allow for multiple arguments Add basic looping in async runtime	2020-10-22 15:22:56 +03:00
Kenneth Gitere	db11e78d8c	Add template for epub output Change output format to name file with the title name Add getters in MetaData	2020-10-22 13:55:02 +03:00
Kenneth Gitere	703de7e3bf	Merge the readability module with the rest of the extractor	2020-10-22 12:12:30 +03:00
Kenneth Gitere	679bf3cb04	Add logic for attempting different rounds for content extraction with different flags set Add additional test in `fix_relative_uris`	2020-10-22 11:50:34 +03:00
Kenneth Gitere	a0f69ccf80	Fix bug in `is_probably_visible` Add fix in `grab_article` when appending nodes. This internally detaches children so it can end up running only once	2020-10-22 11:37:02 +03:00
Kenneth Gitere	a94798cc95	Add flags for conditional cleaning and removal of nodes This also includes updating the function signatures of the affected methods	2020-10-22 08:24:46 +03:00
Kenneth Gitere	f17c9bfbc9	Add bug fixes for overflows in subtraction, giving a default for capture groups and in extracting nodes. Add fix in `is_probably_visible`	2020-10-21 20:48:21 +03:00
Kenneth Gitere	350447d1c4	Change calls on replacing regexes to `replace_all` Add `fix_relative_uris`, `clean_classes`, `clean_readability_attrs` and `post_process_content`	2020-10-21 19:55:22 +03:00
Kenneth Gitere	aacb442b7a	Move MetaAttr to `moz_readability` and rename to `MetaData` Add get_article_metadata, get_article_title and unescape_html_entities and their tests	2020-10-20 22:27:40 +03:00
Kenneth Gitere	d99b1c687b	Fix counting of h2 nodes in prep_article Add test for prep_article	2020-10-20 10:13:34 +03:00
Kenneth Gitere	94fa8db218	Fix bug in deletion of multiple nodes. When calling `detach` in a for loop or `for_each` iterator consumer, only the first node is ever deleted. Fix replacement of table nodes in prep_article Edit clean_conditionally to remove unnecessary assignment.	2020-10-20 10:04:12 +03:00
Kenneth Gitere	ccdbbb5a16	Add initial implementation of `grabArticle` Change function signature of setNodeTag to return a NodeRef Minor fix in clean, clean_headers and clean_conditionally	2020-10-20 07:42:32 +03:00
Kenneth Gitere	3254064c0d	Fix calls to `select` to return an iterator excluding the original calling node. Edit `next_element` to either return an element node only or element/ text node	2020-10-17 07:13:39 +03:00
Kenneth Gitere	6377c01fb3	Add tests for `clean_conditionally` and `fix_lazy_images` Minor refactor in `fix_lazy_images` Fix incorrect boolean expression and bug in element node name comparison in `clean_conditionally`	2020-10-16 08:03:01 +03:00
Kenneth Gitere	78d6e16618	Add unit tests for `clean`, `clean_styles`, `clean_headers` and `clean_matched_nodes` Add missing function calls in `prep_article`	2020-10-16 08:00:47 +03:00
Kenneth Gitere	b661211f0f	Refactored code to use regexes from regexes module Extracted constants from the code for easier reusability in some cases. Change select queries for multiple elements to use the `,` operator instead of calling `chain`. Remove check for "null" in `fix_lazy_images`. This mitigates a JSOM issue so it doesn't affect the Rust code in any way.	2020-10-15 22:45:18 +03:00
Kenneth Gitere	75018894ae	Add regexes module in moz_readability that contains the regular expressions used. For optimal performance, the regular expresions are compiled to static values to prevent recompiling in loops	2020-10-15 22:25:10 +03:00
Kenneth Gitere	d2bd31dc47	Add helper functions for the grabArticle function	2020-10-07 20:46:08 +03:00
Kenneth Gitere	87ff21b676	Add regex and lazy_static crates	2020-10-07 20:44:35 +03:00
Kenneth Gitere	7219198524	Change function signature of `next_element` to return an Option rather than mutate a given value. The new function signature reads a little easier than before. Remove TODO task in replace_brs	2020-09-23 22:52:07 +03:00
Kenneth Gitere	7fb09130e8	Add calls to remove_scripts and prep_document	2020-08-31 20:40:37 +03:00
Kenneth Gitere	e1debf5630	Add moz_readability initial code and accompanying unit tests This currently contains the preprocessing code of the Readability. It is a port of Readability.js by Mozilla.	2020-08-31 19:30:09 +03:00
Kenneth Gitere	a27e45b5f3	Merge branch 'master' into dev	2020-05-16 10:35:47 +03:00
Kenneth Gitere	5e7cf7ddfe	Fixed img resolving bug	2020-05-16 10:32:36 +03:00
Kenneth Gitere	6dab011cac	Fixed img resolving bug	2020-05-16 10:22:49 +03:00
Kenneth Gitere	9f56c58dd9	Add simple CLI wrapper	2020-05-16 10:09:44 +03:00
Kenneth Gitere	c30d5f732e	Fix test data	2020-05-06 14:01:49 +03:00
Kenneth Gitere	271d3c8951	Change download code to save images to a folder Add downloaded images to the output epub file	2020-05-05 12:24:11 +03:00
Kenneth Gitere	f02973157d	Refactor downloading code to download images in parallel	2020-05-05 09:40:44 +03:00
Kenneth Gitere	4e8812c1ee	Add first attempt to save an epub file	2020-05-02 19:25:31 +03:00
Kenneth Gitere	e5a318282d	Update img tags with new src values to point to the local files	2020-05-02 19:06:03 +03:00
Kenneth Gitere	78ba40f57a	Add image download functionality	2020-05-02 18:33:45 +03:00
Kenneth Gitere	f24e72e70f	Change signature of `extract_content` to copy the reference to article DOM node instead of writing to file	2020-05-02 14:51:53 +03:00
Kenneth Gitere	529704d227	Add test for extract content	2020-05-01 20:42:41 +03:00
Kenneth Gitere	b5336e078d	Factor out text extraction into extractor module	2020-05-01 16:17:59 +03:00

1 2

52 commits