paperoni

Archived

Author	SHA1	Message	Date
Kenneth Gitere	2f4da824ba	feat: add HTML exports with inlining of images fix: typo fix refactor: refactor `add_stylesheets` function	2021-07-24 12:08:18 +03:00
Kenneth Gitere	c6c10689eb	fix: fix broken links in toc generation the fix involves ensuring the ToC is generated prior to serialization because it mutates the document and will not work otherwise. chore: add .vscode config to .gitignore	2021-06-16 18:09:05 +03:00
Kenneth Gitere	282d229754	fix: fix ordering issue with merged articles This commit adds the itertools crate which is used to dedup the Vec when downloading urls fix: fix error message feat: change the serif and mono fonts declarations	2021-06-11 14:21:41 +03:00
Kenneth Gitere	4247fab1ea	feat: add css library for EPUB exports	2021-06-09 08:04:50 +03:00
Kenneth Gitere	d50bbdfb58	fix: minor fixes - restore default debug level when logging to file - return early from generating epubs if there are no articles - fix serialization bug in creating attributes	2021-06-09 07:26:52 +03:00
Kenneth Gitere	b496abb576	Fix serialization issue with poorly defined attribute names	2021-04-22 19:00:32 +03:00
Kenneth Gitere	dbac7c3b69	Refactor `grab_article` to return a Result - Add ReadabilityError field - Refactor `article` getter in Extractor to return a &NodeRef. This relies on the assumption that the article has already been parsed and should otherwise panic.	2021-04-21 19:11:57 +03:00
Kenneth Gitere	60fb30e8a2	Add url field in Extractor struct	2021-04-20 21:06:54 +03:00
Kenneth Gitere	7e9dcfc2b7	Add custom error types and ignore failed image downloads Using this custom error type, many instances of unwrap are replaced with mapping to errors that are then logged in main.rs. This allows paperoni to stop crashing when downloading articles when the errors are possibly recoverable or should not affect other downloads. This subsequently introduces ignoring the failed image downloads and instead leaving the original URLs intact.	2021-04-17 12:04:06 +03:00
Kenneth Gitere	b402472ba6	Add http and epub modules	2021-02-06 12:59:03 +03:00
Kenneth Gitere	1dc7b3432b	Bug fixes The bug fixes include: - `<html>` nodes being added to the replaced image when `unwrap_noscript_tags` is called. - Remove `srcset` attribute of <img> tags after downloading the image. This prevented readers like Foliate from displaying the downloaded image	2021-01-12 10:27:46 +03:00
Kenneth Gitere	8407c613df	Bug fixes - Prevent downloading images with base64 strings as the source - Add escaping of quotation characters in the serializer - Disable redirects when downloading images which fails on multiple sites - Remove invalid characters for making the epub export file name - Fix version number in release	2020-12-24 14:03:36 +03:00
Kenneth Gitere	725c73c83f	Add basic redirect provided by surf and early exit of the program if the response is not a 200	2020-11-24 18:31:16 +03:00
Kenneth Gitere	5f99bddc10	Add custom serializer for XHTML	2020-11-24 14:54:23 +03:00
Kenneth Gitere	ef3efdba81	Refactor to use temp directory and update surf Change from using res directory for image downloads to using temp directories. Update surf to v2 which required changing the way Content-Type headers are read from.	2020-11-23 13:38:58 +03:00
Kenneth Gitere	db11e78d8c	Add template for epub output Change output format to name file with the title name Add getters in MetaData	2020-10-22 13:55:02 +03:00
Kenneth Gitere	703de7e3bf	Merge the readability module with the rest of the extractor	2020-10-22 12:12:30 +03:00
Kenneth Gitere	aacb442b7a	Move MetaAttr to `moz_readability` and rename to `MetaData` Add get_article_metadata, get_article_title and unescape_html_entities and their tests	2020-10-20 22:27:40 +03:00
Kenneth Gitere	6dab011cac	Fixed img resolving bug	2020-05-16 10:22:49 +03:00
Kenneth Gitere	c30d5f732e	Fix test data	2020-05-06 14:01:49 +03:00
Kenneth Gitere	271d3c8951	Change download code to save images to a folder Add downloaded images to the output epub file	2020-05-05 12:24:11 +03:00
Kenneth Gitere	f02973157d	Refactor downloading code to download images in parallel	2020-05-05 09:40:44 +03:00
Kenneth Gitere	e5a318282d	Update img tags with new src values to point to the local files	2020-05-02 19:06:03 +03:00
Kenneth Gitere	78ba40f57a	Add image download functionality	2020-05-02 18:33:45 +03:00
Kenneth Gitere	f24e72e70f	Change signature of `extract_content` to copy the reference to article DOM node instead of writing to file	2020-05-02 14:51:53 +03:00
Kenneth Gitere	529704d227	Add test for extract content	2020-05-01 20:42:41 +03:00
Kenneth Gitere	b5336e078d	Factor out text extraction into extractor module	2020-05-01 16:17:59 +03:00

27 commits