Commit graph

29 commits

Author SHA1 Message Date
Kenneth Gitere
07479afeac refactor: refactor update_imgs_base64
chore: add doc comment on ResourceType alias

fix: add error when image MIME type is invalid on an image
2021-07-28 10:00:45 +03:00
Kenneth Gitere
e6f901eb5a refactor: rename Extractor to Article 2021-07-24 12:43:40 +03:00
Kenneth Gitere
2f4da824ba feat: add HTML exports with inlining of images
fix: typo fix
refactor: refactor `add_stylesheets` function
2021-07-24 12:08:18 +03:00
Kenneth Gitere
c6c10689eb fix: fix broken links in toc generation
the fix involves ensuring the ToC is generated prior to serialization
because it mutates the document and will not work otherwise.

chore: add .vscode config to .gitignore
2021-06-16 18:09:05 +03:00
Kenneth Gitere
282d229754 fix: fix ordering issue with merged articles
This commit adds the itertools crate which is used to dedup the Vec
when downloading urls

fix: fix error message
feat: change the serif and mono fonts declarations
2021-06-11 14:21:41 +03:00
Kenneth Gitere
4247fab1ea feat: add css library for EPUB exports 2021-06-09 08:04:50 +03:00
Kenneth Gitere
d50bbdfb58 fix: minor fixes
- restore default debug level when logging to file
- return early from generating epubs if there are no articles
- fix serialization bug in creating attributes
2021-06-09 07:26:52 +03:00
Kenneth Gitere
b496abb576 Fix serialization issue with poorly defined attribute names 2021-04-22 19:00:32 +03:00
Kenneth Gitere
dbac7c3b69 Refactor grab_article to return a Result
- Add ReadabilityError field
- Refactor `article` getter in Extractor to return a &NodeRef. This
  relies on the assumption that the article has already been parsed
  and should otherwise panic.
2021-04-21 19:11:57 +03:00
Kenneth Gitere
60fb30e8a2 Add url field in Extractor struct 2021-04-20 21:06:54 +03:00
Kenneth Gitere
7e9dcfc2b7 Add custom error types and ignore failed image downloads
Using this custom error type, many instances of unwrap are replaced
with mapping to errors that are then logged in main.rs. This allows
paperoni to stop crashing when downloading articles when the errors
are possibly recoverable or should not affect other downloads.

This subsequently introduces ignoring the failed image downloads
and instead leaving the original URLs intact.
2021-04-17 12:04:06 +03:00
Kenneth Gitere
b402472ba6 Add http and epub modules 2021-02-06 12:59:03 +03:00
Kenneth Gitere
1dc7b3432b Bug fixes
The bug fixes include:
- `<html>` nodes being added to the replaced image when `unwrap_noscript_tags`
  is called.
- Remove `srcset` attribute of <img> tags after downloading the image. This
  prevented readers like Foliate from displaying the downloaded image
2021-01-12 10:27:46 +03:00
Kenneth Gitere
8407c613df Bug fixes
- Prevent downloading images with base64 strings as the source
- Add escaping of quotation characters in the serializer
- Disable redirects when downloading images which fails on multiple sites
- Remove invalid characters for making the epub export file name
- Fix version number in release
2020-12-24 14:03:36 +03:00
Kenneth Gitere
725c73c83f Add basic redirect provided by surf and early exit of the program if the response is not a 200 2020-11-24 18:31:16 +03:00
Kenneth Gitere
5f99bddc10 Add custom serializer for XHTML 2020-11-24 14:54:23 +03:00
Kenneth Gitere
ef3efdba81 Refactor to use temp directory and update surf
Change from using res directory for image downloads to using temp directories.
Update surf to v2 which required changing the way Content-Type headers are
read from.
2020-11-23 13:38:58 +03:00
Kenneth Gitere
db11e78d8c Add template for epub output
Change output format to name file with the title name
Add getters in MetaData
2020-10-22 13:55:02 +03:00
Kenneth Gitere
703de7e3bf Merge the readability module with the rest of the extractor 2020-10-22 12:12:30 +03:00
Kenneth Gitere
aacb442b7a Move MetaAttr to moz_readability and rename to MetaData
Add get_article_metadata, get_article_title and unescape_html_entities
and their tests
2020-10-20 22:27:40 +03:00
Kenneth Gitere
6dab011cac Fixed img resolving bug 2020-05-16 10:22:49 +03:00
Kenneth Gitere
c30d5f732e Fix test data 2020-05-06 14:01:49 +03:00
Kenneth Gitere
271d3c8951 Change download code to save images to a folder
Add downloaded images to the output epub file
2020-05-05 12:24:11 +03:00
Kenneth Gitere
f02973157d Refactor downloading code to download images in parallel 2020-05-05 09:40:44 +03:00
Kenneth Gitere
e5a318282d Update img tags with new src values to point to the local files 2020-05-02 19:06:03 +03:00
Kenneth Gitere
78ba40f57a Add image download functionality 2020-05-02 18:33:45 +03:00
Kenneth Gitere
f24e72e70f Change signature of extract_content to copy the reference to article DOM
node instead of writing to file
2020-05-02 14:51:53 +03:00
Kenneth Gitere
529704d227 Add test for extract content 2020-05-01 20:42:41 +03:00
Kenneth Gitere
b5336e078d Factor out text extraction into extractor module 2020-05-01 16:17:59 +03:00