Commit graph

116 commits

Author SHA1 Message Date
Kenneth Gitere
b496abb576 Fix serialization issue with poorly defined attribute names 2021-04-22 19:00:32 +03:00
Kenneth Gitere
313041a109 Update dependencies and restore redirect middleware in download_images 2021-04-22 18:01:23 +03:00
Kenneth Gitere
960f114dc6 Minor fixes in moz_readability
- swap unwrap for if let statement in `get_article_metadata`
- add default when extracting the title from a possible `<title>` element
- fix extracting alternative titles from h1 tags
2021-04-21 19:52:41 +03:00
Kenneth Gitere
dbac7c3b69 Refactor grab_article to return a Result
- Add ReadabilityError field
- Refactor `article` getter in Extractor to return a &NodeRef. This
  relies on the assumption that the article has already been parsed
  and should otherwise panic.
2021-04-21 19:11:57 +03:00
Kenneth Gitere
ae1ddb9386 Add printing of table for failed article downloads
- Map errors in `fetch_html` to include the source url
- Change `article_link` to `article_source`
- Add `Into` conversion for `UTF8Error`
- Collect errors in `generate_epubs` for displaying in a table
2021-04-20 21:33:24 +03:00
Kenneth Gitere
60fb30e8a2 Add url field in Extractor struct 2021-04-20 21:06:54 +03:00
Kenneth Gitere
b217448601 Add printing of tables upon successful extraction 2021-04-20 14:02:56 +03:00
Kenneth Gitere
04a1eed4e2 Add progress indicators for the cli 2021-04-17 17:28:07 +03:00
Kenneth Gitere
217cd3e442 Minor refactor
Change cli to grab version from the Cargo manifest
Rename fetch_url to fetch_html
2021-04-17 12:37:53 +03:00
Kenneth Gitere
7e9dcfc2b7 Add custom error types and ignore failed image downloads
Using this custom error type, many instances of unwrap are replaced
with mapping to errors that are then logged in main.rs. This allows
paperoni to stop crashing when downloading articles when the errors
are possibly recoverable or should not affect other downloads.

This subsequently introduces ignoring the failed image downloads
and instead leaving the original URLs intact.
2021-04-17 12:04:06 +03:00
Kenneth Gitere
d6cbbe405b Fix bug in inline_css_str_to_map 2021-04-14 18:07:39 +03:00
Kenneth Gitere
165b2187be Bump version 2021-02-24 13:03:52 +03:00
Kenneth Gitere
912bc9d915 Add flag for configuring maximum concurrent requests
Change printing macro for error messages to go out to stderr
2021-02-21 13:11:26 +03:00
Kenneth Gitere
b0c4c47413 Add support for merging articles into a single epub
This is still experimental as it lacks validation of the target file name
2021-02-11 13:51:21 +03:00
Kenneth Gitere
f0a610c2ac Bug fix with empty titles
The code for title retrieval previously assumed that meta tags concerned
with the title would always contain a value but some sites leave the value
empty thus it had to be checked for as well.
2021-02-09 12:56:07 +03:00
Kenneth Gitere
65fdd967c1 Refactor image downloading and update README
Image downloads uses streams instead of spawned tasks to ensure that
it does not start an unbounded number of spawned tasks
2021-02-09 10:34:35 +03:00
Kenneth Gitere
003953332f Refactor downloading of HTML pages
This change allows for parallel downloads of HTML pages upto a maximum
number of concurrent HTTP requests which is more efficient than
before where all HTTP requests are likely to begin at the same time.
2021-02-06 17:06:03 +03:00
Kenneth Gitere
6b62051942 Add replace_metadata_value function 2021-02-06 13:53:04 +03:00
Kenneth Gitere
b402472ba6 Add http and epub modules 2021-02-06 12:59:03 +03:00
Kenneth Gitere
08f847531f Remove empty lines when reading from an input file 2021-02-03 07:39:51 +03:00
Kenneth Gitere
3d56023592 Add -f flag for adding links from a file instead of needing to use cat 2021-02-01 11:31:24 +03:00
Kenneth Gitere
b98c0a69a6 Bump version 2021-01-24 17:54:33 +03:00
Kenneth Gitere
21c3ffd922 Refactor fetch_url
This adds:
- More validation of responses to ensure the HTML response is valid.
- Better handling of redirecting URLs which allows for fetching of
  links proxied to Medium.
2021-01-24 17:52:31 +03:00
Kenneth Gitere
1dc7b3432b Bug fixes
The bug fixes include:
- `<html>` nodes being added to the replaced image when `unwrap_noscript_tags`
  is called.
- Remove `srcset` attribute of <img> tags after downloading the image. This
  prevented readers like Foliate from displaying the downloaded image
2021-01-12 10:27:46 +03:00
Kenneth Gitere
8407c613df Bug fixes
- Prevent downloading images with base64 strings as the source
- Add escaping of quotation characters in the serializer
- Disable redirects when downloading images which fails on multiple sites
- Remove invalid characters for making the epub export file name
- Fix version number in release
2020-12-24 14:03:36 +03:00
Kenneth Gitere
725c73c83f Add basic redirect provided by surf and early exit of the program if the response is not a 200 2020-11-24 18:31:16 +03:00
Kenneth Gitere
5f99bddc10 Add custom serializer for XHTML 2020-11-24 14:54:23 +03:00
Kenneth Gitere
37cb4e1fd2 Change from structopt to clap
This allows printing the help message if no args are passed
2020-11-24 09:58:50 +03:00
Kenneth Gitere
cdfbc2b3f6 Refactor inline_css_str_to_map to use a better tokenizer 2020-11-24 08:29:00 +03:00
Kenneth Gitere
aff4054ca9 Update crates and fix bugs
The bug fixes are for:
- <base> elements with "/" as the href
- articles containing an ampersand in the title which would create
  corrupted manifest files.
2020-11-23 15:55:58 +03:00
Kenneth Gitere
ef3efdba81 Refactor to use temp directory and update surf
Change from using res directory for image downloads to using temp directories.
Update surf to v2 which required changing the way Content-Type headers are
read from.
2020-11-23 13:38:58 +03:00
Kenneth Gitere
ab800d0174 Bug fix and add printing of the name of the extracted EPUB
The fix prevents creating the res directory if it already exists
2020-11-23 09:06:13 +03:00
Kenneth Gitere
be48cc1e47 Fix alignment in README
Update manifest file
Add fix in serialized file to have self closing tags which is invalid
xhtml
2020-10-22 19:18:18 +03:00
Kenneth Gitere
1b4c4ee658 Change CLI option to allow for multiple arguments
Add basic looping in async runtime
2020-10-22 15:22:56 +03:00
Kenneth Gitere
db11e78d8c Add template for epub output
Change output format to name file with the title name
Add getters in MetaData
2020-10-22 13:55:02 +03:00
Kenneth Gitere
703de7e3bf Merge the readability module with the rest of the extractor 2020-10-22 12:12:30 +03:00
Kenneth Gitere
679bf3cb04 Add logic for attempting different rounds for content extraction
with different flags set

Add additional test in `fix_relative_uris`
2020-10-22 11:50:34 +03:00
Kenneth Gitere
a0f69ccf80 Fix bug in is_probably_visible
Add fix in `grab_article` when appending nodes. This internally
detaches children so it can end up running only once
2020-10-22 11:37:02 +03:00
Kenneth Gitere
a94798cc95 Add flags for conditional cleaning and removal of nodes
This also includes updating the function signatures of the affected
methods
2020-10-22 08:24:46 +03:00
Kenneth Gitere
f17c9bfbc9 Add bug fixes for overflows in subtraction, giving a default for
capture groups and in extracting nodes. Add fix in `is_probably_visible`
2020-10-21 20:48:21 +03:00
Kenneth Gitere
350447d1c4 Change calls on replacing regexes to replace_all
Add `fix_relative_uris`, `clean_classes`, `clean_readability_attrs`
and `post_process_content`
2020-10-21 19:55:22 +03:00
Kenneth Gitere
aacb442b7a Move MetaAttr to moz_readability and rename to MetaData
Add get_article_metadata, get_article_title and unescape_html_entities
and their tests
2020-10-20 22:27:40 +03:00
Kenneth Gitere
d99b1c687b Fix counting of h2 nodes in prep_article
Add test for prep_article
2020-10-20 10:13:34 +03:00
Kenneth Gitere
94fa8db218 Fix bug in deletion of multiple nodes.
When calling `detach` in a for loop or `for_each` iterator consumer,
only the first node is ever deleted.

Fix replacement of table nodes in prep_article
Edit clean_conditionally to remove unnecessary assignment.
2020-10-20 10:04:12 +03:00
Kenneth Gitere
ccdbbb5a16 Add initial implementation of grabArticle
Change function signature of setNodeTag to return a NodeRef

Minor fix in clean, clean_headers and clean_conditionally
2020-10-20 07:42:32 +03:00
Kenneth Gitere
3254064c0d Fix calls to select to return an iterator excluding the original
calling node.

Edit `next_element` to either return an element node only or element/
text node
2020-10-17 07:13:39 +03:00
Kenneth Gitere
6377c01fb3 Add tests for clean_conditionally and fix_lazy_images
Minor refactor in `fix_lazy_images`
Fix incorrect boolean expression and bug in element node name comparison
in `clean_conditionally`
2020-10-16 08:03:01 +03:00
Kenneth Gitere
78d6e16618 Add unit tests for clean, clean_styles, clean_headers and
`clean_matched_nodes`

Add missing function calls in `prep_article`
2020-10-16 08:00:47 +03:00
Kenneth Gitere
b661211f0f Refactored code to use regexes from regexes module
Extracted constants from the code for easier reusability in some cases.
Change select queries for multiple elements to use the `,` operator
instead of calling `chain`.

Remove check for "null" in `fix_lazy_images`. This mitigates a JSOM
issue so it doesn't affect the Rust code in any way.
2020-10-15 22:45:18 +03:00
Kenneth Gitere
75018894ae Add regexes module in moz_readability that contains the regular
expressions used. For optimal performance, the regular expresions
are compiled to static values to prevent recompiling in loops
2020-10-15 22:25:10 +03:00
Kenneth Gitere
d2bd31dc47 Add helper functions for the grabArticle function 2020-10-07 20:46:08 +03:00
Kenneth Gitere
7219198524 Change function signature of next_element to return an Option
rather than mutate a given value.

The new function signature reads a little easier than before.
Remove TODO task in replace_brs
2020-09-23 22:52:07 +03:00
Kenneth Gitere
7fb09130e8 Add calls to remove_scripts and prep_document 2020-08-31 20:40:37 +03:00
Kenneth Gitere
e1debf5630 Add moz_readability initial code and accompanying unit tests
This currently contains the preprocessing code of the Readability.
It is a port of Readability.js by Mozilla.
2020-08-31 19:30:09 +03:00
Kenneth Gitere
6dab011cac Fixed img resolving bug 2020-05-16 10:22:49 +03:00
Kenneth Gitere
9f56c58dd9 Add simple CLI wrapper 2020-05-16 10:09:44 +03:00
Kenneth Gitere
c30d5f732e Fix test data 2020-05-06 14:01:49 +03:00
Kenneth Gitere
271d3c8951 Change download code to save images to a folder
Add downloaded images to the output epub file
2020-05-05 12:24:11 +03:00
Kenneth Gitere
f02973157d Refactor downloading code to download images in parallel 2020-05-05 09:40:44 +03:00
Kenneth Gitere
4e8812c1ee Add first attempt to save an epub file 2020-05-02 19:25:31 +03:00
Kenneth Gitere
e5a318282d Update img tags with new src values to point to the local files 2020-05-02 19:06:03 +03:00
Kenneth Gitere
78ba40f57a Add image download functionality 2020-05-02 18:33:45 +03:00
Kenneth Gitere
f24e72e70f Change signature of extract_content to copy the reference to article DOM
node instead of writing to file
2020-05-02 14:51:53 +03:00
Kenneth Gitere
529704d227 Add test for extract content 2020-05-01 20:42:41 +03:00
Kenneth Gitere
b5336e078d Factor out text extraction into extractor module 2020-05-01 16:17:59 +03:00
Kenneth Gitere
4527fb07d9 Initial extraction code to get meta information on a blog 2020-04-30 11:05:53 +03:00