Commit graph

151 commits

Author SHA1 Message Date
KOVACS Tamas
7649f6aa18 moz_readability/mod.rs: fix laziness check in fix_lazy_images
fix_lazy_images checks whether an img node is lazily loaded. An img is
considered lazily loaded if it does not have an src/srcset attribute, or
if it's class contains the 'lazy' string. If an img is considered lazy,
fix_lazy_images will make attempts to replace it's src.

However, if an img was missing the class attribute, it was incorrectly
assumed to be lazy and had it's src replaced.

Fixes hipstermojo/paperoni#13
2021-05-10 10:08:33 +02:00
KOVACS Tamas
d50f08b875 moz_readability/mod.rs: add testcase for issue #13
This patch adds a testcase for issue #13, where an img node without
a class attribute is automatically assumed to be lazy and its src is
replaced.
2021-05-10 10:08:25 +02:00
Kenneth Gitere
312dff95e2
Merge pull request #12 from kxt/11-image-status-codes
Check response status for fetched images
2021-05-10 10:58:23 +03:00
KOVACS Tamas
8ec491ff06 http.rs: check response status for fetched images
This patch checks if fetching an image resulted in a non-success status
code. In case of non-success status, the response is discarded and an
error is emitted.

This relies on having 3xx codes already handled by surf's Redirect
middleware, so we should see 4xx and 5xx codes here.

Fixes hipstermojo/paperoni#11
2021-05-09 14:35:55 +02:00
KOVACS Tamas
4581f07330 http.rs: extract process_img_response function 2021-05-08 21:32:15 +02:00
Kenneth Gitere
474d97c6bd
Merge pull request #10 from hipstermojo/dev
v0.4.0 release
2021-04-30 08:48:11 +03:00
Kenneth Gitere
538a65f6fd Update dependencies in lockfile 2021-04-30 08:34:09 +03:00
Kenneth Gitere
f93017ab73 Fix README formatting 2021-04-30 08:29:08 +03:00
Kenneth Gitere
4fd71311a1 Fix bug when validating the download file name in merged mode 2021-04-30 07:47:25 +03:00
Kenneth Gitere
cae9227ab0 Update documentation 2021-04-30 06:55:02 +03:00
Kenneth Gitere
c00582ac29 Fix verbosity levels ordering 2021-04-30 06:42:08 +03:00
Kenneth Gitere
ae52cc4e13 Add features for logging and cli
- display of partial downloads in the summary
- custom file name that is displayed after the summary ensuring it is visible
- log-to-file flag which specifies that logs will be sent to the default directory
- verbose flag (v) used to configure the log levels
- disabling the progress bars when logging to the terminal is active
2021-04-29 20:02:08 +03:00
Kenneth Gitere
00d704fdd6 Move initializing logger to logs module 2021-04-28 07:47:45 +03:00
Kenneth Gitere
36c3eb65c6 Add appendix page for listing the source of the article 2021-04-28 07:46:07 +03:00
Kenneth Gitere
088699b2c3 Add debug flag 2021-04-24 15:50:43 +03:00
Kenneth Gitere
a9787d7b5a Add colored output and configuring of a paperoni root directory for logs 2021-04-24 15:13:44 +03:00
Kenneth Gitere
65f8ebda56 Add logs crate for dealing with printing out the final download summary 2021-04-24 13:58:03 +03:00
Kenneth Gitere
a3de3fb6ff Add ImgError struct for representing errors in downloading article images 2021-04-24 13:57:06 +03:00
Kenneth Gitere
910c45abf7 Add logging configured to send to a file by default 2021-04-24 13:56:02 +03:00
Kenneth Gitere
c0323a6ae4 Minor refactor and add non zero exit upon failure to download any article
- Move printing of the successfully downloaded articles into main.rs
- Add summary text
2021-04-24 09:00:18 +03:00
Kenneth Gitere
b496abb576 Fix serialization issue with poorly defined attribute names 2021-04-22 19:00:32 +03:00
Kenneth Gitere
313041a109 Update dependencies and restore redirect middleware in download_images 2021-04-22 18:01:23 +03:00
Kenneth Gitere
960f114dc6 Minor fixes in moz_readability
- swap unwrap for if let statement in `get_article_metadata`
- add default when extracting the title from a possible `<title>` element
- fix extracting alternative titles from h1 tags
2021-04-21 19:52:41 +03:00
Kenneth Gitere
dbac7c3b69 Refactor grab_article to return a Result
- Add ReadabilityError field
- Refactor `article` getter in Extractor to return a &NodeRef. This
  relies on the assumption that the article has already been parsed
  and should otherwise panic.
2021-04-21 19:11:57 +03:00
Kenneth Gitere
ae1ddb9386 Add printing of table for failed article downloads
- Map errors in `fetch_html` to include the source url
- Change `article_link` to `article_source`
- Add `Into` conversion for `UTF8Error`
- Collect errors in `generate_epubs` for displaying in a table
2021-04-20 21:33:24 +03:00
Kenneth Gitere
60fb30e8a2 Add url field in Extractor struct 2021-04-20 21:06:54 +03:00
Kenneth Gitere
b217448601 Add printing of tables upon successful extraction 2021-04-20 14:02:56 +03:00
Kenneth Gitere
04a1eed4e2 Add progress indicators for the cli 2021-04-17 17:28:07 +03:00
Kenneth Gitere
217cd3e442 Minor refactor
Change cli to grab version from the Cargo manifest
Rename fetch_url to fetch_html
2021-04-17 12:37:53 +03:00
Kenneth Gitere
7e9dcfc2b7 Add custom error types and ignore failed image downloads
Using this custom error type, many instances of unwrap are replaced
with mapping to errors that are then logged in main.rs. This allows
paperoni to stop crashing when downloading articles when the errors
are possibly recoverable or should not affect other downloads.

This subsequently introduces ignoring the failed image downloads
and instead leaving the original URLs intact.
2021-04-17 12:04:06 +03:00
Kenneth Gitere
d6cbbe405b Fix bug in inline_css_str_to_map 2021-04-14 18:07:39 +03:00
Kenneth Gitere
2762bc5086
Merge pull request #7 from hipstermojo/dev
Update README
2021-02-24 13:28:56 +03:00
Kenneth Gitere
b8c0cf29f1 Update README 2021-02-24 13:27:43 +03:00
Kenneth Gitere
e9f96d2970
Merge pull request #6 from hipstermojo/dev
Update to 0.3.0
2021-02-24 13:13:36 +03:00
Kenneth Gitere
165b2187be Bump version 2021-02-24 13:03:52 +03:00
Kenneth Gitere
912bc9d915 Add flag for configuring maximum concurrent requests
Change printing macro for error messages to go out to stderr
2021-02-21 13:11:26 +03:00
Kenneth Gitere
b0c4c47413 Add support for merging articles into a single epub
This is still experimental as it lacks validation of the target file name
2021-02-11 13:51:21 +03:00
Kenneth Gitere
f0a610c2ac Bug fix with empty titles
The code for title retrieval previously assumed that meta tags concerned
with the title would always contain a value but some sites leave the value
empty thus it had to be checked for as well.
2021-02-09 12:56:07 +03:00
Kenneth Gitere
65fdd967c1 Refactor image downloading and update README
Image downloads uses streams instead of spawned tasks to ensure that
it does not start an unbounded number of spawned tasks
2021-02-09 10:34:35 +03:00
Kenneth Gitere
003953332f Refactor downloading of HTML pages
This change allows for parallel downloads of HTML pages upto a maximum
number of concurrent HTTP requests which is more efficient than
before where all HTTP requests are likely to begin at the same time.
2021-02-06 17:06:03 +03:00
Kenneth Gitere
6b62051942 Add replace_metadata_value function 2021-02-06 13:53:04 +03:00
Kenneth Gitere
b402472ba6 Add http and epub modules 2021-02-06 12:59:03 +03:00
Kenneth Gitere
08f847531f Remove empty lines when reading from an input file 2021-02-03 07:39:51 +03:00
Kenneth Gitere
3d56023592 Add -f flag for adding links from a file instead of needing to use cat 2021-02-01 11:31:24 +03:00
Kenneth Gitere
c82071a871
Merge pull request #5 from hipstermojo/dev
Merge 0.2.2-alpha-1
2021-01-24 18:00:50 +03:00
Kenneth Gitere
b98c0a69a6 Bump version 2021-01-24 17:54:33 +03:00
Kenneth Gitere
21c3ffd922 Refactor fetch_url
This adds:
- More validation of responses to ensure the HTML response is valid.
- Better handling of redirecting URLs which allows for fetching of
  links proxied to Medium.
2021-01-24 17:52:31 +03:00
Kenneth Gitere
1dc7b3432b Bug fixes
The bug fixes include:
- `<html>` nodes being added to the replaced image when `unwrap_noscript_tags`
  is called.
- Remove `srcset` attribute of <img> tags after downloading the image. This
  prevented readers like Foliate from displaying the downloaded image
2021-01-12 10:27:46 +03:00
Kenneth Gitere
ca1f9e2800
Merge pull request #4 from hipstermojo/dev
Update to 0.2.1-alpha1
2020-12-24 14:11:42 +03:00
Kenneth Gitere
8407c613df Bug fixes
- Prevent downloading images with base64 strings as the source
- Add escaping of quotation characters in the serializer
- Disable redirects when downloading images which fails on multiple sites
- Remove invalid characters for making the epub export file name
- Fix version number in release
2020-12-24 14:03:36 +03:00