paperoni

Archived

Author	SHA1	Message	Date
Kenneth Gitere	5f99bddc10	Add custom serializer for XHTML	2020-11-24 14:54:23 +03:00
Kenneth Gitere	ef3efdba81	Refactor to use temp directory and update surf Change from using res directory for image downloads to using temp directories. Update surf to v2 which required changing the way Content-Type headers are read from.	2020-11-23 13:38:58 +03:00
Kenneth Gitere	db11e78d8c	Add template for epub output Change output format to name file with the title name Add getters in MetaData	2020-10-22 13:55:02 +03:00
Kenneth Gitere	703de7e3bf	Merge the readability module with the rest of the extractor	2020-10-22 12:12:30 +03:00
Kenneth Gitere	aacb442b7a	Move MetaAttr to `moz_readability` and rename to `MetaData` Add get_article_metadata, get_article_title and unescape_html_entities and their tests	2020-10-20 22:27:40 +03:00
Kenneth Gitere	6dab011cac	Fixed img resolving bug	2020-05-16 10:22:49 +03:00
Kenneth Gitere	c30d5f732e	Fix test data	2020-05-06 14:01:49 +03:00
Kenneth Gitere	271d3c8951	Change download code to save images to a folder Add downloaded images to the output epub file	2020-05-05 12:24:11 +03:00
Kenneth Gitere	f02973157d	Refactor downloading code to download images in parallel	2020-05-05 09:40:44 +03:00
Kenneth Gitere	e5a318282d	Update img tags with new src values to point to the local files	2020-05-02 19:06:03 +03:00
Kenneth Gitere	78ba40f57a	Add image download functionality	2020-05-02 18:33:45 +03:00
Kenneth Gitere	f24e72e70f	Change signature of `extract_content` to copy the reference to article DOM node instead of writing to file	2020-05-02 14:51:53 +03:00
Kenneth Gitere	529704d227	Add test for extract content	2020-05-01 20:42:41 +03:00
Kenneth Gitere	b5336e078d	Factor out text extraction into extractor module	2020-05-01 16:17:59 +03:00

14 commits