Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
563c140
Perform 2to3
ohyou Jun 19, 2016
56b361d
Make it work as a module
ohyou Jun 19, 2016
2a4477e
Remove unnecessary output
ohyou Jun 20, 2016
60f0856
Fix errors when downloading from gfycat
ohyou Jul 1, 2016
236a592
Handle some errors
ohyou Jul 3, 2016
8c1311f
update .gitignore
jtara1 Aug 25, 2016
ab18926
Merge https://github.com/ohyou/RedditImageGrab
jtara1 Aug 25, 2016
bf02892
downloads imgur imgs using jtara1/imgur-album-downloader
jtara1 Aug 25, 2016
0edcfb3
fix func names & update .gitignore
jtara1 Aug 25, 2016
75f4c8b
fix slugify func so --filename-format arg works
jtara1 Aug 26, 2016
3d29e43
added func to make cleaner filename, update readme
jtara1 Aug 26, 2016
1758e32
updates & fixes, func process_deviant_url works, & ImgurDownloader up…
jtara1 Aug 30, 2016
9258a07
new feature, last-id tracking in file '._history.txt'
jtara1 Aug 31, 2016
d32ca53
new feature, load subreddits from subreddits text file and process ea…
jtara1 Aug 31, 2016
b401526
update readme.md & add todo.md
jtara1 Aug 31, 2016
1c85413
fix, [--subreddit-list srl] and [--dir <dest_file>] work properly now
jtara1 Sep 1, 2016
0186590
update DOWNLOADED, ERRORS, & other vars to keep track of progress
jtara1 Sep 1, 2016
2c5321d
update readme
jtara1 Sep 1, 2016
167e5df
critical fix, parse_subreddit_list will now break main loop when newl…
jtara1 Sep 1, 2016
6539870
critical fix, --subreddits-list arg and everything else should be wor…
jtara1 Sep 1, 2016
e90d53d
parse_subreddit_list.py now skips iteration (continue) when line=='\n'
jtara1 Sep 1, 2016
82a89d8
reorganized loop to handle subreddits more clearly, changed user-agen…
jtara1 Sep 1, 2016
839fc57
update readme & todo
jtara1 Sep 3, 2016
72a60b9
fix ARGS.num checker, update readme
jtara1 Sep 9, 2016
af0b953
update readme
jtara1 Sep 10, 2016
3361953
add jtara1/imgur-downloader repository
jtara1 Sep 11, 2016
28bf0d1
added jtara1/imgur-downloader & rm test imgs
jtara1 Sep 11, 2016
ec208d5
update, rename some files
jtara1 Sep 11, 2016
a3c430c
CRIT FIXES: url corrected in reddit.py, PROG_REPORT variables corrected
jtara1 Sep 12, 2016
90e6860
update
jtara1 Sep 12, 2016
1cd0764
hot fix for gfycat & printing is more helpful
jtara1 Sep 12, 2016
3517e8d
update readme
jtara1 Sep 12, 2016
5a92db0
update readme
jtara1 Sep 14, 2016
ba4e8e6
fix gfycat downloading, Does Not Exist errors are handled properly now
jtara1 Sep 14, 2016
9177594
update txt
jtara1 Sep 14, 2016
8c9be20
update readme
jtara1 Sep 14, 2016
eeb204e
update imgur-downloader
jtara1 Sep 15, 2016
3426c5d
rm imgur-downloader/readme.md
jtara1 Sep 15, 2016
8e4787e
add subreddit list examples
jtara1 Sep 15, 2016
e113d2c
update history_log, cli args, readme, & vars renamed
jtara1 Sep 16, 2016
ab53b4a
new cli arg, --restart (begins downloading from beginning of subreddit
jtara1 Sep 16, 2016
f9a27f1
update readme, todo, minor fixes
jtara1 Sep 17, 2016
ef89c68
update history_log func, imports relocated in gfycat.py
jtara1 Sep 17, 2016
2c9ffda
update parse_subreddit_list, CRIT FIX: imgur-downloader
jtara1 Sep 17, 2016
5399043
update imgur-downloader docstrings
jtara1 Sep 17, 2016
ddf0745
hotfix for when last ITEM is comment thread (caused infinite looping)
jtara1 Sep 17, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,15 +1,20 @@
*/*.jpg
*/*.png
*/*.webm
*/*.mp4
*.swp
*.bak
*.DS_Store
*.sh
*.pyc
/.idea

venv/*

/.project
/*~
/*.webm
/gfycat
/build
/.pydevproject

cli.txt
81 changes: 62 additions & 19 deletions readme.md → README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,56 @@ fresh and interesting. The main idea is that the script would download
any JPEG or PNG formatted image that it found listed in the specified
subreddit and download them to a folder.

## jtara1 Fork
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only for fork


# Requirements:
### Features and Changes:

* Python 2 (Python3 might be supported over 2to3, but see for
yourself and report back).
* Optional requirements: listed in setup.py under extras_require.


# Usage:
* Adapted to Python 3 mostly by merge from [ohyou/RedditImageGrab](https://github.com/ohyou/RedditImageGrab) along with some additional fixes

* \-\-num cli argument now counts by reddit submission rather than individual image

* added submodule `imgur-downloader` which enabled the above feature among other things


* file `._history.txt` contains reddit id of last downloaded and is identified by `subreddit` & `ARGS.sort_type`, e.g.:

> {'wallpapers': {'topmonth': {'last\-id': '4x4so2'}}}

* positional argument, `<subreddit>`, can now autodetect whether value points to subreddit name or subreddit list file


* `--subreddit-list srl-filename` cli argument added where srl is the filename containing list of subreddits to process

* added function to process subreddit list for subreddit links & associated save location for each

* at this time, the same cli arguments are used for all subreddits in list, but save folder can be altered

* examples for subreddits.txt added, in folder `subreddit-list-examples`

* updated progress report variables such as DOWNLOADED and ERRORS to accommodate for processing a list of subreddits

* `--restart` cli arg added which begins downloading from the beginning of the subreddit rather than resuming from last download ID.

### Fixes:

* `--filename-format` cli arg should work as expected

* `gfycat.py` failed to download direct links to .webm & .mp4 files

* `gfycat.py` failed to process gfycat links that did not exist

## Issues
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than issue, maybe todo?


* needs more testing

## Requirements:

* Python 3
* Optional requirements: listed in setup.py under extras_require.

## Usage:

See `./redditdl.py --help` for uptodate details.

Expand All @@ -33,14 +74,16 @@ ordering = ('key', )

Downloads files with specified extension from the specified subreddit.

positional arguments:
main arguments:

<subreddit> Subreddit name.
<dest_file> Dir to put downloaded files in.
subreddit <subreddit> Subreddit or subreddit list file name.
dir <dest_file> Dir to put downloaded files in.

optional arguments:

-h, --help show this help message and exit
--subbreddit-list srl-filename
Take a list of subreddits from a text file, srl = subreddits.txt
--multireddit Take multirredit instead of subreddit as input. If so,
provide /user/m/multireddit-name as argument
--last l ID of the last downloaded file.
Expand All @@ -54,43 +97,43 @@ optional arguments:
--skipAlbums Skip all albums
--mirror-gfycat Download available mirror in gfycat.com.
--filename-format FILENAME_FORMAT
Specify filename format: reddit (default), title or
url
Specify filename format: reddit (default), title or url
--sort-type Sort the subreddit.
--restart Begin downloading from beginning of subreddit rather than resuming from last dl subreddit submission.


# Examples
## Examples

An example of running this script to download images with a score
greater than 50 from the wallpaper sub-reddit into a folder called
wallpaper would be as follows:

python redditdl.py wallpaper wallpaper --score 50
python3 redditdl.py wallpaper wallpaper --score 50

And to run the same query but only get new images you don't already
have, run the following:

python redditdl.py wallpaper wallpaper --score 50 -update
python3 redditdl.py wallpaper wallpaper --score 50 -update

For getting some nice pictures of cats in your catsfolder (wich will be created if it
doesn't exist yet) run:

python redditdl.py cats ~/Pictures/catsfolder --score 1000 --num 5 --sfw --verbose
python3 redditdl.py cats ~/Pictures/catsfolder --score 1000 --num 5 --sfw --verbose


## Advanced Examples
### Advanced Examples

Retrieve last 10 pics in the 'wallpaper' subreddit with the word
Retrieve pics from last 10 submission in the 'wallpaper' subreddit with the word
"sunset" in the title (note: case is ignored by (?i) predicate)

python redditdl.py wallpaper sunsets --regex '(?i).*sunset.*' --num 10
python3 redditdl.py wallpaper sunsets --regex '(?i).*sunset.*' --num 10

Download top week post from subreddit 'animegifs' and use gfycat gif mirror (if available)

python redditdl.py animegifs --sort-type topweek --mirror-gfycat
python3 redditdl.py animegifs --sort-type topweek --mirror-gfycat


## Sorting
### Sorting

Available sorting are following : hot, new, rising, controversial, top, gilded

Expand Down
7 changes: 7 additions & 0 deletions TODO.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
## todo

* fix downloading from deviantart, tubmlr, pixiv.net, instagram & other sites

* record metadata (submission link & comments, local file location) in database

* integrate youtube-dl module to handle all video links
2 changes: 1 addition & 1 deletion redditdl.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@


if __name__ == '__main__':
main()
main("")
37 changes: 18 additions & 19 deletions redditdownload/gfycat.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
from collections import namedtuple

import urllib.request, urllib.error, urllib.parse
from urllib.error import URLError
import json
import random
import string
import requests

class gfycat(object):

Expand All @@ -23,21 +28,17 @@ def __init__(self):
super(gfycat, self).__init__()

def __fetch(self, url, param):
import urllib2
import json
try:
# added simple User-Ajent string to avoid CloudFlare block this request
headers = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(url+param, None, headers)
connection = urllib2.urlopen(req).read()
except urllib2.HTTPError, err:
req = urllib.request.Request(url+param, None, headers)
connection = urllib.request.urlopen(req).read()
except urllib.error.HTTPError as err:
raise ValueError(err.read())
result = namedtuple("result", "raw json")
return result(raw=connection, json=json.loads(connection))
return result(raw=connection, json=json.loads(connection.decode('ascii')))

def upload(self, param):
import random
import string
# gfycat needs to get a random string before our search parameter
randomString = ''.join(random.choice
(string.ascii_uppercase + string.digits) for _ in range(5))
Expand All @@ -55,9 +56,6 @@ def uploadFile(self, file):

def __fileHandler(self, file):
# Thanks thesourabh for the implementation
import random
import string
import requests
# gfycat needs a random key before upload
key = ''.join(random.choice
(string.ascii_uppercase + string.digits) for _ in range(10))
Expand All @@ -80,8 +78,11 @@ def __fileHandler(self, file):

def more(self, param):
result = self.__fetch(self.url, "/cajax/get/%s" % param)
if "error" in result.json["gfyItem"]:
raise ValueError("%s" % self.json["gfyItem"]["error"])
try:
if result.json['error']:
raise URLError('%s%s%s' % ('DNE: ', 'http://gfycat.com/', param))
except KeyError:
pass # no error reported in json
return _gfycatMore(result)

def check(self, param):
Expand Down Expand Up @@ -117,26 +118,24 @@ def get(self, what):
return ("Sorry, can't find %s" % error)

def download(self, location):
import urllib2
if not location.endswith(".mp4"):
location = location + self.get("gfyName") + ".mp4"
try:
# added simple User-Ajent string to avoid CloudFlare block this request
headers = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(self.get("mp4Url"), None, headers)
file = urllib2.urlopen(req)
req = urllib.request.Request(self.get("mp4Url"), None, headers)
file = urllib.request.urlopen(req)
# make sure that the status code is 200, and the content type is mp4
if int(file.code) is not 200 or file.headers["content-type"] != "video/mp4":
raise ValueError("Problem downlading the file. Status code is %s or the content-type is not right %s"
% (file.code, file.headers["content-type"]))
data = file.read()
with open(location, "wb") as mp4:
mp4.write(data)
except urllib2.HTTPError, err:
except urllib.error.HTTPError as err:
raise ValueError(err.read())

def formated(self, ignoreNull=False):
import json
if not ignoreNull:
return json.dumps(self.js, indent=4,
separators=(',', ': ')).strip('{}\n')
Expand Down
22 changes: 11 additions & 11 deletions redditdownload/img_scrap_stuff.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,11 @@
import re
import json
import logging
import urlparse
import urllib.parse
import traceback

from PIL import Image
from cStringIO import StringIO
from io import StringIO
import lxml
import html5lib # Heavily recommended for bs4 (apparently)
import bs4
Expand Down Expand Up @@ -52,7 +52,7 @@ def indexall_re(topstr, substr_re):
def walker(text, opening='{', closing='}'):
""" A near-useless experiment that was intended for `get_all_objects` """
stack = []
for pos in xrange(len(text)):
for pos in range(len(text)):
if text[pos:pos + len(opening)] == opening:
stack.append(pos)
continue
Expand Down Expand Up @@ -88,7 +88,7 @@ def get_all_objects(text, beginning=r'{', debug=False):
"""

def _dbg_actual(st, *ar):
print "D: ", st % ar
print("D: ", st % ar)

_dbg = _dbg_actual if debug else (lambda *ar: None)

Expand All @@ -106,9 +106,9 @@ def __getitem__(self, key):
class TheLoader(yaml.SafeLoader):
ESCAPE_REPLACEMENTS = ddd(yaml.SafeLoader.ESCAPE_REPLACEMENTS)

from cStringIO import StringIO
from io import StringIO
# optimised slicing
if isinstance(text, unicode):
if isinstance(text, str):
_dbg("encoding")
text = text.encode('utf-8')
_dbg("Length: %r", len(text))
Expand Down Expand Up @@ -214,13 +214,13 @@ def get_get_get(url, **kwa):

def get_get(*ar, **kwa):
retries = kwa.pop('_xretries', 5)
for retry in xrange(retries):
for retry in range(retries):
try:
return get_get_get(*ar, **kwa)
except Exception as exc:
traceback.print_exc()
ee = exc
print "On retry #%r (%s)" % (retry, repr(exc)[:30])
print("On retry #%r (%s)" % (retry, repr(exc)[:30]))
raise GetError(ee)


Expand All @@ -244,7 +244,7 @@ def get(url, cache_file=None, req_params=None, bs=True, response=False, undecode
for chunk in resp.iter_content(chunk_size=16384):
data += chunk
if len(data) > _max_len:
print "Too large"
print("Too large")
break
data = bytes(data) ## Have to, alas.
data_bytes = data
Expand Down Expand Up @@ -274,7 +274,7 @@ def _filter(l):


def _url_abs(l, base_url):
return (urlparse.urljoin(base_url, v) for v in l)
return (urllib.parse.urljoin(base_url, v) for v in l)


def _preprocess_bs_links(bs, links):
Expand Down Expand Up @@ -413,7 +413,7 @@ def _pp(lst):
for val in lst
if val.startswith('http') or val.startswith('/')]
# (urljoin should be done already though)
return [urlparse.urljoin(url, val) for val in res]
return [urllib.parse.urljoin(url, val) for val in res]

imgs, links = bs2img(bs), bs2lnk(bs)
to_check = imgs + links
Expand Down
3 changes: 3 additions & 0 deletions redditdownload/imgur-downloader/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
.DS_Store
test.py
*.pyc
7 changes: 7 additions & 0 deletions redditdownload/imgur-downloader/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Copyright (C) 2012 Alex Gisby

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Empty file.
Binary file added redditdownload/imgur-downloader/imgur-dne.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading