A small program to scrape subreddits and reddit profiles, downloading content matching your criteria. Can also go deeper and scrape images and albums on commonly linked sites.

41 提交歷史

1 分支列表

1 版本發佈

Oliver Galvin 76cb1a5271 package fixes		7 年之前
.SRCINFO	76cb1a5271 package fixes	7 年之前
LICENSE	7428f56166 Initial commit	7 年之前
PKGBUILD	76cb1a5271 package fixes	7 年之前
README.md	23d589dcdd tiny changes	7 年之前
config	64c97aecf5 pinterest support	7 年之前
snooscraper	cae9d89aea make clean function for generic folder	7 年之前
snooscraper.install	b43c7e066c arch package files	7 年之前

SnooScraper

A small program to scrape content from a subreddit, a reddit user, and from various popular image sites, downloading files matching your criteria.

You can scrape a subreddit or a username, downloading posts and content from their posts, depending on the options you pick. I have tested it by downloading large amounts of data and it works well. There are several parameters to customise what kind of content to get and where to get it from, the more you turn on the longer a scrape will take but the more content you will be able to download.

Animated GIFs/videos for example can be large, and can quickly add up to take up valuable disk space, so downloading them can be turned off. Another issue on reddit is that often images are not linked directly but via webpages on sites like imgur. I have included functions which go on to get data from those sites so as to not miss anything. These functions can be called separately using the command line options, to just download a whole imgur album for example.

My aim is to write SnooScraper in POSIX-compliant shell script for portability and efficiency, it should work in any shell. I've tried it in bash and dash on GNU/Linux, and in bash in Cygwin on Windows so far. It should work on other operating systems that I haven't yet tried.

SnooScraper is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Supported External Sites: imgur, instagram, pinterest, tumblr

Getting SnooScraper

If you use Arch Linux or an Arch-based GNU/Linux distribution, you can get SnooScraper from the AUR.

Other than a shell, and standard tools curl, sed and grep, the only dependency is jq. jq is available in most GNU/Linux distributions, on Homebrew for macOS, in Cygwin on Windows, or can be downloaded from github.

You can download the script and default config file by downloading and extracting a release, or cloning this repository, and make it executable:

git clone https://notabug.org/odg/SnooScraper.git #if you haven't downloaded it
cd SnooScraper
chmod +x snooscraper
./snooscraper -h

Using SnooScraper

The help text (-h) should explain the command line options. Review the parameters in the config file and customise them as you like before using. Basically a subreddit's name or username can be passed to the script to download posts' contents, or URLs of the supported external sites can be passed directly. Any files already present will not be downloaded, delete them to force redownloading.

Files will be named after the ID (a base 36 number) of the reddit post from which they originate, or a unix timestamp if the URL is passed to the script directly. Reddit post IDs are zero-padded, so when sorting alphabetically they are also in chronological order. Albums have their images with a zero-padded number suffix after a dash, for example: abcxyz-01.jpg, abcxyz-02.jpg...

A cleaning function (-c) is also included. This attempts to remove any duplicates (ie. reposts), empty files (404 errors, any failed downloads) or files with the wrong extension (depending on configuration) in a given directory.

Planned Features

In the future I plan to add more sites (twitter, flickr wikimedia, maybe more), and the ability to scrape multiple subreddits/accounts simultaneously. Also more options: to change the sort method, narrow down by upvotes, start/end searches at different dates, etc. Also potentially different download methods to increase speed and better handle parallel downloads.

Thanks

Jason Baumgartner for his pushshift APIs, using Elasticsearch
jq for being a great lightweight JSON parser
Shellcheck for helping me write correct and portable code

README.md

SnooScraper

Getting SnooScraper

Using SnooScraper

Planned Features

Thanks