easierscrape package

easierscrape module

class easierscrape.Scraper(url, download_path='easierscrape_downloads')

Bases: object

Class for a scraper that targets a specific url and downloads all files to a download_path relative to the current working directory. A Scraper object acts as a “one-stop-shop” for all scraping functions.

clear_downloads()

Deletes the Scraper download directory.

Returns

True if the Scraper download directory exists and is deleted. False otherwise.

Return type

bool

get_screenshot()

Downloads screenshot from the Scraper url to the Scraper download directory.

Returns

True

Return type

bool

parse_anchors()

Parses a list of anchor tags from the Scraper url.

Returns

List of anchor tags in the url.

Return type

List[str]

parse_files(filetypes=[])

Downloads provided filetypes from the Scraper url to the Scraper download directory.

Parameters

filetypes (List[str]) – List of filetypes (“pdf”, “txt”, etc.) to scrape.

Returns

List of number of files downloaded per filetype from url (so if filetypes=[“pdf”, “txt”] and the return value is [1, 30] this means that 1 pdf file and 30 txt files were downloaded).

Return type

List[int]

parse_images()

Downloads all images from the Scraper url to the Scraper download directory.

Returns

Number of images downloaded from url.

Return type

int

parse_lists()

Parses a list of lists from the Scraper url.

Returns

List of lists (each stored as a List) in the url.

Return type

List[List[str]]

parse_tables(output_type='csv')

Downloads all tables from the Scraper url to the Scraper download directory.

Supported output types are csv and xlsx (defaults to csv).

  • If downloaded as a csv file, each table will be stored in a separate csv.

  • If downloaded as an xlsx file, all tables will be stored as separate sheets in a “tables.xlsx” file.

Parameters

output_type (str) – The filetype to output to (defaults to csv).

Returns

Number of tables downloaded from url.

Return type

int

parse_text()

Parses a list of text fragments from the Scraper url.

Returns

List of text fragments in the url.

Return type

List[str]

print_tree(maxdepth, blacklist=[], whitelist=[])

Prints a tree of depth=maxdepth starting at the Scraper url.

Parameters

maxdepth (int) – The depth you want to print the tree to.

tree_gen(maxdepth, blacklist=[], whitelist=[])

Generates a tree of depth=maxdepth starting at the Scraper url. If the blacklist argument is used, none of the blacklisted domains will appear. If the whitelist argument is used, only the whitelisted domains will appear.

Parameters
  • maxdepth (int) – The depth you want to generate the tree to.

  • blacklist (List[str]) – A list of all domains to ignore in the tree generation

  • whitelist (List[str]) – A list of only domains to include in the tree generation

Returns

Head node of an anytree hyperlink tree.

Return type

Node