easierscrape package¶
easierscrape module¶
- class easierscrape.Scraper(url, download_path='easierscrape_downloads')¶
Bases:
objectClass for a scraper that targets a specific url and downloads all files to a download_path relative to the current working directory. A Scraper object acts as a “one-stop-shop” for all scraping functions.
- clear_downloads()¶
Deletes the Scraper download directory.
- Returns
True if the Scraper download directory exists and is deleted. False otherwise.
- Return type
bool
- get_screenshot()¶
Downloads screenshot from the Scraper url to the Scraper download directory.
- Returns
True
- Return type
bool
- parse_anchors()¶
Parses a list of anchor tags from the Scraper url.
- Returns
List of anchor tags in the url.
- Return type
List[str]
- parse_files(filetypes=[])¶
Downloads provided filetypes from the Scraper url to the Scraper download directory.
- Parameters
filetypes (List[str]) – List of filetypes (“pdf”, “txt”, etc.) to scrape.
- Returns
List of number of files downloaded per filetype from url (so if filetypes=[“pdf”, “txt”] and the return value is [1, 30] this means that 1 pdf file and 30 txt files were downloaded).
- Return type
List[int]
- parse_images()¶
Downloads all images from the Scraper url to the Scraper download directory.
- Returns
Number of images downloaded from url.
- Return type
int
- parse_lists()¶
Parses a list of lists from the Scraper url.
- Returns
List of lists (each stored as a List) in the url.
- Return type
List[List[str]]
- parse_tables(output_type='csv')¶
Downloads all tables from the Scraper url to the Scraper download directory.
Supported output types are csv and xlsx (defaults to csv).
If downloaded as a csv file, each table will be stored in a separate csv.
If downloaded as an xlsx file, all tables will be stored as separate sheets in a “tables.xlsx” file.
- Parameters
output_type (str) – The filetype to output to (defaults to csv).
- Returns
Number of tables downloaded from url.
- Return type
int
- parse_text()¶
Parses a list of text fragments from the Scraper url.
- Returns
List of text fragments in the url.
- Return type
List[str]
- print_tree(maxdepth, blacklist=[], whitelist=[])¶
Prints a tree of depth=maxdepth starting at the Scraper url.
- Parameters
maxdepth (int) – The depth you want to print the tree to.
- tree_gen(maxdepth, blacklist=[], whitelist=[])¶
Generates a tree of depth=maxdepth starting at the Scraper url. If the blacklist argument is used, none of the blacklisted domains will appear. If the whitelist argument is used, only the whitelisted domains will appear.
- Parameters
maxdepth (int) – The depth you want to generate the tree to.
blacklist (List[str]) – A list of all domains to ignore in the tree generation
whitelist (List[str]) – A list of only domains to include in the tree generation
- Returns
Head node of an anytree hyperlink tree.
- Return type
Node