Files
web_scraper/README.md
2024-01-15 10:36:36 +01:00

11 lines
688 B
Markdown

# Web Scraper
Simple web scraping with Beautiful Soup 4 (BS4) and Selenium (headless browser).
Cookie banner can be handled, look into `parse_urls.py`. Therefore wait until the buttons for the banner is loaded, click on it and wait again until the content of the site is loaded. But do this only for the first URL, for the next URLs, the cookies are already set.
It's easy with BS4 to scrape infos out of HTML. To get a `div` with id, write `elem = soup.find('div', id='my_id')`. To find children (or children of children, etc.) of that element, write `children = elem.findAll('span')`.
## Chromedriver
To use Selenium with Chrome, you need to download ChromeDriver (just google it).