Web Scraping with Regex

DISCLAIMER: The current method is very inefficient on both my own server, and theirs. Ideally I should be scraping with JS which would cache the results on the users machine. Regardless I’ve decided to keep this on my site for historical purposes.

Link: https://dreamyz.net/webscrape-hpcreatures.php
(the website will be slow to load. be patient and refresh in case of timeout)

Just a few of the 227 entries parsed by the code.

I used Regular Expressions in PHP to parse the Harry Potter wikia page for monsters, automatically entering each page and extracting the image and other info from the site. It does this every time the script is run, as it’s not actually storing any information on disk. This means that if a new Harry Potter or Fantastic Beasts movie comes out, the page will automatically be updated without any extra work from me. While the script is directed at the HP wikia, I have a feeling this script while nearly unchanged can work on many more wiki pages, but that’s yet to be supported.

Leave a Reply

Your email address will not be published.