I finished building my webscraper yesterday. I wanted to learn how to build one, and I followed some online tutorials and used the nokogiri gem. It was quite a smooth and interesting experience… Nokogiri is useful, you can choose to target the elements you want to scrape using CSS or Xpath.
I’m not sure what’s Xpath, okay a quick google search says it’s syntax for targeting parts of an xml document. XML is similar to HTML, except the syntax is more descriptive, like actual words in the tags instead of <p>s, <br>s and what not.
Okay, so anyway, while learning to build the webscraper with nokogiri, I learnt some cool new things.
- I learnt about other webscraping gems like spidey and other tools; one useful tool is a Chrome extension called SelectorGadget (the name sounds similar to Inspector Gadget, that 80s cartoon series), it helps you easily find and group the CSS/Xpath elements you want to target.
- I also learnt to create a random User-Agent for when Nokogiri targets a html page, because if you don’t specify a User-Agent name, you get a default one, and after a few tries, you will get a 429 error code for sending too many requests to the site/page you want to scrape.
- Lastly, no matter how small the project is, there’s always something new to be learnt. I guess this is because of how new I am to web dev/programming.
My initial goal to scrape my own diaryland website is accomplished. Good job lol. I followed an online tutorial that was very informative. This one.
What next? I think there’s improvements to be made:
- For one, I’ll need to learn how to create a background service job for scraping. So that the rails server can do other things while it’s scraping; like serve the initial results first.
- And pagination, I think that’s easier with the Kaminari gem.
Okay, this is it. Oh yeah, another thing I learnt, well I learnt that i have about 680++ diaryland entries. That’s amazing haha. I’ve been writing stuff at my diaryland page since I was 24? That’s in 2004, so it’s been more than 13 years. Wow. I’m surprised diaryland is still around frankly.
So okay. See you. Yeah in case you want to scrape your own diaryland page, with rails, if you have either, you can see the source code here.