Hernandez.gg

Home

About

Work

Contact

Pagalworld Scraper

This is a tool used for crawling and downloading songs and playlists from pagalworld.pw.

Pagalworld Scraper

WorkJanuary, 2022

What I Made ⚙

I made a scraper/downloader for pagalworld.pw that downloads mp3 files for a client.

What I Used 🛠

1. Python

Python is the programming/scripting language used for building this project.

2. Beautiful Soup 4

I used the Beautiful Soup library to parse the HTML downloaded by python requests.

3. Selenium

I used Selenium to load data that required JavaScript and user interaction.

4. Batch

I used a batch file to make the installation process easier for the client.

5. Powershell

I used powershell for a script that made installation easier for the client.

About this site 🔍

This scraper was created for a client that needed to download mp3 files from pagalworld.pw.

What I learned 🧠

I learned that how to take advantage of Pythons' concurrent futures library in order to speed up the scraping process. The client sought to download a large number of mp3 files from this site and going through them one-by-one would've taken a lot longer without the futures library.

Finally 🔥

The process for building this scraper was a little challenging due to the layout of the site. The client had given me a list of playlists containing supposed URLs to the mp3's they were looking to download. These playlists were really playlists for other playlists and sometimes this nested even further down into more playlists before actually reaching an MP3 download page. Wanting to take advantage of Pythons' concurrent futures library, I decided that I needed to find a way to parse the different URL types for the site in order for each instance to know what to do when it has a certain URL type. I looked around the site and identified indentifiers for each URL type. I created a subroutine for each URL type and removed or added them to or from their respective lists after the subroutine completed. These subroutines ran in parallel with each other and finalized when the MP3 page was reached. This generated a list of MP3 objects with links and their respective names, waiting to be downloaded.

Pagalworld Scraper

January, 2022