Crawl collection information from QQ music for further data analysis. The crawlers will first try to retrive data from QQ music collection homepage, then it goes to detail page to get the list of songs. The two crawlers are separated into two files. The first file "qq_music_collection_step1.py" will crawl the basic information of the collection and store to the Mongo Database. The second file "qq_music_collection_step2.py" gets the detail informaiton form the detial page.
Python 3.6.5
- Install virtualenv
pip install virtualenv
- activate virutalenv & source
cd [project_path]
virtualenv .venv
source .venv/bin/activate/ --python=python3
-
Install packages
pip install -r requirements.txt
-
Set up MongoDB connection Depending on your Mongo set up, you need to configure the uri in settings.py in order to store to the database. Modify settings.py.example to proceed.
-
Modify request.py Likely you would need some kind of IP rotation to avoid being banned. Modify the request.py file and add proxy in it.
-
Run first script
python qq_music_collection_step1.py
This will get the basic information of the collections from homepage -
Run second script
python qq_music_collection_step2.py
This will get the detail information of each collection, obtained from the first step
Known anti-crawling mechanism
- Need set referer in header for each request
- Need set User-Agent
- IP rotation (hasn't tested yet)