mirror of https://github.com/Urpagin/doujinstyle-scraper.git synced 2025-11-21 20:26:43 +01:00

Ethically scrapes doujinstyle.com https://doujinstyle.com/

async asynchronous asyncio doujinstyle http httpx python python3 requests scrape scraper

Find a file

Urpagin 6b2c6470f0 Add todo		2025-11-02 22:58:42 +01:00
scraper	Add debug statements	2025-08-28 23:14:22 +02:00
.gitignore	Add scaffolding	2025-08-15 08:36:07 +02:00
doujinstyle-logo.png	Add site logo.	2025-08-15 22:37:36 +02:00
LICENSE	Initial commit	2025-08-15 06:13:20 +02:00
README.md	Add todo	2025-11-02 22:58:42 +01:00
requirements.txt	Add scaffolding	2025-08-15 08:36:07 +02:00
user_agents.txt	Add scaffolding	2025-08-15 08:36:07 +02:00

README.md

doujinstyle-scraper

Ethically scrapes doujinstyle.com.

Important

Work in progress (nothing so far! -a literal stub right now)

doujinstyle.com 🌐

DoujinStyle functions as an index of content found publicly on the Internet

https://doujinstyle.com/?p=dmca

In this case, "content" being mostly music.

Warning

Tested and developed using Python 3.13. I don't expect the app to run beneath 3.12.

Format 📦

Exports each entry into a singular JSON file containing W.I.P.

Time & Breakage ⏳

Since we are scraping and parsing from the website's public HTTP, and not from any kind of API, it is very likely this project will not last long into time. The website need only become prettier, modifying or adding HTML, the existing parser will most likely break.

It is also likely the website may modernize in a way that adds a cruel CAPTCHA or rate limiter.

There is also the possibility of the website being taken down, somehow. At the time of writing this, it is written "Version 3" near the site's logo, implying other versions of the website might have been taken down, or just modernized.

Motivation 💿

While searching for a high quality FLAC recording of LEMON MELON COOKIE (TAK), I stumbled upon this website, it immediately sparked a flame of need within me; the need to SCRAPE; doujinstyle.com looked so docile and scrapable, I couldn't resist but to scrape it to the bone!

Requests & Inner Workings ⚡

Let N be the number of IDs you want to fetch. The program does 2 * N HTTP requests:

One HTTP GET to fetch the contents of the page item.
One HTTP POST on the download form to fetch the download link.

POST also redirects, it may be more than 2 * N, but for the sake of simplicity we'll say it's 2 * N.

I reckon this POST request allows the website to count the number of times an item has been downloaded, visible with the # of Downloads: label on each item.

Note

Replace <item_id> with the ID of the item.

The HTTP GET URL request is like so:

https://doujinstyle.com/?p=page&type=1&id=<item_id>

This returns the normal HTTP that is also sent when visiting via a web browser.

The HTTP POST request data is as follows:

{
  "type": "1",
  "id": "<item_id>",
  "source": "0",
  "download_link": ""
}

TODO: Rewrite this incomprehensible mess below

This returns the download link linked with the item (usually Mediafire or Mega).

It can be sent to either an item URL https://doujinstyle.com/?p=page&type=1&id=<item_id> or directly the base URL https://doujinstyle.com/, both seem to work.

Concerning the values of the POST data:

type: I don't know what it means, only that sometimes, e.g., ID=6, when setting it to 1 this download URL is returned:

https://mega.nz/#!ZE5UXYIA!VYp8h5mG1_pgQA8PebVN0gEElMjNAOijtUZf-_-dxLc

And when setting it to 2, this one is returned:

https://mega.nz/#!8QMF3YBI!Bj7OJnXHpfTBnr6jfY5O_k_oXVyEV8OMUpPIxH1OERM

Different URLs, the first one seems is the good one though, that when a user clicks on the 'Download' button, it redirects to the same URL.

id: The item ID.
source: I don't know what it means. It is set by default to 0. Maybe a different CDN, however when set to 1 the posted URL is returned, not the download link. When set to `` (empty string), the POST request still seems to function.
download_link: I don't know what it means. Only that it is required to exist with an empty string for the download URL to be returned, otherwise, the posted URL is returned.

App Components

The app has three main components:

The logger which initializes an app logger.
The fetcher which does all the asynchronous requests to the website.
The parser which parses the response from the fetcher to get usable data.

The fetcher and parser communicate via a callback function that is called whenever the fetcher fetched the data.

Find Highest Item ID 🗻

Visit doujinstyle.com and click on the title of the latest item (top left hand corner)
Copy the URL's ID following this format: https://doujinstyle.com/?p=page&type=1&id=<item_id>
<item_id> is the latest, highest ID.