04-02-2024

Subdomains Wordlist. How it is made?

Hi there!

I hope you are doing well

Recently, I created a new GitHub repository, for the subdomain wordlist. The repo can be found at https://github.com/shriyanss/subdomains_wordlist

This repository contains multiple wordlists, but they are generated from the same dataset. So first of all, I’ll tell you how this dataset is obtained

How the dataset is obtained?

The tool first of all gets the subdomains for different domains. The list of domains is obtained from @arkadiyt/bounty-targets-data/data/domains.txt (thanks to @arkadiyt for this repo 😉 )

Since this is too raw to undergo further process, I first of all, get all the domains with the help of a custom Python script that extracts domains from it. By me telling this is raw, I mean that there are programs that add app.target.com to the scope. So from that, I am not interested in app or the something else, but I am interested in the domain.

Once I’ve got the domains, the next step is subdomain enumeration. A bash script does this thing. But before I passed everything to it, I made another Python script, that divides X the number of domains into a file, for which it has to do subdomain enumeration. This is because I don’t want my machine to just blast up doing subdomain enumeration for the whole week, which might saturate the system for other processes. The script I am talking about simply reads a file called last.txt, and gives the next X number of domains from the domains file. Also, if it reaches the end of the file, it starts from the top to complete that X *criteria*.

At this point, our dataset to be processed is ready.

At the time of writing this blog, I had nearly 1 million subdomains. This count can easily increase later

Generating wordlist

Once it has got the subdomain list, it simply runs my tools subgen, and hence, a raw wordlist.txt is complete. But the thing is that it’s not a wordlist. It’s just a raw dump of subdomains from different companies.

The next that is to be done is to remove some prominent noise from the wordlist. Currently, I manually scroll through the list and identify any pattern of noise. For example, the subdomains containing just numbers – are less useful and are also prominent in the wordlist. So I simply remove them. There are more similar noise patterns that are removed from the wordlist and then piped into no_number.txt, no_uuid.txt and others.

The file filtered_subdomain_wordlist.txt is a combination of the above two (no UUID and numbers) file and multiple other patterns I’ve identified.

The next comes frequent.txt. First of all, the generation of this file takes the longest. What it does is it iterates through the filtered_subdomain_wordlist.txt, and then checks the occurrence of the current line. It first goes to the first line, and it contains www (suppose), next it iterates through the whole file and gets a count of it. I know this process is less efficient, but it can be improved later on when it gets really slow. The number of times is currently set to 5. Meaning, that if any subdomain has a frequency of 5 in the subdomains.txt, then it will be included in the frequent wordlist. I might change this later, on the basis of the noise I get, or maybe seperate them into different files.

I hope this was informative. See you next time 🙂

Also, feel free to star the wordlist repo. Link: https://github.com/shriyanss/subdomains_wordlist

Pull requests related to grammar or missing content or anything is always welcome (but not spam) 🙂
Update README.md