How I Collect Passwords
Some of you out there know that I have been collecting passwords for quite some time. Since 1998 to be exact. Originally I did it just to have big wordlists for password cracking, then I started gathering them for research on my Perfect Passwords book, finally it became like a big ball of string where you just do it because it makes no sense to stop now. My list currently contains about 6 million unique username/password combinations (not counting those from public lists from Gawker, RockYou, and others).
So I thought that some people might be interested in how I collect these passwords. Note that all of these passwords have already been made public and can easily be found by anyone. There are no passwords on my list that have not already been made public. Also note that so far I have never shared this list with anyone.
- I use tools such as Athena, which does massive Google searches for and collects passwords in the format “http://user:password@example.com/members”. This tool can easily gather 200,000 combos in a day but the majority of these are already in my database. I run this about once a month.
- I have a script that nightly leeches from a huge list of well-known password sharing web sites.
- I use a number of Google alerts that watch for common keylogger log formats. This is just one of many that I use. There are a surprisingly huge number of these logs that can be found via Google, although it is sometimes difficult to parse the passwords from the content.
- I use Google alerts to watch for SQL database dumps of forum and other common software databases.
- I also use Google alerts to look for passwords on pastebin.com and other related sites.
- I use a script that grabs all the Google alerts as RSS feeds and parses out URLs, then another script visits each site and leeches the passwords.
- I use RSS feeds from filestube.com to watch for and download password lists that might show up on a number of file sharing sites.
- I use RSS feeds from various torrent searches that I put into uTorrent to download automatically.
- I use a number of IRC bots that hang out in a large number of IRC channels where password sharing happens. These aren’t as effective as they once were but I still use them occasionally.
- I use a script to automatically download posts from various Usenet newsgroups, although most of those are just spam nowadays.
- I visit a number of public and private hacking-related forums to get wordlists and hacked passwords. I often pay for VIP memberships (usually the lifetime ones) so that I can access premium content areas. Leeching from forums has to be done manually, because you often have to comment on posts to be able to download the lists, but occasionally I will spend half a day leeching from these forums. Some forums will let you subscribe to posts and will include the entire post contents in the email. This bypasses the often-used “hide hack” and I can just use another script to save that inbox to local files.
- I use various FTP search engines to watch for interesting filenames that might show up on FTP sites.
- In the past I have used various P2P networks (such as LimeWire) to search for files but those don’t produce many results nowadays.
- Every once in a while someone will send me a big dump of their own lists they have collected.
As these scripts collect data, it is all dumped into a directory on my hard drive and regularly I run program I wrote that parses all the data looking for password is common formats.
Here are some examples of what the program recognizes:
http://www.example.com/members/ L:user1 P:password1
http://www.example.com/members login:user1 password:password1
http://www.example.com/members user: user1 pass: password1
Login: user1 passw:password1
L:user1 P:password1
username:user1 password:password1
http://www.example.com/members L: user1 P: password1
username = user1 password= password1
u=user1 p=password1
username user1 password password1
login id: user1 password: password1
It grabs the username/password combos and saves them into text log file. After a while these files accumulate and I merge them into my master database. In the database I perform cleanup steps such as removing passwords from well-known password hackers (such as pr0test) and other junk that might appear. I also strip domain names off usernames that are email addresses.
What is interesting about all this is how difficult it is to find new username/passwords combos that aren’t already on my list. These scripts can easily collecting 100,000 unique username/password combos every day, but only a few thousand of those are not already on my list.
After 12+ years of collecting passwords, I have found a few interesting facts:
- Although my list contains about 6 million username/password combos, the list only contains about 1,300,000 unique passwords.
- Of those, approximately 300,000 of those passwords are used by more than one person; about 1,000,000 only appear once (and a good portion of those are obviously generated by a computer).
- The list of the top 20 passwords rarely changes and 1 out of every 50 people uses one of these passwords.
There are a few flaws with my list that I should point out:
- Many of these passwords have been cracked from hashes so a good percentage of them would by nature be crackable, skewing the statistics some.
- These passwords are largely dominated by passwords from adult web sites, which are the ones mostly publicly shared. This results in a higher percentage of adult-related and obscene passwords.
- These passwords are usually from web sites that often do not enforce strong passwords policies that a private organization might. This is bad because this data doesn’t truly reflect all passwords, but on the other hand it shows the kind of passwords users will select if a password policy is not enforced.
- My scripts only grab usernames and passwords between 3 and 30 characters long, all others are thrown out.
- None of the passwords contain a colon, because that is the delimiter used to separate usernames and passwords in the combo lists my scripts generate.
So that is how I collect my passwords, maybe someday I will share the list itself.
Incidentally, the one tool I really wish I had time to build is either a proxy server or a Greasemonkey script that will automatically parse and log usernames and password combos from web pages that you visit. That would be extremely helpful!
Tags: combos, Gawker, Hacking, hashes, passes, password, password combinations, password lists, Passwords, pastebin, RockYou, Tools, wordlists
You can leave a response, or trackback from your own site.


Why not share the list?
While the work you’ve done is impressive, 80% of the list could be compiled with one or two days’ work, if only using the bullet points of this post. This is not to belittle your work in any way — please believe me in that regard.
Having a large, maintained, common list would also help with quick security audits for legitimate security professionals. Brute-forcing is hard, comparing against 1.3 million commonly-used passwords is not. Black-hat folks already have the resources necessary to compile their own lists.
“While the work you’ve done is impressive, 80% of the list could be compiled with one or two days’ work”
That may be true, but the last 20% is what takes so long.
And I do plan on releasing this list soon.
Can appreciate your work on Human selected password frequencies.
I had noticed it as a result of my large password cracking times. Until now, I have been using Pareto Principle, 80% of passwords come from 20% of the key space.
I would be interested in a re-cast of your statistics based on key space usage. This approach varies by Language, but is a direct aid in computational modeling of Average and Better than Average Password formation strategies.
For practical purposes, I adapt to Moore’s Law, that computing power to crack passwords grows by a power of 10 every 10 years, by holding a constant, Inflation adjusted priced for the cracking system, software and purchased tables.
By the metric that the cracking system remains worth less than $2000 USD in 1990 Dollars. The time taken to crack 50% of the password population defines the average password survival time. This remains between 1 to 2.5 minutes.
Mathematically, the average password should last about 0.03 seconds on such systems.
Have you also picked up a copy of the SONY password file release of 1 Million passwords?
I do create training materials in human password selection. What I learned is that any training rule should inspire a student to select a password and that should be re-cracked for the purpose of improving password selection training materials. I have made many generations of improvement in password training. With them, I can train a user to form a password that has a better than 50% odds of surviving 2 to 36 hours of password cracking effort in about 20 minutes.
Surveys of trained professionals that crack passwords for a living tend to support the view that the average, 50% likelihood, password cracking effort lasts, 2 hours before self-termination. That is the cracker has better things to do with their computer seek diminishing returns with their computing system.
Your effort, fits in to partial optimization of the first 2 minutes of the cracking effort.
I also did work on the formation of Designed Experiment Passwords. These are useful in characterizing the cracking efficiency of password cracking tools.
It is possible to make a set of passwords that can enable a statistical curve fit of the cracking times each specific cracking tool would take to crack that password. That curve fit can map back to the length, complexity and placement inside each password. Also, because the passwords are designed, it is impractical for the tool maker to alter their software to re-optimize on these designed passwords. Thus, measuring passwords based on features that lead to long vs short cracking times defines the phrase, “A Strong vs Weak Password.”
Were you ever trained or did you assume you knew what the word Strong in the Phrase “Strong Password” actually means? If like me, you are an IT professional, then in general one is the most likely to assume they know while having the least level of training before they make such a claim. Does this picture sound about right?
Take a look at these lists but realize that you are looking at the mistakes in password choice that do not survive the first 30 seconds. What is your plan to last the next 2 hours, so that your password has a 50/50 chance of remaining uncracked?
The good news is that what survives 2 hours only needs small improvements to last 36 hours. Then, the password has better than an 84% chance, +1 Sigma, of remaining uncracked.
Hello, I found an image that is a password tag cloud that has you as the credit… I would love to purchase a poster of this as I think it is unique and would fit well in my office… do you either 1) Offer this as a print or 2) can you provide a high-resolution image that I could have professionally printed? I am not looking to sell these or give away.. I just want one for my personal use.
Thanks,
Steve
The PDF version should be high resolution. Here is a 36″ x 12″ poster of the tag cloud:
https://www.zazzle.com/top_500_most_common_passwords_poster-228062979422658434
(be sure to click on the little photos to see how it would look above the fireplace, in the children’s room, etc.)