Center for Democracy & Technology
March 2003
Every day, millions of people receive dozens of unsolicited commercial e-mails (UCE), known popularly as "spam." Some users see spam as a minor annoyance, while others are so overwhelmed with spam that they are forced to switch e-mail addresses. This has led many Internet users to wonder: How did these people get my e-mail address?
In the summer of 2002, CDT embarked on a project to attempt to determine the source of spam. To do so, we set up hundreds of different e-mail addresses, used them for a single purpose, and then waited six months to see what kind of mail those addresses were receiving. It should come as no surprise to most e-mail users that many of the addresses CDT created for this study attracted spam, but it is very interesting to see the different ways that e-mail addresses attracted spam -- and the different volumes -- depending on where the e-mail addresses were used.
The results offer Internet users insights about what online behavior results in the most spam. The results also debunk some of the myths about spam.
robotsor
spidersto record e-mail addresses listed on Web sites, including both personal Web pages and institutional (corporate or non-profit) Web pages.
CDT tested two methods of obstructing address harvesting:
E-mail addresses posted to Web sites using these conventions did not receive any spam.
Currently there is no foolproof way to prevent spam. Based on our research, we recommend that Internet users try the following methods to prevent spam:
CDT received the most spam just by placing an e-mail address at the bottom of a webpage. Spammers "harvest" these addresses with computer programs that collect and process addresses and add them to spam mailing lists. If a user must post his/her e-mail address in a public place, it is useful to disguise the address through simple means such as replacing "example@domain.com" with "example at domain dot com" or other variations such as the HTML numeric equivalent, in which "example@domain.com" could be written "example@d omain.com."
Opt out of member directories that may place your e-mail
address online. If your employer places your e-mail address online, ask the
Webmaster to make sure it is disguised in some way.
If you don't want to receive e-mail from a Web site operator,
don't give them your e-mail address unless they offer the option of declining
to receive e-mail and you exercise that option. If you are asked for your
e-mail address in an online setting such as a form, make sure you pay
attention to any options discussing how the address will be used. Pay
attention to check boxes that request the right to send you e-mails or share
your e-mail address with partners. Read the privacy policies of Web sites. If
you suspect that a Web site has violated its privacy policy, you can report it
to your state attorney general or the Federal Trade Commission.
When using an unfamiliar Web site or posting to a newsgroup,
establish an e-mail address for that specific purpose. Alternatively, instead
of just using one or two e-mail addresses, you can use "disposable e-mail
addresses," which consolidate e-mail in a single location but allow you to
immediately shut off any address that is attracting spam. By recording which
disposable address was used at which web site, one can track what sites are
causing spam. Many Web sites are now providing free e-mail accounts. A search
in Google Directory for "disposable e-mail addresses" provides a list of
e-mail providers designed for one-time use e-mails.
Many ISPs and free e-mail services now provide spam filtering.
While filters are not perfect, they can cut down tremendously the amount of
spam a user receives.
At least one spammer tried to guess the e-mail addresses used in this study by sending mail to short and common addresses. E-mail addresses composed of short names and initials like bob@ or tse@, or basic combinations like smithj@ or toms@ will probably receive more spam. E-mail addresses need not be incomprehensible, but a user with a common or short name may want to modify or add to it in some way in his or her e-mail address.
For further information, please contact Ari Schwartz at the Center for Democracy & Technology, 202-637-9800, ari@cdt.org.
Junk e-mail, a.k.a. spam, inconveniences tens of millions of Internet users and imposes huge costs on ISPs. Armed with lists of e-mail addresses, "spammers" send billions of e-mail messages every day -- messages that most users don't want.
It is often difficult or impossible to tell how a spammer acquired a user's e-mail address. Was it a result of some activity the user engaged in? Did the user give his/her e-mail address to the wrong person? Was the user randomly targeted? Are there steps the user could take to avoid such spam in the future?
This study attempts to answer some of these questions by analyzing common activities of Internet users and looking for evidence of some activities that resulted in one e-mail address receiving more spam than others. We do not believe that this report answers every question about spam, where it comes from, or how to stop it. However, by illuminating some of the ways that an e-mail address can be added to a spam list, the study provides users and policymakers with a better understanding of the problem and some guidance about how to better avoid spam in the future.
The goal of this study was to understand whether certain kinds of Internet activities make a user an easy target for spam.
To determine how a person's e-mail address finds its way onto spam lists, CDT created hundreds of e-mail accounts and seeded the addresses in dozens of popular Internet locations.
Each e-mail address was used or posted in only one place; Table 1 summarizes the ways in which the addresses were used or posted. The addresses themselves were randomized, making it unlikely that a spam sender could guess them [1] -- one sample address was "m45k5e@egovtoolkit.org."[2]
Table 1 - Usage Categories
Type of online activity | Control addresses | Experimental addresses |
---|---|---|
Public Web posting:
|
Addresses were posted on a publicly accessible Web page and left online for six months. |
|
USENET:
|
Addresses were used in the headers of posted messages. |
|
Web services: Appendix 1 lists the Web-based companies and organizations to which e-mail addresses were provided. |
Addresses were provided to Web sites offering various online services using default and/or "opt-in" privacy preferences. |
|
Web-based postings:
|
Provided an e-mail address as part of posting to a Web-based job, auction, or discussion board. |
|
WHOIS database:
|
Provided an e-mail address as part of registering a ".com" or ".org" domain. | None. |
The project examined five basic ways of posting or otherwise disclosing an e-mail address, and how each could affect the amount of spam a user received. The activities examined were: 1) posting an e-mail address on a public Web site; 2) posting an e-mail address on a USENET newsgroup; 3) providing an e-mail address to a popular Web site in connection with some service; 4) providing an e-mail address to a popular Web site in order to post on a job, auction, or discussion board; and 5) posting the address in the WHOIS database of information about domain name registrants.
In each area of online activity, we created a set of "control" addresses, provided in a straightforward manner with no attempt to avoid future spam, and one or more "experimental" addresses, each of which utilized a particular anti-spam measure.
In six months of operation, our project received over 10,000
e-mail messages to the more than 250 single-use e-mail addresses we created.
About 1,600 of these were legitimate e-mail communications that we'd requested
from various online services. Another 62 were unclassifiable due to incomplete
e-mail headers or other missing data. And 16 messages were received after we'd
opted-out of future communications from a business we'd given an e-mail address
to, but were received within a two-week grace period
that our methodology
allowed. We classified the remaining 8,842 as unsolicited, a.k.a. spam,
e-mail.
Figure 1 - Sources of addresses used by spammers (select figure for full view.)
The vast majority of the spam we received -- over 97% of it -- was delivered to addresses that had been posted on the public Web.
Figure 2 - Messages received by addresses on the public Web (select figure for full view.)
All the plaintext e-mail addresses we placed on the public web received some spam. The number of messages received seems to be related to the popularity of the web site. GetNetWise.org is a well-known online safety site that is linked to by major portals like AOL and Yahoo!, and the addresses posted there received a lot of spam, while ConsumerPrivacyGuide.org is a relatively new site, and addresses posted there received much less spam.
But none of the addresses that were obscured, whether in "human-readable" or "HTML-obscured" form, received a single piece of spam, leading us to conclude that e-mail address "harvesters" are not presently capable of collecting such addresses. While this may change as time passes and technology develops, for the time being it appears that obscuring an e-mail address is an effective means of avoiding spam.
Figure 3 - Sample HTML code from GetNetWise.org/index.html (select figure for full view.)
Two weeks after placing our test addresses on the public Web, we removed some of them in order to determine how long an e-mail address, once placed on the public Web, would continue to receive spam after its removal. The effect was significant for all three Web sites tested.
Figure 4 - Effect of removing an address from the public Web (select figure for full view.)
Over the remainder of the study, the address that had been removed from the public Web received significantly less spam than the address still on the Web.
The second-greatest amount of spam we received was from public postings to USENET newsgroups. Once again, we posted addresses in plaintext, "human-readable," and "HTML-obscured" form.
Figure 5 - Sample USENET posting with e-mail addresses in plaintext, human-readable, and HTML-obscured form (select figure for full view.)
Once again, neither the "human-readable" nor the "HTML-obscured" e-mail addresses received any spam. And while not every message posted to USENET generated spam to the plaintext address we provided, most (85%) did.
Figure 6 - Messages received by addresses on USENET newsgroups (select figure for full view.)
For the vast majority of the spam we received due to USENET postings, messages were sent to addresses referenced in the message header, not to addresses referenced in the text of the message. In a very few cases (<1% of all USENET-related spam we received), messages were sent to addresses referenced in the message text. In all cases, spam was sent to addresses that were included in plaintext, not obscured in any way.
The chart above indicates the distribution of spam we received relative to the newsgroups to which we posted. While "alt.sex.erotica" generated twice as much spam as the next newsgroup, we do not believe that this data supports any strong conclusion regarding which newsgroups are the most susceptible to spam.
The third area we tested was the degree to which Web companies respected consumer attempts to opt out of receiving commercial e-mail.
First of all, in all of the cases where we disclosed an e-mail address and asked not to receive commercial e-mail, the Web site operator respected that request -- we received no spam when we opted out when first giving our e-mail address. In a variation on this test, we changed our preference from permitting commercial e-mail to opting out of it. Any e-mail we received more than two weeks after an attempt to "opt-out" was classified as spam. We tested two different kinds of opt-out: first, opt-out immediately after opting-in (simulating a consumer changing his/her mind immediately about his/her privacy preferences), and second, opt-out two or more weeks after the initial opt-in (simulating a consumer changing his/her mind after some time).
We pursued this methodology with thirty-one Web businesses and other organizations with myriad offerings.
Figure 7 - Web sites' respect for a change in user privacy preferences (select figure for full view.)
For the majority of Web sites we encountered no difficulty and found that "opt-outs" were respected within the two-week grace period our methodology provided. In five cases, however, opt-outs were not respected, and a total of eighty-two "spam" messages were received from the companies well after an opt-out request had been submitted.
Figure 8 - Messages received after changing to opt-out from further communication (select figure for full view.)
Our study also tested whether Web companies and other organizations shared or sold e-mail addresses in inappropriate ways. For this study, "inappropriate" was defined as sharing/sale (i) without notice to the consumer (in the form of a Web site privacy policy, or some other notice), (ii) in contradiction to the terms of the stated privacy policy, or (iii) in contradiction to the personal preferences we entered.
In general, we found inappropriate sharing/sale of e-mail addresses to be limited. We received just twenty-five such messages, mostly from gambling- and adult-content related websites.
We also reviewed how much spam might result from a user's participation in a Web-based discussion board. In most cases, no spam was received. The only exception was an e-mail provided to Intelihealth.com, which generated fifteen spam messages.
When a user registers a domain name in one of the Internet's seven global Top-Level Domains or certain country-code Top-Level Domains, his or her contact information is entered into a publicly accessible database known as the WHOIS database. We tested how much spam would be received to an address provided in the WHOIS database. Contrary to our expectations, just one spam e-mail was generated in the six months that our project was operational. Since WHOIS records are permanent, however, it is possible that additional spam could still be generated. Additionally, it should be noted that in the six months that this project was operational, none of the domains in question were up for renewal -- anecdotal reporting from many Internet users describes a significant increase in spam when renewals draw near.
Finally, at one point in the project our mail system began receiving spam messages to addresses that had never been used for any purpose, had been submitted to no one and, in many cases, did not even exist. By reviewing the server logs, we determined that our system had been the victim of a "brute force attack" in which a spammer had attempted to send e-mails to every possible combination of letters that could form an e-mail address.
Figure 9 - Example addresses used in a brute-force attack
a@egovtoolkit.org
b@egovtoolkit.org
c@egovtoolkit.org
d@egovtoolkit.org
...
z@egovtoolkit.org
aa@egovtoolkit.org
ab@egovtoolkit.org
ac@egovtoolkit.org
ad@egovtoolkit.org
...
zz@egovtoolkit.org
aaa@egovtoolkit.org
aab@egovtoolkit.org
aac@egovtoolkit.org
aad@egovtoolkit.org
...
zzz@egovtoolkit.org
aaaa@egovtoolkit.org
aaab@egovtoolkit.org
aaac@egovtoolkit.org
etc....
The strain of so many e-mails severely impaired our mail server, and our team decided to install a block that would prevent any more messages from the responsible network (in this case, h8h.com) from entering our server. Our system received 8,506 "brute force" e-mails before the block was installed. Few, if any, of these e-mails actually made their way to existing e-mail addresses. In order to maintain the integrity of our conclusions, we did not include these 8,506 messages in the data above.
When an address has been posted on the public Web, it can potentially be viewed by hundreds of millions of users. People who develop spam lists exploit this feature by using address-harvesting programs to surf across thousands of web sites, collecting any e-mail addresses that they encounter. Most users have no idea that their addresses have been harvested until they begin receiving spam.
For further information, contact Ari Schwartz at the Center for Democracy & Technology, 202-637-9800, ari@cdt.org.
Web services:
[1] During the course of this project CDT's mail system suffered a "dictionary attack," in which a would-be spam sender attempted to guess every e-mail address on our system.
[2] We used the egovtoolkit.org domain for all addresses in this project. The domain is owned and operated by CDT, but is not presently used except internally. This was done to avoid the small chance that a spam-sender might recognize the cdt.org domain and treat those addresses differently from all others.
[3] If you'd like to obscure your e-mail address, or any other piece of text, try the free E-mail Address Encoder at http://www.wbwip.com/wbw/emailencoder.html.
The Center For Democracy &
Technology |