Jump to content
ScienceWeather

How to Catch a MySpace Creep


Jeb

Recommended Posts

How to Catch a MySpace Creep

 

Six months ago, Wired News launched an investigation of MySpace with the goal of comparing the company's 120 million user profiles against public sex offender registries to see how many matches we could find.

 

The project began when Wired News contributor Jenn Shreve found a handful of matches based on a random search. How many would you find with a software script that systematically went through those records and compared them all?

 

We decided to find out. I wrote a series of Perl scripts and began sifting the data.

 

The technique was crude, like searching for a needle in a haystack. When I began checking ostensible matches by hand, false positives registered in the thousands.

 

Nevertheless, after several weeks of part-time work on the project, I was led to one suspect whose behavior was so disturbing I contacted New York's Suffolk County Police Department for comment. The suspect, Andrew Lubrano, was arrested earlier this month on attempted child endangerment charges.

 

Some 700 other matches were also confirmed, though none of those individuals could be linked by public MySpace posts to actual evidence of wrongdoing.

 

Today, Wired News is releasing the code used in this investigation (click here to download the gzip file). Anyone is free to take the software, look at it, validate (or invalidate) the methodology, discuss, tinker and improve the code.

 

We're releasing this code completely and utterly unsupported, under a BSD license. We'll happily link to open-source development efforts that pick it up for adoption, if notified.

 

Warning: These scripts were developed for a one-off project and all admittedly could use a thorough scrubbing.

 

It's also worth noting what this code is not.

 

First, it is not a plug-and-play application for average parents or concerned citizens who want to protect specific children who have MySpace accounts. To use it, you'll need to have a web server, a MySQL database server and some rudimentary knowledge of Perl.

 

Nor is it a fully automated find-a-perp program. Rather, this code is a starting point for human reporting.

 

It's of paramount importance to realize that the vast majority of matches produced by this method are false. I backed my automated search with an eyeball inspection of the candidate profiles, comparing photos, ages and, in many cases, birth dates and other biographical data available on MySpace and in the offender registry. This software makes those comparisons relatively easy, but they still have to made by hand.

 

If you do make matches after careful visual review, don't go all vigilante. The state Megan's Laws that created the registries also generally proscribe using the data to harass ex-offenders. Another important legal point: At the time that I ran this code, neither MySpace nor the Department of Justice had any prohibition on automated searches of their data. That could change at any time.

 

Now, to the code.

 

Finding sex offenders on MySpace is a three-step process. First, you need the list of offenders. I put together the first script, scraperps.pl, in late April. From a list of ZIP codes, the program simply fills out the query form on the DOJ's registry, maxing out the query by running five ZIPs at a time. Then it stores the results -- name, ZIP, city, county, state -- in a database, within a table called `perps`.

 

My first run quickly got me temporarily blocked from the site. It turns out the DOJ server doesn't like you running a lot of queries back-to-back. When the ban was lifted (never let it be said that the Justice Department is unforgiving), I incorporated a 30-second pause between queries, which seemed to satisfy the server. That raised the run time to over 71 hours.

 

While that was under way, I went to work on screen-scraping MySpace. When you register for MySpace, you're prompted to provide your full name and your ZIP code. That information doesn't appear in your MySpace profile, which may help explain why so many offenders felt comfortable providing it. But MySpace's search engine lets you search by name, and restrict the results to within five miles of a particular ZIP code. That made it a natural match for the sex offender registry.

 

The MySpace scraper, myspacebot.pl, performs this search for every entry in `perps`, and loads the result into a table called `myspace`.

 

Easy in theory, this code still wound up far more complicated than the DOJ scraper. It underwent a lot of tweaking, and it comes to you in need of a thorough cleaning. Like all the code, it was written for one-time use; comments are scarce, variable names like "$foo" and "$bar" are unhelpful. I almost used a GOTO.

 

MySpace servers are unpredictable. Searches fail, producing blank pages or error messages. One search will respond instantly, then the very next will drag on for half a minute. For that reason, the code is persistent. It'll keep trying, over and over again, until it runs the search. You can run multiple threads at once. To stay polite, I ran only four.

 

At one point, noticing that very long load times invariably ended in failure, I shortened the client's timeout to 10 seconds, which dramatically sped the process. But that afternoon it just stopped working. I realized it was because teens were getting home from school and logging on in eager droves, dragging the servers and turning my short timeout into a fool's pipe dream. I boosted it to 20 seconds, which seemed to do the trick.

 

Once I had the potential matches, another script, lookup.pl, went back to the DOJ to gather the direct links to the state offender page for each potentially matching perp.

 

 

 

I put a lot of thought into the third step in the process -- manual analysis. This is where I'd be spending the most time.

 

I settled on a CGI script, msresults.pl, that presents each potential match, one line for the offender information, one for each matching MySpace profile. Name, last login time, reported location, age and the profile's default photo are all displayed above the bare-bones information from the sex offender site and a link to the offender's rap sheet. Direct links to the MySpace profile, comment page, photo page and friends list are also presented.

 

There are fields for notes, and check boxes to flag a profile for follow-up or hide it from view once you've excluded it. There are options -- set in the code, or in the URL -- to hide profiles with no default photo (I used this), to show only profiles flagged by the user as a match, or to present a printer-friendly version of the final results, among other things.

 

I quickly learned that browsing around MySpace is hazardous, both to one's sense of aesthetics and to Firefox, which often chooses suicide rather than processing the hodgepodge of dodgy JavaScript, streaming media and insane HTML.

 

So I wrote another script, friendbot.pl, to try to ease the process of auditing a suspect's friends list. This is optional, and you needn't run it.

 

friendbot.pl failed to work reliably, possibly because of the intermittence of MySpace's servers. But when it works, the program crawls the friends lists, loads the profile for each friend and grabs the vital stats, most relevantly the friend's age. That all goes into the database, and it can be presented through friend.pl, another CGI script that can serve as an alternative to the actual, MySpace-hosted friends list.

 

You might use the data gathered by friendbot.pl to try to pull out candidate profiles with an unusually high number of underage friends.

 

age_filler.pl counts the number of friends under 17, the number who are exactly 16 and the total number of friends, and load that profile's entry in `myspace`. This data can be displayed by msresults.pl.

 

You'll see how in the code.

http://wired.com/news/technology/0,71976-0...l?tw=wn_index_1

------------------------------------------------------------------------------------------

 

 

 

 

------------------------------

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...