Why do so many web pages about a topic have exactly the same text?

Because they’re content scrapers.

Plagiarism 1.0 — handwriting

Have you ever seen someone in school write an essay by copying out an encyclopaedia article by hand?  It’s unethical and very tedious.

Plagiarism 2.0 — copy and paste

Computers are good at automating tedious tasks.  Find a web page on the topic, copy it, paste it, print it, and hand it in.  This will free up plenty of time for practising your best “innocent” expression in front of a mirror.  You will need it when your teacher runs a web search using phrases from your essay ((How to cheat good, Alex Halavais, 2006)) and finds the exact site you stole it from.

Plagiarism 3.0 — more copy and paste

If your innocent expression is good enough, you can pass high school — or even University — without learning any skill other than copying and pasting.  What can you do with that skill?

How about runnning a website full of advertising?  It’s not as profitable as it used to be, but you can try.  However, you need some content — some text, images or videos to attract browsers.  If you had learnt to write or draw while in school, you could create some content yourself.  Fortunately the Web is full of other people’s content, so maybe you could copy and paste that instead.  You start browsing high traffic sites, and copying the content to your own, but there’s just too much of it.  Select, copy paste, select, copy paste — it becomes tedious again.

Plagiarism 4.0 — automated scraping

Once again, computer can automate away the tedium.  Instead of visiting a website, you send your computer to do it.  It takes some programming effort, but there are ways to separate the meaningful it from the menus and advertising.  This is called content scraping.  Put the scraped content on your own site and add your own advertisements.  Perfect!

What does scraping look like?

I just typed “SepiaScape” into a commonly used search engine.  The first results were:

  1. iTunes App Store entry for SepiaScape Cataract
  2. Cowirrie website page for SepiaScape
  3. Cowirrie website demonstration for SepiaScape Longford
  4. Video demonstration of the three SepiaScape apps
  5. Content scraper
  6. Content scraper
  7. Content scraper
  8. Content scraper
  9. Unrelated use of the name “sepiascape”
  10. Content scraper
  11. Art titled “Sepiascape”
  12. Art titled “Sepiascape”
  13. Art titled “Sepiascape”
  14. Art titled “Sepiascape”
  15. Art titled “Sepiascape”
  16. Content scraper
  17. Content scraper
  18. iTunes App Store entry for SepiaScape Richmond
  19. The blog you’re currently reading
  20. Content scraper

Cowirrie has not been singled out here.  Seach for most apps by name and you will find a dozen pages that have copied the text and other details from iTunes, with added advertising.

Often these sites pretend that they sell the app, but the links simply go to the iTunes App Store.  You cannot install iOS apps on any site but the iTunes App Store ((For completeness: you can install iOS apps from other sources if you jailbreak your iOS device.  We don’t recommend that.  Even if you have good reasons for jailbreaking your device, adding apps through channels other than iTunes probably means you’re pirating the apps.  The content scrapers found above all linked to iTunes.  Most apps, including ours, are also pirated on filesharing sites, but those appear well down in most regular search engines results.)).

Many scrapers also encourage user reviews.  They don’t just steal content, they then ask other people to write more content just for them.  What a deal!

While app store scraping was given here as an example, it appears in many other places, like these:

  • Wikipedia: There are even publishers selling print-on-demand physical books that are nothing but Wikipedia entries ((The odd tale of Alphascript Publishing and Betascript Publishing, Chris Rand, 2012)).
  • Product listings: This can lead to sellers scraping each other to optimise prices, with hilarious results ((Amazon’s $23,698,655.93 book about flies, Michael Eisen, 2011)).
  • Business listings: Search for a small business that doesn’t have a website and you will find a multitude of pages that expand a single line from the local telephone directory some widely spaced text surrounded by advertising.
  • Blog posts: This is already common, and irony dictates that this very post will one day be scraped ((Did the scraper preserve the footnotes?)).

Where’s the harm?

Some content scrapers duplicate copyrighted material in its entirety, which is simply illegal ((In many jurisdictions you can quote from a copyrighted work for purposes of review, but not simply duplicate that work.  Make sure you know your country’s fair use law or its equivalent.)).

Even when legal — for example, Wikipedia content may be reused under the Creative Commons Attribution-ShareAlike license ((Terms of Use: Licensing of Content, Wikipedia, 2012)) — content scraping damages the Web.  It clutters up search engine results, hiding original writing by real people.  The scraped content is often out of date relative to the source.  Because scraping is automatic, it may by grouped in ways that make no sense ((Scraped descriptions of Cowirrie apps have appeared on sites listing Windows and Android software.  At the time of writing, all Cowirrie apps only run on iOS.)).

In extreme cases, people intending to spend money may give up searching the Web for the thing they want to spend it on ((Dishwashers, and How Google Eats Its Own Tail, Paul Kedrosky, 2009.)), so everyone loses.

How do I find the original document?

Search engines find their reputations harmed when they serve up scraped results, so they try to detect the scrapers and demote them.  The scrapers try to avoid detection.  The result is an arms race where search results may change rapidly over months, weeks or even days.

If you are not finding what you need, try a different search engine.  Even if the default search for your browser perfectly anticipated you preferences last week, it may be full of spam today… and helpful again next week.

Does all copying count as content scraping?

No.  Your school essays could usually contain text from other writers, as long as it was acknowledged as a quote, and cited the source.  Quoting and citing is also fine on the Web, especially if you link to the source.  The problem is when you copy an entire page, and still more so if you copy without attribution or linking.

Is content scraping always bad?

No.  Sometimes content scrapers make the world a better place:

  • Summaries and anaylsis: search engines have to scrape the web in order to search it.  However, just because they store the entire site doesn’t mean they display it.
  • Archives: the Internet Archive ((Internet Archive, 2012)) stores entire sites at multiple points in time.  However, it always makes it clear that you’re reading an archived site, and links to the current version.
  • Accessibility: certain websites are hard to navigate for people with impaired vision, don’t display properly on some devices, or are blocked in some countries.  It’s best if the original site fixes these problems, but in urgent cases third parties may have to “mirror” the content in other forms.

What can I do?

The Web is full of scraped content because scraping profitable.  You can deny scrapers of that profit.  If you find yourself on a site that appears to be scraping, go away without clicking anywhere.  Even portions of the site that look “real” may be advertisements, earning click-through payments for the site owner.

Even better, learn to recognise scrapers in your search results and avoid them, denying ad-view payments as well.

You will make the Web a better place for yourself, and may just make it better for everyone else.

For even better browsing, consider installing an ad blocker like Adblock Plus for Firefox ((Adblock Plus :: Add-ons for Firefox, Henrik Aasted Sørensen & Michael McDonald & Wladimir Palant, 2012)), Adblock Plus for Chrome ((Chrome Web Store – Adblock Plus (Beta), Henrik Aasted Sørensen & Michael McDonald & Wladimir Palant, 2012)), AdBlock for Chrome ((Chrome Web Store – AdBlock, Michael Gundlach, 2012)) or AdBlock For Safari ((AdBlock For Safari, Michael Gundlach, 2012)).  These also deny advertising income to sites you like, so please support those sites by changing your ad blocker filter accordingly.  Or, for sites you can subscribe or donate to, cut out the advertising middleman and give your money directly to the people who need it.

Leave a Reply

Your e-mail address will not be published or shared with third parties.

First comments from new name/e-mail combinations will be held in moderation.