You’ve probably come across websites that want you to prove that you’re human and not a robot. This may come in the form of a picture challenge (for example, select all the squares with bicycles) or it may simply require you to check a box to assert that you’re not a robot. Perhaps you’re wondering why you need to do this. Why is the website so concerned about being visited by robots? Alternatively, perhaps you’re a website developer and are determined to find a way to keep out all bots.
What is a bot? Are all bots bad?
As with cookies, bots are important tools in the digital world. However, as with cookies, bots can also be used for unwholesome purposes.
“Bot” is short for robot and is simply a piece of software (an application) that visits websites. A bot may follow one link after another, crawling through pages across the World Wide Web. For this reason, they are often called “crawlers” or “spiders”.
If you go to your favourite search engine and type in a keyword or phrase (or use a voice activated request on your mobile device) then the results usually come up fairly quickly. This is only possible because the search engine has an index that has been compiled by bots that have followed link after link, gathering information. Without this index, it would take a very long time to scour all the millions of pages that make up the web to find something relevant.
Not all bots are crawlers. For example, Facebook has a bot that’s used when a post contains a link. The bot is used to check that the link exists and it reads any Open Graph markup. This allows Facebook to include an image and short excerpt to arouse the interest of anyone who views the post. Unlike the search engine bots, this bot doesn’t roam free about the Internet but instead restricts itself to links posted on Facebook pages.
Well behaved bots commonly identify themselves in the user agent string in the form:
For example, “facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)” identifies the Facebook bot (facebookexternalhit), its version number (1.1) and a way of finding out information about the bot.
So these are useful bots that help users to discover interesting sites.
However, even the good bots don’t always honestly identify themselves. For example, if you post a link in the Signal Messaging app, the bot used to fetch the preview information identifies itself as WhatsApp, and this basically seems to be a rehash of the “all browsers identify themselves as Mozilla” problem.
Not So Good Bots
Although the crawlers used by search engines are useful, some crawlers that index sites to provide certain types of information for their users (who may require free or paid accounts to access it) can be a nuisance because they’re not well-behaved. For example, they may not follow the robot instructions stipulated by the website (robots.txt), they may try to access pages that are only intended for human visitors or they may hit the site so hard (that is, they look up pages so fast) that they slow down the site and it becomes unusable for everyone else.
This could be because the bot’s developer made a mistake (a bug in the bot’s code or an inexperienced programmer) or it could be because the developer simply doesn’t care and wants the information quickly regardless of the inconvenience to others (perhaps to satisfy the demands of paying customers). In the long run, this is counter-productive as it will lead to the bot (which is identified in the user agent string) being banned.
Web scraping (or harvesting) is when a bot extracts data from a webpage. In the earlier case of search engines and social media, this data can just be keywords or phrases or the URL for the page image, but some bots are designed to gather all information from a page in order to reproduce it verbatim on another site. This is often done to lure visitors to their own copycat site, which will most likely be stuffed full of adverts and tracking (which makes a profit for their owner). This is usually a violation of intellectual property. Even where the original page is available under a permissive licence, attribution is usually required but is often omitted. This happens a lot for question and answer sites, such as Stack Exchange, or forums.
These bots may well have the user agent string empty or set to the default value for the given API that they are built with.
Trolls and Spambots
These are the types of bots that the pages that require you to identify yourself as a human are mostly trying to block. The user agent string is typically set to a common browser and platform to make the bot appear as though it is a human visitor. These bots search for forms to fill in, such as contact forms to send spam messages or comment forms to advertise dubious products and sites.
While spambots are the digital equivalent of fly-posters, trollbots are the equivalent of poison-pen letter writers. They are created by individuals who take a puckish delight in causing hurt and discord. These bots are designed to search for certain keywords on a page and craft an offensive or divisive comment that relates to the topic. The creators of these bots may have a particular hatred towards a certain group of people, but they can also be chaotic nihilists with a set of offensive comments for every group.
The expression “don’t feed the trolls” has been around for a long time. I remember first encountering it on Usenet back in the early 1990s (accompanied by some ASCII art). It’s very good advice. Don’t give trolls the attention that they are looking for, but, in some cases, the troll posting the offensive comments isn’t human. It’s a bot that has no ability to reason, no feelings, no embarrassment. Its function is solely to post content that its creator programmed into it.
Chatbots can come under both this category and the next. Chatbots in general are just a tool that simulates conversation, and are often used for legitimate services, such as online help, but they are also used by criminals to deceive people. For example, a fraudster might create a fake account on an Internet dating site and use a chatbot to hook victims who believe they are chatting with a human. Once the chatbot has gained the victim’s trust, the fraudster takes over.
The worst of the bad bots are the ones created by cyber-criminals and they are designed to wreak havoc, stealing data and installing malware. These bots look for dynamic web pages that use parameters and will try to inject malicious code into the parameter values.
For example, the page https://www.dickimaw-books.com/booklist.php?book_id=11 has a parameter (
book_id) that identifies a particular edition of a book. (In this case, the second paperback edition of The Private Enemy.) The parameter value (11) uniquely identifies this edition in the database that contains all the title information.
A malicious bot will try altering the parameter value to break into the database. For example, it may start out by simply appending an apostrophe (
book_id=11'). If this triggers a syntax error then the site is vulnerable to SQL injection and the bot can then try something far nastier to access the contents of the database.
Or the parameter value may be the name of a template file, which is used for the main body of the web page, so the bot will try replacing the parameter value with
../etc/passwd etc) in order to trick that web page into revealing the contents of the password file instead.
Bad bots can also disrupt a website by repeatedly accessing pages in rapid succession (a denial of service attack or, where an army of bots are working together, a distributed denial of service attack). This can make the site completely inaccessible to anyone else.
These types of bots rarely identify themselves honestly. The user agent string is typically empty or contains a common browser and platform combination (as with the trolls and spambots). I’ve also encountered attempts at SQL injection where the user agent string was the same as the aforementioned Facebook bot. At first glance, it gives the impression that a Facebook bot has gone rogue (or followed a bad link) but the IP was registered to somewhere in Russia, which seems an unlikely origin for a Facebook bot, so bad bots not only pretend to be human but also try to pass themselves off as legitimate bots.
Sometimes the user agent string will contain “sqlmap”. This is a legitimate pen testing application. However, in many jurisdictions, penetration testing can only be performed by mutual consent between the pen tester and the website owner. If you are a website developer and a pen tester has been hired by your organisation, then don’t block bots with this user agent as the site needs to be tested by an unblocked bot since most bad bots don’t conveniently identify themselves. If a pen tester hasn’t been engaged then the tool is being used illegally (which is par for the course with criminals). [However, non-invasive probes by security researchers to test for known vulnerabilities that are responsibly reported are beneficial to the site owner.]
So, if you’re a website developer and you want to stop bad bots, remember that you can’t rely on the user agent string. Bots pretend to be human and some humans blank their user agent string for privacy reasons. The first line of defence is to filter (e.g. ensure that a numeric value is actually a number), escape special characters (e.g.
htmlentities) and use prepared statements.
If you’re just a regular website user, don’t assume that every comment you read was actually posted by a human and, while captchas may be frustrating, your web browsing experience may be far worse without them.
Update 2021-08-08: added paragraph on Signal in Good Bots section and paragraph on chatbots in Trolls and Spambots section.
- Children’s Illustrated Fiction
- Illustrated fiction for young children: The Foolish Hedgehog and Quack, Quack, Quack. Give My Hat Back!
- Creative Writing
- The art of writing fiction, inspiration and themes.
- Crime Fiction
- The crime fiction category covers the crime novels The Private Enemy and The Fourth Protectorate and also the crime short stories I’ve Heard the Mermaid Sing and I’ve Heard the Mermaid Sing.
- Fiction books and other stories.
- Natural languages including regional dialects.
- The TeX typesetting system in general or the LaTeX format in particular.
- This category is about the county of Norfolk in East Anglia (the eastern bulgy bit of England). It’s where The Private Enemy is set and is also where the author lives.
- Information about the Dickimaw Books site.
- Speculative Fiction
- The speculative fiction category includes the novel The Private Enemy (set in the future), the alternative history novel The Fourth Protectorate, and the fantasy novel Muirgealia.
- Alternative History
- Sub-genre of speculative fiction, alternative history is “what if?” fiction.
- book samples
- Conservation of Detail
- A part of the creative writing process, conservation of detail essentially means that only significant information should be added to a work of fiction.
- Information about the site cookies.
- Regional dialects, in particular the Norfolk dialect.
- The education system.
- Sub-genre of speculative fiction involving magical elements.
- File formats
- A pochette (pocket violin) with a hippo headpiece.
- I’ve Heard the Mermaid Sing
- A crime fiction short story (available as an ebook) set in the late 1920s on the RMS Aquitania. See the story’s main page for further details.
- The little things that inspired the author’s stories.
- Posts about the website migration.
- A fantasy novel. See the book’s main page for further details.
- Online Store
- Posts about the Dickimaw Books store.
- Quack, Quack, Quack. Give My Hat Back!
- Information about the illustrated children’s book. See the book’s main page for further details.
- Articles that were previously published elsewhere and reproduced on this blog in order to collect them all together in one place.
- Posts about sales that are running or are pending at the time of the post.
- Site settings
- Information about the site settings.
- Story creation
- The process of creating stories.
- TeX Live
- The Foolish Hedgehog
- Information about the illustrated children’s book. See the book’s main page for further details.
- The Fourth Protectorate
- Alternative history novel set in 1980s/90s London. See the book’s main page for further details.
- The Private Enemy
- A crime/speculative fiction novel set in a future Norfolk run by gangsters. See the book’s main page for further details.
- Unsocial Media
- A cybercrime fiction short story (available as an ebook). See the story’s main page for further details.
- World Book Day
- World Book Day (UK and Ireland) is an annual charity event held in the United Kingdom and the Republic of Ireland on the first Thursday in March. It’s a local version of the global UNESCO World Book Day.