You’ve probably come across websites that want you to prove that you’re human and not a robot. This may come in the form of a picture challenge (for example, select all the squares with bicycles) or it may simply require you to check a box to assert that you’re not a robot. Perhaps you’re wondering why you need to do this. Why is the website so concerned about being visited by robots? Alternatively, perhaps you’re a website developer and are determined to find a way to keep out all bots.
What is a bot? Are all bots bad?
As with cookies, bots are important tools in the digital world. However, as with cookies, bots can also be used for unwholesome purposes.
“Bot” is short for robot and is simply a piece of software (an application) that visits websites. A bot may follow one link after another, crawling through pages across the World Wide Web. For this reason, they are often called “crawlers” or “spiders”.
If you go to your favourite search engine and type in a keyword or phrase (or use a voice activated request on your mobile device) then the results usually come up fairly quickly. This is only possible because the search engine has an index that has been compiled by bots that have followed link after link, gathering information. Without this index, it would take a very long time to scour all the millions of pages that make up the web to find something relevant.
Not all bots are crawlers. For example, Facebook has a bot that’s used when a post contains a link. The bot is used to check that the link exists and it reads any Open Graph markup. This allows Facebook to include an image and short excerpt to arouse the interest of anyone who views the post. Unlike the search engine bots, this bot doesn’t roam free about the Internet but instead restricts itself to links posted on Facebook pages.
Well behaved bots commonly identify themselves in the user agent string in the form:
For example, “facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)” identifies the Facebook bot (facebookexternalhit), its version number (1.1) and a way of finding out information about the bot.
So these are useful bots that help users to discover interesting sites.
However, even the good bots don’t always honestly identify themselves. For example, if you post a link in the Signal Messaging app, the bot used to fetch the preview information identifies itself as WhatsApp, and this basically seems to be a rehash of the “all browsers identify themselves as Mozilla” problem.
Not So Good Bots
Although the crawlers used by search engines are useful, some crawlers that index sites to provide certain types of information for their users (who may require free or paid accounts to access it) can be a nuisance because they’re not well-behaved. For example, they may not follow the robot instructions stipulated by the website (robots.txt), they may try to access pages that are only intended for human visitors or they may hit the site so hard (that is, they look up pages so fast) that they slow down the site and it becomes unusable for everyone else.
This could be because the bot’s developer made a mistake (a bug in the bot’s code or an inexperienced programmer) or it could be because the developer simply doesn’t care and wants the information quickly regardless of the inconvenience to others (perhaps to satisfy the demands of paying customers). In the long run, this is counter-productive as it will lead to the bot (which is identified in the user agent string) being banned.
Web scraping (or harvesting) is when a bot extracts data from a webpage. In the earlier case of search engines and social media, this data can just be keywords or phrases or the URL for the page image, but some bots are designed to gather all information from a page in order to reproduce it verbatim on another site. This is often done to lure visitors to their own copycat site, which will most likely be stuffed full of adverts and tracking (which makes a profit for their owner). This is usually a violation of intellectual property. Even where the original page is available under a permissive licence, attribution is usually required but is often omitted. This happens a lot for question and answer sites, such as Stack Exchange, or forums.
These bots may well have the user agent string empty or set to the default value for the given API that they are built with.
Trolls and Spambots
These are the types of bots that the pages that require you to identify yourself as a human are mostly trying to block. The user agent string is typically set to a common browser and platform to make the bot appear as though it is a human visitor. These bots search for forms to fill in, such as contact forms to send spam messages or comment forms to advertise dubious products and sites.
While spambots are the digital equivalent of fly-posters, trollbots are the equivalent of poison-pen letter writers. They are created by individuals who take a puckish delight in causing hurt and discord. These bots are designed to search for certain keywords on a page and craft an offensive or divisive comment that relates to the topic. The creators of these bots may have a particular hatred towards a certain group of people, but they can also be chaotic nihilists with a set of offensive comments for every group.
The expression “don’t feed the trolls” has been around for a long time. I remember first encountering it on Usenet back in the early 1990s (accompanied by some ASCII art). It’s very good advice. Don’t give trolls the attention that they are looking for, but, in some cases, the troll posting the offensive comments isn’t human. It’s a bot that has no ability to reason, no feelings, no embarrassment. Its function is solely to post content that its creator programmed into it.
Chatbots can come under both this category and the next. Chatbots in general are just a tool that simulates conversation, and are often used for legitimate services, such as online help, but they are also used by criminals to deceive people. For example, a fraudster might create a fake account on an Internet dating site and use a chatbot to hook victims who believe they are chatting with a human. Once the chatbot has gained the victim’s trust, the fraudster takes over.
The worst of the bad bots are the ones created by cyber-criminals and they are designed to wreak havoc, stealing data and installing malware. These bots look for dynamic web pages that use parameters and will try to inject malicious code into the parameter values.
For example, the page https://www.dickimaw-books.com/booklist.php?book_id=11 has a parameter (
book_id) that identifies a particular edition of a book. (In this case, the second paperback edition of The Private Enemy.) The parameter value (11) uniquely identifies this edition in the database that contains all the title information.
A malicious bot will try altering the parameter value to break into the database. For example, it may start out by simply appending an apostrophe (
book_id=11'). If this triggers a syntax error then the site is vulnerable to SQL injection and the bot can then try something far nastier to access the contents of the database.
Or the parameter value may be the name of a template file, which is used for the main body of the web page, so the bot will try replacing the parameter value with
../etc/passwd etc) in order to trick that web page into revealing the contents of the password file instead.
Bad bots can also disrupt a website by repeatedly accessing pages in rapid succession (a denial of service attack or, where an army of bots are working together, a distributed denial of service attack). This can make the site completely inaccessible to anyone else.
These types of bots rarely identify themselves honestly. The user agent string is typically empty or contains a common browser and platform combination (as with the trolls and spambots). I’ve also encountered attempts at SQL injection where the user agent string was the same as the aforementioned Facebook bot. At first glance, it gives the impression that a Facebook bot has gone rogue (or followed a bad link) but the IP was registered to somewhere in Russia, which seems an unlikely origin for a Facebook bot, so bad bots not only pretend to be human but also try to pass themselves off as legitimate bots.
Sometimes the user agent string will contain “sqlmap”. This is a legitimate pen testing application. However, in many jurisdictions, penetration testing can only be performed by mutual consent between the pen tester and the website owner. If you are a website developer and a pen tester has been hired by your organisation, then don’t block bots with this user agent as the site needs to be tested by an unblocked bot since most bad bots don’t conveniently identify themselves. If a pen tester hasn’t been engaged then the tool is being used illegally (which is par for the course with criminals).
So, if you’re a website developer and you want to stop bad bots, remember that you can’t rely on the user agent string. Bots pretend to be human and some humans blank their user agent string for privacy reasons. The first line of defence is to filter (e.g. ensure that a numeric value is actually a number), escape special characters (e.g.
htmlentities) and use prepared statements.
If you’re just a regular website user, don’t assume that every comment you read was actually posted by a human and, while captchas may be frustrating, your web browsing experience may be far worse without them.
Update 2021-08-08: added paragraph on Signal in Good Bots section and paragraph on chatbots in Trolls and Spambots section.