Tagged with "htaccess" http://www.addedbytes.com/feeds/tag-feed/ en Web Development in Brighton - Added Bytes 2006 120 URL Rewriting for Beginners http://www.addedbytes.com/articles/for-beginners/url-rewriting-for-beginners/ Introduction

URL rewriting can be one of the best and quickest ways to improve the usability and search friendliness of your site. It can also be the source of near-unending misery and suffering. Definitely worth playing carefully with it - lots of testing is recommended. With great power comes great responsibility, and all that.

There are several other guides on the web already, that may suit your needs better than this one.

Before reading on, you may find it helpful to have the mod_rewrite cheat sheet and/or the regular expressions cheat sheet handy. A basic grasp of the concept of regular expressions would also be very helpful.

What is "URL Rewriting"?

Most dynamic sites include variables in their URLs that tell the site what information to show the user. Typically, this gives URLs like the following, telling the relevant script on a site to load product number 7.

http://www.pets.com/show_a_product.php?product_id=7

The problems with this kind of URL structure are that the URL is not at all memorable. It's difficult to read out over the phone (you'd be surprised how many people pass URLs this way). Search engines and users alike get no useful information about the content of a page from that URL. You can't tell from that URL that that page allows you to buy a Norwegian Blue Parrot (lovely plumage). It's a fairly standard URL - the sort you'd get by default from most CMSes. Compare that to this URL:

http://www.pets.com/products/7/

Clearly a much cleaner and shorter URL. It's much easier to remember, and vastly easier to read out. That said, it doesn't exactly tell anyone what it refers to. But we can do more:

http://www.pets.com/parrots/norwegian-blue/

Now we're getting somewhere. You can tell from the URL, even when it's taken out of context, what you're likely to find on that page. Search engines can split that URL into words (hyphens in URLs are treated as spaces by search engines, whereas underscores are not), and they can use that information to better determine the content of the page. It's an easy URL to remember and to pass to another person.

Unfortunately, the last URL cannot be easily understood by a server without some work on our part. When a request is made for that URL, the server needs to work out how to process that URL so that it knows what to send back to the user. URL rewriting is the technique used to "translate" a URL like the last one into something the server can understand.

Platforms and Tools

Depending on the software your server is running, you may already have access to URL rewriting modules. If not, most hosts will enable or install the relevant modules for you if you ask them very nicely.

Apache is the easiest system to get URL rewriting running on. It usually comes with its own built-in URL rewriting module, mod_rewrite, enabled, and working with mod_rewrite is as simple as uploading correctly formatted and named text files.

IIS, Microsoft's server software, doesn't include URL rewriting capability as standard, but there are add-ons out there that can provide this functionality. ISAPI_Rewrite is the one I recommend working with, as I've so far found it to be the closest to mod_rewrite's functionality. Instructions for installing and configuring ISAPI_Rewrite can be found at the end of this article.

The code that follows is based on URL rewriting using mod_rewrite.

Basic URL Rewriting

To begin with, let's consider a simple example. We have a website, and we have a single PHP script that serves a single page. Its URL is:

http://www.pets.com/pet_care_info_07_07_2008.php

We want to clean up the URL, and our ideal URL would be:

http://www.pets.com/pet-care/

In order for this to work, we need to tell the server to internally redirect all requests for the URL "pet-care" to "pet_care_info_07_07_2008.php". We want this to happen internally, because we don't want the URL in the browser's address bar to change.

To accomplish this, we need to first create a text document called ".htaccess" to contain our rules. It must be named exactly that (not ".htaccess.txt" or "rules.htaccess"). This would be placed in the root directory of the server (the same folder as "pet_care_info_07_07_2008.php" in our example). There may already be an .htaccess file there, in which case we should edit that rather than overwrite it.

The .htaccess file is a configuration file for the server. If there are errors in the file, the server will display an error message (usually with an error code of "500"). If you are transferring the file to the server using FTP, you must make sure it is transferred using the ASCII mode, rather than BINARY. We use this file to perform 2 simple tasks in this instance - first, to tell Apache to turn on the rewrite engine, and second, to tell apache what rewriting rule we want it to use. We need to add the following to the file:

RewriteEngine On # Turn on the rewriting engine RewriteRule ^pet-care/?$ pet_care_info_01_02_2008.php [NC,L] # Handle requests for "pet-care"

A couple of quick items to note - everything following a hash symbol in an .htaccess file is ignored as a comment, and I'd recommend you use comments liberally; and the "RewriteEngine" line should only be used once per .htaccess file (please note that I've not included this line from here onwards in code example).

The "RewriteRule" line is where the magic happens. The line can be broken down into 5 parts:

  • RewriteRule - Tells Apache that this like refers to a single RewriteRule.
  • ^/pet-care/?$ - The "pattern". The server will check the URL of every request to the site to see if this pattern matches. If it does, then Apache will swap the URL of the request for the "substitution" section that follows.
  • pet_care_info_01_02_2003.php - The "substitution". If the pattern above matches the request, Apache uses this URL instead of the requested URL.
  • [NC,L] - "Flags", that tell Apache how to apply the rule. In this case, we're using two flags. "NC", tells Apache that this rule should be case-insensitive, and "L" tells Apache not to process any more rules if this one is used.
  • # Handle requests for "pet-care" - Comment explaining what the rule does (optional but recommended)

The rule above is a simple method for rewriting a single URL, and is the basis for almost all URL rewriting rules.

Patterns and Replacements

The rule above allows you to redirect requests for a single URL, but the real power of mod_rewrite comes when you start to identify and rewrite groups of URLs based on patterns they contain.

Let's say you want to change all of your site URLs as described in the first pair of examples above. Your existing URLs look like this:

http://www.pets.com/show_a_product.php?product_id=7

And you want to change them to look like this:

http://www.pets.com/products/7/

Rather than write a rule for every single product ID, you of course would rather write one rule to manage all product IDs. Effectively you want to change URLs of this format:

http://www.pets.com/show_a_product.php?product_id={a number}

And you want to change them to look like this:

http://www.pets.com/products/{a number}/

In order to do so, you will need to use "regular expressions". These are patterns, defined in a specific format that the server can understand and handle appropriately. A typical pattern to identify a number would look like this:

[0-9]+

The square brackets contain a range of characters, and "0-9" indicates all the digits. The plus symbol indicates that the pattern will idenfiy one or more of whatever precedes the plus - so this pattern effectively means "one or more digits" - exactly what we're looking to find in our URL.

The entire "pattern" part of the rule is treated as a regular expression by default - you don't need to turn this on or activate it at all.

RewriteRule ^products/([0-9]+)/?$ show_a_product.php?product_id=$1 [NC,L] # Handle product requests

The first thing I hope you'll notice is that we've wrapped our pattern in brackets. This allows us to "back-reference" (refer back to) that section of the URL in the following "substitution" section. The "$1" in the substitution tells Apache to put whatever matched the earlier bracketed pattern into the URL at this point. You can have lots of backreferences, and they are numbered in the order they appear.

And so, this RewriteRule will now mean that Apache redirects all requests for domain.com/products/{number}/ to show_a_product.php?product_id={same number}.

Regular Expressions

A complete guide to regular expressions is rather beyond the scope of this article. However, important points to remember are that the entire pattern is treated as a regular expression, so always be careful of characters that are "special" characters in regular expressions.

The most instance of this is when people use a period in their pattern. In a pattern, this actually means "any character" rather than a literal period, and so if you want to match a period (and only a period) you will need to "escape" the character - precede it with another special character, a backslash, that tells Apache to take the next character to be literal.

For example, this RewriteRule will not just match the URL "rss.xml" as intended - it will also match "rss1xml", "rss-xml" and so on.

RewriteRule ^rss.xml$ rss.php [NC,L] # Change feed URL

This does not usually present a serious problem, but escaping characters properly is a very good habit to get into early. Here's how it should look:

RewriteRule ^rss\.xml$ rss.php [NC,L] # Change feed URL

This only applies to the pattern, not to the substitution. Other characters that require escaping (referred to as "metacharacters") follow, with their meaning in brackets afterwards:

  • . (any character)
  • * (zero of more of the preceding)
  • + (one or more of the preceding)
  • {} (minimum to maximum quantifier)
  • ? (ungreedy modifier)
  • ! (at start of string means "negative pattern")
  • ^ (start of string, or "negative" if at the start of a range)
  • $ (end of string)
  • [] (match any of contents)
  • - (range if used between square brackets)
  • () (group, backreferenced group)
  • | (alternative, or)
  • \ (the escape character itself)

Using regular expressions, it is possible to search for all sorts of patterns in URLs and rewrite them when they match. Time for another example - we wanted earlier to be able to indentify this URL and rewrite it:

http://www.pets.com/parrots/norwegian-blue/

And we want to be able to tell the server to interpret this as the following, but for all products:

http://www.pets.com/get_product_by_name.php?product_name=norwegian-blue

And we can do that relatively simply, with the following rule:

RewriteRule ^parrots/([A-Za-z0-9-]+)/?$ get_product_by_name.php?product_name=$1 [NC,L] # Process parrots

With this rule, any URL that starts with "parrots" followed by a slash (parrots/), then one or more (+) of any combination of letters, numbers and hyphens ([A-Za-z0-9-]) (note the hyphen at the end of the selection of characters within square brackets - it must be added there to be treated literally rather than as a range separator). We reference the product name in brackets with $1 in the substitution.

We can make it even more generic, if we want, so that it doesn't matter what directory a product appears to be in, it is still sent to the same script, like so:

RewriteRule ^[A-Za-z-]+/([A-Za-z0-9-]+)/?$ get_product_by_name.php?product_name=$1 [NC,L] # Process all products

As you can see, we've replaced "parrots" with a pattern that matches letter and hyphens. That rule will now match anything in the parrots directory or any other directory whose name is comprised of at least one or more letters and hyphens.

Flags

Flags are added to the end of a rewrite rule to tell Apache how to interpret and handle the rule. They can be used to tell apache to treat the rule as case-insensitive, to stop processing rules if the current one matches, or a variety of other options. They are comma-separated, and contained in square brackets. Here's a list of the flags, with their meanings (this information is included on the cheat sheet, so no need to try to learn them all).

  • C (chained with next rule)
  • CO=cookie (set specified cookie)
  • E=var:value (set environment variable var to value)
  • F (forbidden - sends a 403 header to the user)
  • G (gone - no longer exists)
  • H=handler (set handler)
  • L (last - stop processing rules)
  • N (next - continue processing rules)
  • NC (case insensitive)
  • NE (do not escape special URL characters in output)
  • NS (ignore this rule if the request is a subrequest)
  • P (proxy - i.e., apache should grab the remote content specified in the substitution section and return it)
  • PT (pass through - use when processing URLs with additional handlers, e.g., mod_alias)
  • R (temporary redirect to new URL)
  • R=301 (permanent redirect to new URL)
  • QSA (append query string from request to substituted URL)
  • S=x (skip next x rules)
  • T=mime-type (force specified mime type)

Moving Content

RewriteRule ^article/?$ http://www.new-domain.com/article/ [R,NC,L] # Temporary Move

Adding an "R" flag to the flags section changes how a RewriteRule works. Instead of rewriting the URL internally, Apache will send a message back to the browser (an HTTP header) to tell it that the document has moved temporarily to the URL given in the "substitution" section. Either an absolute or a relative URL can be given in the substitution section. The header sent back includea a code - 302 - that indicates the move is temporary.

RewriteRule ^article/?$ http://www.new-domain.com/article/ [R=301,NC,L] # Permanent Move

If the move is permanent, append "=301" to the "R" flag to have Apache tell the browser the move is considered permanent. Unlike the default "R", "R=301" will also tell the browser to display the new address in the address bar.

This is one of the most common methods of rewriting URLs of items that have moved to a new URL (for example, it is in use extensively on this site to forward users to new post URLs whenever they are changed).

Conditions

Rewrite rules can be preceded by one or more rewrite conditions, and these can be strung together. This can allow you to only apply certain rules to a subset of requests. Personally, I use this most often when applying rules to a subdomain or alternative domain as rewrite conditions can be run against a variety of criteria, not just the URL. Here's an example:

RewriteCond %{HTTP_HOST} ^addedbytes\.com [NC] RewriteRule ^(.*)$ http://www.addedbytes.com/$1 [L,R=301]

The rewrite rule above redirects all requests, no matter what for, to the same URL at "www.addedbytes.com". Without the condition, this rule would create a loop, with every request matching that rule and being sent back to itself. The rule is intended to only redirect requests missing the "www" URL portion, though, and the condition preceding the rule ensures that this happens.

The condition operates in a similar way to the rule. It starts with "RewriteCond" to tell mod_rewrite this line refers to a condition. Following that is what should actually be tested, and then the pattern to test. Finally, the flags in square brackets, the same as with a RewriteRule.

The string to test (the second part of the condition) can be a variety of different things. You can test the domain being requested, as with the above example, or you could test the browser being used, the referring URL (commonly used to prevent hotlinking), the user's IP address, or a variety of other things (see the "server variables" section for an outline of how these work).

The pattern is almost exactly the same as that used in a RewriteRule, with a couple of small exceptions. The pattern may not be interpreted as a pattern if it starts with specific characters as described in the following "exceptions" section. This means that if you wish to use a regular expression pattern starting with <, >, or a hyphen, you should escape them with the backslash.

Rewrite conditions can, like rewrite rules, be followed by flags, and there are only two. "NC", as with rules, tells Apache to treat the condition as case-insensitive. The other available flag is "OR". If you only want to apply a rule if one of two conditions match, rather than repeat the rule, add the "OR" flag to the first condition, and if either match then the following rule will be applied. The default behaviour, if a rule is preceded by multiple conditions, is that it is only applied if all rules match.

Exceptions and Special Cases

Rewrite conditions can be tested in a few different ways - they do not need to be treated as regular expression patterns, although this is the most common way they are used. Here are the various ways rewrite conditons can be processed:

  • <Pattern (is test string lower than pattern)
  • >Pattern (is test string greater than pattern)
  • =Pattern (is test string equal to pattern)
  • -d (is test string a valid directory)
  • -f (is test string a valid file)
  • -s (is test string a valid file with size greater than zero)
  • -l (is test string a symbolic link)
  • -F (is test string a valid file, and accessible (via subrequest))
  • -U (is test string a valid URL, and accessible (via subrequest))

Server Variables

Server variables are a selection of items you can test when writing rewrite conditions. This allows you to apply rules based on all sorts of request parameters, including browser identifiers, referring URL or a multitude of other strings. Variables are of the following format:

%{VARIABLE_NAME}

And "VARIABLE_NAME" can be replaced with any one of the following items:

  • HTTP Headers
    • HTTP_USER_AGENT
    • HTTP_REFERER
    • HTTP_COOKIE
    • HTTP_FORWARDED
    • HTTP_HOST
    • HTTP_PROXY_CONNECTION
    • HTTP_ACCEPT
  • Connection Variables
    • REMOTE_ADDR
    • REMOTE_HOST
    • REMOTE_USER
    • REMOTE_IDENT
    • REQUEST_METHOD
    • SCRIPT_FILENAME
    • PATH_INFO
    • QUERY_STRING
    • AUTH_TYPE
  • Server Variables
    • DOCUMENT_ROOT
    • SERVER_ADMIN
    • SERVER_NAME
    • SERVER_ADDR
    • SERVER_PORT
    • SERVER_PROTOCOL
    • SERVER_SOFTWARE
  • Dates and Times
    • TIME_YEAR
    • TIME_MON
    • TIME_DAY
    • TIME_HOUR
    • TIME_MIN
    • TIME_SEC
    • TIME_WDAY
    • TIME
  • Special Items
    • API_VERSION
    • THE_REQUEST
    • REQUEST_URI
    • REQUEST_FILENAME
    • IS_SUBREQ

Working With Multiple Rules

The more complicated a site, the more complicated the set of rules governing it can be. This can be problematic when it comes to resolving conflicts between rules. You will find this issue rears its ugly head most often when you add a new rule to a file, and it doesn't work. What you may find, if the rule itself is not at fault, is that an earlier rule in the file is matching the URL and so the URL is not being tested against the new rule you've just added.

RewriteRule ^([A-Za-z0-9-]+)/([A-Za-z0-9-]+)/?$ get_product_by_name.php?category_name=$1&product_name=$2 [NC,L] # Process product requests RewriteRule ^([A-Za-z0-9-]+)/([A-Za-z0-9-]+)/?$ get_blog_post_by_title.php?category_name=$1&post_title=$2 [NC,L] # Process blog posts

In the example above, the product pages of a site and the blog post pages have identical patterns. The second rule will never match a URL, because anything that would match that pattern will have already been matched by the first rule.

There are a few ways to work around this. Several CMSes (including wordpress) handle this by adding an extra portion to the URL to denote the type of request, like so:

RewriteRule ^products/([A-Za-z0-9-]+)/([A-Za-z0-9-]+)/?$ get_product_by_name.php?category_name=$1&product_name=$2 [NC,L] # Process product requests RewriteRule ^blog/([A-Za-z0-9-]+)/([A-Za-z0-9-]+)/?$ get_blog_post_by_title.php?category_name=$1&post_title=$2 [NC,L] # Process blog posts

You could also write a single PHP script to process all requests, which checked to see if the second part of the URL matched a blog post or a product. I usually go for this option, as while it may increase the load on the server slightly, it gives much cleaner URLs.

RewriteRule ^([A-Za-z0-9-]+)/([A-Za-z0-9-]+)/?$ get_product_or_blog_post.php?category_name=$1&item_name=$2 [NC,L] # Process product and blog requests

There are certain situations where you can work around this issue by writing more precise rules and ordering your rules intelligently. Imagine a blog where there were two archives - one by topic and one by year.

RewriteRule ^([A-Za-z0-9-]+)/?$ get_archives_by_topic.php?topic_name=$1 [NC,L] # Get archive by topic RewriteRule ^([A-Za-z0-9-]+)/?$ get_archives_by_year.php?year=$1 [NC,L] # Get archive by year

The above rules will conflict. Of course, years are numeric and only 4 digits, so you can make that rule more precise, and by running it first the only type of conflict you cound encounter would be if you had a topic with a 4-digit number for a name.

RewriteRule ^([0-9]{4})/?$ get_archives_by_year.php?year=$1 [NC,L] # Get archive by year RewriteRule ^([A-Za-z0-9-]+)/?$ get_archives_by_topic.php?topic_name=$1 [NC,L] # Get archive by topic

mod_rewrite

Apache's mod_rewrite comes as standard with most Apache hosting accounts, so if you're on shared hosting, you are unlikely to have to do anything. If you're managing your own box, then you most likely just have to turn on mod_rewrite. If you are using Apache1, you will need to edit your httpd.conf file and remove the leading '#' from the following lines:

#LoadModule rewrite_module modules/mod_rewrite.so #AddModule mod_rewrite.c

If you are using Apache2 on a Debian-based distribution, you need to run the following command and then restart Apache:

sudo a2enmod rewrite

Other distubutions and platforms differ. If the above instructions are not suitable for your system, then Google is your friend. You may need to edit your apache2 configuration file and add "rewrite" to the "APACHE_MODULES" list, or edit httpd.conf, or even download and compile mod_rewrite yourself. For the majority, however, installation should be simple.

ISAPI_Rewrite

ISAPI_Rewrite is a URL rewriting plugin for IIS based on mod_rewrite and is not free. It performs most of the same functionality as mod_rewrite, and there is a good quality ISAPI_Rewrite forum where most common questions are answered. As ISAPI_Rewrite works with IIS, installation is relatively simple - there are installation instructions available.

ISAPI_Rewrite rules go into a file named httpd.ini. Errors will go into a file named httpd.parse.errors by default.

Leading Slashes

I have found myself tripped up numerous times by leading slashes in URL rewriting systems. Whether they should be used in the pattern or in the substitution section of a RewriteRule or used in a RewriteCond statement is a constant source of frustration to me. This may be in part because I work with different URL rewriting engines, but I would advise being careful of leading slashes - if a rule is not working, that's often a good place to start looking. I never include leading slashes in mod_rewrite rules and always include them in ISAPI_Rewrite.

Sample Rules

To redirect an old domain to a new domain:

RewriteCond %{HTTP_HOST} old_domain\.com [NC] RewriteRule ^(.*)$ http://www.new_domain.com/$1 [L,R=301]

To redirect all requests missing "www" (yes www):

RewriteCond %{HTTP_HOST} ^domain\.com [NC] RewriteRule ^(.*)$ http://www.domain.com/$1 [L,R=301]

To redirect all requests with "www" (no www):

RewriteCond %{HTTP_HOST} ^www\.domain\.com [NC] RewriteRule ^(.*)$ http://domain.com/$1 [L,R=301]

Redirect old page to new page:

RewriteRule ^old-url\.htm$ http://www.domain.com/new-url.htm [NC,R=301,L]

Useful Links

Summary

Hopefully if you've made it this far you now have a clear understanding of what URL rewriting is and how to add it to your site. It is worth taking the time to become familiar with - it can benefit your SEO efforts immediately, and increase the usability of your site.



]]>
Mon, 04 Aug 2008 12:53:00 +0100 http://www.addedbytes.com/articles/for-beginners/url-rewriting-for-beginners/ Dave Child ,,,,,,,
Block Referrer Spam (Updated) http://www.addedbytes.com/blog/block-referrer-spam/ Log files are a useful tool for webmasters. It helps to know how people are finding your site, and what software they are using to view it, among other things. A strange decision by a small group of bloggers, though, has given unscrupulous marketers another window of opportunity to manipulate search engines to increase their traffic.

The decision made by these short-sighted bloggers was to display, on their site, a list of recent referrers to each page. I can't imagine any reason why a visitor might be in the least bit interested in seeing this, but nevertheless many sites now display referrers on every page.

As search engine spiders visit sites, they grab the contents of each page they visit. They use this snapshot in their index - meaning that although a page may change every minute or two, a search engine may be using a single copy of a page for several days, or even weeks.

So a referral URL that is on a page when the spiders come to visit can have quite a bit of value, if the search engine visiting uses link popularity in any way (Google uses link popularity, as do many others).

So marketers have started to use programs to visit pages using a fake referral header, to get their URLs listed on as many sites as possible, in the hopes that this will increase their traffic.

However, this renders log files almost completely useless. These fake visitors usually visit from search engines, having searched for a keyphrase relevant to their own site. They skew statistics relating to number of visitors received, the countries used to visit, the technology used to view the page, how users found the page, how long they spent on the site ... and so on.

A webmaster may find their search engine rankings dropping because of this, and they may find search engines have removed them completely. Many sites that use spam techniques are quickly identified and penalised, and penalties will often be applied to sites that link to them as well.

There are plenty of techniques available for blocking referrer spam, and everyone has their favourite. Personally, I use a combination of two techniques.

The first is fairly simple - my referrer log is not indexable. I don't display referrers on the pages of my site. My referral log is publicly available, but search engines are instructed to ignore it. This removes the main incentive for people to referrer-spam my site (the other reason for this type of spam - the hope that the site owner will themselves visit the spamming URL - is less common, because it has such a low response rate).

Second, I use an .htaccess file to block requests from whatever I've managed to identify as either a crawler designed to find URLs to spam or a spamming URL. This is a relatively simple blacklist, and though it cannot work as a long term solution to this problem, it keeps me happy for now.

To implement this technique on your own site, first make sure you are running Apache with mod_rewrite. If you are, create a file called ".htaccess" (just that, not .htaccess.txt or anything else) and paste the following into it:

Update: 14th September 2005

The list below has been expanded substantially over the last year, and now covers much more spam than before. As stated before, this is not a practical solution to the problem in the long term, as this list can only ever get longer and longer, and may become unmaintainable, or even (eventually) slow a site to a crawl as all the rules are processed. However, as of now, it is still a useful tool.

RewriteEngine on # Block Referrer Spam # Drugs / Herbal RewriteCond %{HTTP_REFERER} (sleep-?deprivation) [NC,OR] RewriteCond %{HTTP_REFERER} (sleep-?disorders) [NC,OR] RewriteCond %{HTTP_REFERER} (insomnia) [NC,OR] RewriteCond %{HTTP_REFERER} (phentermine) [NC,OR] RewriteCond %{HTTP_REFERER} (phentemine) [NC,OR] RewriteCond %{HTTP_REFERER} (vicodin) [NC,OR] RewriteCond %{HTTP_REFERER} (hydrocodone) [NC,OR] RewriteCond %{HTTP_REFERER} (levitra) [NC,OR] RewriteCond %{HTTP_REFERER} (hgh-) [NC,OR] RewriteCond %{HTTP_REFERER} (-hgh) [NC,OR] RewriteCond %{HTTP_REFERER} (ultram-) [NC,OR] RewriteCond %{HTTP_REFERER} (-ultram) [NC,OR] RewriteCond %{HTTP_REFERER} (cialis) [NC,OR] RewriteCond %{HTTP_REFERER} (soma-) [NC,OR] RewriteCond %{HTTP_REFERER} (-soma) [NC,OR] RewriteCond %{HTTP_REFERER} (diazepam) [NC,OR] RewriteCond %{HTTP_REFERER} (gabapentin) [NC,OR] RewriteCond %{HTTP_REFERER} (celebrex) [NC,OR] RewriteCond %{HTTP_REFERER} (viagra) [NC,OR] RewriteCond %{HTTP_REFERER} (fioricet) [NC,OR] RewriteCond %{HTTP_REFERER} (ambien) [NC,OR] RewriteCond %{HTTP_REFERER} (valium) [NC,OR] RewriteCond %{HTTP_REFERER} (zoloft) [NC,OR] RewriteCond %{HTTP_REFERER} (finasteride) [NC,OR] RewriteCond %{HTTP_REFERER} (lamisil) [NC,OR] RewriteCond %{HTTP_REFERER} (meridia) [NC,OR] RewriteCond %{HTTP_REFERER} (allegra) [NC,OR] RewriteCond %{HTTP_REFERER} (diflucan) [NC,OR] RewriteCond %{HTTP_REFERER} (zovirax) [NC,OR] RewriteCond %{HTTP_REFERER} (valtrex) [NC,OR] RewriteCond %{HTTP_REFERER} (lipitor) [NC,OR] RewriteCond %{HTTP_REFERER} (proscar) [NC,OR] RewriteCond %{HTTP_REFERER} (acyclovir) [NC,OR] RewriteCond %{HTTP_REFERER} (sildenafil) [NC,OR] RewriteCond %{HTTP_REFERER} (tadalafil) [NC,OR] RewriteCond %{HTTP_REFERER} (xenical) [NC,OR] RewriteCond %{HTTP_REFERER} (melatonin) [NC,OR] RewriteCond %{HTTP_REFERER} (xanax) [NC,OR] RewriteCond %{HTTP_REFERER} (herbal) [NC,OR] RewriteCond %{HTTP_REFERER} (drugs) [NC,OR] RewriteCond %{HTTP_REFERER} (lortab) [NC,OR] RewriteCond %{HTTP_REFERER} (adipex) [NC,OR] RewriteCond %{HTTP_REFERER} (propecia) [NC,OR] RewriteCond %{HTTP_REFERER} (carisoprodol) [NC,OR] RewriteCond %{HTTP_REFERER} (tramadol) [NC] RewriteRule .* - [F] # Porn RewriteCond %{HTTP_REFERER} (porno) [NC,OR] RewriteCond %{HTTP_REFERER} (shemale) [NC,OR] RewriteCond %{HTTP_REFERER} (gangbang) [NC,OR] RewriteCond %{HTTP_REFERER} (-cock) [NC,OR] RewriteCond %{HTTP_REFERER} (-anal) [NC,OR] RewriteCond %{HTTP_REFERER} (-orgy) [NC,OR] RewriteCond %{HTTP_REFERER} (cock-) [NC,OR] RewriteCond %{HTTP_REFERER} (anal-) [NC,OR] RewriteCond %{HTTP_REFERER} (orgy-) [NC,OR] RewriteCond %{HTTP_REFERER} (singles-?christian) [NC,OR] RewriteCond %{HTTP_REFERER} (dating-?christian) [NC,OR] RewriteCond %{HTTP_REFERER} (cumeating) [NC,OR] RewriteCond %{HTTP_REFERER} (cream-?pies) [NC,OR] RewriteCond %{HTTP_REFERER} (cumsucking) [NC,OR] RewriteCond %{HTTP_REFERER} (cumswapping) [NC,OR] RewriteCond %{HTTP_REFERER} (cumfilled) [NC,OR] RewriteCond %{HTTP_REFERER} (cumdripping) [NC,OR] RewriteCond %{HTTP_REFERER} (krankenversicherung) [NC,OR] RewriteCond %{HTTP_REFERER} (cumpussy) [NC,OR] RewriteCond %{HTTP_REFERER} (suckingcum) [NC,OR] RewriteCond %{HTTP_REFERER} (drippingcum) [NC,OR] RewriteCond %{HTTP_REFERER} (pussycum) [NC,OR] RewriteCond %{HTTP_REFERER} (swappingcum) [NC,OR] RewriteCond %{HTTP_REFERER} (eatingcum) [NC,OR] RewriteCond %{HTTP_REFERER} (cum-) [NC,OR] RewriteCond %{HTTP_REFERER} (-cum) [NC,OR] RewriteCond %{HTTP_REFERER} (sperm) [NC,OR] RewriteCond %{HTTP_REFERER} (christian-?dating) [NC,OR] RewriteCond %{HTTP_REFERER} (jewish-?singles) [NC,OR] RewriteCond %{HTTP_REFERER} (sex-?meetings) [NC,OR] RewriteCond %{HTTP_REFERER} (swinging) [NC,OR] RewriteCond %{HTTP_REFERER} (swingers) [NC,OR] RewriteCond %{HTTP_REFERER} (personals) [NC,OR] RewriteCond %{HTTP_REFERER} (sleeping) [NC,OR] RewriteCond %{HTTP_REFERER} (libido) [NC,OR] RewriteCond %{HTTP_REFERER} (grannies) [NC,OR] RewriteCond %{HTTP_REFERER} (mature) [NC,OR] RewriteCond %{HTTP_REFERER} (enhancement) [NC,OR] RewriteCond %{HTTP_REFERER} (sexual) [NC,OR] RewriteCond %{HTTP_REFERER} (gay-?teen) [NC,OR] RewriteCond %{HTTP_REFERER} (teen-?chat) [NC,OR] RewriteCond %{HTTP_REFERER} (gay-?chat) [NC,OR] RewriteCond %{HTTP_REFERER} (adult-?finder) [NC,OR] RewriteCond %{HTTP_REFERER} (adult-?friend) [NC,OR] RewriteCond %{HTTP_REFERER} (friend-?finder) [NC,OR] RewriteCond %{HTTP_REFERER} (friend-?adult) [NC,OR] RewriteCond %{HTTP_REFERER} (finder-?adult) [NC,OR] RewriteCond %{HTTP_REFERER} (finder-?friend) [NC,OR] RewriteCond %{HTTP_REFERER} (discrete-?encounters) [NC,OR] RewriteCond %{HTTP_REFERER} (cheating-?wives) [NC,OR] RewriteCond %{HTTP_REFERER} (housewives) [NC,OR] RewriteCond %{HTTP_REFERER} (\-sex\.) [NC,OR] RewriteCond %{HTTP_REFERER} (xxx) [NC,OR] RewriteCond %{HTTP_REFERER} (snowballing) [NC] RewriteRule .* - [F] # Weight RewriteCond %{HTTP_REFERER} (fat-) [NC,OR] RewriteCond %{HTTP_REFERER} (-fat) [NC,OR] RewriteCond %{HTTP_REFERER} (diet) [NC,OR] RewriteCond %{HTTP_REFERER} (pills) [NC,OR] RewriteCond %{HTTP_REFERER} (weight) [NC,OR] RewriteCond %{HTTP_REFERER} (supplement) [NC] RewriteRule .* - [F] # Gambling RewriteCond %{HTTP_REFERER} (texas-?hold-?em) [NC,OR] RewriteCond %{HTTP_REFERER} (poker) [NC,OR] RewriteCond %{HTTP_REFERER} (casino) [NC,OR] RewriteCond %{HTTP_REFERER} (blackjack) [NC] RewriteRule .* - [F] # Loans / Finance RewriteCond %{HTTP_REFERER} (mortgage) [NC,OR] RewriteCond %{HTTP_REFERER} (refinancing) [NC,OR] RewriteCond %{HTTP_REFERER} (cash-?advance) [NC,OR] RewriteCond %{HTTP_REFERER} (cash-?money) [NC,OR] RewriteCond %{HTTP_REFERER} (pay-?day) [NC] RewriteRule .* - [F] # User Agents RewriteCond %{HTTP_USER_AGENT} (Program\ Shareware|Fetch\ API\ Request) [NC,OR] RewriteCond %{HTTP_USER_AGENT} (Microsoft\ URL\ Control) [NC] RewriteRule .* - [F] # Misc / Specific Sites RewriteCond %{HTTP_REFERER} (netwasgroup\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (nic4u\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (wear4u\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (foxmediasolutions\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (liveplanets\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (aeterna-tech\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (continentaltirebowl\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (chemsymphony\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (infolibria\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (globaleducationeurope\.net) [NC,OR] RewriteCond %{HTTP_REFERER} (soma\.125mb\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (mitglied\.lycos\.de) [NC,OR] RewriteCond %{HTTP_REFERER} (foxmediasolutions\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (jroundup\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (feathersandfurvanlines\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (conecrusher\.org) [NC,OR] RewriteCond %{HTTP_REFERER} (sbj-broadcasting\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (edthompson\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (codychesnutt\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (artsmallforsenate\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (axionfootwear\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (protzonbeer\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (candiria\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (bigsitecity\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (coresat\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (istarthere\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (amateurvoetbal\.net) [NC,OR] RewriteCond %{HTTP_REFERER} (alleghanyeda\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (xadulthosting\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (datashaping\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (zick\.biz) [NC,OR] RewriteCond %{HTTP_REFERER} (newprinceton\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (dvdsqueeze\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (xopy\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (webdevboard\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (devaddict\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (eaton-inc\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (whiteguysgroup\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (guestbookz\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (webdevsquare\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (indfx\.net) [NC,OR] RewriteCond %{HTTP_REFERER} (snap\.to) [NC,OR] RewriteCond %{HTTP_REFERER} (2y\.net) [NC,OR] RewriteCond %{HTTP_REFERER} (astromagia\.info) [NC,OR] RewriteCond %{HTTP_REFERER} (free-?sms) [NC] RewriteRule .* - [F]

The above will block just about all of the most common referral spam that I've seen so far. I'm adding to the list constantly (last addition: 14th September 2005) so do check back and see if there are updates if you're using it.

One potential problem with this technique, other than that it will, in time, become useless as too many URLs are added, is that there is always a possibility authentic visitors will be blocked. So, on this site, instead of the last line above, I've actually used something a little more user-friendly:

RewriteRule .* bad_referrer.php [L]

Instead of a "Forbidden" message, this displays a quick note explaining why there has been an error and that the user can click on a link to proceed. If you want to check this out for yourself, try visiting http://www.addedbytes.com/swingers/block-referrer-spam/ (note the "swingers" portion of the URL). This page will reload with a new URL. Then try visiting http://www.addedbytes.com/spam/block-referrer-spam/. You should find you get a message explaining what has happened, and a URL to click if you want to proceed.

And there we have it. With minimum effort (for now), referral log spamming in my site has been almost entirely removed. Before adding this set of rules and scripts, I was seeing around 200 fake referrals per day in my log files. Now, I see about 3 or 4 a week. Hopefully, this will continue until I can devise a better way of protecting against this kind of problem - before blacklists become an impossibility to manage.



]]>
Wed, 14 Sep 2005 11:36:00 +0100 http://www.addedbytes.com/blog/block-referrer-spam/ Dave Child ,,,,,,,,
Password Protect a Directory with .htaccess http://www.addedbytes.com/blog/code/password-protect-a-directory-with-htaccess/ Password protecting a directory can be done several ways. Many people use PHP or ASP to verify users, but if you want to protect a directory of files or images (for example), that often isn't practical. Fortunately, Apache has a built-in method for protecting directories from prying eyes, using the .htaccess file.

In order to protect your chosen directory, you will first need to create an .htaccess file. This is the file that the server will check before allowing access to anything in the same directory. That's right, the .htaccess file belongs in the directory you are protecting, and you can have one in each of as many directories as you like.

You'll need first to define a few parameters for the .htaccess file. It needs to know where to find certain information, for example a list of valid usernames and passwords. This is a sample of the few lines required in an .htaccess file to begin with, telling it where the usernames and passwords can be found, amongst other things.

AuthUserFile /full/path/to/.htpasswd AuthName "Please Log In" AuthType Basic

You've now defined a few basic parameters for Apache to manage the authorisation process. First, you've defined the location of the .htpasswd file. This is the file that contains all the usernams and encrypted passwords for your site. We'll cover adding information to this file shortly. It's extremely important that you place this file outside of the web root. You should only be able to access it by FTP, not over the web.

The AuthName parameter basically just defines the title of the password entry box when the user logs in. It's not exactly the most important part of the file, but should be defined. The AuthType tells the server what sort of processing is in use, and "Basic" is the most common and perfectly adequate for almost any purpose.

We've told apache where to find files, but we've not told it who, of those people defined in the .htpasswd file, can access the directory. For that reason, we still have another line to define.

If we want to grant access to everyone in the .htpasswd file, we can add this line ("valid-user" is like a keyword, telling apache any user will do):

require valid-user

If we want to just grant access to a single user, we can use "user" and their username instead of "valid-user":

require user dave

A normal and complete .htaccess file might look like this:

AuthUserFile /home/dave/.htpasswd AuthName "Dave's Login Area" AuthType Basic require user dave

Now we have almost everything defined, but we are still missing an .htpasswd file. Without that, the server won't know what usernames and passwords are ok.

An .htpasswd file is made up of a series of lines, one for each valid user. Each line looks like this, with a username, then colon, then encrypted password:

username:encryptedpassword

The password encryption is the same as you'll find in PHP's crypt() function. It is not reversible, so you can't find out a password from the encrypted version. (Please note that on page 2 of this article is a tool to help you generate an .htpasswd file, that will help you encrypt passwords).

A user of "dave" and password of "dave" might be added with the following line:

dave:XO5UAT7ceqPvc

Each time you run an encryption function like "crypt", you will almost certainly get a different result. This is down to something called "salt", which in the above case was "XO" (first two letters of encrypted password). Different salt will give different encrypted values, and if not explicitly specified will be randomly generated. Don't worry though, the server is quite capable of understanding all this - if you come up with a different value for the encrypted password and replace it, everything would still work fine, as long as the password was the same.

Once you've created your .htpasswd file, you need to upload it to a safe location on your server, and check you've set the .htaccess file to point to it correctly. Then, upload the .htaccess file to the directory you want to protect and you'll be all set. Simply visit the directory to check it is all working.

.htpasswd Generator

The .htpasswd file needs encrypted passwords, which can be a problem for anyone without experience with a programming language. For that reason, I've created this simple tool, which, if you enter the username and password you wish to use, will generate the appropriate line to add to your .htpasswd file.


[!htpasswd!]

]]>
Tue, 15 Mar 2005 09:58:46 +0000 http://www.addedbytes.com/blog/code/password-protect-a-directory-with-htaccess/ Dave Child ,,,,,,,,,,,,
.htaccess Error Documents for Beginners http://www.addedbytes.com/articles/for-beginners/error-documents-for-beginners/ In Apache, you can set up each directory on your server individually, giving them different properties or requirements for access. And while you can do this through normal Apache configuration, some hosts may wish to give users the ability to set up their own virtual server how they like. And so we have .htaccess files, a way to set Apache directives on a directory by directory basis without the need for direct server access, and without being able to affect other directories on the same server.

One up-side of this (amongst many) is that with a few short lines in an .htaccess file, you can tell your server that, for example, when a user asks for a page that doesn't exist, they are shown a customized error page instead of the bog-standard error page they've seen a million times before. If you visit http://www.addedbytes.com/random_made_up_address then you'll see this in action - instead of your browser's default error page, you see an error page sent by my server to you, telling you that the page you asked for doesn't exist.

This has a fair few uses. For example, my 404 (page not found) error page also sends me an email whenever somebody ends up there, telling me which page they were trying to find, and where they came from to find it - hopefully, this will help me to fix broken links without needing to trawl through mind-numbing error logs.

[Aside: If you set up your custom error page to email you whenever a page isn't found, remember that "/favicon.ico" requests failing doesn't mean that a page is missing. Internet Explorer 5 assumes everyone has a "favicon" and so asks the server for it. It's best to filter error messages about missing "/favicon.ico" files from your error logging, if you plan to do any.]

Setting up your htaccess file is a piece of cake. First things first, open notepad (or better yet, [url=http://www.editplus.com/]EditPlus2[/url]), and add the following to a new document:

ErrorDocument 404     /404.html

Next you need to save the file. You need to save it as ".htaccess". Not ".htaccess.txt", or "mysite.htaccess" - just ".htaccess". I know it sounds strange, but that is what these files are - just .htaccess files. Nothing else. Happy? If not, take a look at this [url=http://wsabstract.com/howto/htaccess.shtml].htaccess guide[/url], which also explains the naming convention of .htaccess in a little more depth. If you do use Notepad, you may need to rename the file after saving it, and you can do this before or after uploading the file to your server.

Now, create a page called 404.html, containing whatever you want a visitor to your site to see when they try to visit a page that doesn't exist. Now, upload both to your website, and type in a random, made-up address. You should, with any luck, see your custom error page instead of the traditional "Page Not Found" error message. If you do not see that, then there is a good chance your server does not support .htaccess, or it has been disabled. I suggest the next thing you do is check quickly with your server administrator that you are allowed to use .htaccess to serve custom error pages.

If all went well, and you are now viewing a custom 404 (page not found) error page, then you are well on your way to a complete set of error documents to match your web site. There are more errors out there, you know, not just missing pages. Of course, you can also use PHP, ASP or CFML pages as error documents - very useful for keeping track of errors.

You can customize these directives a great deal. For example, you can add directives for any of the status codes below, to show custom pages for any error the server may report. You can also, if you want, specify a full URL instead of a relative one. And if you are truly adventurous, you could even use pure HTML in the .htaccess file to be displayed in case of an error, as below. Note that if you want to use HTML, you must start the HTML with a quotation mark, however you should not put one at the other end of the HTML (you can include quotation marks within the HTML itself as normal).

ErrorDocument 404 "Ooops, that page was <b>not found</b>. Please try a different one or <a href="mailto:owner@site.com">email the site owner</a> for assistance.

Server response codes

A server reponse code is a three digit number sent by a server to a user in response to a request for a web page or document. They tell the user whether the request can be completed, or if the server needs more information, or if the server cannot complete the request. Usually, these codes are sent 'silently' - so you never see them, as a user - however, there are some common ones that you may wish to set up error pages for, and they are listed below. Most people will only ever need to set up error pages for server codes 400, 401, 403, 404 and 500, and you would be wise to always have an error document for 404 errors at the very least.

It is also relatively important to ensure that any error page is over 512 bytes in size. Internet Explorer 5, when sent an error page of less than 512 bytes, will display its own default error document instead of your one. Feel free to use padding if this is an issue - personally, I'm not going to increase the size of a page because Internet Explorer 5 doesn't behave well.

In order to set up an error page for any other error codes, you simply add more lines to your .htaccess file. If you wanted to have error pages for the above five errors, your .htaccess file might look something like this:

ErrorDocument 400     /400.html ErrorDocument 401     /401.html ErrorDocument 403     /403.html ErrorDocument 404     /404.html ErrorDocument 500     /500.html

Informational

  • 100 - Continue
  • 101 - Switching Protocols
Successful
  • 200 - OK
  • 201 - Created
  • 202 - Accepted
  • 203 - Non-Authoritative Information
  • 204 - No Content
  • 205 - Reset Content
  • 206 - Partial Content
Redirection
  • 300 - Multiple Choices
  • 301 - Moved Permanently
  • 302 - Found
  • 303 - See Other
  • 304 - Not Modified
  • 305 - Use Proxy
  • 307 - Temporary Redirect
Client Error
  • 400 - Bad Request
  • 401 - Unauthorized
  • 402 - Payment Required
  • 403 - Forbidden
  • 404 - Not Found
  • 405 - Method Not Allowed
  • 406 - Not Acceptable
  • 407 - Proxy Authentication Required
  • 408 - Request Timeout
  • 409 - Conflict
  • 410 - Gone
  • 411 - Length Required
  • 412 - Precondition Failed
  • 413 - Request Entity Too Large
  • 414 - Request-URI Too Long
  • 415 - Unsupported Media Type
  • 416 - Requested Range Not Satisfiable
  • 417 - Expectation Failed
Server Error
  • 500 - Internal Server Error
  • 501 - Not Implemented
  • 502 - Bad Gateway
  • 503 - Service Unavailable
  • 504 - Gateway Timeout
  • 505 - HTTP Version Not Supported

If you are interested, I have also put together a more thorough [url=http://www.addedbytes.com/apache/http-status-codes-explained/]explanation of the meaning of the HTTP Status Response Codes[/url].



]]>
Sun, 23 Nov 2003 11:28:02 +0000 http://www.addedbytes.com/articles/for-beginners/error-documents-for-beginners/ Dave Child ,,