Skip Navigation
+44 1273 906 908   hi@addedbytes.com @AddedBytes
Added Bytes

Blog

Dropbox is an excellent cross-platform freemium file synchronisation and online storage application. If that doesn't have you salivating already, it has a few more tricks up its sleeve.

Read the rest of this post »

Hashes are used almost everywhere on the web, behind the scenes, to protect your passwords. Learn why it's important to always add salt to your hashes.

Read the rest of this post »

Blogging in Business


20 November 2009   |   Comments   |   marketing, blogging, social media, seo, content

As internet marketers, we are always telling our clients to start blogging, and it isn't always an easy sell.

Read the rest of this post »

The second article in the "Improve Your Website Conversion Rate" series. Learned the lessons of part 1? Here are nine more ways to improve your conversion rate.

Read the rest of this post »

The fourth part of the Writing Secure PHP series, covering cross-site scripting, cross-site request forgery and character encoding security issues.

Read the rest of this post »

XSS Alarm Userscript


11 September 2008   |   Comments

A user script for Opera, Firefox and Chrome that notifies you when a site is loading scripts from unrecognised third parties to help you spot potential XSS attacks more easily.

Read the rest of this post »

A beginner's guide to URL rewriting, with plenty of examples.

Read the rest of this post »

Readability Code Open Sourced


1 August 2008   |   Comments   |   readability

In July 2004 I wrote some code to calculate the readability of text using the most common algorithms available (Flesch-Kincaid and Gunning-Fog). The code hasn't aged well, and had many flaws, especially when it came to the subject of syllable counting.

Syllable counting is a tricky prospect. Consider the following sentence, for example: "I moped about, hopeful that my moped would be back on the highway soon". Sound innocuous? There's a pair of homographs in there (two words, spelled the same, that sound different) - and these have different syllable counts depending on which of the two words you mean. Words can be almost identical, with the same order and number of consonants and vowels (and it's that order you generally use to calculate syllable numbers) - "sired" has one syllable, while "sided" has two. Throw in prefixes, suffixes, plurals and compound words and you've got yourself a challenge.

Syllable counting is a minefield, with a small set of rules and a massive set of exceptions to handle.

That said, I've spent some time working through a set of test data and have come up with a small set of rules to take on the task. It helped tremendously having the work of Greg Fast (creator of Perl module Lingua::EN::Syllables) handy for reference, and setting up a decent set of unit tests allowed me to experiement with different rules until I found a set that works. So far. I expect to find more and more exceptions as time goes on, and hopefully the rules can be expanded to account for them.

It wasn't just the syllable counting that was bad. The code was inefficient, disorganised, and incapable of handling anything unpredictable (every extra space counted as an extra word, for example). There were lines in there that didn't make any sense. And I hadn't documented anything, so couldn't tell you why I'd added them in the first place. Oh to be that young and inexperienced again ...

So, as with the other releases in the last few weeks, I went back and rewrote the code properly. The new and improved version has been released as a Google Code project by the name of PHP Text Statistics. It's released (as with the other projects I've set free recently) under a New BSD License.

TextStatistics.php

It consists (so far) of a single class that will tell you various things about the text you feed it:

  • String length
  • Letter count
  • Syllable count
  • Sentence count
  • Average words per sentence
  • Average syllables per word

It will also calculate the readability of the text you enter according to the 6 known algorithms (links go to Wikipedia):

TextStatistics.php4

There is also a PHP4 compatible version of the code. At the time of writing, it returns the correct scores for test data, though given PHP4's decline and the rise of PHP5, this version may not remain as current as the previous file.

tests/

Next thing to be aware of is the unit tests included in the project. There's no easy way to check your calculations are correct, unless you have a set of verified numbers to compare them against. So, I put together (so far) three files with a variety of different tests for the code. These tests should be run with PHPUnit and at the time of writing they all pass (which means there's not enough of them yet).

tests/TextStatisticsTest.php

The basic unit test class lists a large selection of words and compares their calculated syllable count with their actual syllable count (worked out the old fashioned way). It includes a variety of tests to ensure sentence counting and word counting both work as intended. It also includes a small selection of sentences, for which readability scores have been calculated by hand, and checks that the class returns the correct scores for these items.

tests/TextStatisticsKiplingIf.php

Rudyard Kipling's If is, aside from a brilliant piece of inspiring poetry, one long sentence comprised of lots and lots of short words (take a look - impressive how few of the words are multi-syllabic). This file contains a selection of tests to run against If. It checks all of the words of the poem have their syllable count correctly calculated, and that all of the readability scores are correctly calculated, by matching the calculated scores against hand-calculated numbers.

tests/TextStatisticsMelvilleMobyDick.php

Herman Melville's Moby Dick is up next (well, the first paragraph is - I'm not prepared to count, by hand, the number of syllables in the entire book). Like If, it is (I believe) in the public domain, so can be used for this sort of purpose without complications. It's also a brilliant read. This file contains a selection of tests to run against the first paragraph of Moby Dick. It checks all of the words of the passage have their syllable count correctly calculated, and that all of the readability scores are correctly calculated, by matching the calculated scores against hand-calculated numbers.

Get Involved!

This project can benefit from the involvement of people in many ways. Initially, the most helpful thing anyone can do is find words whose syllable count is not correctly calculated by the script and add a new test for that word. There are going to be a lot out there (especially compound words, like "shoreline", and odd words that are not pronounced according to normal rules, like "simile").

The class could be expanded to give more information about text - like letter frequencies, word and phrase freqencies (useful for SEO) and unique word count, among other things. I've made a start on making the code multi-byte character set safe, but there's lots more to do there too.

The really brave could add more test text, too. Paragraphs of (public domain) text provide an excellent way to check the tool is working as it should. I'd suggest using either the Kipling or Melville file as a template to work from, and prepare for a boring few hours. You get a great feeling of satisfaction at the end, though, when the whole thing is done!

There's a discussion group for ... well, discussion. Suggestions, comments and feedback all welcome. If you would like to get involved in this project, start there (or email me), or grab a copy of the code from the SVN repository on Google Code.

Modem Emulator Open Sourced


18 July 2008   |   Comments

In July 2004 I released a modem emulator (a.k.a. a throughput throttling proxy). It was created to help give designers a sense of how their sites function for people with slower connections.

I've had to take it offline a number of times due to the volume of traffic and the various ways it was being used (turns out it was a highly effective way to bypass workplace web filters).

Not only that, the code was badly out of date (code soup, not an object in sight, no real validation ... the shame) and badly needed an update.

It's been sitting there, half-working and half-not, and begging for an update for almost exactly 4 years. Ultimately, the choice was to update it or kill it permanently.

So, I spent some quality time rewriting the whole thing, pretty much from the ground up, and now with pleasure announce that it has been turned into an open source project (yes, another one) and the code is now available from Google Code under a New BSD License.

With any luck, this will allow more people to make this tool part of their workflow.

I've updated the Email Address Validation function posted in June 2004. I've converted it to a PHP5 compatible class, and released it under a New BSD License on Google Code.

Read the rest of this post »

Hi! We are Added Bytes, a creative team of e-commerce experts from Brighton, UK specialising in Magento websites.
We are available for projects starting in June.
Check out our Services and Portfolio or Get in Touch!