Tagged with "readability" http://www.addedbytes.com/feeds/tag-feed/ en Web Development in Brighton - Added Bytes 2006 120 Readability-Score.com Gets Some Love http://www.addedbytes.com/blog/readability-score.com-gets-some-love/ Way, way back in 2004, I wrote a piece of code to analyze the readability of text using the various algorithms that had been developed for the task. It grew in usage, and became an Open Source project and got its own website over the next few years. And then, it sat around, being useful but quiet.

Every so often I would take a look at the stats, modify a little piece of code, or actually use the tool myself. But more often than not, it didn't really venture into my thoughts, except briefly when some kind soul made a donation because the site was useful to them.

But that's no way to run a website, even if it is a small, single-purpose site. So, it's had some attention. Dinner and a movie, a quick fumble behind the bike sheds - usual sort of thing.

What's New?

One of the first changes that you might spot on the new site is addition of a great big "Premium" tab. Premium supporters pay a small amount every year (as much as they like) to help pay for the site. (They also get access to the new bulk processing feature - more info below.)

Next, Readability-Score.com now supports URLs. Instead of copying and pasting text from another site, you can just drop in the URL, and it will go off, fetch the page, process the text and give you a score for the readability of the page. You can specify a single portion of the page, if you want and if you're familiar with IDs in HTML, and it will just score that part of the page - good for helping prevent navigation and footer elements from skewing your text scores.

The site is also now served over HTTPS. Because why not.

There's now a quick and convenient bookmarklet, for sending the URL of the page you're browsing to Readability-Score.com.

Next, there is now a handy bulk processing feature, for people with lots of text or URLs to score. It takes a CSV, processes the text or URLs within it, and emails the results back to you. Bulk processing is only available to premium supporters at the moment.

It's had a small set of changes made to the codebase, thanks in part to fellow text statistics enthusiasts on GitHub. So it's now a little faster and a tiny bit more accurate. Though it still can't tell how many syllables there are in moped, and doesn't handle numbers particularly well.

Finally, everyone who has donated to Readability-Score in the last couple of years has been made a premium supporter and should have received an email with their login details; if you donated and haven't received your login, please let me know.



]]>
Tue, 11 Feb 2014 14:15:28 +0000 http://www.addedbytes.com/blog/readability-score.com-gets-some-love/ Dave Child
Text Readability Scores http://www.addedbytes.com/blog/code/readability-score/ This tool has moved! It's now on its own domain, at Readability-Score.com. (It's also had a small makeover and is now full of speedy AJAXy goodness.)

The code that powers the tool is still available on GitHub.



]]>
Wed, 07 Jul 2004 15:14:59 +0100 http://www.addedbytes.com/blog/code/readability-score/ Dave Child ,,,,,,,,,,
Readability Code Open Sourced http://www.addedbytes.com/blog/readability-code-open-sourced/ In July 2004 I wrote some code to calculate the readability of text using the most common algorithms available (Flesch-Kincaid and Gunning-Fog).

In July 2004 I wrote some code to calculate the readability of text using the most common algorithms available (Flesch-Kincaid and Gunning-Fog). The code hasn't aged well, and had many flaws, especially when it came to the subject of syllable counting.

Syllable counting is a tricky prospect. Consider the following sentence, for example: "I moped about, hopeful that my moped would be back on the highway soon". Sound innocuous? There's a pair of homographs in there (two words, spelled the same, that sound different) - and these have different syllable counts depending on which of the two words you mean. Words can be almost identical, with the same order and number of consonants and vowels (and it's that order you generally use to calculate syllable numbers) - "sired" has one syllable, while "sided" has two. Throw in prefixes, suffixes, plurals and compound words and you've got yourself a challenge.

Syllable counting is a minefield, with a small set of rules and a massive set of exceptions to handle.

That said, I've spent some time working through a set of test data and have come up with a small set of rules to take on the task. It helped tremendously having the work of Greg Fast (creator of Perl module Lingua::EN::Syllables) handy for reference, and setting up a decent set of unit tests allowed me to experiement with different rules until I found a set that works. So far. I expect to find more and more exceptions as time goes on, and hopefully the rules can be expanded to account for them.

It wasn't just the syllable counting that was bad. The code was inefficient, disorganised, and incapable of handling anything unpredictable (every extra space counted as an extra word, for example). There were lines in there that didn't make any sense. And I hadn't documented anything, so couldn't tell you why I'd added them in the first place. Oh to be that young and inexperienced again ...

So, as with the other releases in the last few weeks, I went back and rewrote the code properly. The new and improved version has been released as a Google Code project by the name of PHP Text Statistics. It's released (as with the other projects I've set free recently) under a New BSD License.

TextStatistics.php

It consists (so far) of a single class that will tell you various things about the text you feed it:

  • String length
  • Letter count
  • Syllable count
  • Sentence count
  • Average words per sentence
  • Average syllables per word

It will also calculate the readability of the text you enter according to the 6 known algorithms (links go to Wikipedia):

TextStatistics.php4

There is also a PHP4 compatible version of the code. At the time of writing, it returns the correct scores for test data, though given PHP4's decline and the rise of PHP5, this version may not remain as current as the previous file.

tests/

Next thing to be aware of is the unit tests included in the project. There's no easy way to check your calculations are correct, unless you have a set of verified numbers to compare them against. So, I put together (so far) three files with a variety of different tests for the code. These tests should be run with PHPUnit and at the time of writing they all pass (which means there's not enough of them yet).

tests/TextStatisticsTest.php

The basic unit test class lists a large selection of words and compares their calculated syllable count with their actual syllable count (worked out the old fashioned way). It includes a variety of tests to ensure sentence counting and word counting both work as intended. It also includes a small selection of sentences, for which readability scores have been calculated by hand, and checks that the class returns the correct scores for these items.

tests/TextStatisticsKiplingIf.php

Rudyard Kipling's If is, aside from a brilliant piece of inspiring poetry, one long sentence comprised of lots and lots of short words (take a look - impressive how few of the words are multi-syllabic). This file contains a selection of tests to run against If. It checks all of the words of the poem have their syllable count correctly calculated, and that all of the readability scores are correctly calculated, by matching the calculated scores against hand-calculated numbers.

tests/TextStatisticsMelvilleMobyDick.php

Herman Melville's Moby Dick is up next (well, the first paragraph is - I'm not prepared to count, by hand, the number of syllables in the entire book). Like If, it is (I believe) in the public domain, so can be used for this sort of purpose without complications. It's also a brilliant read. This file contains a selection of tests to run against the first paragraph of Moby Dick. It checks all of the words of the passage have their syllable count correctly calculated, and that all of the readability scores are correctly calculated, by matching the calculated scores against hand-calculated numbers.

Get Involved!

This project can benefit from the involvement of people in many ways. Initially, the most helpful thing anyone can do is find words whose syllable count is not correctly calculated by the script and add a new test for that word. There are going to be a lot out there (especially compound words, like "shoreline", and odd words that are not pronounced according to normal rules, like "simile").

The class could be expanded to give more information about text - like letter frequencies, word and phrase freqencies (useful for SEO) and unique word count, among other things. I've made a start on making the code multi-byte character set safe, but there's lots more to do there too.

The really brave could add more test text, too. Paragraphs of (public domain) text provide an excellent way to check the tool is working as it should. I'd suggest using either the Kipling or Melville file as a template to work from, and prepare for a boring few hours. You get a great feeling of satisfaction at the end, though, when the whole thing is done!

There's a discussion group for ... well, discussion. Suggestions, comments and feedback all welcome. If you would like to get involved in this project, start there (or email me), or grab a copy of the code from the SVN repository on Google Code.



]]>
Fri, 01 Aug 2008 12:01:00 +0100 http://www.addedbytes.com/blog/readability-code-open-sourced/ Dave Child
Flesch-Kincaid Reading Level http://www.addedbytes.com/blog/code/flesch-kincaid-function/ PLEASE NOTE: This code is now considered out of date. An updated version has been released under an open source license as a Google Code project: php-text-statistics. There is more about this change in the post Readability Code Open Sourced.

A tool for checking the readability scores of text is available - this article covers the functions behind that tool.

Calculations based upon word structure can tell you a fair bit about the text on your site, most notably the readability of your copy. A lot of sites have text on them that is simply too advanced for their users, which is as useful as having no text at all.

It is therefore usually a good idea to check the copy on your website as thoroughly as possible. Spelling and grammar should be checked as a matter of course. You should also check how difficult your text is to read. If a user cannot easily understand what they are reading, they will leave the site and find one they can comprehend.

The following are two calculations that can give you an indicator of how easy your text is to read.

Flesch-Kincaid Reading Ease

The Flesch-Kincaid reading ease score is worked out using the following calculation, which gives a number. The higher that number is, the easier the text is to read.

206.835 - (1.015 * average_words_sentence) - (84.6 * average_syllables_word)

The function you will need to use to work this score out (in addition to the three at the bottom of this page) is:

function calculate_flesch($text) { return (206.835 - (1.015 * average_words_sentence($text)) - (84.6 * average_syllables_word($text))); }

And you can call the function like so:

$flesh_score = calculate_flesch($text);

Flesch-Kincaid Grade level

The Flesch-Kincaid grade level is a similar calculation, however gives a number that corresponds to the grade a person will need to have reached to understand it. For example, a Grade level score of 8 means that an eighth grader will understand the text.

(.39 * average_words_sentence) + (11.8 * average_syllables_word) - 15.59

The function you will need to use to work this score out (in addition to the three at the bottom of this page) is:

function calculate_flesch_grade($text) { return ((.39 * average_words_sentence($text)) + (11.8 * average_syllables_word($text)) - 15.59); }

And you can call the function like so:

$flesh_score = calculate_flesch_grade($text);

Both of the functions above make use of the functions below, so these will need to be included in your scripts in order for either function to be used.

Each score returned is not perfectly accurate. Unfortunately, it is not always possible to work out the number of syllables in a word programatically, and not always possible to correctly calculate the number of words per sentence, or indeed number of sentences, in text. However, the function will return a close approximation of the value - certainly good enough for our purposes.

Ideally, you should aim for a reading ease of around 60 to 70 (equivalent to a Grade level of around 6 to 8). The nearer 100 your text scores, the easier it is to read (and conversely, the lower the grade score, the easier the text is to read). Comics, for example, are usually in the 90s. The Harvard Law Review scores in the low 30s. Legal documents are usually lucky to make it into double figures.

The functions you will need in order to calculate the Flesch-Kincaid reading ease or Grade level of text are:

function average_words_sentence($text) { $sentences = strlen(preg_replace('/[^\.!?]/', '', $text)); $words = strlen(preg_replace('/[^ ]/', '', $text)); return ($words/$sentences); } function average_syllables_word($text) { $words = explode(' ', $text); for ($i = 0; $i < count($words); $i++) { $syllables = $syllables + count_syllables($words[$i]); } return ($syllables/count($words)); } function count_syllables($word) { $subsyl = Array( 'cial' ,'tia' ,'cius' ,'cious' ,'giu' ,'ion' ,'iou' ,'sia$' ,'.ely$' ); $addsyl = Array( 'ia' ,'riet' ,'dien' ,'iu' ,'io' ,'ii' ,'[aeiouym]bl$' ,'[aeiou]{3}' ,'^mc' ,'ism$' ,'([^aeiouy])\1l$' ,'[^l]lien' ,'^coa[dglx].' ,'[^gq]ua[^auieo]' ,'dnt$' ); // Based on Greg Fast's Perl module Lingua::EN::Syllables $word = preg_replace('/[^a-z]/is', '', strtolower($word)); $word_parts = preg_split('/[^aeiouy]+/', $word); foreach ($word_parts as $key => $value) { if ($value <> '') { $valid_word_parts[] = $value; } } $syllables = 0; // Thanks to Joe Kovar for correcting a bug in the following lines foreach ($subsyl as $syl) { $syllables -= preg_match('~'.$syl.'~', $word); } foreach ($addsyl as $syl) { $syllables += preg_match('~'.$syl.'~', $word); } if (strlen($word) == 1) { $syllables++; } $syllables += count($valid_word_parts); $syllables = ($syllables == 0) ? 1 : $syllables; return $syllables; }

Examples

The following are two examples of text and the readability of that text.

The first is an excerpt from [url=http://www.online-literature.com/grahame/windwillows/]The Wind in the Willows[/url]. It is what most people would call easy to read:

"There's Toad Hall," said the Rat; "and that creek on the left, where the notice-board says, 'Private. No landing allowed,' leads to his boat-house, where we'll leave the boat. The stables are over there to the right. That's the banqueting-hall you're looking at now - very old, that is. Toad is rather rich, you know, and this is really one of the nicest houses in these parts, though we never admit as much to Toad."

For reading ease, this scored 69. It also had a Grade Level of 7. This particular passage of Wind in the Willows scores at almost exactly the same level web page text should ideally score.

On the other hand, the following (both this text and the above were generously provided by [url=http://members.dca.net/slawski]Bill Slawski[/url], by the way) is an excerpt from a legal document, and would give many a headache:

The foregoing warranties by each party are in lieu of all other warranties, express or implied, with respect to this agreement, including but not limited to implied warranties of merchantability and fitness for a particular purpose. Neither party shall have any liability whatsoever for any cover or setoff nor for any indirect, consequential, exemplary, incidental or punitive damages, including lost profits, even if such party has been advised of the possibility of such damages.

This scores an incredible -1 on the reading ease scale. The Grade Level required to read it? 22. This is what you could widely consider the most unreadable text you could add to a web page.

These are, perhaps, extreme examples, but they should give an idea of the differences between good and bad text on a web page.



]]>
Wed, 07 Jul 2004 14:17:00 +0100 http://www.addedbytes.com/blog/code/flesch-kincaid-function/ Dave Child ,,,,,,,,,,
Gunning-Fog Index http://www.addedbytes.com/blog/code/gunning-fog-function/ PLEASE NOTE: This code is now considered out of date. An updated version has been released under an open source license as a Google Code project: php-text-statistics. There is more about this change in the post Readability Code Open Sourced.

A tool for checking the readability scores of text is available - this article covers the functions behind that tool.

The Gunning-Fog index is a measure of text readability. It represents the approximate reading age of the text - the age someone will need to be to understand what they are reading.

The following is the algorithm to determine the Gunning-Fog index:

(average_words_sentence + percentage_of_words_with_more_than_three_syllables) * 0.4

The above produces a number, which is a rough measure of the age someone must be to understand the content. The lower the number, the more understandable the content will be to your visitors. Web sites should aim to have content that falls roughly in the 11-15 range for this test.

Any number returned over the value of 22 can be taken to be just 22, and is roughly equivalent to post-graduate level.

Below are a selection of function you can use to determine the Gunning-Fog index of text. To calculate this, all you need to is call the function as follows, where $text is the text you wish to measure the readability of.

$gunning_fog_score = gunning_fog_score($text); function gunning_fog_score($text) { return ((average_words_sentence($text) + percentage_number_words_three_syllables($text)) * 0.4); } function average_words_sentence($text) { $sentences = strlen(preg_replace('/[^\.!?]/', '', $text)); $words = strlen(preg_replace('/[^ ]/', '', $text)); return ($words/$sentences); } function percentage_number_words_three_syllables($text) { $syllables = 0; $words = explode(' ', $text); for ($i = 0; $i < count($words); $i++) { if (count_syllables($words[$i]) > 2) { $syllables ++; } } $score = number_format((($syllables / count($words)) * 100)); return ($score); } function count_syllables($word) { $subsyl = Array( 'cial' ,'tia' ,'cius' ,'cious' ,'giu' ,'ion' ,'iou' ,'sia$' ,'.ely$' ); $addsyl = Array( 'ia' ,'riet' ,'dien' ,'iu' ,'io' ,'ii' ,'[aeiouym]bl$' ,'[aeiou]{3}' ,'^mc' ,'ism$' ,'([^aeiouy])\1l$' ,'[^l]lien' ,'^coa[dglx].' ,'[^gq]ua[^auieo]' ,'dnt$' ); // Based on Greg Fast's Perl module Lingua::EN::Syllables $word = preg_replace('/[^a-z]/is', '', strtolower($word)); $word_parts = preg_split('/[^aeiouy]+/', $word); foreach ($word_parts as $key => $value) { if ($value <> '') { $valid_word_parts[] = $value; } } $syllables = 0; // Thanks to Joe Kovar for correcting a bug in the following lines foreach ($subsyl as $syl) { $syllables -= preg_match('~'.$syl.'~', $word); } foreach ($addsyl as $syl) { $syllables += preg_match('~'.$syl.'~', $word); } if (strlen($word) == 1) { $syllables++; } $syllables += count($valid_word_parts); $syllables = ($syllables == 0) ? 1 : $syllables; return $syllables; }

]]>
Tue, 06 Jul 2004 11:41:35 +0100 http://www.addedbytes.com/blog/code/gunning-fog-function/ Dave Child ,,,,,