Return to Project-GC

Question

What are the revised rules for word length definition? Are they consistent with the ones used by Kyle's BadgeGen?

1.9k views

Yes, the word count definition might vary depending on exactly how a "word" is defined. Done a little testing using the following 6 tools:
* Microsoft Word
* wordcounter.net
* wordcountertool.com
* wordcounttool.com
* wordcounttools.com
* Microsoft Excel

The count results from these tools on the examples are, respectively:
    X X X X X X X => 7, 7, 7, 7, 7, 7 => (no variation)
    -------------- => 1, 0, 1, 1, 0, 1 => (0 or 1)
    fri3nd => 1, 1, 1, 2, 1, 1 => (1 or 2)
    1.0 => 1, 2, 1, 0, 1, 1 => (0, 1 or 2)
    100 => 1, 1, 1, 0, 1, 1 => (0 or 1)
    5/5 => 1, 2, 1, 0, 2, 1 => (0, 1, or 2)
    foo-bar => 1, 1, 1, 1, 1, 1 => (no variation)
    friend(name) => 1, 1, 1, 2, 2, 1 => (1 or 2)
    måste => 1, 1, 1, 2, 1, 1 => (1, 2)

Total counts for the 9 examples together => 15, 16, 15, 15, 16, 15 => (15 or 16)

Could you advise how Project-GC has counted the above using the previous algorithm and the new algorithm? It will be useful to understand the rules used.

At the end of the day, it probably doesn't matter a great deal as long as it is consistently applied. However, it is useful for the Project-GC algorithm to be compatible with that used for Kyle's BadgeGen. Do you know how the old and new algorithms compare with BadgeGen's?

related to an answer for: "Log Length, words: Total words" numbers have gone down. Why?

asked Jan 30, 2016 in Support and help by huhugrub (270 points)

I have no idea what PGC uses for algoritm but i can look at the BadeGen code and it is
BEGINSUB name=LogWordCount
SHOWSTATUS msg="Counting Average Log Length" top=10 Left=10
$_sql="Select lText from LogsAll where g_foundlog(ltype) AND lIsowner"
$Data=Sqlite("sql",$_sql)
$Words=RegExCount(".\b",$Data)/2
$_sql="select count(*) from logsall where lisowner and g_foundlog(ltype)"
$Count=Val(Sqlite("sql",$_sql))
$Writ=Round($Words/$Count,3)
ENDSUB #name=LogWordCount

The interesting code is "$Words=RegExCount(".\b",$Data)/2" that uses regular expression that matches word break. It uses PCRE as regexp engine and \b is defines a \w followed by \W or the other order
from the man page
\w= Match a "word" character (alphanumeric plus "_", plus other connector punctuation chars plus Unicode marks)
\W= Match a non-"word" character

Without reading the code i don't know exactly what "other connector punctuation chars" exactly is   but from the test below is it obvious that . / - ( is not word characters. A text show but "doesn't" counts as two words so ' is not a word character

Result from GSAK with the BadgeGen code
X X X X X X X =7
-------------- =0
fri3nd =1
1.0=2
100=1
5/5=2
foo-bar=2
friend(name)=2
måste=1

But one example why the badgen code is inappropriate is the old log edit time that groundspeek nog have stop using. The line below is counted as 16 words in badegen. My opinion is that this should not be includes i the log length
This entry was edited by Target. on Sunday, 22 June 2014 at 15:37:49 UTC.

Another is that a time like i write for FTF like (15:30) is two words and caching nic with special characters in the will be two words.

Soon obsolete html and BB code formatering is also counted as words. I have seen logs with TFTC with each letter in a different colour and the syntax is:
[color=blue]T[/color]
each letter will be 4 words and the different coloured TFTC will 16 words. I suspect most people will agree that that is not appropriate. Any text formating will result in a minimum of 2 words

My guess of an better implementation is to require that there has to be a space or a new line between words. Doing that by removing all characters that is not word chars, space or newline will probably give a more reasonable result
With that rule all 2 long in the list starting with "X X X X X X X" will be 1 long
the edit will be 14 when the time 15:37:49 is won word and the coloured TFTC will be one word.
If acceptable BB and HTML tags can have space in the they should probably be removed from the calculation

I tried to remove all :,*()[] from my logs and calculate the average log length and got a reduction of 2 words in   but the runtime increased from 2.7 to 4.7s
runtime #words     Avg log lengh
4,734s 704067    130.455
2,741s 715140    132.507

commented Jan 30, 2016 by Target. (Expert) (104k points)

1 Answer

Answer 1 · 2016-02-01T01:47:24+0000

We are not planning to publish how we calculate words since there are so many users who abuses the system just to get their word count up. Write real logs and it will be counted fairly. It will also save us development time to counter these things.

We do not know how BadgeGen calculates then, in fact I believe it's GSAK that does it and not BadgeGen.

EDIT: I can see from Targets post that it is BadgeGen that calculates the words.

Return to Project-GC

Categories

What are the revised rules for word length definition? Are they consistent with the ones used by Kyle's BadgeGen?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.