Return to Project-GC

Welcome to Project-GC Q&A. Ask questions and get answers from other Project-GC users.

If you get a good answer, click the checkbox on the left to select it as the best answer.

Upvote answers or questions that have helped you.

If you don't get clear answers, edit your question to make it clearer.

+1 vote
1.8k views
Yes, the word count definition might vary depending on exactly how a "word" is defined.  Done a little testing using the following 6 tools:
* Microsoft Word
* wordcounter.net
* wordcountertool.com
* wordcounttool.com
* wordcounttools.com
* Microsoft Excel

The count results from these tools on the examples are, respectively:
    X X X X X X X  => 7, 7, 7, 7, 7, 7 => (no variation)
    --------------  => 1, 0, 1, 1, 0, 1 => (0 or 1)
    fri3nd  => 1, 1, 1, 2, 1, 1 => (1 or 2)
    1.0  => 1, 2, 1, 0, 1, 1 => (0, 1 or 2)
    100  => 1, 1, 1, 0, 1, 1 => (0 or 1)
    5/5  => 1, 2, 1, 0, 2, 1 => (0, 1, or 2)
    foo-bar  => 1, 1, 1, 1, 1, 1 => (no variation)
    friend(name)  => 1, 1, 1, 2, 2, 1 => (1 or 2)
    måste  => 1, 1, 1, 2, 1, 1 => (1, 2)

Total counts for the 9 examples together  => 15, 16, 15, 15, 16, 15 => (15 or 16)

Could you advise how Project-GC has counted the above using the previous algorithm and the new algorithm?  It will be useful to understand the rules used.

At the end of the day, it probably doesn't matter a great deal as long as it is consistently applied.  However, it is useful for the Project-GC algorithm to be compatible with that used for Kyle's BadgeGen.  Do you know how the old and new algorithms compare with BadgeGen's?
in Support and help by huhugrub (270 points)
I have no idea what PGC uses for algoritm but i can look at the BadeGen code and it is
BEGINSUB name=LogWordCount
    SHOWSTATUS msg="Counting Average Log Length" top=10 Left=10
    $_sql="Select lText from LogsAll where g_foundlog(ltype) AND lIsowner"
    $Data=Sqlite("sql",$_sql)
    $Words=RegExCount(".\b",$Data)/2
    $_sql="select count(*) from logsall where lisowner and g_foundlog(ltype)"
    $Count=Val(Sqlite("sql",$_sql))
    $Writ=Round($Words/$Count,3)
ENDSUB #name=LogWordCount

The interesting code is "$Words=RegExCount(".\b",$Data)/2"  that uses regular expression that matches word break. It uses PCRE as regexp engine and \b is defines a \w followed by \W  or the other order
from the man page
\w= Match a "word" character (alphanumeric plus "_", plus other connector punctuation chars plus Unicode marks)
\W= Match a non-"word" character

Without reading the code i don't know exactly what "other connector punctuation chars" exactly is   but from the test below is it obvious that . / - ( is not word characters. A text show but "doesn't" counts as two words so ' is not a word character

Result from GSAK with the BadgeGen code
X X X X X X X =7
-------------- =0
fri3nd =1
1.0=2
100=1
5/5=2
foo-bar=2
friend(name)=2
måste=1

But one example why the badgen code is inappropriate is the old log edit time that groundspeek nog have stop using. The line below is counted as 16 words in badegen. My opinion is that this should not be includes i the log length
This entry was edited by Target. on Sunday, 22 June 2014 at 15:37:49 UTC.

Another is that a time like i write for FTF like (15:30) is two words and caching nic with special characters in the will be two words.

Soon obsolete html and BB code formatering is also counted as words. I have seen logs with TFTC with each letter in a different colour and the syntax is:
[color=blue]T[/color]
each letter will be 4 words and the different coloured TFTC will 16 words. I suspect most people will agree that that is not appropriate. Any text formating will result in a minimum of 2 words


My guess of an better implementation is to require that there has to be a space or a new line between words. Doing that by removing all characters that is not word chars, space or newline will probably give a more reasonable result
With that rule all 2 long in the list starting with "X X X X X X X" will be 1 long
the edit will be 14 when the time 15:37:49 is won word and the coloured TFTC will be one word.
If acceptable BB and HTML tags can have space in the they should probably be removed from the calculation

I tried to remove all :,*()[] from my logs and calculate the average log length and got a reduction of 2 words in   but the runtime increased from 2.7 to 4.7s
runtime #words     Avg log lengh
4,734s  704067    130.455
2,741s  715140    132.507
I tested Badgen with the downloadable find GPX from pgc and run bade gen on in. I have not edited or found a cache since it was generated and a while before so it will be the same data as i the stats.
The result is
BadeGen 133.471
PGC 128
badegen with []:(), removed 131.891
badegen edit line removed   132.514
badegen with []:(), and edit line removed 131.054
badegen with []:(), numbers and edit line removed 126,739

I have probably missed some obvious thing in my remove []:(), that pgc uses.
But is would not say that the pgc result is unreasonable when my fast method reduced the word count with 2.45 words per log and pgc have removed 5.47 words per log

1 Answer

+2 votes

We are not planning to publish how we calculate words since there are so many users who abuses the system just to get their word count up. Write real logs and it will be counted fairly. It will also save us development time to counter these things.

We do not know how BadgeGen calculates then, in fact I believe it's GSAK that does it and not BadgeGen.

EDIT: I can see from Targets post that it is BadgeGen that calculates the words.

by magma1447 (Admin) (243k points)
Agreed with the sentiment here.  Perhaps it's time to exclude this statistic altogether from PGC's implementation of BadgeGen badges.  The original intention of this metric is to disencourage unduly terse logs.  But on the other hand, this metric has at times led to abuses at worse, but also to much more verbosity in logs that detract from the quality in favour of quantity.
...