Difference between revisions of "Log similarity"

From Project-GC
Jump to: navigation, search
(Cleaned up this page a bit, earlier version is saved as a comment for easier copypasteing in case my rewriting doesn't make sense. :))
m (added category)
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
'''Log similarity''' is part of the [[Finds tab#Some number|some numbers-module]] of the [[Finds tab]] in the [[Profile Stats]]. It is an indication of how much the user's logs differ, a high score mean their logs have a lot of repeated content, a low score mean they are more diverse.
+
'''Log similarity''' is part of the [[Finds tab#Some number|Some numbers-module]] of the [[Finds tab]] in the [[Profile Stats]]. It is an indication of how much the user's logs differ, a high score mean their logs have a lot of repeated content, a low score mean they are more diverse.
  
 
== How does it work? ==
 
== How does it work? ==
It is the content of the logs, not the actual logs that are similar. If the number is 57% that does not mean that 57% of the users logs are identical, but that the content of the logs coming after the former one has a average of 57% similarity.
+
It is the content of the logs, not the actual logs that are similar. If the number is 57% that does not mean that 57% of the users logs are identical, but that the content of the logs coming after the former one has an average of 57% similarity.
  
The code is not aware of the language or words, it is based on characters only. So, if the user write "ab" in one log and "cd" in the next log, those logs will have 0% similarity. So logging the three logs "ab", "cd" and "cd" in a row would give 50% log similarity (0% similarity for "ab" and "cd", 100% similarity for "cd" and "cd". (0+100)/2=50 ), posting an other log of "ab" after that would get 33% ((0+100+0)/3=33). Since the code is not based on words it does not matter if some words are very common in the language of the user since all languages will have characters that are more common, as an example the words "Rabbit" and "Tea" have 22% similarity.
+
The code is not aware of the language or words, it is based on characters only. So, if the user writes "ab" in one log and "cd" in the next log, those logs will have 0% similarity. So logging the three logs "ab", "cd" and "cd" in a row would give 50% log similarity (0% similarity for "ab" and "cd", 100% similarity for "cd" and "cd". ''(0+100)/2=50'' ), posting another log of "ab" after that would result in 33% (''(0+100+0)/3=33''). Since the code is not based on words it does not matter if some words are very common in the language of the user since all languages will have characters that are more common, as an example the words "Rabbit" and "Tea" have a 22% similarity.
  
 
== What is considered low log similarity? ==
 
== What is considered low log similarity? ==
If the number is lower than most others, then the user is doing a better job of writing varying logs than the majority. That beeing said, everything below 50% is fairly low. It is definitely possible to achive a high word count and a low log similarity at the same thime. In theory one could reach 0% but that would definitely requiring aming for it and it will not happen by chance.
+
If the number is lower than most others, then the user is doing a better job of writing varying logs than the majority. That being said, everything below 50% is fairly low. It is definitely possible to achieve a high word count and a low log similarity at the same time. In theory one could reach 0% but that would definitely requiring aiming for it, it's not likely to happen by chance.
  
 
== Math and programming ==
 
== Math and programming ==
The exact math can be learned by studying documentation and source code. We rely on an open source function that is called [https://www.php.net/manual/en/function.similar-text.php similar_text].
+
The exact math can be learned by studying documentation and source code. The algorithm rely on PHP's open source function [https://www.php.net/manual/en/function.similar-text.php similar_text].
  
In an ideal world we would rather use [https://www.php.net/manual/en/function.levenshtein.php levenshtein] and at first we did! But it is much slower and it scales worse so with long logs it is just too slow. The similar_text-function is more linear in its scaling.  
+
In an ideal world the [https://www.php.net/manual/en/function.levenshtein.php Levenshtein algorithm] would be used, and at first it actually was. But it is much slower and it scales worse so with long logs it is just too slow. The similar_text-function is more linear in its scaling.  
  
  
Line 18: Line 18:
  
  
<!-- I saved the previous page here to make it easier to copypaste things if I messed any facts up in my re-writing. If the article is fine this comment can be removed. /Pleu
+
<!-- The above is based on a copy/paste from Facebook. The original content is saved as a comment here. Can be removed in the future. -->
 
+
<!--
 
'''The following is a copy/paste of an answer on Facebook. It should be cleaned up.'''
 
'''The following is a copy/paste of an answer on Facebook. It should be cleaned up.'''
 
But it's pasted here for now. It contains a lot of related facts.
 
But it's pasted here for now. It contains a lot of related facts.
Line 39: Line 39:
  
 
In an ideal world we would rather use https://www.php.net/manual/en/function.levenshtein.php , which we actually used at first. But it's a multiple times slow. It also scales worse. With long logs it's ridiculous slow. The similar_text function scales is more linear in it's scaling.-->
 
In an ideal world we would rather use https://www.php.net/manual/en/function.levenshtein.php , which we actually used at first. But it's a multiple times slow. It also scales worse. With long logs it's ridiculous slow. The similar_text function scales is more linear in it's scaling.-->
 +
 +
[[Category:Site Info]]

Latest revision as of 14:08, 21 November 2020

Log similarity is part of the Some numbers-module of the Finds tab in the Profile Stats. It is an indication of how much the user's logs differ, a high score mean their logs have a lot of repeated content, a low score mean they are more diverse.

How does it work?

It is the content of the logs, not the actual logs that are similar. If the number is 57% that does not mean that 57% of the users logs are identical, but that the content of the logs coming after the former one has an average of 57% similarity.

The code is not aware of the language or words, it is based on characters only. So, if the user writes "ab" in one log and "cd" in the next log, those logs will have 0% similarity. So logging the three logs "ab", "cd" and "cd" in a row would give 50% log similarity (0% similarity for "ab" and "cd", 100% similarity for "cd" and "cd". (0+100)/2=50 ), posting another log of "ab" after that would result in 33% ((0+100+0)/3=33). Since the code is not based on words it does not matter if some words are very common in the language of the user since all languages will have characters that are more common, as an example the words "Rabbit" and "Tea" have a 22% similarity.

What is considered low log similarity?

If the number is lower than most others, then the user is doing a better job of writing varying logs than the majority. That being said, everything below 50% is fairly low. It is definitely possible to achieve a high word count and a low log similarity at the same time. In theory one could reach 0% but that would definitely requiring aiming for it, it's not likely to happen by chance.

Math and programming

The exact math can be learned by studying documentation and source code. The algorithm rely on PHP's open source function similar_text.

In an ideal world the Levenshtein algorithm would be used, and at first it actually was. But it is much slower and it scales worse so with long logs it is just too slow. The similar_text-function is more linear in its scaling.