GolosSQL database has been updated with a new set of features, including language detection.
1. New database schema
A new Comments
table has been added
This table is populated with data that were previously stored in TxComments
.
A new record is created in TxComments
each time a post or comment is created or edited. The correlated data have to be updated each time a transaction related to post is generated. This resulted to the field contents in TxComments
being redundant and hard to manage.
The new design is much more simpler for data retrieval.
2. Language detection
Language detection has been added to the Database Injector Process.
Each time a post or comment is created, the body content is analyzed to determine which language(s) it contains.
The result is stored in JSON format in the field body_language
in the table Comments
.
As a post or comment can contains multiple languages, the result is an array.
Each object contains following values:
- language code
- confidence score
- isReliable - true/false
If the language cannot be determined (ex: post containing pictures only), the array will be left empty
The result is something like this:
A post with several language will have his body_language
field set to something like
[{"language":"ru","isReliable":true,"confidence":8.22},
{"language":"en","isReliable":false,"confidence":5.12}]
The confidence value is related to how much text the post contains. The more text analyzed, the better the language analysis, the higher the confidence value. Confidence is not a ratio and can be higher than 100.
If the post contains words in different languages, isReliable
will be set to true to identify the most probable language, even if its confidence value is lower.
If there is only one language and isReliable
is set to false, this indicate confidence is too low.
Be aware that language detector works using probabilities and sometimes it is not accurate with very short texts. The same happens when different languages used in the post have similar words.
The language detector can also be tricked when the content of a post contains lot's of "technical noise" like pictures, source code, edit tags, …
I hope this will help non-Russian communities to better identify post in their respective language.
Support
If you need help, have any comment or request, please use GolosSQL channel on chat.golos.io
Thanks for reading!
Голосуйте за меня в качестве делегата
Вы также можете проголосовать прямо с платформы Голос здесь. Для этого внизу страницы нужно проделать следующее. Пожалуйста, сделайте это ! Каждый голос важен. Спасибо !
Если Вам понравился этот пост,
не забудьте проголосовать, подписаться на меня или поделиться