Create Scribe-Data Wikidata based translation

Create Scribe-Data Wikidata based translation

Wikidata is a free, multilingual knowledge base that provides structured data on words and phrases across many languages. It offers definitions, translations, and usage examples via its SPARQL endpoint and API. As an open-source platform, Wikidata allows global contributions, ensuring up-to-date content.

For contextual understanding and multilingual support, Wikidata-based translation enables systems to provide translations alongside the original text, serving as an educational tool. By using Wikidata's connected data, this approach provides more accurate and culturally aware translations, focusing on meaning and cultural context rather than just word-for-word translations.

To structure its data, Wikidata uses unique identifiers:

  • QIDs (Q-numbers): Unique IDs for items representing concepts, people, places, or things.

  • LIDs (Lexeme IDs): Identifiers for lexemes (words or phrases) capturing linguistic properties.

  • PIDs (Property IDs): Define relationships between items, such as attributes or characteristics.

These identifiers standardize data connections across Wikidata, making them essential for Scribe-Data integration. For more details, refer to WIKIDATAGUIDE.md.

To further enhance its capabilities, the CLI will support downloading Wikidata Lexeme dumps, allowing Scribe-Data to process lexical data efficiently, improving multilingual support and enriching data integration.

Process: Lexeme Dump is validated by the Job Downloader (latest/specific), then parsed for translation, data types, and forms before being used in Scribe-Data.

Job Downloader

The CLI will support downloading the latest .json.bz2 dump like if the user passes this arguments below,

#Download the latest .json.bz2 dump.
Scribe-data download --wikidata-dump

#Retrieve a specific dump by YYYYMMDD
Scribe-data download --wikidata-dump YYYYMMDD

#Save the dump to a specified directory 
Scribe-data download --wikidata-dump --output-dir DIRECTORY_PATH

For user simplicity if user gives wrong date we have an option to user give wrong date it'll ask user for available closest vailable old dumps, here we also check in a date-lexeme-folder .json.bz2 file are in or not. In this case 20241030 is included and 20241122 is not.

CLI attempts to download a Wikidata lexeme dump but encounters a 404 error. It suggests the closest available dump (2024-10-30) and begins downloading it.

Parsing Dump

The lexeme dump organizes lexical data into specific forms and their grammatical contexts, along with corresponding translations or glosses for multilingual understanding.

Line Structure

Attributes:

  • id: Unique identifier for the lexeme (e.g., L4).

    • lemmas: Object with language codes as keys (e.g., "en") and their corresponding lemma (base word) as values. "lemmas":{"en":{"language":"en","value":"windsurf"}

    • lexicalCategory: Identifier for the part of speech as QID (e.g., "lexicalCategory":"Q24905").

    • language: Identifier for the language of the lexeme. (e.g., "language":"Q1860")

Forms

  • Key: "forms"

  • Attributes for each form:

    • id: Unique identifier for the form (e.g., L4-F1).

    • representations: Object containing language codes as keys (e.g., "en") and corresponding text values. "representations":{"en":{"language":"en","value":"windsurfed"}}

    • grammaticalFeatures: Array of IDs representing grammatical features (e.g., tense, mood,gender). "grammaticalFeatures":["Q1230649"] (Past participle in English)

Translations (Senses)

  • Key: "senses"

  • Attributes for each sense:

    • id: Unique identifier for the sense (e.g., L4-S1).

    • glosses: Object with language codes as keys and their definitions or translations as values.

    • claims: Metadata linking the sense to other entities or senses (e.g., related items, references).

Method

We use the libbzip2 compression library to read the compressed lexeme dump file, which is approximately 380MB with around 1,344,702 JSON entries. To efficiently process the file, we utilize orjson and handle the data in batches (default batch size: 50,000 lines)

The CLI checks for a local dump if the user specifies it. While this will not provide the most up-to-date data, it allows users to parse the dump locally without an internet connection, providing offline functionality.

Translations

# fetches translations for Bengali from the specified dump path and exports data in JSON file
Scribe-data get -l bengali -dt translations -wdp dump_path -od exported_json

Added the functionality to get translations for Bengali & translations.

i want to make an additional collection list in customer schema, in left nevbar where userproile , dashboard and etc. where this list contain customer id, name and amount as box. amount will enter manually/updated by admin or user. those amounts will show in customer details individually  as total amount of collections as like the total amount of purchase.

CLI parsing Bengali lexeme dump, processing 1.3M+ entries, and exporting translations to exported_json/bengali/lexeme_translations.json.

We retrieve the key "senses" along with its unique identifier id, and filter the data based on the user-specified language and data type.

{
...
  "L33425": {
    "সোমবার": {
      "bn": "সপ্তাহের একটি দিন"
    }
  },
  "L40022": {
    "জল": {
      "bn": "এক ধরণের তরল পদার্থ"
    }
  }
...
}

Forms

The get command will parse the dump and fetch all the forms. Each form contains the key "forms", which includes multiple "representations" and associated "grammaticalFeatures".


#parses all English nouns forms the given dump path
Scribe-data get -lang English -dt nouns -wdp PATH_TO_DUMP

These will be used to generate the parsed output, which will look like this:

{
...
  "L298": {
    "etymology":"Q110786",
    "etymologies": "Q146786"
  },
 "L301": {
    "sport's": ["Q146233","Q110786"],
...
}

As expected we don’t get readable format we are getting qids for forms. As we have lexeme_form_metadata.json where each category contains numeric keys with corresponding linguistic labels and Wikidata Query (QID) identifiers.

  • first we need to filter out and remove unexpected forms.

  • Modify the forms as the JSON contains linguistic metadata categorized into 9 different linguistic dimensions in a sequential order. (e.g., we want forms in 01_case will be first, then 02_gender and so on in a sorted way to maintain the formatting process)

  • modify it by camelCase to match Wikidata query service output.

Successfully exported forms for English nouns to scribe_data_json_export/english/lexeme_nouns.json."

Final output will be like this -

{
...
  "L298": {
    "singular": "etymology",
    "plural": "etymologies"
  },
  "L301": {
    "genitiveSingular": "sport's",
    "nominativePlural": "sports",
    "genitivePlural": "sports'",
    "nominativeSingular": "sport"
  },
  "L513": {
    "singular": "table",
    "plural": "tables'"
  }
...
}

Total

#total count for English nouns from the specified dump path 
Scribe-data total -lang english -dt nouns -wdp

Allow parsing of a downloaded dump to return the total number of translations, where English nouns contain 30,838 lexemes and 24,591 translations.

This terminal output provides a summary of English lexemes from a Wikidata dump using scribe-data, displaying the total lexemes and translations for various data types.

Steps in method

  • Local Dump Check:

    • The CLI checks for a local dump if the user indicates they want to use it.

    • If found, it runs the queries against the local dump.

  • Online Query Fallback:

    • If the user doesn’t pass a dump or prefers not to download one, the CLI runs all English queries/gets all translations for bangali.
  • User Guidance:

    • The CLI informs the user that using a dump would be more responsible for efficiency and suggests using the -wdp argument to download a dump.

In the end, Scribe-Data Wikidata-based translation system provides a streamlined solution for processing multilingual lexical data. Through its user-friendly CLI, it enables efficient downloading and parsing of Wikidata Lexeme dumps, supporting comprehensive translation capabilities across languages like English and Bengali.