How to start Scribe
Welcome to Scribe-Data, an open-source initiative designed to systematically extract, structure, and manage multilingual linguistic data from Wikidata and Wikipedia. It supports the development of language tools such as intelligent keyboards, translation systems, and grammar assistants. Below is a structured overview of its purpose and workflow:
Project Overview
Scribe-Data is a command-line interface (CLI) tool that automates the retrieval and organization of language-specific data (e.g., verb conjugations, noun forms, emoji metadata) from open knowledge bases. Its core focus is on leveraging Wikidata’s lexeme dump—a structured repository of lexical entries—to generate datasets for downstream applications like Scribe-iOS.
Two types of getting data availability from Wikidata, one is Wikidata Query Service and another one is Lexeme dump database.
Let's break down the key differences between WDQS and Wiki Lexeme dumps:
Feature | Wikidata Query Service (WDQS) | Wiki Lexeme Dump |
Access Method | Interactive web interface with SPARQL queries wikidata.org | Static file downloads |
Data Freshness | Real-time access to current data | Snapshot of data at dump time |
Query Flexibility | Complex queries with filters and conditions | Limited to downloaded content |
Language Support | Results available in any language wikidata.org | Limited to dump contents |
Data Scope | Full Wikidata knowledge base | Lexeme-specific data only wikidata.org |
Processing Power | Server-side processing | Scribe-Data |
Storage Requirements | No storage needed | Large file downloads required |
While Lexeme dumps are useful for offline processing or bulk data analysis, WDQS provides a more flexible and efficient solution for most querying needs, especially when working with dynamic or complex queries requiring real-time data access.
The autosuggestion process uses popular words from Wikipedia and their common successors as a baseline until NLP methods are applied. Autosuggestions are generated in gen_autosuggestions.ipynb
. Emojis come from Unicode CLDR via the scribe-data get -lang LANGUAGE -dt emoji-keywords
command.
What is Scribe-Data?
Scribe-Data is a command-line interface (CLI) tool that simplifies extracting, formatting, and managing multilingual data (e.g., verbs, nouns, emojis) from open knowledge bases. Developers use this data to build apps like Scribe-iOS, which offers features like verb conjugation and translation.
Key Features
CLI Commands:
list
: Show available languages (e.g.,scribe-data list --language
).get
: Fetch data (e.g.,scribe-data get -lang English -dt verbs
).total
: Check data counts (e.g.,scribe-data total -lang German
).convert
: Transform data into CSV/TSV/JSON formats.
Interactive Mode:
Runscribe-data get -i
for a guided interface to select languages, data types, and output formats.Data Sources:
Wikidata (lexemes, grammar rules).
Wikipedia (popular words for autosuggestions).
Unicode CLDR (emoji keywords).
Quick Example: Fetching English Verbs
# Retrieve English verbs and save to a directory
scribe-data get --language English --data-type verbs --output-dir ./my_data
By adding —wikidata-dump-path(-wdp) will make use wiki Lexeme dump.
You can check the Scribe-Data CLI Usage and also in scribe-data.readthedocs.
Why Contribute?
Contributing to Scribe-Data offers meaningful impact and growth opportunities:
Professional Growth
Gain expertise in SPARQL queries, data processing, and multilingual computing
Build a public portfolio of meaningful open-source contributions
Collaborate with developers globally on language technology challenges
Impact on Language Accessibility
Help underserved language communities access better digital tools
Support language preservation through structured documentation
Enable innovative applications in education and communication
Technical Learning
Master practical skills in Python, CLI development, and data processing
Understand complex linguistic data structures and relationships
Learn best practices in open-source development and documentation
How to Contribute?
Fix Data Issues: Improve Wikidata entries (e.g., missing verb forms) instead of editing Scribe’s files directly. (Which can be found in Scribe-data open issues)
Expand Languages: Add support for underrepresented languages by writing missing SPARQL queries or updating Wikidata.
Build Features: Work on CLI enhancements (e.g., a terminal UI for interactive mode), any missing tests to add or a new feature that can help scribe community.
Get Started Today!
Clone the repo and see the details on scribe-data installation process:
Join the Community: Chat with the team on Matrix and tackle
good first issue
labeled tasks.
Scribe-Data is more than code—it’s about making language tools accessible to everyone. Ready to dive in? 🚀