This repository contains Python tools for analyzing writing directionality by examining character frequency and entropy differences in word-initial and word-final positions. Based on research from The "Handedness" of Language: Directional Symmetry Breaking of Sign Usage in Words by Md Izhar Ashraf and Sitabhra Sinha, it can automatically detect whether a language is Left-to-Right (LTR) or Right-to-Left (RTL).
- direction.py: Main Python script performing directionality analysis
- The_handedness_of_language_Directional.pdf: Original research paper detailing theory and empirical analysis
- results.csv: Comprehensive analysis results across multiple languages
The analysis is rooted in Ashraf and Sinha's discovery of universal asymmetry in character distributions across languages:
- Initial characters show more balanced distribution (greater flexibility)
- Final characters show more restrictive distribution (fewer common endings)
This directional asymmetry—measured using statistical metrics—provides insights into writing direction, with implications for decipherment of unknown scripts.
- Utilizes
europarl_raw
andudhr
corpora from NLTK - Extracts text samples:
- European languages (Europarl): 100,000 characters
- RTL languages (UDHR): Full text available
- Initial Characters: Extracts first character of each word
- Final Characters: Extracts last character of each word
- Analyzes distribution patterns using statistical measures
For each text sample, calculates:
- Entropy: Measures randomness/predictability of character distributions
- Gini Coefficient: Measures inequality in character frequency distributions
- Combined Score: Weighted combination indicating directionality
The improved scoring system combines:
- Entropy Difference = Initial Entropy - Final Entropy
- Scaled Gini Difference = (Initial Gini - Final Gini) × 5.0
- Combined Score = Entropy Difference - Scaled Gini Difference
Score interpretation:
- Positive → Left-to-Right writing system
- Negative → Right-to-Left writing system
- Magnitude indicates strength of directional signal
All correctly identified as Left-to-Right:
- Strongest signals:
- Finnish (1.456)
- Italian (1.3576)
- Portuguese (1.3555)
- Moderate signals:
- French (1.0639)
- Dutch (1.1662)
- German (0.9778)
- Danish (0.932)
- Swedish (0.9255)
- Spanish (1.068)
- Weakest signal:
- Greek (0.5617)
Both correctly identified as Right-to-Left:
- Arabic: Strong RTL signal (-0.502)
- Hebrew: Weaker RTL signal (-0.0664)
- Reversed text samples show opposite directionality
- Score magnitudes preserved in reversed text
- Pattern consistent across all languages tested
- European languages show stronger directional signals than Semitic languages
-
Writing Direction Determination
- Primary challenge in unknown script analysis
- Solution: Statistical analysis of character distributions
-
Pattern Recognition
- Aids in identifying word boundaries
- Helps recognize linguistic patterns
-
Computational Applications
- Automated preliminary analysis
- Integration with larger decipherment frameworks
- May need adaptation for vertical scripts
- Challenges with mixed/flexible writing directions
- Requires sufficient sample size
def analyze_directionality(text):
"""Analyze text directionality using entropy and Gini coefficient."""
# Extract character frequencies
initial_chars, final_chars = extract_character_frequencies(text)
# Calculate statistical measures
initial_gini = calculate_gini_coefficient(initial_chars)
final_gini = calculate_gini_coefficient(final_chars)
initial_entropy = calculate_entropy_value(initial_chars)
final_entropy = calculate_entropy_value(final_chars)
# Calculate differences with Gini scaling
gini_difference = (initial_gini - final_gini) * 5.0
entropy_difference = initial_entropy - final_entropy
combined_score = entropy_difference - gini_difference
return {
"Initial Gini": initial_gini,
"Final Gini": final_gini,
"Initial Entropy": initial_entropy,
"Final Entropy": final_entropy,
"Gini Difference": gini_difference,
"Entropy Difference": entropy_difference,
"Combined Score": combined_score,
"Likely Direction": "Left-to-Right" if combined_score > 0 else "Right-to-Left"
}
- Automatic language detection
- Statistical analysis of character distributions
- Directionality prediction
- Comprehensive CSV output
- Validation through reversed text analysis
Required packages:
- NLTK (corpora access)
- NumPy (statistical calculations)
- SciPy (entropy calculations)
pip install nltk numpy scipy
from direction import analyze_directionality
# Analyze a text sample
results = analyze_directionality(text_sample)
# Results include:
# - Initial/Final Gini coefficients
# - Initial/Final Entropy values
# - Gini Difference (scaled)
# - Entropy Difference
# - Combined Score
# - Likely Direction
-
Technical Constraints:
- Requires sufficient text sample size
- Performance varies with text genre/formality
- Some languages show weaker directional signals
- Sample size differences between corpora affect comparability
-
Methodological Limitations:
- Not designed for vertical writing systems
- May struggle with mixed-direction scripts
- Requires clean, well-formatted text input
-
Data Processing:
- Normalize for corpus size differences
- Optimize Gini coefficient scaling
- Add confidence scores for predictions
-
Feature Additions:
- Support for vertical writing systems
- Additional statistical measures
- Interactive visualization tools
- Extended language coverage
-
Script Analysis:
- Support for additional writing systems
- Historical script analysis tools
- Comparative analysis features
-
Tool Development:
- GUI for analysis visualization
- Batch processing capabilities
- Integration with other linguistic tools
To contribute:
- Fork the repository
- Create a feature branch
- Submit pull request
Areas for contribution:
- Additional corpora support
- Vertical writing system analysis
- Visualization tools
- Language coverage expansion
This project is open-source under the MIT License.
Created by John Winstead