This is a python script that helps in translating XLSForm documents from English to a target language of your choice using Google's translation API. It is especially useful for researchers or data scientists who deal with XLSForm documents in different languages.
The script requires the following:
- A Jupyter notebook (You can also use Collab).
- Python 3.6 or higher.
- Pandas Python Library.
- Google Cloud Translate API.
- Google Cloud SDK for setting up credentials.
- Fork this repository.
- Install the required Python libraries. For this, you can use the command:
pip install pandas google-cloud-translate openpyxl
in your environment or!pip install pandas google-cloud-translate openpyxl
in your Jupyter notebook.
To use this script, you need to set up Google Cloud Translate in your environment:
- Set up a Google Cloud Project: Follow the guide here.
- Enable the Google Cloud Translate API for your project. You can get a free API key here
- Create a service account and download the JSON credentials file.
- Upload your Google translate API authentication JSON file in the directory and rename it to
authentication.json
. - Enter your desired output language code (eg.
fr
for French) in the next cell. A list of codes can be found here: https://developers.google.com/admin-sdk/directory/v1/languages
Note: You can only translate into one language at once. If you want to translate multiple languages you can either repeat the process or modify the code to add a loop in Cell 5.
Make sure that your XLSForm
- has a column with english wording that is called label::English;all of these sheets:
survey
,choices
,settings
. - is called
thefile.xlsx
and uploaded in the directory.
After that, Execute all cells.
To use this script:
-
Load your XLSForm document into a pandas DataFrame. The script currently assumes you have three DataFrames: 'dfsurvey', 'dfchoices', and 'dfsettings'. Each of these correspond to a worksheet in your XLSForm document.
-
Call the
translate_dataframe()
function for each DataFrame, passing the DataFrame and the name of the column that contains the text to translate. Currently, the script assumes this is 'label::English', but this may vary depending on the structure of your XLSForm document. -
The script will translate the text in chunks, add the translations to a new column in the DataFrame, and replace any 'nan' strings with NaN values (missing data indicators in pandas).
-
The script then writes the translated DataFrames back to an Excel file using the 'openpyxl' engine. You can specify the name of this file with the 'name' variable.
The script currently assumes that the text to translate is in English, but this can be modified according to your needs.
The script also uses '' as a separator when joining and splitting the chunks of text. Ensure that your original text does not contain '', or this could cause issues with the translation.
This application only translates text in existing columns of your DataFrames. If you want to add new columns with translated text, you need to modify the translate_dataframe()
function accordingly.
The script also does not handle API request errors. Depending on the size of your text and the limitations of your Google Cloud account, you may need to handle these.
Feel free to contribute to this project by opening issues or submitting pull requests. Any contributions you make are greatly appreciated.