Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
COVID 19 Data for South Africa created, maintained and hosted by DSFSI research group at the University of Pretoria
Disclaimer: We have worked to keep the data as accurate as possible. We collate the COVID 19 reporting data from NICD and South Africa DoH. We only update that data once there is an official report or statement. For the other data, we work to keep the data as accurate as possible. If you find errors let us know.
See original GitHub repo for detailed information https://github.com/dsfsi/covid19za
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Coronavirus COVID-19 (2019-nCoV) Data Repository for South Africa created, maintained and hosted by Data Science for Social Impact research group, led by Dr. Vukosi Marivate, at the University of Pretoria.
Disclaimer: The maintainers have worked to keep the data as accurate as possible. The COVID 19 reporting data has been collated from NICD and DoH and is only updated once there is an official report or statement.
If you use this repo for any research/development/innovation, please contact the maintainers of the data.
Please note that these reports are the daily reports as released by the National Department of Health or the NICD. The new cases reported are based on new positive test reports released. However, there may be a significant lag from when the patient was tested. As an example, in epidemiological Week 1 of 2021 (3-9 Jan) approximately 33k new cases were reported on the daily announcement. However, the NICD Testing Summary Report for Week 3 of 2021 (which also reports the two previous weeks) shows that the number of positive tests was 43635 for Week 1 of 2021. The difference is due to the lag in testing being done -- some of the 33k cases reported on the daily announcements were actually from prior weeks while a large number of people were tested between 3-9 January, but the cases were only reported from the 10th onwards. Care needs to be taken in doing some analyses to take this into account.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
About
Department of Sports Arts and Culture Multilingual Terminology Lists
Author(s)
Author(s) - Original DSAC Multilingual Terminology Lists Author(s) - Original OERTB + Data Science for Social Impact Team (To be updated)
LICENSE for Data
The files on https://github.com/dsfsi/za-marito/ are under CC-BY-SA-4.0 and should acknowledge the Original Department of Sports Arts and Culture Multilingual Terminology Lists + Open Database authors list.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PuoBERTa + PuoBERTaJW300: Setswana Language Models A Roberta-based language model specially designed for Setswana, using the new PuoData dataset (PuoBERTa) and PuoData + JW300 TSN (PuoBERTaJW300) Cite @inproceedings{marivate2023puoberta, title = {PuoBERTa: Training and evaluation of a curated language model for Setswana}, author = {Vukosi Marivate and Moseli Mots'Oehli and Valencia Wagner and Richard Lastrucci and Isheanesu Dzingirai}, year = {2023}, booktitle= {Artificial Intelligence Research. SACAIR 2023. Communications in Computer and Information Science}, url= {https://link.springer.com/chapter/10.1007/978-3-031-49002-6_17}, keywords = {NLP}, preprint_url = {https://arxiv.org/abs/2310.09141}, dataset_url = {https://github.com/dsfsi/PuoBERTa}, software_url = {https://huggingface.co/dsfsi/PuoBERTa} } Model Details Model Description This is a masked language model trained on Setswana corpora, making it a valuable tool for a range of downstream applications from translation to content creation. It's powered by the PuoData dataset to ensure accuracy and cultural relevance.
Developed by: Vukosi Marivate (@vukosi), Moseli Mots'Oehli (@MoseliMotsoehli) , Valencia Wagner, Richard Lastrucci and Isheanesu Dzingirai Model type: RoBERTa Model Language(s) (NLP): Setswana License: CC BY 4.0 Usage Use this model filling in masks or finetune for downstream tasks. Here's a simple example for masked prediction: from transformers import RobertaTokenizer, RobertaModel # Load model and tokenizer model = RobertaModel.from_pretrained('dsfsi/PuoBERTa') tokenizer = RobertaTokenizer.from_pretrained('dsfsi/PuoBERTa')
Downstream Use Downstream Performance MasakhaPOS Performance of models on the MasakhaPOS downstream task. Model Test Performance Multilingual Models AfroLM 83.8 AfriBERTa 82.5 AfroXLMR-base 82.7 AfroXLMR-large 83.0 Monolingual Models NCHLT TSN RoBERTa 82.3 PuoBERTa 83.4 PuoBERTa+JW300 84.1
MasakhaNER Performance of models on the MasakhaNER downstream task. Model Test Performance (f1 score) Multilingual Models AfriBERTa 83.2 AfroXLMR-base 87.7 AfroXLMR-large 89.4 Monolingual Models NCHLT TSN RoBERTa 74.2 PuoBERTa 78.2 PuoBERTa+JW300 80.2
Dataset We used the PuoData dataset, a rich source of Setswana text, ensuring that our model is well-trained and culturally attuned. Citation Information Bibtex Reference @inproceedings{marivate2023puoberta, title = {PuoBERTa: Training and evaluation of a curated language model for Setswana}, author = {Vukosi Marivate and Moseli Mots'Oehli and Valencia Wagner and Richard Lastrucci and Isheanesu Dzingirai}, year = {2023}, booktitle= {Artificial Intelligence Research. SACAIR 2023. Communications in Computer and Information Science}, url= {https://link.springer.com/chapter/10.1007/978-3-031-49002-6_17}, keywords = {NLP}, preprint_url = {https://arxiv.org/abs/2310.09141}, dataset_url = {https://github.com/dsfsi/PuoBERTa}, software_url = {https://huggingface.co/dsfsi/PuoBERTa} } Contributing Your contributions are welcome! Feel free to improve the model. Model Card Authors Vukosi Marivate Model Card Contact For more details, reach out or check our website. Email: vukosi.marivate@cs.up.ac.za Enjoy exploring Setswana through AI!
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The purpose of this repository is to collate data on the ongoing coronavirus pandemic in Africa. Our goal is to record detailed information on each reported case in every African country. We want to build a line list – a table summarizing information about people who are infected, dead, or recovered. The table for each African country would include demographic, location, and symptom (where available) information for each reported case. The data will be obtained from official sources (e.g., WHO, departments of health, CDC etc.) and unofficial sources (e.g., news). Such a dataset has many uses, including studying the spread of COVID-19 across Africa and assessing similarities and differences to what’s being observed in other regions of the world.
See the repo here https://github.com/dsfsi/covid19africa
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Github: https://github.com/dsfsi/gov-za-multilingual Zenodo:
The data set contains cabinet statements from the South African government. Data was scraped from the governments website: https://www.gov.za/cabinet-statements
The datasets contain government cabinet statements in 11 languages, namely:
Language | Code | Language | Code |
---|---|---|---|
English | (eng) | Sepedi | (nso) |
Afrikaans | (afr) | Setswana | (tsn) |
isiNdebele | (nbl) | Siswati | (ssw) |
isiXhosa | (xho) | Tshivenda | (ven) |
isiZulu | (zul) | Xitstonga | (tso) |
Sesotho | (sot) |
The dataset contains the full data in a JSON file (/data/govza-cabinet-statements.json), as well as CSV’s split by each language, eg: “govza-cabinet-statements-en.csv” for english. The dataset does not contain special characters like unicode or ascii.
Please see the data-statement.md for full dataset information. (TODO)
src_lang | trg_lang | num_aligned_pairs |
---|---|---|
afr | eng | 14549 |
afr | nbl | 6621 |
afr | nso | 15388 |
afr | sot | 8834 |
afr | ssw | 15610 |
afr | tsn | 12605 |
afr | tso | 14936 |
afr | ven | 5776 |
afr | xho | 16065 |
afr | zul | 14998 |
nbl | eng | 3616 |
nbl | nso | 6342 |
nbl | sot | 16163 |
nbl | ssw | 4655 |
nbl | tsn | 3369 |
nbl | tso | 4465 |
nbl | ven | 18984 |
nbl | xho | 5213 |
nbl | zul | 3868 |
nso | eng | 15257 |
nso | ssw | 18697 |
nso | tsn | 16179 |
nso | tso | 17617 |
nso | ven | 6367 |
sot | eng | 5212 |
sot | nso | 8077 |
sot | ssw | 5811 |
sot | tsn | 5450 |
sot | tso | 6586 |
sot | ven | 14098 |
ssw | eng | 15721 |
ssw | tso | 17880 |
ssw | ven | 4588 |
tsn | eng | 14544 |
tsn | ssw | 16386 |
tsn | tso | 16681 |
tsn | ven | 3267 |
tso | eng | 16068 |
ven | eng | 3670 |
ven | tso | 4578 |
xho | eng | 16537 |
xho | nso | 18110 |
xho | sot | 7489 |
xho | ssw | 18387 |
xho | tsn | 16571 |
xho | tso | 17954 |
xho | ven | 4559 |
xho | zul | 18145 |
zul | eng | 16149 |
zul | nso | 17630 |
zul | sot | 5975 |
zul | ssw | 18563 |
zul | tsn | 16482 |
zul | tso | 17789 |
zul | ven | 3606 |
@inproceedings{lastrucci-etal-2023-preparing, title = "Preparing the Vuk{'}uzenzele and {ZA}-gov-multilingual {S}outh {A}frican multilingual corpora", author = "Richard Lastrucci and Isheanesu Dzingirai and Jenalea Rajab and Andani Madodonga and Matimba Shingange and Daniel Njini and Vukosi Marivate", booktitle = "Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023)", month = may, year = "2023", address = "Dubrovnik, Croatia", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.rail-1.3", pages = "18--25" }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
WordSim and Simlex Data for South African Languages
Embedding Evaluation Data for South African Languages
Dataset Information\
The datasets(Simlex and WordSim) contain pairs of Setswana and Sepedi words that have been assigned similarity ratings by humans to measure semantic relatedness. The word-pairs(Simlex and WordSim) are manually translated from English to Setswana and Sepedi. The evaluation task aims to find the degree of correlation between the scores provided by the model and the human rating, the score of the model is collected by computing the cosine similarity of corresponding vectors for word pairs.
Online Repository link
Authors
See also the list of contributors who participated in this project.
Citing the dataset
To appear in conference proceedings
@article{Makgatho_Marivate_Sefara_Wagner_2022, title={Training Cross-Lingual embeddings for Setswana and Sepedi},
volume={3},
url={https://upjournals.up.ac.za/index.php/dhasa/article/view/3822},
DOI={10.55492/dhasa.v3i03.3822},
number={03},
journal={Journal of the Digital Humanities Association of Southern Africa },
author={Makgatho, Mack and Marivate, Vukosi and Sefara, Tshephisho and Wagner, Valencia},
year={2022},
month={Feb.}}
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
COVID 19 Data for South Africa created, maintained and hosted by DSFSI research group at the University of Pretoria
Disclaimer: We have worked to keep the data as accurate as possible. We collate the COVID 19 reporting data from NICD and South Africa DoH. We only update that data once there is an official report or statement. For the other data, we work to keep the data as accurate as possible. If you find errors let us know.
See original GitHub repo for detailed information https://github.com/dsfsi/covid19za