43 datasets found

Quality scores for Wikipedia articles (July 2018)
figshare.com
kaggle.com
bz2
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wiki Rank (2023). Quality scores for Wikipedia articles (July 2018) [Dataset]. http://doi.org/10.6084/m9.figshare.7272713.v1
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7272713.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Wiki Rank
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset includes a list of over 37 million Wikipedia articles in 55 languages with quality scores by WikiRank (https://wikirank.net). Quality scores of articles are based on Wikipedia dumps from July, 2018 License All files included in this datasets are released under CC0: https://creativecommons.org/publicdomain/zero/1.0/Format• page_id -- The identifier of the Wikipedia article (int), e.g. 4519301• revision_id -- The Wikipedia revision of the article (int), e.g. 24284811• page_name -- The title of the Wikipedia article (utf-8), e.g. General relativity• wikirank_quality -- quality score 0-100
WikiRank quality scores and measures for Wikipedia articles (April 2022)
figshare.com
application/gzip
Updated May 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wiki Rank (2023). WikiRank quality scores and measures for Wikipedia articles (April 2022) [Dataset]. http://doi.org/10.6084/m9.figshare.19762927.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19762927.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Wiki Rank
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Those datasets include lists of over 43 million Wikipedia articles in 55 languages with quality scores by WikiRank (https://wikirank.net). Additionally, the datasets contain the quality measures (metrics) which directly affect these scores. Quality measures were extracted based on Wikipedia dumps from April, 2022.

License All files included in this datasets are released under CC BY 4.0: https://creativecommons.org/licenses/by/4.0/ Format

page_id -- The identifier of the Wikipedia article (int), e.g. 840191 page_name -- The title of the Wikipedia article (utf-8), e.g. Sagittarius A* wikirank_quality -- quality score for Wikipedia article in a scale 0-100 (as of April 1, 2022). This is a synthetic measure that was calculated based on the metrics below (also included in the datasets). norm_len - normalized "page length" norm_refs - normalized "number of references" norm_img - normalized "number of images" norm_sec - normalized "number of sections" norm_reflen - normalized "references per length ratio" norm_authors - normalized "number of authors" (without bots and anonymous users) flawtemps - flaw templates
Wikipedia SQLITE Portable DB, Huge 5M+ Rows
kaggle.com
zip
Updated Jun 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
christernyc (2024). Wikipedia SQLITE Portable DB, Huge 5M+ Rows [Dataset]. https://www.kaggle.com/datasets/christernyc/wikipedia-sqlite-portable-db-huge-5m-rows/code
Explore at:
zip(6064169983 bytes)Available download formats
Dataset updated
Jun 29, 2024
Authors
christernyc
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The "Wikipedia SQLite Portable DB" is a compact and efficient database derived from the Kensho Derived Wikimedia Dataset (KDWD). This dataset provides a condensed subset of raw Wikimedia data in a format optimized for natural language processing (NLP) research and applications.

I am not affiliated or partnered with the Kensho in any way, just really like the dataset for giving my agents to query easily.

Key Features:

Contains over 5 million rows of data from English Wikipedia and Wikidata Stored in a portable SQLite database format for easy integration and querying Includes a link-annotated corpus of English Wikipedia pages and a compact sample of the Wikidata knowledge base Ideal for NLP tasks, machine learning, data analysis, and research projects

The database consists of four main tables:

items: Contains information about Wikipedia items, including labels and descriptions

properties: Stores details about Wikidata properties, such as labels and descriptions

pages: Provides metadata for Wikipedia pages, including page IDs, item IDs, titles, and view counts

link_annotated_text: Contains the link-annotated text of Wikipedia pages, divided into sections

This dataset is derived from the Kensho Derived Wikimedia Dataset (KDWD), which is built from the English Wikipedia snapshot from December 1, 2019, and the Wikidata snapshot from December 2, 2019. The KDWD is a condensed subset of the raw Wikimedia data in a form that is helpful for NLP work, and it is released under the CC BY-SA 3.0 license. Credits: The "Wikipedia SQLite Portable DB" is derived from the Kensho Derived Wikimedia Dataset (KDWD), created by the Kensho R&D group. The KDWD is based on data from Wikipedia and Wikidata, which are crowd-sourced projects supported by the Wikimedia Foundation. We would like to acknowledge and thank the Kensho R&D group for their efforts in creating the KDWD and making it available for research and development purposes. By providing this portable SQLite database, we aim to make Wikipedia data more accessible and easier to use for researchers, data scientists, and developers working on NLP tasks, machine learning projects, and other data-driven applications. We hope that this dataset will contribute to the advancement of NLP research and the development of innovative applications utilizing Wikipedia data.

https://www.kaggle.com/datasets/kenshoresearch/kensho-derived-wikimedia-data/data

Tags: encyclopedia, wikipedia, sqlite, database, reference, knowledge-base, articles, information-retrieval, natural-language-processing, nlp, text-data, large-dataset, multi-table, data-science, machine-learning, research, data-analysis, data-mining, content-analysis, information-extraction, text-mining, text-classification, topic-modeling, language-modeling, question-answering, fact-checking, entity-recognition, named-entity-recognition, link-prediction, graph-analysis, network-analysis, knowledge-graph, ontology, semantic-web, structured-data, unstructured-data, data-integration, data-processing, data-cleaning, data-wrangling, data-visualization, exploratory-data-analysis, eda, corpus, document-collection, open-source, crowdsourced, collaborative, online-encyclopedia, web-data, hyperlinks, categories, page-views, page-links, embeddings

Usage with LIKE queries: ``` import aiosqlite import asyncio

class KenshoDatasetQuery: def init(self, db_file): self.db_file = db_file

async def _aenter_(self): self.conn = await aiosqlite.connect(self.db_file) return self async def _aexit_(self, exc_type, exc_val, exc_tb): await self.conn.close() async def search_pages_by_title(self, title): query = """ SELECT pages.page_id, pages.item_id, pages.title, pages.views, items.labels AS item_labels, items.description AS item_description, link_annotated_text.sections FROM pages JOIN items ON pages.item_id = items.id JOIN link_annotated_text ON pages.page_id = link_annotated_text.page_id WHERE pages.title LIKE ? """ async with self.conn.execute(query, (f"%{title}%",)) as cursor: return await cursor.fetchall() async def search_items_by_label_or_description(self, keyword): query = """ SELECT id, labels, description FROM items WHERE labels LIKE ? OR description LIKE ? """ async with self.conn.execute(query, (f"%{keyword}%", f"%{keyword}%")) as cursor: return await cursor.fetchall() async def search_items_by_label(self, label): query = """ SELECT id, labels, description FROM items WHERE labels LIKE ? """ async with self.conn.execute(query, (f"%{label}%",)) as cursor: return await cursor.fetchall() async def search_properties_by_label_or_desc...
WikiRank 05.2019 - quality, popularity and AI for Wikipedia articles
figshare.com
bz2
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wiki Rank (2023). WikiRank 05.2019 - quality, popularity and AI for Wikipedia articles [Dataset]. http://doi.org/10.6084/m9.figshare.8231273.v2
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8231273.v2
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Wiki Rank
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset includes a list of over 39 million Wikipedia articles in 55 languages with quality scores by WikiRank (https://wikirank.net). Quality scores of articles are based on Wikipedia dumps from May, 2019. Popularity and Authors' Interest based on activity in April 2019.License All files included in this datasets are released under CC0: https://creativecommons.org/publicdomain/zero/1.0/Format• page_id -- The identifier of the Wikipedia article (int), e.g. 4519301• page_name -- The title of the Wikipedia article (utf-8), e.g. General relativity• wikirank_quality -- quality score for Wikipedia article in a scale 0-100 (as of May 1, 2019)• poularity -- miedian of daily number of page views of the Wikipedia article during April 2019• authors_interest -- number of authors of the Wikipedia article during April 2019
d
Codebase for Wikidata as Gazetteer: An Open Geocoding Pipeline for Textual...
search.dataone.org
Updated Oct 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lamar, Annie (2025). Codebase for Wikidata as Gazetteer: An Open Geocoding Pipeline for Textual Corpora in the Humanities [Dataset]. http://doi.org/10.7910/DVN/NNGFJC
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/NNGFJC
Dataset updated
Oct 28, 2025
Dataset provided by
Harvard Dataverse
Authors
Lamar, Annie
Description
This repository contains the scripts required to implement the Wikidata-based geocoding pipeline described in the accompanying paper. geocode.sh : Shell script for setting up and executing Stanford CoreNLP with the required language models and entitylink annotator. Automates preprocessing, named entity recognition (NER), and wikification across a directory of plain-text (.txt) files. Configured for both local execution and high-performance computing (HPC) environments. geocode.py : Python script that processes the list of extracted location entities (entities.txt) and retrieves latitude/longitude coordinates from Wikidata using Pywikibot. Handles redirects, missing pages, and missing coordinate values, returning standardized placeholder codes where necessary. Outputs results as a CSV file with columns for place name, latitude, longitude, and source file. geocode.sbatch : Optional SLURM submission script for running run_corenlp.sh on HPC clusters. Includes configurable resource requests for scalable processing of large corpora. README.md : Detailed README file including a line-by-line explanation of the geocode.sh file. Together, these files provide a reproducible workflow for geocoding textual corpora via wikification, suitable for projects ranging from small-scale literary analysis to large-scale archival datasets.
Wikipedia Multilingual Vandalism Detection Dataset
zenodo.org
data.niaid.nih.gov
csv
Updated May 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mykola Trokhymovych; Mykola Trokhymovych; Muniza Aslam; Muniza Aslam; Ai-Jou Chou; Ai-Jou Chou; Ricardo Baeza-Yates; Ricardo Baeza-Yates; Diego Saez-Trumper; Diego Saez-Trumper (2025). Wikipedia Multilingual Vandalism Detection Dataset [Dataset]. http://doi.org/10.5281/zenodo.8174336
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8174336
Dataset updated
May 22, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mykola Trokhymovych; Mykola Trokhymovych; Muniza Aslam; Muniza Aslam; Ai-Jou Chou; Ai-Jou Chou; Ricardo Baeza-Yates; Ricardo Baeza-Yates; Diego Saez-Trumper; Diego Saez-Trumper
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset accompanies a research paper that introduces a novel system designed to support the Wikipedia community in combating vandalism on the platform. The dataset has been prepared to enhance the accuracy and efficiency of Wikipedia patrolling in multiple languages.

The release of this comprehensive dataset aims to encourage further research and development in vandalism detection techniques, fostering a safer and more inclusive environment for the Wikipedia community. Researchers and practitioners can utilize this dataset to train and validate their models for vandalism detection and contribute to improving online platforms' content moderation strategies.

Dataset Details:

Number of Languages: 47

Observation period: 6 months training, one week hold-out testing

Use Case: The dataset is primarily intended for training and evaluating vandalism detection systems.

Features: Each record characterizes the corresponding revision of the Wikipedia page, including revision metadata, user details, text inserted, removed, or changed, and corresponding MLMs-based features.

Data Filtering and Feature Engineering: Advanced filtering and feature engineering techniques were applied to ensure the dataset's quality and relevance for effectively training the vandalism detection system.

Files: Training and hold-out testing datasets of anonymous and all users.

Related paper citation:

@inproceedings{10.1145/3580305.3599823, author = {Trokhymovych, Mykola and Aslam, Muniza and Chou, Ai-Jou and Baeza-Yates, Ricardo and Saez-Trumper, Diego}, title = {Fair Multilingual Vandalism Detection System for Wikipedia}, year = {2023}, isbn = {9798400701030}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3580305.3599823}, doi = {10.1145/3580305.3599823}, abstract = {This paper presents a novel design of the system aimed at supporting the Wikipedia community in addressing vandalism on the platform. To achieve this, we collected a massive dataset of 47 languages, and applied advanced filtering and feature engineering techniques, including multilingual masked language modeling to build the training dataset from human-generated data. The performance of the system was evaluated through comparison with the one used in production in Wikipedia, known as ORES. Our research results in a significant increase in the number of languages covered, making Wikipedia patrolling more efficient to a wider range of communities. Furthermore, our model outperforms ORES, ensuring that the results provided are not only more accurate but also less biased against certain groups of contributors.}, booktitle = {Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}, pages = {4981–4990}, numpages = {10}, location = {Long Beach, CA, USA}, series = {KDD '23} }
Wikidata Constraint Violations - July 2018 - extended
figshare.com
txt
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas Pellissier Tanon (2023). Wikidata Constraint Violations - July 2018 - extended [Dataset]. http://doi.org/10.6084/m9.figshare.13338743.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13338743.v2
Dataset updated
Jun 5, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Thomas Pellissier Tanon
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset is a cleaned up and annotated version of an other dataset previously shared: https://figshare.com/articles/dataset/Wikidata_Constraints_Violations_-_July_2017/7712720This dataset contains corrections for Wikidata constraint violations extracted from the July 1st 2018 Wikidata full history dump.It has been created as part of a work named Neural Knowledge Base Repairs by Thomas Pellissier Tanon and Fabian Suchanek.An example of code making use of this dataset is available on GitHub: https://github.com/Tpt/bass-materials/blob/master/corrections_learning.ipynbThe following constraints are considered:* conflicts with: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Conflicts_with* distinct values: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Unique_value* inverse and symmetric: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Inverse https://www.wikidata.org/wiki/Help:Property_constraints_portal/Symmetric* item requires statement: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Item* one of: https://www.wikidata.org/wiki/Help:Property_constraints_portal/One_of* single value: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Single_value* type: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Type* value requires statement: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Target_required_claim* value type: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Value_typeThe constraints.tsv file contains the list of most of the Wikidata constraints considered in this dataset (beware, there could be some discrepancies for type, valueType, itemRequiresClaim and valueRequiresClaim constraints).It is a tabbed-separated file with the following columns:1. constrain id: the URI of the Wikidata statement describing the constraint2. property id: the URI of the property that is constrained3. type id: the URI of the constraint type (type, value type...). It is a Wikidata item.4. 15 columns for the possible attributes of the constraint. If an attribute has multiple values, they are in the same cell but separated by a space. The columns are:* regex: https://www.wikidata.org/wiki/Property:P1793* exceptions: https://www.wikidata.org/wiki/Property:P2303* group by: https://www.wikidata.org/wiki/Property:P2304* items: https://www.wikidata.org/wiki/Property:P2305* property: https://www.wikidata.org/wiki/Property:P2306* namespace: https://www.wikidata.org/wiki/Property:P2307* class: https://www.wikidata.org/wiki/Property:P2308* relation: https://www.wikidata.org/wiki/Property:P2309* minimal date: https://www.wikidata.org/wiki/Property:P2310* maximum date: https://www.wikidata.org/wiki/Property:P2311* maximum value: https://www.wikidata.org/wiki/Property:P2312* minimal value: https://www.wikidata.org/wiki/Property:P2313* status: https://www.wikidata.org/wiki/Property:P2316* separator: https://www.wikidata.org/wiki/Property:P4155* scope: https://www.wikidata.org/wiki/Property:P5314The other files provide for each constraint type the list of all corrections extracted from the edit history. The format of the file is one line per correction with the following tabbed-separated values: 1. constraint id 2. revision that fixed the constraint violation 3. first violation triple subject 4. first violation triple predicate 5. first violation triple object 6. second violation triple subject (blank if no second violation triple) 7. second violation triple predicate (blank if no second violation triple) 8. second violation triple object (blank if no second violation triple) 9. separator (not useful) 10. subject of the first triple in the correction 11. predicate of the first triple in the correction 12. object of the first triple in the correction 13. is the first triple in the correction an addition or a deletion (for a deletion and for an addition) 14. subject of the second triple in the correction (might not exist) 15. predicate of the econd triple in the correction (might not exist) 16. object of the econd triple in the correction (might not exist) 17. is the second triple in the correction an addition or a deletion (for a deletion and for an addition) (might not exist) 18. Description of the subject of the first violation triple encoded in JSON 19. Description of the object of the first violation triple encoded in JSON (might be empty for literals) 20. Description of the term of the second triple that has not already be described by the two previous description. (might be empty for literals or if there is no second triple)
a
Niger export data: Thriving on Agri, Mining and Energy
tr.abrams.wiki
pt.abrams.wiki
+4more
Updated Nov 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ABRAMS world trade wiki (2025). Niger export data: Thriving on Agri, Mining and Energy [Dataset]. https://tr.abrams.wiki/global-trade-data/niger-export-data
Explore at:
Dataset updated
Nov 1, 2025
Dataset authored and provided by
ABRAMS world trade wiki
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2025
Area covered

Description
Niger export data: Discover the potential in agriculture, mining, and energy. Unveil the prime partners and the key to economic growth.
Wikidata Constraints Violations - July 2018
figshare.com
txt
Updated Feb 14, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas Pellissier Tanon; Camille Bourgaux (2019). Wikidata Constraints Violations - July 2018 [Dataset]. http://doi.org/10.6084/m9.figshare.7712720.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7712720.v2
Dataset updated
Feb 14, 2019
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Thomas Pellissier Tanon; Camille Bourgaux
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains corrections for Wikidata constraint violations extracted from the July 1st 2018 Wikidata full history dump.The following constraints are considered:* conflicts with: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Conflicts_with* distinct values: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Unique_value* inverse and symmetric: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Inverse https://www.wikidata.org/wiki/Help:Property_constraints_portal/Symmetric* item requires statement: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Item* one of: https://www.wikidata.org/wiki/Help:Property_constraints_portal/One_of* single value: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Single_value* type: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Type* value requires statement: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Target_required_claim* value type: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Value_typeThe constraints.tsv file contains the list of most of the Wikidata constraints considered in this dataset (beware, there could be some discrepancies for type, valueType, itemRequiresClaim and valueRequiresClaim constraints).It is a tabbed-separated file with the following columns:* constrain id: the URI of the Wikidata statement describing the constraint* property id: the URI of the property that is constrained* type id: the URI of the constraint type (type, value type...). It is a Wikidata item.* 15 columns for the possible attributes of the constraint. If an attribute has multiple values, they are in the same cell but separated by a space. The columns are:** regex: https://www.wikidata.org/wiki/Property:P1793** exceptions: https://www.wikidata.org/wiki/Property:P2303** group by: https://www.wikidata.org/wiki/Property:P2304** items: https://www.wikidata.org/wiki/Property:P2305** property: https://www.wikidata.org/wiki/Property:P2306** namespace: https://www.wikidata.org/wiki/Property:P2307** class: https://www.wikidata.org/wiki/Property:P2308** relation: https://www.wikidata.org/wiki/Property:P2309** minimal date: https://www.wikidata.org/wiki/Property:P2310** maximum date: https://www.wikidata.org/wiki/Property:P2311** maximum value: https://www.wikidata.org/wiki/Property:P2312** minimal value: https://www.wikidata.org/wiki/Property:P2313** status: https://www.wikidata.org/wiki/Property:P2316** separator: https://www.wikidata.org/wiki/Property:P4155** scope: https://www.wikidata.org/wiki/Property:P5314The other files provide for each constraint type the list of all corrections extracted from the edit history. The format of the file is one line per correction with the following tabbed-separated values:* URI for the statement describing the constraint in Wikidata* URI of the revision that has solved the constraint violation* subject, predicate and object of the triple that was violating the constraint (separated by a tab)* the string "->"* subject, predicate and object of the triple(s) of the correction, each followed by "http://wikiba.se/history/ontology#deletion" if the triple has been removed or "http://wikiba.se/history/ontology#addition" if the triple has been added. Each component of these values is separated by a tab.More detailed explanations are provided in a soon to be published paper
BLM AK Federal Mining Claims (CEU)
catalog.data.gov
Updated Nov 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bureau of Land Management (2025). BLM AK Federal Mining Claims (CEU) [Dataset]. https://catalog.data.gov/dataset/blm-ak-federal-mining-claims-ceu-a50d3
Explore at:
Dataset updated
Nov 11, 2025
Dataset provided by
Bureau of Land Managementhttp://www.blm.gov/
Description
This data set depicts the non-surveyed boundaries of active (recorded or interim) and closed federal mining claims within the State of Alaska. Each mining claim is represented as an individual region, identified by the casefile serial number which can be linked to background data via the ALIS (Alaska Land Information System). Mining claim boundaries were identified in location notices from the original casefiles. They were plotted on maps based on rough sketches, claimant maps or physical descriptions.
[Coursera ] Text Mining and Analytics
academictorrents.com
bittorrent
Updated Jan 22, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Coursera (2017). [Coursera ] Text Mining and Analytics [Dataset]. https://academictorrents.com/details/e2c129491a3841bfac5d7b08b41ad79387132a23
Explore at:
bittorrent(1063576833)Available download formats
Dataset updated
Jan 22, 2017
Dataset authored and provided by
Courserahttp://coursera.org/
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
A BitTorrent file to download data with the title '[Coursera ] Text Mining and Analytics'
Mining and quarrying; employment and finance, SIC'93, 2006 - 2008
cbs.nl
data.overheid.nl
+2more
xml
Updated Sep 1, 2011
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Centraal Bureau voor de Statistiek (2011). Mining and quarrying; employment and finance, SIC'93, 2006 - 2008 [Dataset]. https://www.cbs.nl/en-gb/figures/detail/71837eng
Explore at:
xmlAvailable download formats
Dataset updated
Sep 1, 2011
Dataset authored and provided by
Centraal Bureau voor de Statistiek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2006 - 2008
Area covered
The Netherlands
Description
This table shows relevant data for the economic sector 'mining and quarrying', e.g. the number of persons employed, costs and revenues, turnover and other financial results. The figures can be divided by a number of branches according to Statistics Netherlands' Standard Industrial Classification of all Economic Activities 1993 (SIC'93).

The survey questionnaire was changed slightly in 2007. Up to and including 2006, wage subsidies were counted as (other) business returns. From 2007 onwards these subsidies are deducted from business costs. Because of these changes, results for 2007 are not fully comparable with results for 2006. The effect of these changes on business returns and business costs are small for most of the branches.

Data available from 2006 - 2008.

Status of the figures: All data are definite.

Changes as of 1 September 2011: This table has been stopped. Two important points in the processing of the data on 2009 have changed :
- a new version of the Standard Industrial Classification of all Economic
Activities has been implemented (SIC 2008);
- the statistical unit has been changed.
Due to these changes, the figures are no longer comparable to those of the previous years. Therefore new table has been started from 2009 onwards (see also heading 3).
Bible Corpus
kaggle.com
zip
Updated Jun 16, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oswin Rahadiyan Hartono (2017). Bible Corpus [Dataset]. https://www.kaggle.com/datasets/oswinrh/bible/code
Explore at:
zip(184125756 bytes)Available download formats
Dataset updated
Jun 16, 2017
Authors
Oswin Rahadiyan Hartono
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Bible (or Biblia in Greek) is a collection of sacred texts or scriptures that Jews and Christians consider to be a product of divine inspiration and a record of the relationship between God and humans (Wiki). And for data mining purpose, we could do many things using Bible scriptures as for NLP, Classification, Sentiment Analysis and other particular topics between Data Science and Theology perspective.

Content

Here you will find the following bible versions in sql, sqlite, xml, csv, and json format:

American Standard-ASV1901 (ASV)

Bible in Basic English (BBE)

Darby English Bible (DARBY)

King James Version (KJV)

Webster's Bible (WBT)

World English Bible (WEB)

Young's Literal Translation (YLT)

Each verse is accessed by a unique key, the combination of the BOOK+CHAPTER+VERSE id.

Example:

Genesis 1:1 (Genesis chapter 1, verse 1) = 01001001 (01 001 001)

Exodus 2:3 (Exodus chapter 2, verse 3) = 02002003 (02 002 003)

The verse-id system is used for faster, simplified queries.

For instance: 01001001 - 02001005 would capture all verses between Genesis 1:1 through Exodus 1:5.

Written simply:

SELECT * FROM bible.t_asv WHERE id BETWEEN 01001001 AND 02001005

Coordinating Tables

There is also a number-to-book key (key_english table), a cross-reference list (cross_reference table), and a bible key containing meta information about the included translations (bible_version_key table). See below SQL table layout. These tables work together providing you a great basis for a bible-reading and cross-referencing app. In addition, each book is marked with a particular genre, mapping in the number-to-genre key (key_genre_english table) and common abbreviations for each book can be looked up in the abbreviations list (key_abbreviations_english table). While its expected that your programs would use the verse-id system, book #, chapter #, and verse # columns have been included in the bible versions tables.

A Valuable Cross-Reference Table

A very special and valuable addition to these databases is the extensive cross-reference table. It was created from the project at http://www.openbible.info/labs/cross-references/. See .txt version included from http://www.openbible.info website. Its extremely useful in bible study for discovering related scriptures. For any given verse, you simply query vid (verse id), and a list of rows will be returned. Each of those rows has a rank (r) for relevance, start-verse (sv), and end verse (ev) if there is one.

Basic Web Interaction

The web folder contains two php files. Edit the first few lines of index.php to match your server's settings. Place these in a folder on your webserver. The references search box can be multiple comma separated values. (i.e. John 3:16, Rom 3:23, 1 Jn 1:9, Romans 10:9-10) You can also directly link to a verse by altering the URI: http://localhost/index.php?b=John 3:16, Rom 3:23, 1 Jn 1:9, Romans 10:9-10

bible-mysql.sql (MySQL) is the main database and most feature-oriented due to contributions from developers. It is suggested you use that for most things, or at least convert the information from it.

cross_references-mysql.sql (MySQL) is the cross-reference table. It has been separated to become an optional feature. This is converted from the project at http://www.openbible.info/labs/cross-references/.

bible-sqlite.db (SQLite) is a basic simplified database for simpler applications (includes cross-references too).

cross_references.txt is the source cross-reference file obtained from http://www.openbible.info/labs/cross-references/

In CSV folder, you will find (same list order with the other formats):

bible_version_key.csv

http://i.imgur.com/S9JialN.png" alt="bible_version_key">

key_abbreviations_english.csv

http://i.imgur.com/v59SpQs.png" alt="key_abbreviations_english">

key_english.csv

http://i.imgur.com/BbKMQgF.png" alt="key_english">

key_genre_english.csv

http://i.imgur.com/lJVVW2C.png" alt="key_genre_english">

t_asv.csv, t_bbe.csv, t_dby.csv, t_wbt.csv, t_web.csv, t_ylt.csv

http://i.imgur.com/jJ4cf4q.png" alt="t_version">

Acknowledgements

In behalf of the original contributors (Github)

Inspirations

WordNet as an additional semantic resource for NLP
a
Burkina Faso export data: Driving Growth with Agro-Mining
tr.abrams.wiki
es.abrams.wiki
+4more
Updated Nov 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ABRAMS world trade wiki (2025). Burkina Faso export data: Driving Growth with Agro-Mining [Dataset]. https://tr.abrams.wiki/kueresel-ticaret-verileri/burkina-faso-export-data
Explore at:
Dataset updated
Nov 1, 2025
Dataset authored and provided by
ABRAMS world trade wiki
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2025
Area covered

Description
Burkina Faso export data: Discover how this West African nation thrives on agriculture and mining exports, targeting growth and employment.
Softcite software mention extraction from the CORD-19 publications
zenodo.org
bin
Updated Aug 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrice Lopez; Patrice Lopez; Caifan Du; Caifan Du; Hannah Cohoon; Hannah Cohoon; James Howison; James Howison (2021). Softcite software mention extraction from the CORD-19 publications [Dataset]. http://doi.org/10.5281/zenodo.5235661
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5235661
Dataset updated
Aug 24, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Patrice Lopez; Patrice Lopez; Caifan Du; Caifan Du; Hannah Cohoon; Hannah Cohoon; James Howison; James Howison
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Softcite software mention extraction from the CORD-19 publications

This dataset is the result of the extraction of software mentions from the set of publications of the CORD-19 corpus (https://allenai.org/data/cord-19) by the Softcite software recognizer (SciBERT-CRF fine-tuned model), see https://github.com/ourresearch/software-mentions.

The CORD-19 version used for this dataset is the one dated 2021-07-26, using the metadata.csv file only. We re-harvested the PDF with https://github.com/kermitt2/article-dataset-builder in order to also extract coordinates of software mentions in the PDF and to take advantage of the latest version of GROBID to produce better full text extraction from PDF. We also harvested 61,230 full-texts more than the standard CORD-19 distribution.

Note that this is the third version of this dataset (version 0.3.0). The previous Softcite software mention extraction from the CORD-19 was based on 2020-09-11 and 2021-03-22 versions. The new version cover a larger set of documents and is using an improved version of the extraction tools.

Data format

The extraction consists of 3 JSON files:

annotations.jsonl contains the individual software annotations including software name and possible attached attributes (publisher, URL and version). Each annotation is associated with coordinates expressed as bounding boxes in the original PDF. See Coordinates of structures in the original PDF for more details on the coordinate format.

The context of citation is the sentence where the software name and its attributes are extracted. It is added to the JSON structure (field context), as well as the identifier of the document where the annotation belongs (field document, pointing to entries available in documents.json) and a list of bibliographical references attached to the software name (field references, pointing to entries available in references.json, with the used reference marker string). See https://github.com/ourresearch/software-mentions for more details on the extracted attributes.

If the software name was sucessfully disambiguated against WikiData ("entity linking"), it appears in the field wikidataId as Wikidata entity identifier and in the field wikipediaExternalRef as a Wikipedia PageID from the English Wikipedia. Entity linking is realized with entity-fishing.

documents.jsonl contains the metadata of the all the CORD-19 documents containing at least one software annotation. The metadata are given as a CrossRef JSON structure. The abstract should be included in the metadata most of the time, as well as some complements extracted by GROBID directly from the PDF. In addition, the size of the pages and the unique file path to the PDF can be found to allow annotations directly on the PDF (see Coordinates of structures in the original PDF for more details on the PDF annotation display mechanism).

references.jsonl contains the parsed reference entries associated to software mentions. These references are given in the field tei encoded in the XML TEI format of GROBID extraction. The extracted raw references have been matched against CrossRef to get a DOI and more complete metadata with biblio-glutton.

Statistics

CORD-19 version: 2021-07-26

- total Open Access full texts: 296,686
- with at least one software mention: 115,073

- total software name annotations: 652,518
- with linked Wikidata ID: 231,599

- associated field
- publisher: 107,421
- version: 188,724
- URL: 59,366
- references: 230,145

- associated bibliographical references: 92,573
- references with matched DOI: 49,350
- references with matched PMID: 32,895
- references with matched PMC ID: 18,741

License and acknowledgements

This dataset is licensed under a Creative Commons Attribution 4.0 International License.

We thank the Alfred P. Sloan Foundation and of the Gordon and Betty Moore Foundation for supporting this work.
BLM Natl MLRS Mining Claims -Closed
catalog.data.gov
datasets.ai
+1more
Updated Nov 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bureau of Land Management (2025). BLM Natl MLRS Mining Claims -Closed [Dataset]. https://catalog.data.gov/dataset/blm-natl-mlrs-mining-claims-closed-196f0
Explore at:
Dataset updated
Nov 11, 2025
Dataset provided by
Bureau of Land Managementhttp://www.blm.gov/
Description
This dataset contains mining claim cases with the case disposition (status) of closed from US Bureau of Land Management's, BLM, Mineral and Land Record System(MLRS). The BLM only requires that mining claims be identified down to the affected quarter section(s)—as such, that is what the MLRS research map and public reports will reflect, most commonly. Claim boundaries, as staked and monumented, are found in the accepted Notice/Certificate of Location as part of the official case file, managed by the BLM State Office having jurisdiction over the claim. The geometries are created in multiple ways but are primarily derived from Legal Land Descriptions (LLD) for the case and geocoded (mapped) using the Public Land Survey System (PLSS) derived from the most accurate survey data available through BLM Cadastral Survey workforce. Geospatial representations might be missing for some cases that can not be geocoded using the MLRS algorithm. Each case is given a data quality score based on how well it mapped. These can be lumped into seven groups to provide a simplified way to understand the scores.Group 1: Direct PLSS Match. Scores “0”, “1”, “2”, “3” should all have a match to the PLSS data. There are slight differences, but the primary expectation is that these match the PLSS. Group 2: Calculated PLSS Match. Scores “4”, “4.1”, “5”, “6”, “7” and “8” were generated through a process of creating the geometry that is not a direct capture from the PLSS. They represent a best guess based on the underlining PLSS Group 3 – Mapped to Section. Score of “8.1”, “8.2”, “8.3”, “9” and “10” are mapped to the Section for various reasons (refer to log information in data quality field). Group 4- Combination of mapped and unmapped areas. Score of 15 represents a case that has some portions that would map and others that do not. Group 5 – No NLSDB Geometry, Only Attributes. Scores “11”, “12”, “20”, “21” and “22” do not have a match to the PLSS and no geometry is in the NLSDB, and only attributes exist in the data. Group 6 – Mapped to County. Scores of “25” map to the County. Group 7 – Improved Geometry. Scores of “100” are cases that have had their geometry edited by BLM staff using ArcGIS Pro or MLRS bulk upload tool.
Population by Country - 2020
kaggle.com
zip
Updated Feb 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tanu N Prabhu (2020). Population by Country - 2020 [Dataset]. https://www.kaggle.com/datasets/tanuprabhu/population-by-country-2020/versions/1
Explore at:
zip(7616 bytes)Available download formats
Dataset updated
Feb 10, 2020
Authors
Tanu N Prabhu
Description
Context

I always wanted to access a data set that was related to the world’s population (Country wise). But I could not find a properly documented data set. Rather, I just created one manually.

Content

Now I knew I wanted to create a dataset but I did not know how to do so. So, I started to search for the content (Population of countries) on the internet. Obviously, Wikipedia was my first search. But I don't know why the results were not acceptable. And also there were only I think 190 or more countries. So then I surfed the internet for quite some time until then I stumbled upon a great website. I think you probably have heard about this. The name of the website is Worldometer. This is exactly the website I was looking for. This website had more details than Wikipedia. Also, this website had more rows I mean more countries with their population.

Once I got the data, now my next hard task was to download it. Of course, I could not get the raw form of data. I did not mail them regarding the data. Now I learned a new skill which is very important for a data scientist. I read somewhere that to obtain the data from websites you need to use this technique. Any guesses, keep reading you will come to know in the next paragraph.

https://fiverr-res.cloudinary.com/images/t_main1,q_auto,f_auto/gigs/119580480/original/68088c5f588ec32a6b3a3a67ec0d1b5a8a70648d/do-web-scraping-and-data-mining-with-python.png" alt="alt text">

You are right its, Web Scraping. Now I learned this so that I could convert the data into a CSV format. Now I will give you the scraper code that I wrote and also I somehow found a way to directly convert the pandas data frame to a CSV(Comma-separated fo format) and store it on my computer. Now just go through my code and you will know what I'm talking about.

Below is the code that I used to scrape the code from the website

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3200273%2Fe814c2739b99d221de328c72a0b2571e%2FCapture.PNG?generation=1581314967227445&alt=media" alt="">

Acknowledgements

Now I couldn't have got the data without Worldometer. So special thanks to the website. It is because of them I was able to get the data.

Inspiration

As far as I know, I don't have any questions to ask. You guys can let me know by finding your ways to use the data and let me know via kernel if you find something interesting
n
LitMiner
neuinfo.org
scicrunch.org
+1more
Updated Feb 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). LitMiner [Dataset]. http://identifiers.org/RRID:SCR_008200
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_008200 https://identifiers.org/RRID:SCR_008200/resolver/mentions
Dataset updated
Feb 27, 2025
Description
THIS RESOURCE IS NO LONGER IN SERVICE, documented August 23, 2016. The LitMiner software is a literature data-mining tool that facilitates the identification of major gene regulation key players related to a user-defined field of interest in PubMed abstracts. The prediction of gene-regulatory relationships is based on co-occurrence analysis of key terms within the abstracts. LitMiner predicts relationships between key terms from the biomedical domain in four categories (genes, chemical compounds, diseases and tissues). The usefulness of the LitMiner system has been demonstrated recently in a study that reconstructed disease-related regulatory networks by promoter modeling that was initiated by a LitMiner generated primary gene list. To overcome the limitations and to verify and improve the data, we developed WikiGene, a Wiki-based curation tool that allows revision of the data by expert users over the Internet. It is based on the annotation of key terms in article abstracts followed by statistical co-citation analysis of annotated key terms in order to predict relationships. Key terms belonging to four different categories are used for the annotation process: -Genes: Names of genes and gene products. Gene name recognition is based on Ensembl . Synonyms and aliases are resolved. -Chemical Compounds: Names of chemical compounds and their respective aliases. -Diseases and Phenotypes: Names of diseases and phenotypes -Tissues and Organs: Names of tissues and organs LitMiner uses a database of disease and phenotype terms for literature annotation. Currently, there are 2225 diseases or phenotypes, 801 tissues and organs, and 10477 compounds in the database.
Quality Prediction in a Mining Process
kaggle.com
zip
Updated Dec 6, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
EduardoMagalhãesOliveira (2017). Quality Prediction in a Mining Process [Dataset]. https://www.kaggle.com/datasets/edumagalhaes/quality-prediction-in-a-mining-process/code
Explore at:
zip(53386037 bytes)Available download formats
Dataset updated
Dec 6, 2017
Authors
EduardoMagalhãesOliveira
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

It is not always easy to find databases from real world manufacturing plants, specially mining plants. So, I would like to share this database with the community, which comes from one of the most important parts of a mining process: a flotation plant!

PLEASE HELP ME GET MORE DATASETS LIKE THIS FILLING A 30s SURVEY:

The main goal is to use this data to predict how much impurity is in the ore concentrate. As this impurity is measured every hour, if we can predict how much silica (impurity) is in the ore concentrate, we can help the engineers, giving them early information to take actions (empowering!). Hence, they will be able to take corrective actions in advance (reduce impurity, if it is the case) and also help the environment (reducing the amount of ore that goes to tailings as you reduce silica in the ore concentrate).

Content

The first column shows time and date range (from march of 2017 until september of 2017). Some columns were sampled every 20 second. Others were sampled on a hourly base.

The second and third columns are quality measures of the iron ore pulp right before it is fed into the flotation plant. Column 4 until column 8 are the most important variables that impact in the ore quality in the end of the process. From column 9 until column 22, we can see process data (level and air flow inside the flotation columns, which also impact in ore quality. The last two columns are the final iron ore pulp quality measurement from the lab. Target is to predict the last column, which is the % of silica in the iron ore concentrate.

Inspiration

I have been working in this dataset for at least six months and would like to see if the community can help to answer the following questions:

Is it possible to predict % Silica Concentrate every minute?

How many steps (hours) ahead can we predict % Silica in Concentrate? This would help engineers to act in predictive and optimized way, mitigatin the % of iron that could have gone to tailings.

Is it possible to predict % Silica in Concentrate whitout using % Iron Concentrate column (as they are highly correlated)?

Related research using this dataset

Research/Conference Papers and Master Thesis:

Purities prediction in a manufacturing froth flotation plant: the deep learning techniques link

Soft Sensor: Traditional Machine Learning or Deep Learning link

Machine Learning-based Quality Prediction in the Froth Flotation Process of Mining link
f
Biographies of literature writers writen in spanish language
figshare.com
application/gzip
Updated Mar 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javier Gomez (2023). Biographies of literature writers writen in spanish language [Dataset]. http://doi.org/10.6084/m9.figshare.13551437.v2
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13551437.v2
Dataset updated
Mar 17, 2023
Dataset provided by
figshare
Authors
Javier Gomez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains 832 biographies of literature writers retrieved from the Spanish Wikipedia. There is a total of 416 biographies of women writers extracted from the category entitled “Escritoras de España” (https://es.wikipedia.org/wiki/Categoría:Escritoras_de_España) and 416 biographies of male writers extracted from the category “Escritores de España del siglo XX” (https://es.wikipedia.org/wiki/Categoría:Escritores_de_España_del_siglo_XX)

Facebook

Twitter

Click to copy link

Link copied

Cite

Wiki Rank (2023). Quality scores for Wikipedia articles (July 2018) [Dataset]. http://doi.org/10.6084/m9.figshare.7272713.v1

Quality scores for Wikipedia articles (July 2018)

Explore at:

bz2Available download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.7272713.v1

Dataset updated

May 30, 2023

Dataset provided by

figshare
Figsharehttp://figshare.com/

Authors

Wiki Rank

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

This dataset includes a list of over 37 million Wikipedia articles in 55 languages with quality scores by WikiRank (https://wikirank.net). Quality scores of articles are based on Wikipedia dumps from July, 2018 License All files included in this datasets are released under CC0: https://creativecommons.org/publicdomain/zero/1.0/Format• page_id -- The identifier of the Wikipedia article (int), e.g. 4519301• revision_id -- The Wikipedia revision of the article (int), e.g. 24284811• page_name -- The title of the Wikipedia article (utf-8), e.g. General relativity• wikirank_quality -- quality score 0-100

Clear search

Close search

Google apps

Main menu

Quality scores for Wikipedia articles (July 2018)

WikiRank quality scores and measures for Wikipedia articles (April 2022)

Wikipedia SQLITE Portable DB, Huge 5M+ Rows

WikiRank 05.2019 - quality, popularity and AI for Wikipedia articles

Codebase for Wikidata as Gazetteer: An Open Geocoding Pipeline for Textual...

Wikipedia Multilingual Vandalism Detection Dataset

Wikidata Constraint Violations - July 2018 - extended

Niger export data: Thriving on Agri, Mining and Energy

Wikidata Constraints Violations - July 2018

BLM AK Federal Mining Claims (CEU)

[Coursera ] Text Mining and Analytics

Mining and quarrying; employment and finance, SIC'93, 2006 - 2008

Bible Corpus

Context

Content

Acknowledgements

Inspirations

Burkina Faso export data: Driving Growth with Agro-Mining

Softcite software mention extraction from the CORD-19 publications

BLM Natl MLRS Mining Claims -Closed

Population by Country - 2020

Context

Content

Acknowledgements

Inspiration

LitMiner

Quality Prediction in a Mining Process

Context

Content

Inspiration

Related research using this dataset

Biographies of literature writers writen in spanish language

Quality scores for Wikipedia articles (July 2018)