Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset includes a list of over 37 million Wikipedia articles in 55 languages with quality scores by WikiRank (https://wikirank.net). Quality scores of articles are based on Wikipedia dumps from July, 2018 License All files included in this datasets are released under CC0: https://creativecommons.org/publicdomain/zero/1.0/Format• page_id -- The identifier of the Wikipedia article (int), e.g. 4519301• revision_id -- The Wikipedia revision of the article (int), e.g. 24284811• page_name -- The title of the Wikipedia article (utf-8), e.g. General relativity• wikirank_quality -- quality score 0-100
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Those datasets include lists of over 43 million Wikipedia articles in 55 languages with quality scores by WikiRank (https://wikirank.net). Additionally, the datasets contain the quality measures (metrics) which directly affect these scores. Quality measures were extracted based on Wikipedia dumps from April, 2022.
License All files included in this datasets are released under CC BY 4.0: https://creativecommons.org/licenses/by/4.0/ Format
page_id -- The identifier of the Wikipedia article (int), e.g. 840191 page_name -- The title of the Wikipedia article (utf-8), e.g. Sagittarius A* wikirank_quality -- quality score for Wikipedia article in a scale 0-100 (as of April 1, 2022). This is a synthetic measure that was calculated based on the metrics below (also included in the datasets). norm_len - normalized "page length" norm_refs - normalized "number of references" norm_img - normalized "number of images" norm_sec - normalized "number of sections" norm_reflen - normalized "references per length ratio" norm_authors - normalized "number of authors" (without bots and anonymous users) flawtemps - flaw templates
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The "Wikipedia SQLite Portable DB" is a compact and efficient database derived from the Kensho Derived Wikimedia Dataset (KDWD). This dataset provides a condensed subset of raw Wikimedia data in a format optimized for natural language processing (NLP) research and applications.
I am not affiliated or partnered with the Kensho in any way, just really like the dataset for giving my agents to query easily.
Key Features:
Contains over 5 million rows of data from English Wikipedia and Wikidata Stored in a portable SQLite database format for easy integration and querying Includes a link-annotated corpus of English Wikipedia pages and a compact sample of the Wikidata knowledge base Ideal for NLP tasks, machine learning, data analysis, and research projects
The database consists of four main tables:
This dataset is derived from the Kensho Derived Wikimedia Dataset (KDWD), which is built from the English Wikipedia snapshot from December 1, 2019, and the Wikidata snapshot from December 2, 2019. The KDWD is a condensed subset of the raw Wikimedia data in a form that is helpful for NLP work, and it is released under the CC BY-SA 3.0 license. Credits: The "Wikipedia SQLite Portable DB" is derived from the Kensho Derived Wikimedia Dataset (KDWD), created by the Kensho R&D group. The KDWD is based on data from Wikipedia and Wikidata, which are crowd-sourced projects supported by the Wikimedia Foundation. We would like to acknowledge and thank the Kensho R&D group for their efforts in creating the KDWD and making it available for research and development purposes. By providing this portable SQLite database, we aim to make Wikipedia data more accessible and easier to use for researchers, data scientists, and developers working on NLP tasks, machine learning projects, and other data-driven applications. We hope that this dataset will contribute to the advancement of NLP research and the development of innovative applications utilizing Wikipedia data.
https://www.kaggle.com/datasets/kenshoresearch/kensho-derived-wikimedia-data/data
Tags: encyclopedia, wikipedia, sqlite, database, reference, knowledge-base, articles, information-retrieval, natural-language-processing, nlp, text-data, large-dataset, multi-table, data-science, machine-learning, research, data-analysis, data-mining, content-analysis, information-extraction, text-mining, text-classification, topic-modeling, language-modeling, question-answering, fact-checking, entity-recognition, named-entity-recognition, link-prediction, graph-analysis, network-analysis, knowledge-graph, ontology, semantic-web, structured-data, unstructured-data, data-integration, data-processing, data-cleaning, data-wrangling, data-visualization, exploratory-data-analysis, eda, corpus, document-collection, open-source, crowdsourced, collaborative, online-encyclopedia, web-data, hyperlinks, categories, page-views, page-links, embeddings
Usage with LIKE queries: ``` import aiosqlite import asyncio
class KenshoDatasetQuery: def init(self, db_file): self.db_file = db_file
async def _aenter_(self):
self.conn = await aiosqlite.connect(self.db_file)
return self
async def _aexit_(self, exc_type, exc_val, exc_tb):
await self.conn.close()
async def search_pages_by_title(self, title):
query = """
SELECT pages.page_id, pages.item_id, pages.title, pages.views,
items.labels AS item_labels, items.description AS item_description,
link_annotated_text.sections
FROM pages
JOIN items ON pages.item_id = items.id
JOIN link_annotated_text ON pages.page_id = link_annotated_text.page_id
WHERE pages.title LIKE ?
"""
async with self.conn.execute(query, (f"%{title}%",)) as cursor:
return await cursor.fetchall()
async def search_items_by_label_or_description(self, keyword):
query = """
SELECT id, labels, description
FROM items
WHERE labels LIKE ? OR description LIKE ?
"""
async with self.conn.execute(query, (f"%{keyword}%", f"%{keyword}%")) as cursor:
return await cursor.fetchall()
async def search_items_by_label(self, label):
query = """
SELECT id, labels, description
FROM items
WHERE labels LIKE ?
"""
async with self.conn.execute(query, (f"%{label}%",)) as cursor:
return await cursor.fetchall()
async def search_properties_by_label_or_desc...
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset includes a list of over 39 million Wikipedia articles in 55 languages with quality scores by WikiRank (https://wikirank.net). Quality scores of articles are based on Wikipedia dumps from May, 2019. Popularity and Authors' Interest based on activity in April 2019.License All files included in this datasets are released under CC0: https://creativecommons.org/publicdomain/zero/1.0/Format• page_id -- The identifier of the Wikipedia article (int), e.g. 4519301• page_name -- The title of the Wikipedia article (utf-8), e.g. General relativity• wikirank_quality -- quality score for Wikipedia article in a scale 0-100 (as of May 1, 2019)• poularity -- miedian of daily number of page views of the Wikipedia article during April 2019• authors_interest -- number of authors of the Wikipedia article during April 2019
Facebook
TwitterThis repository contains the scripts required to implement the Wikidata-based geocoding pipeline described in the accompanying paper. geocode.sh : Shell script for setting up and executing Stanford CoreNLP with the required language models and entitylink annotator. Automates preprocessing, named entity recognition (NER), and wikification across a directory of plain-text (.txt) files. Configured for both local execution and high-performance computing (HPC) environments. geocode.py : Python script that processes the list of extracted location entities (entities.txt) and retrieves latitude/longitude coordinates from Wikidata using Pywikibot. Handles redirects, missing pages, and missing coordinate values, returning standardized placeholder codes where necessary. Outputs results as a CSV file with columns for place name, latitude, longitude, and source file. geocode.sbatch : Optional SLURM submission script for running run_corenlp.sh on HPC clusters. Includes configurable resource requests for scalable processing of large corpora. README.md : Detailed README file including a line-by-line explanation of the geocode.sh file. Together, these files provide a reproducible workflow for geocoding textual corpora via wikification, suitable for projects ranging from small-scale literary analysis to large-scale archival datasets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset accompanies a research paper that introduces a novel system designed to support the Wikipedia community in combating vandalism on the platform. The dataset has been prepared to enhance the accuracy and efficiency of Wikipedia patrolling in multiple languages.
The release of this comprehensive dataset aims to encourage further research and development in vandalism detection techniques, fostering a safer and more inclusive environment for the Wikipedia community. Researchers and practitioners can utilize this dataset to train and validate their models for vandalism detection and contribute to improving online platforms' content moderation strategies.
Dataset Details:
Related paper citation:
@inproceedings{10.1145/3580305.3599823,
author = {Trokhymovych, Mykola and Aslam, Muniza and Chou, Ai-Jou and Baeza-Yates, Ricardo and Saez-Trumper, Diego},
title = {Fair Multilingual Vandalism Detection System for Wikipedia},
year = {2023},
isbn = {9798400701030},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3580305.3599823},
doi = {10.1145/3580305.3599823},
abstract = {This paper presents a novel design of the system aimed at supporting the Wikipedia community in addressing vandalism on the platform. To achieve this, we collected a massive dataset of 47 languages, and applied advanced filtering and feature engineering techniques, including multilingual masked language modeling to build the training dataset from human-generated data. The performance of the system was evaluated through comparison with the one used in production in Wikipedia, known as ORES. Our research results in a significant increase in the number of languages covered, making Wikipedia patrolling more efficient to a wider range of communities. Furthermore, our model outperforms ORES, ensuring that the results provided are not only more accurate but also less biased against certain groups of contributors.},
booktitle = {Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
pages = {4981–4990},
numpages = {10},
location = {Long Beach, CA, USA},
series = {KDD '23}
}
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset is a cleaned up and annotated version of an other dataset previously shared: https://figshare.com/articles/dataset/Wikidata_Constraints_Violations_-_July_2017/7712720This dataset contains corrections for Wikidata constraint violations extracted from the July 1st 2018 Wikidata full history dump.It has been created as part of a work named Neural Knowledge Base Repairs by Thomas Pellissier Tanon and Fabian Suchanek.An example of code making use of this dataset is available on GitHub: https://github.com/Tpt/bass-materials/blob/master/corrections_learning.ipynbThe following constraints are considered:* conflicts with: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Conflicts_with* distinct values: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Unique_value* inverse and symmetric:
https://www.wikidata.org/wiki/Help:Property_constraints_portal/Inverse
https://www.wikidata.org/wiki/Help:Property_constraints_portal/Symmetric* item requires statement: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Item* one of: https://www.wikidata.org/wiki/Help:Property_constraints_portal/One_of* single value: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Single_value* type: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Type* value requires statement: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Target_required_claim* value type: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Value_typeThe constraints.tsv file contains the list of most of the
Wikidata constraints considered in this dataset (beware, there could be
some discrepancies for type, valueType, itemRequiresClaim and
valueRequiresClaim constraints).It is a tabbed-separated file with the following columns:1. constrain id: the URI of the Wikidata statement describing the constraint2. property id: the URI of the property that is constrained3. type id: the URI of the constraint type (type, value type...). It is a Wikidata item.4. 15 columns for the possible attributes of the constraint. If an
attribute has multiple values, they are in the same cell but separated
by a space. The columns are:* regex: https://www.wikidata.org/wiki/Property:P1793* exceptions: https://www.wikidata.org/wiki/Property:P2303* group by: https://www.wikidata.org/wiki/Property:P2304* items: https://www.wikidata.org/wiki/Property:P2305* property: https://www.wikidata.org/wiki/Property:P2306* namespace: https://www.wikidata.org/wiki/Property:P2307* class: https://www.wikidata.org/wiki/Property:P2308* relation: https://www.wikidata.org/wiki/Property:P2309* minimal date: https://www.wikidata.org/wiki/Property:P2310* maximum date: https://www.wikidata.org/wiki/Property:P2311* maximum value: https://www.wikidata.org/wiki/Property:P2312* minimal value: https://www.wikidata.org/wiki/Property:P2313* status: https://www.wikidata.org/wiki/Property:P2316* separator: https://www.wikidata.org/wiki/Property:P4155* scope: https://www.wikidata.org/wiki/Property:P5314The other files provide for each constraint type the list of all
corrections extracted from the edit history. The format of the file is
one line per correction with the following tabbed-separated values: 1. constraint id 2. revision that fixed the constraint violation 3. first violation triple subject 4. first violation triple predicate 5. first violation triple object 6. second violation triple subject (blank if no second violation triple) 7. second violation triple predicate (blank if no second violation triple) 8. second violation triple object (blank if no second violation triple) 9. separator (not useful) 10. subject of the first triple in the correction 11. predicate of the first triple in the correction 12. object of the first triple in the correction 13. is the first triple in the correction an addition or a deletion (for a deletion and for an addition) 14. subject of the second triple in the correction (might not exist) 15. predicate of the econd triple in the correction (might not exist) 16. object of the econd triple in the correction (might not exist) 17. is the second triple in the correction an addition or a deletion (for a deletion and for an addition) (might not exist) 18. Description of the subject of the first violation triple encoded in JSON 19. Description of the object of the first violation triple encoded in JSON (might be empty for literals) 20. Description of the term of the second triple that has not already be described by the two previous description. (might be empty for literals or if there is no second triple)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Niger export data: Discover the potential in agriculture, mining, and energy. Unveil the prime partners and the key to economic growth.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains corrections for Wikidata constraint violations extracted from the July 1st 2018 Wikidata full history dump.The following constraints are considered:* conflicts with: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Conflicts_with* distinct values: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Unique_value* inverse and symmetric: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Inverse https://www.wikidata.org/wiki/Help:Property_constraints_portal/Symmetric* item requires statement: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Item* one of: https://www.wikidata.org/wiki/Help:Property_constraints_portal/One_of* single value: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Single_value* type: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Type* value requires statement: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Target_required_claim* value type: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Value_typeThe constraints.tsv file contains the list of most of the Wikidata constraints considered in this dataset (beware, there could be some discrepancies for type, valueType, itemRequiresClaim and valueRequiresClaim constraints).It is a tabbed-separated file with the following columns:* constrain id: the URI of the Wikidata statement describing the constraint* property id: the URI of the property that is constrained* type id: the URI of the constraint type (type, value type...). It is a Wikidata item.* 15 columns for the possible attributes of the constraint. If an attribute has multiple values, they are in the same cell but separated by a space. The columns are:** regex: https://www.wikidata.org/wiki/Property:P1793** exceptions: https://www.wikidata.org/wiki/Property:P2303** group by: https://www.wikidata.org/wiki/Property:P2304** items: https://www.wikidata.org/wiki/Property:P2305** property: https://www.wikidata.org/wiki/Property:P2306** namespace: https://www.wikidata.org/wiki/Property:P2307** class: https://www.wikidata.org/wiki/Property:P2308** relation: https://www.wikidata.org/wiki/Property:P2309** minimal date: https://www.wikidata.org/wiki/Property:P2310** maximum date: https://www.wikidata.org/wiki/Property:P2311** maximum value: https://www.wikidata.org/wiki/Property:P2312** minimal value: https://www.wikidata.org/wiki/Property:P2313** status: https://www.wikidata.org/wiki/Property:P2316** separator: https://www.wikidata.org/wiki/Property:P4155** scope: https://www.wikidata.org/wiki/Property:P5314The other files provide for each constraint type the list of all corrections extracted from the edit history. The format of the file is one line per correction with the following tabbed-separated values:* URI for the statement describing the constraint in Wikidata* URI of the revision that has solved the constraint violation* subject, predicate and object of the triple that was violating the constraint (separated by a tab)* the string "->"* subject, predicate and object of the triple(s) of the correction, each followed by "http://wikiba.se/history/ontology#deletion" if the triple has been removed or "http://wikiba.se/history/ontology#addition" if the triple has been added. Each component of these values is separated by a tab.More detailed explanations are provided in a soon to be published paper
Facebook
TwitterThis data set depicts the non-surveyed boundaries of active (recorded or interim) and closed federal mining claims within the State of Alaska. Each mining claim is represented as an individual region, identified by the casefile serial number which can be linked to background data via the ALIS (Alaska Land Information System). Mining claim boundaries were identified in location notices from the original casefiles. They were plotted on maps based on rough sketches, claimant maps or physical descriptions.
Facebook
Twitterhttps://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
A BitTorrent file to download data with the title '[Coursera ] Text Mining and Analytics'
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This table shows relevant data for the economic sector 'mining and quarrying', e.g. the number of persons employed, costs and revenues, turnover and other financial results. The figures can be divided by a number of branches according to Statistics Netherlands' Standard Industrial Classification of all Economic Activities 1993 (SIC'93).
The survey questionnaire was changed slightly in 2007. Up to and including 2006, wage subsidies were counted as (other) business returns. From 2007 onwards these subsidies are deducted from business costs. Because of these changes, results for 2007 are not fully comparable with results for 2006. The effect of these changes on business returns and business costs are small for most of the branches.
Data available from 2006 - 2008.
Status of the figures: All data are definite.
Changes as of 1 September 2011:
This table has been stopped. Two important points in the processing of
the data on 2009 have changed :
- a new version of the Standard Industrial Classification of all Economic
Activities has been implemented (SIC 2008);
- the statistical unit has been changed.
Due to these changes, the figures are no longer comparable to those of the
previous years. Therefore new table has been started from 2009 onwards
(see also heading 3).
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Bible (or Biblia in Greek) is a collection of sacred texts or scriptures that Jews and Christians consider to be a product of divine inspiration and a record of the relationship between God and humans (Wiki). And for data mining purpose, we could do many things using Bible scriptures as for NLP, Classification, Sentiment Analysis and other particular topics between Data Science and Theology perspective.
Here you will find the following bible versions in sql, sqlite, xml, csv, and json format:
American Standard-ASV1901 (ASV)
Bible in Basic English (BBE)
Darby English Bible (DARBY)
King James Version (KJV)
Webster's Bible (WBT)
World English Bible (WEB)
Young's Literal Translation (YLT)
Each verse is accessed by a unique key, the combination of the BOOK+CHAPTER+VERSE id.
Example:
Genesis 1:1 (Genesis chapter 1, verse 1) = 01001001 (01 001 001)
Exodus 2:3 (Exodus chapter 2, verse 3) = 02002003 (02 002 003)
The verse-id system is used for faster, simplified queries.
For instance: 01001001 - 02001005 would capture all verses between Genesis 1:1 through Exodus 1:5.
Written simply:
SELECT * FROM bible.t_asv WHERE id BETWEEN 01001001 AND 02001005
Coordinating Tables
There is also a number-to-book key (key_english table), a cross-reference list (cross_reference table), and a bible key containing meta information about the included translations (bible_version_key table). See below SQL table layout. These tables work together providing you a great basis for a bible-reading and cross-referencing app. In addition, each book is marked with a particular genre, mapping in the number-to-genre key (key_genre_english table) and common abbreviations for each book can be looked up in the abbreviations list (key_abbreviations_english table). While its expected that your programs would use the verse-id system, book #, chapter #, and verse # columns have been included in the bible versions tables.
A Valuable Cross-Reference Table
A very special and valuable addition to these databases is the extensive cross-reference table. It was created from the project at http://www.openbible.info/labs/cross-references/. See .txt version included from http://www.openbible.info website. Its extremely useful in bible study for discovering related scriptures. For any given verse, you simply query vid (verse id), and a list of rows will be returned. Each of those rows has a rank (r) for relevance, start-verse (sv), and end verse (ev) if there is one.
Basic Web Interaction
The web folder contains two php files. Edit the first few lines of index.php to match your server's settings. Place these in a folder on your webserver. The references search box can be multiple comma separated values. (i.e. John 3:16, Rom 3:23, 1 Jn 1:9, Romans 10:9-10) You can also directly link to a verse by altering the URI: http://localhost/index.php?b=John 3:16, Rom 3:23, 1 Jn 1:9, Romans 10:9-10
In CSV folder, you will find (same list order with the other formats):
http://i.imgur.com/S9JialN.png" alt="bible_version_key">
http://i.imgur.com/v59SpQs.png" alt="key_abbreviations_english">
http://i.imgur.com/BbKMQgF.png" alt="key_english">
http://i.imgur.com/lJVVW2C.png" alt="key_genre_english">
http://i.imgur.com/jJ4cf4q.png" alt="t_version">
In behalf of the original contributors (Github)
WordNet as an additional semantic resource for NLP
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Burkina Faso export data: Discover how this West African nation thrives on agriculture and mining exports, targeting growth and employment.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Softcite software mention extraction from the CORD-19 publications
This dataset is the result of the extraction of software mentions from the set of publications of the CORD-19 corpus (https://allenai.org/data/cord-19) by the Softcite software recognizer (SciBERT-CRF fine-tuned model), see https://github.com/ourresearch/software-mentions.
The CORD-19 version used for this dataset is the one dated 2021-07-26, using the metadata.csv file only. We re-harvested the PDF with https://github.com/kermitt2/article-dataset-builder in order to also extract coordinates of software mentions in the PDF and to take advantage of the latest version of GROBID to produce better full text extraction from PDF. We also harvested 61,230 full-texts more than the standard CORD-19 distribution.
Note that this is the third version of this dataset (version 0.3.0). The previous Softcite software mention extraction from the CORD-19 was based on 2020-09-11 and 2021-03-22 versions. The new version cover a larger set of documents and is using an improved version of the extraction tools.
Data format
The extraction consists of 3 JSON files:
annotations.jsonl contains the individual software annotations including software name and possible attached attributes (publisher, URL and version). Each annotation is associated with coordinates expressed as bounding boxes in the original PDF. See Coordinates of structures in the original PDF for more details on the coordinate format.
The context of citation is the sentence where the software name and its attributes are extracted. It is added to the JSON structure (field context), as well as the identifier of the document where the annotation belongs (field document, pointing to entries available in documents.json) and a list of bibliographical references attached to the software name (field references, pointing to entries available in references.json, with the used reference marker string). See https://github.com/ourresearch/software-mentions for more details on the extracted attributes.
If the software name was sucessfully disambiguated against WikiData ("entity linking"), it appears in the field wikidataId as Wikidata entity identifier and in the field wikipediaExternalRef as a Wikipedia PageID from the English Wikipedia. Entity linking is realized with entity-fishing.
documents.jsonl contains the metadata of the all the CORD-19 documents containing at least one software annotation. The metadata are given as a CrossRef JSON structure. The abstract should be included in the metadata most of the time, as well as some complements extracted by GROBID directly from the PDF. In addition, the size of the pages and the unique file path to the PDF can be found to allow annotations directly on the PDF (see Coordinates of structures in the original PDF for more details on the PDF annotation display mechanism).
references.jsonl contains the parsed reference entries associated to software mentions. These references are given in the field tei encoded in the XML TEI format of GROBID extraction. The extracted raw references have been matched against CrossRef to get a DOI and more complete metadata with biblio-glutton.
Statistics
CORD-19 version: 2021-07-26
- total Open Access full texts: 296,686
- with at least one software mention: 115,073
- total software name annotations: 652,518
- with linked Wikidata ID: 231,599
- associated field
- publisher: 107,421
- version: 188,724
- URL: 59,366
- references: 230,145
- associated bibliographical references: 92,573
- references with matched DOI: 49,350
- references with matched PMID: 32,895
- references with matched PMC ID: 18,741
License and acknowledgements
This dataset is licensed under a Creative Commons Attribution 4.0 International License.
We thank the Alfred P. Sloan Foundation and of the Gordon and Betty Moore Foundation for supporting this work.
Facebook
TwitterThis dataset contains mining claim cases with the case disposition (status) of closed from US Bureau of Land Management's, BLM, Mineral and Land Record System(MLRS). The BLM only requires that mining claims be identified down to the affected quarter section(s)—as such, that is what the MLRS research map and public reports will reflect, most commonly. Claim boundaries, as staked and monumented, are found in the accepted Notice/Certificate of Location as part of the official case file, managed by the BLM State Office having jurisdiction over the claim. The geometries are created in multiple ways but are primarily derived from Legal Land Descriptions (LLD) for the case and geocoded (mapped) using the Public Land Survey System (PLSS) derived from the most accurate survey data available through BLM Cadastral Survey workforce. Geospatial representations might be missing for some cases that can not be geocoded using the MLRS algorithm. Each case is given a data quality score based on how well it mapped. These can be lumped into seven groups to provide a simplified way to understand the scores.Group 1: Direct PLSS Match. Scores “0”, “1”, “2”, “3” should all have a match to the PLSS data. There are slight differences, but the primary expectation is that these match the PLSS. Group 2: Calculated PLSS Match. Scores “4”, “4.1”, “5”, “6”, “7” and “8” were generated through a process of creating the geometry that is not a direct capture from the PLSS. They represent a best guess based on the underlining PLSS Group 3 – Mapped to Section. Score of “8.1”, “8.2”, “8.3”, “9” and “10” are mapped to the Section for various reasons (refer to log information in data quality field). Group 4- Combination of mapped and unmapped areas. Score of 15 represents a case that has some portions that would map and others that do not. Group 5 – No NLSDB Geometry, Only Attributes. Scores “11”, “12”, “20”, “21” and “22” do not have a match to the PLSS and no geometry is in the NLSDB, and only attributes exist in the data. Group 6 – Mapped to County. Scores of “25” map to the County. Group 7 – Improved Geometry. Scores of “100” are cases that have had their geometry edited by BLM staff using ArcGIS Pro or MLRS bulk upload tool.
Facebook
TwitterI always wanted to access a data set that was related to the world’s population (Country wise). But I could not find a properly documented data set. Rather, I just created one manually.
Now I knew I wanted to create a dataset but I did not know how to do so. So, I started to search for the content (Population of countries) on the internet. Obviously, Wikipedia was my first search. But I don't know why the results were not acceptable. And also there were only I think 190 or more countries. So then I surfed the internet for quite some time until then I stumbled upon a great website. I think you probably have heard about this. The name of the website is Worldometer. This is exactly the website I was looking for. This website had more details than Wikipedia. Also, this website had more rows I mean more countries with their population.
Once I got the data, now my next hard task was to download it. Of course, I could not get the raw form of data. I did not mail them regarding the data. Now I learned a new skill which is very important for a data scientist. I read somewhere that to obtain the data from websites you need to use this technique. Any guesses, keep reading you will come to know in the next paragraph.
https://fiverr-res.cloudinary.com/images/t_main1,q_auto,f_auto/gigs/119580480/original/68088c5f588ec32a6b3a3a67ec0d1b5a8a70648d/do-web-scraping-and-data-mining-with-python.png" alt="alt text">
You are right its, Web Scraping. Now I learned this so that I could convert the data into a CSV format. Now I will give you the scraper code that I wrote and also I somehow found a way to directly convert the pandas data frame to a CSV(Comma-separated fo format) and store it on my computer. Now just go through my code and you will know what I'm talking about.
Below is the code that I used to scrape the code from the website
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3200273%2Fe814c2739b99d221de328c72a0b2571e%2FCapture.PNG?generation=1581314967227445&alt=media" alt="">
Now I couldn't have got the data without Worldometer. So special thanks to the website. It is because of them I was able to get the data.
As far as I know, I don't have any questions to ask. You guys can let me know by finding your ways to use the data and let me know via kernel if you find something interesting
Facebook
TwitterTHIS RESOURCE IS NO LONGER IN SERVICE, documented August 23, 2016. The LitMiner software is a literature data-mining tool that facilitates the identification of major gene regulation key players related to a user-defined field of interest in PubMed abstracts. The prediction of gene-regulatory relationships is based on co-occurrence analysis of key terms within the abstracts. LitMiner predicts relationships between key terms from the biomedical domain in four categories (genes, chemical compounds, diseases and tissues). The usefulness of the LitMiner system has been demonstrated recently in a study that reconstructed disease-related regulatory networks by promoter modeling that was initiated by a LitMiner generated primary gene list. To overcome the limitations and to verify and improve the data, we developed WikiGene, a Wiki-based curation tool that allows revision of the data by expert users over the Internet. It is based on the annotation of key terms in article abstracts followed by statistical co-citation analysis of annotated key terms in order to predict relationships. Key terms belonging to four different categories are used for the annotation process: -Genes: Names of genes and gene products. Gene name recognition is based on Ensembl . Synonyms and aliases are resolved. -Chemical Compounds: Names of chemical compounds and their respective aliases. -Diseases and Phenotypes: Names of diseases and phenotypes -Tissues and Organs: Names of tissues and organs LitMiner uses a database of disease and phenotype terms for literature annotation. Currently, there are 2225 diseases or phenotypes, 801 tissues and organs, and 10477 compounds in the database.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
It is not always easy to find databases from real world manufacturing plants, specially mining plants. So, I would like to share this database with the community, which comes from one of the most important parts of a mining process: a flotation plant!
PLEASE HELP ME GET MORE DATASETS LIKE THIS FILLING A 30s SURVEY:
The main goal is to use this data to predict how much impurity is in the ore concentrate. As this impurity is measured every hour, if we can predict how much silica (impurity) is in the ore concentrate, we can help the engineers, giving them early information to take actions (empowering!). Hence, they will be able to take corrective actions in advance (reduce impurity, if it is the case) and also help the environment (reducing the amount of ore that goes to tailings as you reduce silica in the ore concentrate).
The first column shows time and date range (from march of 2017 until september of 2017). Some columns were sampled every 20 second. Others were sampled on a hourly base.
The second and third columns are quality measures of the iron ore pulp right before it is fed into the flotation plant. Column 4 until column 8 are the most important variables that impact in the ore quality in the end of the process. From column 9 until column 22, we can see process data (level and air flow inside the flotation columns, which also impact in ore quality. The last two columns are the final iron ore pulp quality measurement from the lab. Target is to predict the last column, which is the % of silica in the iron ore concentrate.
I have been working in this dataset for at least six months and would like to see if the community can help to answer the following questions:
Is it possible to predict % Silica Concentrate every minute?
How many steps (hours) ahead can we predict % Silica in Concentrate? This would help engineers to act in predictive and optimized way, mitigatin the % of iron that could have gone to tailings.
Is it possible to predict % Silica in Concentrate whitout using % Iron Concentrate column (as they are highly correlated)?
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 832 biographies of literature writers retrieved from the Spanish Wikipedia. There is a total of 416 biographies of women writers extracted from the category entitled “Escritoras de España” (https://es.wikipedia.org/wiki/Categoría:Escritoras_de_España) and 416 biographies of male writers extracted from the category “Escritores de España del siglo XX” (https://es.wikipedia.org/wiki/Categoría:Escritores_de_España_del_siglo_XX)
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset includes a list of over 37 million Wikipedia articles in 55 languages with quality scores by WikiRank (https://wikirank.net). Quality scores of articles are based on Wikipedia dumps from July, 2018 License All files included in this datasets are released under CC0: https://creativecommons.org/publicdomain/zero/1.0/Format• page_id -- The identifier of the Wikipedia article (int), e.g. 4519301• revision_id -- The Wikipedia revision of the article (int), e.g. 24284811• page_name -- The title of the Wikipedia article (utf-8), e.g. General relativity• wikirank_quality -- quality score 0-100