100+ datasets found

Data Mining Project 1 Sapfile
kaggle.com
zip
Updated Jan 31, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prutchakorn (2019). Data Mining Project 1 Sapfile [Dataset]. https://www.kaggle.com/prutchakorn/data-mining-project-1-sapfile
Explore at:
zip(2244 bytes)Available download formats
Dataset updated
Jan 31, 2019
Authors
Prutchakorn
Description
Dataset

This dataset was created by Prutchakorn

Contents
ghtorrent-projects Dataset
zenodo.org
data.niaid.nih.gov
bin, txt
Updated Jul 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marios Papachristou; Marios Papachristou (2021). ghtorrent-projects Dataset [Dataset]. http://doi.org/10.5281/zenodo.5111043
Explore at:
txt, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5111043
Dataset updated
Jul 17, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marios Papachristou; Marios Papachristou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A hypergraph dataset mined from the GHTorrent project is presented. The dataset contains two files

1. project_members.txt: Contains GitHub projects with at least 2 contributors and the corresponding contributors (as a hyperedge). The format of the data is:

2. num_followers.txt: Contains all GitHub users and their number of followers.

The artifact also contains the SQL queries used to obtain the data from GHTorrent (schema).
Africa - PowerMining Projects Database
data.subak.org
cloud.csiss.gmu.edu
+3more
csv
Updated Feb 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
World Bank Group (2023). Africa - PowerMining Projects Database [Dataset]. https://data.subak.org/pl/dataset/africa-powermining-projects-database-2014
Explore at:
csvAvailable download formats
Dataset updated
Feb 16, 2023
Dataset provided by
World Bankhttp://worldbank.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Africa
Description
"The Africa Power–Mining Database 2014 shows ongoing and forthcoming mining projects in Africa categorized by the type of mineral, ore grade, size of the project. The database draws on basic mining data from Infomine surveys, the United States Geological Survey, annual reports, technical reports, feasibility studies, investor presentations, sustainability reports on property-owner websites or filed in public domains, and mining websites (Mining Weekly, Mining Journal, Mbendi, Mining-technology, and Miningmx). Comprising 455 projects in 28 SSA countries with each project’s ore reserve value assessed at more than $250 million, the database collates publicly available and proprietary information. It also provides a panoramic view of projects operating in 2000–12 and anticipated demand in 2020. The analysis is presented over three timeframes: pre-2000, 2001–12, and 2020 (each containing the projects from the previous period except for those closing during that previous period)."
Data from: Community-Scale Attic Retrofit and Home Energy Upgrade Data...
datasets.ai
data.openei.org
+3more
33, 53, 55, 8
Updated Sep 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Energy (2024). Community-Scale Attic Retrofit and Home Energy Upgrade Data Mining - Hot Dry Climate [Dataset]. https://datasets.ai/datasets/community-scale-attic-retrofit-and-home-energy-upgrade-data-mining-hot-dry-climate
Explore at:
55, 8, 33, 53Available download formats
Dataset updated
Sep 11, 2024
Dataset provided by
United States Department of Energyhttp://energy.gov/
Authors
Department of Energy
Description
Retrofitting is an essential element of any comprehensive strategy for improving residential energy efficiency. The residential retrofit market is still developing, and program managers must develop innovative strategies to increase uptake and promote economies of scale. Residential retrofitting remains a challenging proposition to sell to homeowners, because awareness levels are low and financial incentives are lacking.

The U.S. Department of Energy's Building America research team, Alliance for Residential Building Innovation (ARBI), implemented a project to increase residential retrofits in Davis, California. The project used a neighborhood-focused strategy for implementation and a low-cost retrofit program that focused on upgraded attic insulation and duct sealing. ARBI worked with a community partner, the not-for-profit Cool Davis Initiative, as well as selected area contractors to implement a strategy that sought to capitalize on the strong local expertise of partners and the unique aspects of the Davis, California, community. Working with community partners also allowed ARBI to collect and analyze data about effective messaging tactics for community-based retrofit programs.

ARBI expected this project, called Retrofit Your Attic, to achieve higher uptake than other retrofit projects, because it emphasized a low-cost, one-measure retrofit program. However, this was not the case. The program used a strategy that focused on attics-including air sealing, duct sealing, and attic insulation-as a low-cost entry for homeowners to complete home retrofits. The price was kept below $4,000 after incentives; both contractors in the program offered the same price. The program completed only five retrofits. Interestingly, none of those homeowners used the one-measure strategy. All five homeowners were concerned about cost, comfort, and energy savings and included additional measures in their retrofits. The low-cost, one-measure strategy did not increase the uptake among homeowners, even in a well-educated, affluent community such as Davis.

This project has two primary components. One is to complete attic retrofits on a community scale in the hot-dry climate on Davis, CA. Sufficient data will be collected on these projects to include them in the BAFDR. Additionally, ARBI is working with contractors to obtain building and utility data from a large set of retrofit projects in CA (hot-dry). These projects are to be uploaded into the BAFDR.
m
Educational Attainment in North Carolina Public Schools: Use of statistical...
data.mendeley.com
Updated Nov 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
Explore at:
Unique identifier
https://doi.org/10.17632/6cm9wyd5g5.1
Dataset updated
Nov 14, 2018
Authors
Scott Herford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
North Carolina
Description
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
Knowledge Graph: tyrolean mining documents 15th and 16th century
zenodo.org
bin
Updated Sep 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gerald Hiebel; Gerald Hiebel; Elisabeth Gruber-Tokić; Elisabeth Gruber-Tokić; Milena Peralta Friedburg; Milena Peralta Friedburg; Brigit Danthine; Brigit Danthine (2024). Knowledge Graph: tyrolean mining documents 15th and 16th century [Dataset]. http://doi.org/10.5281/zenodo.6276586
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6276586
Dataset updated
Sep 26, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gerald Hiebel; Gerald Hiebel; Elisabeth Gruber-Tokić; Elisabeth Gruber-Tokić; Milena Peralta Friedburg; Milena Peralta Friedburg; Brigit Danthine; Brigit Danthine
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains a Knowledge Graph (.nq file) of two historical mining documents: “Verleihbuch der Rattenberger Bergrichter” ( Hs. 37, 1460-1463) and “Schwazer Berglehenbuch” (Hs. 1587, approx. 1515) stored by the Tyrolean Regional Archive, Innsbruck (Austria). The user of the KG may explore the montanistic network and relations between people, claims and mines in the late medieval Tyrol. The core regions concern the districts Schwaz and Kufstein (Tyrol, Austria).

The ontology used to represent the claims is CIDOC CRM, an ISO certified ontology for Cultural Heritage documentation. Supported by the Karma tool the KG is generated as RDF (Resource Description Framework). The generated RDF data is imported into a Triplestore, in this case GraphDB, and then displayed visually. This puts the data from the early mining texts into a semantically structured context and makes the mutual relationships between people, places and mines visible.

Both documents and the Knowledge Graph were processed and generated by the research team of the project “Text Mining Medieval Mining Texts”. The research project (2019-2022) was carried out at the university of Innsbruck and funded by go!digital next generation programme of the Austrian Academy of Sciences.

Citeable Transcripts of the historical documents are online available:
Hs. 37 DOI: 10.5281/zenodo.6274562
Hs. 1587 DOI: 10.5281/zenodo.6274928
i
Data from: Twitter Big Data as a Resource for Exoskeleton Research: A...
ieee-dataport.org
Updated Oct 22, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nirmalya Thakur (2022). Twitter Big Data as a Resource for Exoskeleton Research: A Large-Scale Dataset of about 140,000 Tweets and 100 Research Questions [Dataset]. http://doi.org/10.21227/r5mv-ax79
Explore at:
Unique identifier
https://doi.org/10.21227/r5mv-ax79
Dataset updated
Oct 22, 2022
Dataset provided by
IEEE Dataport
Authors
Nirmalya Thakur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Please cite the following paper when using this dataset:N. Thakur, "Twitter Big Data as a Resource for Exoskeleton Research: A Large-Scale Dataset of about 140,000 Tweets from 2017–2022 and 100 Research Questions", Journal of Analytics, Volume 1, Issue 2, 2022, pp. 72-97, DOI: https://doi.org/10.3390/analytics1020007AbstractThe exoskeleton technology has been rapidly advancing in the recent past due to its multitude of applications and diverse use cases in assisted living, military, healthcare, firefighting, and industry 4.0. The exoskeleton market is projected to increase by multiple times its current value within the next two years. Therefore, it is crucial to study the degree and trends of user interest, views, opinions, perspectives, attitudes, acceptance, feedback, engagement, buying behavior, and satisfaction, towards exoskeletons, for which the availability of Big Data of conversations about exoskeletons is necessary. The Internet of Everything style of today’s living, characterized by people spending more time on the internet than ever before, with a specific focus on social media platforms, holds the potential for the development of such a dataset by the mining of relevant social media conversations. Twitter, one such social media platform, is highly popular amongst all age groups, where the topics found in the conversation paradigms include emerging technologies such as exoskeletons. To address this research challenge, this work makes two scientific contributions to this field. First, it presents an open-access dataset of about 140,000 Tweets about exoskeletons that were posted in a 5-year period from 21 May 2017 to 21 May 2022. Second, based on a comprehensive review of the recent works in the fields of Big Data, Natural Language Processing, Information Retrieval, Data Mining, Pattern Recognition, and Artificial Intelligence that may be applied to relevant Twitter data for advancing research, innovation, and discovery in the field of exoskeleton research, a total of 100 Research Questions are presented for researchers to study, analyze, evaluate, ideate, and investigate based on this dataset.
Titanic Datamining project Yousef
kaggle.com
zip
Updated Dec 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr THABIT FURSAN (2023). Titanic Datamining project Yousef [Dataset]. https://www.kaggle.com/datasets/drthabitfursan/titanic-datamining-project-yousefib/data
Explore at:
zip(22544 bytes)Available download formats
Dataset updated
Dec 19, 2023
Authors
Dr THABIT FURSAN
Description
Dataset

This dataset was created by Dr THABIT FURSAN

Contents
d
Tokenized Forms of Jane Austen Novels with Positional Information
search.dataone.org
dataverse.harvard.edu
Updated Sep 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Duckworth, Tyler J (2024). Tokenized Forms of Jane Austen Novels with Positional Information [Dataset]. https://search.dataone.org/view/sha256%3Ad5fa1267f6f5030c07d81a0d2a2e4deaad316b4d9aabfb88d937ba48e93a4ce5
Explore at:
Dataset updated
Sep 24, 2024
Dataset provided by
Harvard Dataverse
Authors
Duckworth, Tyler J
Description
This dataset contains tokenized forms of four Jane Austen novels sourced from Project Gutenberg--Emma, Persuasion, Pride and Prejudice, and Sense and Sensibility--that are broken down by chapter (and volume where appropriate). Each file also includes positional data for each row which will be used for further analysis. This was created to hold the data for the final project for COSC426: Introduction to Data Mining, a class at the University of Tennessee.
r
Mine Project Approval Boundary
researchdata.edu.au
data.nsw.gov.au
Updated Jul 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.nsw.gov.au (2024). Mine Project Approval Boundary [Dataset]. https://researchdata.edu.au/mine-project-approval-boundary/3362577
Explore at:
Dataset updated
Jul 24, 2024
Dataset provided by
data.nsw.gov.au
Area covered

Description
The Project Approval Boundary spatial data set provides information on the location of the project approvals granted for each mine in NSW by an approval authority (either NSW Department of Planning or local Council). This information may not align to the mine authorisation (i.e. mine title etc) granted under the Mining Act 1992. This information is created and submitted by each large mine operator to fulfill the Final Landuse and Rehabilitation Plan data submission requirements required under Schedule 8A of the Mining Regulation 2016. \r \r The collection of this spatial data is administered by the Resources Regulator in NSW who conducts reviews of the data submitted for assessment purposes. In some cases, information provided may contain inaccuracies that require adjustment following the assessment process by the Regulator. The Regulator will request data resubmission if issues are identified. \r \r Further information on the reporting requirements associated with mine rehabilitation can be found at https://www.resourcesregulator.nsw.gov.au/rehabilitation/mine-rehabilitation. \r \r Find more information about the data at https://www.seed.nsw.gov.au/project-approvals-boundary-layer\r \r Any data related questions should be directed to nswresourcesregulator@service-now.com
Data from: DATA MINING THE GALAXY ZOO MERGERS
data.nasa.gov
data.staging.idas-ds1.appdat.jsc.nasa.gov
+2more
application/rdfxml +5
Updated Jun 26, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). DATA MINING THE GALAXY ZOO MERGERS [Dataset]. https://data.nasa.gov/dataset/DATA-MINING-THE-GALAXY-ZOO-MERGERS/cs4h-8wda
Explore at:
xml, application/rdfxml, application/rssxml, tsv, json, csvAvailable download formats
Dataset updated
Jun 26, 2018
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
DATA MINING THE GALAXY ZOO MERGERS

STEVEN BAEHR*, ARUN VEDACHALAM*, KIRK BORNE*, AND DANIEL SPONSELLER*

Abstract. Collisions between pairs of galaxies usually end in the coalescence (merger) of the two galaxies. Collisions and mergers are rare phenomena, yet they may signal the ultimate fate of most galaxies, including our own Milky Way. With the onset of massive collection of astronomical data, a computerized and automated method will be necessary for identifying those colliding galaxies worthy of more detailed study. This project researches methods to accomplish that goal. Astronomical data from the Sloan Digital Sky Survey (SDSS) and human-provided classifications on merger status from the Galaxy Zoo project are combined and processed with machine learning algorithms. The goal is to determine indicators of merger status based solely on discovering those automated pipeline-generated attributes in the astronomical database that correlate most strongly with the patterns identified through visual inspection by the Galaxy Zoo volunteers. In the end, we aim to provide a new and improved automated procedure for classification of collisions and mergers in future petascale astronomical sky surveys. Both information gain analysis (via the C4.5 decision tree algorithm) and cluster analysis (via the Davies-Bouldin Index) are explored as techniques for finding the strongest correlations between human-identified patterns and existing database attributes. Galaxy attributes measured in the SDSS green waveband images are found to represent the most influential of the attributes for correct classification of collisions and mergers. Only a nominal information gain is noted in this research, however, there is a clear indication of which attributes contribute so that a direction for further study is apparent.
SNL Metals & Mining Dataset | S&P Global Marketplace
marketplace.spglobal.com
Updated May 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
S&P Global (2020). SNL Metals & Mining Dataset | S&P Global Marketplace [Dataset]. https://www.marketplace.spglobal.com/en/datasets/snl-metals-mining-(19)
Explore at:
Dataset updated
May 14, 2020
Dataset authored and provided by
S&P Globalhttps://www.spglobal.com/
Description
A comprehensive source of asset and company-level data for the mining sector worldwide, as well as research and news content.
Z
Meta-study water and mining conflicts
data.niaid.nih.gov
zenodo.org
Updated Feb 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meta-study water and mining conflicts [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_5151474
Explore at:
Dataset updated
Feb 17, 2023
Dataset provided by
Schoderer, Mirja
Ott, Marlen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset comprises the raw data and R Script for the following published article: Schoderer, M., & Ott, M. (2022). Contested water-and miningscapes–Explaining the high intensity of water and mining conflicts in a meta-study. World Development, 154, 105888. The article seeks to better understand the dynamics of mining and water conflicts, specifically under which (combinations of) conditions environmental defenders step outside the legal framework in their contestation of mining projects, according to existing case study-based research. More information on the methodology is available in the paper.

The file Water and mining conflicts full dataset includes the qualitative information extracted from published articles, the scoring scheme and the normalized scores used in the R analysis. The R Script QCA_Preventive water and mining conflicts describes the fuzzy-set, two-step Qualitative Comparative Analysis conduct to understand under which conditions environmental defenders choose non-legal means in conflicts that occur in the planning or licensing stage of a mining project The CSV file Normalized scores_preventive is the raw data used in the R Script QCA_Preventive water and mining conflicts The R Script QCA_Reactive water and mining conflicts describes the fuzzy-set, two-step Qualitative Comparative Analysis conduct to understand under which conditions environmental defenders choose non-legal means in conflicts that occur when the mining project is already in operation The CSV file Normalized scores_reactive is the raw data used in the R Script QCA_Reactive water and mining conflicts
f
Fig A Hierarchical cluster dendrogram for the pairwise Pearson correlation...
plos.figshare.com
docx
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David G. Covell (2023). Fig A Hierarchical cluster dendrogram for the pairwise Pearson correlation coefficients of the CGP data; Fig B Heatmap of GI50 correlations for the CGP data; Fig C Glmnet EN regression output for PD-0325901; Fig D Heatmap for CGP drugs that yielded a converged EN model with 10 or more genes; Fig E: Heatmap for minimal EN model of bortezomib; Table A. [Dataset]. http://doi.org/10.1371/journal.pone.0127433.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0127433.s001
Dataset updated
Jun 2, 2023
Dataset provided by
PLOS ONE
Authors
David G. Covell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Pvclust result for GI50 measures in the CGP tumor cells; Table B. Cluster results for EN genes; Table C. GSEA results for minimal EN models of PD_0325901; Table D. GSEA results for the minimal EN genes of dasatinib; Table E. Summary of counts for B-H adjusted significant MUTs; Table F. Clade members for row-clades A-J of main text Fig 8.; Table G. Best scoring GSEA results for global minimal EN genes.; Table H. Listing of -log(FDR q-values) for GSEA results using genes with significant CN changes; Table I. Listing of tstat values for gene MUT and CN changes for COSMIC drugs (DOCX)
H
Data from: Mining texts to efficiently generate global data on political...
dataverse.harvard.edu
Updated Jul 8, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shahryar Minhas; Jay Ulfelder; Michael D. Ward (2015). Mining texts to efficiently generate global data on political regime types [Dataset]. http://doi.org/10.7910/DVN/8MC1LO
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/8MC1LO
Dataset updated
Jul 8, 2015
Dataset provided by
Harvard Dataverse
Authors
Shahryar Minhas; Jay Ulfelder; Michael D. Ward
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
We describe the design and results of an experiment in using text-mining and machine-learning techniques to generate annual measures of national political regime types. Valid and reliable measures of countries’ forms of national government are essential to cross-national and dynamic analysis of many phenomena of great interest to political scientists, including civil war, interstate war, democratization, and coups d’état. Unfortunately, traditional measures of regime type are very expensive to produce, and observations for ambiguous cases are often sharply contested. In this project, we train a series of support vector machine (SVM) classifiers to infer regime type from textual data sources. To train the classifiers, we used vectorized textual reports from Freedom House and the State Department as features for a training set of prelabeled regime type data. To validate our SVM classifiers, we compare their predictions in an out-of-sample context, and the performance results across a variety of metrics (accuracy, precision, recall) are very high. The results of this project highlight the ability of these techniques to contribute to producing real-time data sources for use in political science that can also be routinely updated at much lower cost than human-coded data. To this end, we set up a text-processing pipeline that pulls updated textual data from selected sources, conducts feature extraction, and applies supervised machine learning methods to produce measures of regime type. This pipeline, written in Python, can be pulled from the Github repository associated with this project and easily extended as more data becomes available.
f
Number of association rules generated using the Apriori rule mining approach...
figshare.com
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Disha Tandon; Mohammed Monzoorul Haque; Sharmila S. Mande (2023). Number of association rules generated using the Apriori rule mining approach with various datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0154493.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0154493.t001
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Disha Tandon; Mohammed Monzoorul Haque; Sharmila S. Mande
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summarised information pertaining to (a) the number of samples, (b) the number of generated association rules (total as well as rules that involve 3 or more genera), (c) the unique number of microbial genera involved in the identified association rules, (d) execution time, and (e) the number of rules generated using an alternative rule mining strategy (detailed in discussion section of the manuscript).
PubChem Data Mining of OXPHOS inhibitors: scripts, data, and models
zenodo.org
data.niaid.nih.gov
application/gzip, bin +2
Updated May 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Spencer Ericksen; Spencer Ericksen (2024). PubChem Data Mining of OXPHOS inhibitors: scripts, data, and models [Dataset]. http://doi.org/10.5281/zenodo.11003006
Explore at:
txt, application/gzip, bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11003006
Dataset updated
May 9, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Spencer Ericksen; Spencer Ericksen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
README doc, source, and data files from PubChem data mining project to identify OXPHOS inhibitory chemotypes.
Data from: Resource Projects
resourceprojects.org
karissamonneymd.com
zip
Updated Jan 9, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Natural Resource Governance Institute (2015). Resource Projects [Dataset]. https://resourceprojects.org
Explore at:
zipAvailable download formats
Dataset updated
Jan 9, 2015
Dataset provided by
Natural Resource Governance Institutehttps://resourcegovernance.org/
Time period covered
Jan 1, 2014 - Present
Area covered
Worldwide
Description
Explore payments made by companies for extracting oil, gas and mining resources around the world.
OSMRE Abandoned Mine Land Award-Winning Projects
catalog.data.gov
s.cnmilf.com
Updated Dec 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office of Surface Mining, Reclamation and Enforcement (2023). OSMRE Abandoned Mine Land Award-Winning Projects [Dataset]. https://catalog.data.gov/dataset/osmre-abandoned-mine-land-award-winning-projects
Explore at:
Dataset updated
Dec 12, 2023
Dataset provided by
Office of Surface Mining Reclamation and Enforcementhttp://www.osmre.gov/
Description
This layer displays the counties where award-winning AML projects are located along with information about the projects. Since project locations are not always known exactly, or may be near private residences, exact locations are not provided. This layer was created using information from OSMRE's website and joining it to US county boundaries. Agencies responsible for the reclamation projects are listed when known; the quality of the data improves after 1998. When available, hyperlinks to project descriptions and videos are included as well, generally for projects within the last 10 years.
l
LScD (Leicester Scientific Dictionary)
figshare.le.ac.uk
docx
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.9746900.v3
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.