100+ datasets found
  1. Data Mining Project 1 Sapfile

    • kaggle.com
    zip
    Updated Jan 31, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prutchakorn (2019). Data Mining Project 1 Sapfile [Dataset]. https://www.kaggle.com/prutchakorn/data-mining-project-1-sapfile
    Explore at:
    zip(2244 bytes)Available download formats
    Dataset updated
    Jan 31, 2019
    Authors
    Prutchakorn
    Description

    Dataset

    This dataset was created by Prutchakorn

    Contents

  2. ghtorrent-projects Dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin, txt
    Updated Jul 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marios Papachristou; Marios Papachristou (2021). ghtorrent-projects Dataset [Dataset]. http://doi.org/10.5281/zenodo.5111043
    Explore at:
    txt, binAvailable download formats
    Dataset updated
    Jul 17, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marios Papachristou; Marios Papachristou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A hypergraph dataset mined from the GHTorrent project is presented. The dataset contains two files

    1. project_members.txt: Contains GitHub projects with at least 2 contributors and the corresponding contributors (as a hyperedge). The format of the data is:

    2. num_followers.txt: Contains all GitHub users and their number of followers.

    The artifact also contains the SQL queries used to obtain the data from GHTorrent (schema).

  3. Africa - PowerMining Projects Database

    • data.subak.org
    • cloud.csiss.gmu.edu
    • +3more
    csv
    Updated Feb 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    World Bank Group (2023). Africa - PowerMining Projects Database [Dataset]. https://data.subak.org/pl/dataset/africa-powermining-projects-database-2014
    Explore at:
    csvAvailable download formats
    Dataset updated
    Feb 16, 2023
    Dataset provided by
    World Bankhttp://worldbank.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Africa
    Description

    "The Africa Power–Mining Database 2014 shows ongoing and forthcoming mining projects in Africa categorized by the type of mineral, ore grade, size of the project. The database draws on basic mining data from Infomine surveys, the United States Geological Survey, annual reports, technical reports, feasibility studies, investor presentations, sustainability reports on property-owner websites or filed in public domains, and mining websites (Mining Weekly, Mining Journal, Mbendi, Mining-technology, and Miningmx). Comprising 455 projects in 28 SSA countries with each project’s ore reserve value assessed at more than $250 million, the database collates publicly available and proprietary information. It also provides a panoramic view of projects operating in 2000–12 and anticipated demand in 2020. The analysis is presented over three timeframes: pre-2000, 2001–12, and 2020 (each containing the projects from the previous period except for those closing during that previous period)."

  4. Data from: Community-Scale Attic Retrofit and Home Energy Upgrade Data...

    • datasets.ai
    • data.openei.org
    • +3more
    33, 53, 55, 8
    Updated Sep 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Energy (2024). Community-Scale Attic Retrofit and Home Energy Upgrade Data Mining - Hot Dry Climate [Dataset]. https://datasets.ai/datasets/community-scale-attic-retrofit-and-home-energy-upgrade-data-mining-hot-dry-climate
    Explore at:
    55, 8, 33, 53Available download formats
    Dataset updated
    Sep 11, 2024
    Dataset provided by
    United States Department of Energyhttp://energy.gov/
    Authors
    Department of Energy
    Description

    Retrofitting is an essential element of any comprehensive strategy for improving residential energy efficiency. The residential retrofit market is still developing, and program managers must develop innovative strategies to increase uptake and promote economies of scale. Residential retrofitting remains a challenging proposition to sell to homeowners, because awareness levels are low and financial incentives are lacking.

    The U.S. Department of Energy's Building America research team, Alliance for Residential Building Innovation (ARBI), implemented a project to increase residential retrofits in Davis, California. The project used a neighborhood-focused strategy for implementation and a low-cost retrofit program that focused on upgraded attic insulation and duct sealing. ARBI worked with a community partner, the not-for-profit Cool Davis Initiative, as well as selected area contractors to implement a strategy that sought to capitalize on the strong local expertise of partners and the unique aspects of the Davis, California, community. Working with community partners also allowed ARBI to collect and analyze data about effective messaging tactics for community-based retrofit programs.

    ARBI expected this project, called Retrofit Your Attic, to achieve higher uptake than other retrofit projects, because it emphasized a low-cost, one-measure retrofit program. However, this was not the case. The program used a strategy that focused on attics-including air sealing, duct sealing, and attic insulation-as a low-cost entry for homeowners to complete home retrofits. The price was kept below $4,000 after incentives; both contractors in the program offered the same price. The program completed only five retrofits. Interestingly, none of those homeowners used the one-measure strategy. All five homeowners were concerned about cost, comfort, and energy savings and included additional measures in their retrofits. The low-cost, one-measure strategy did not increase the uptake among homeowners, even in a well-educated, affluent community such as Davis.

    This project has two primary components. One is to complete attic retrofits on a community scale in the hot-dry climate on Davis, CA. Sufficient data will be collected on these projects to include them in the BAFDR. Additionally, ARBI is working with contractors to obtain building and utility data from a large set of retrofit projects in CA (hot-dry). These projects are to be uploaded into the BAFDR.

  5. m

    Educational Attainment in North Carolina Public Schools: Use of statistical...

    • data.mendeley.com
    Updated Nov 14, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
    Explore at:
    Dataset updated
    Nov 14, 2018
    Authors
    Scott Herford
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    North Carolina
    Description

    The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.

  6. Knowledge Graph: tyrolean mining documents 15th and 16th century

    • zenodo.org
    bin
    Updated Sep 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gerald Hiebel; Gerald Hiebel; Elisabeth Gruber-Tokić; Elisabeth Gruber-Tokić; Milena Peralta Friedburg; Milena Peralta Friedburg; Brigit Danthine; Brigit Danthine (2024). Knowledge Graph: tyrolean mining documents 15th and 16th century [Dataset]. http://doi.org/10.5281/zenodo.6276586
    Explore at:
    binAvailable download formats
    Dataset updated
    Sep 26, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gerald Hiebel; Gerald Hiebel; Elisabeth Gruber-Tokić; Elisabeth Gruber-Tokić; Milena Peralta Friedburg; Milena Peralta Friedburg; Brigit Danthine; Brigit Danthine
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains a Knowledge Graph (.nq file) of two historical mining documents: “Verleihbuch der Rattenberger Bergrichter” ( Hs. 37, 1460-1463) and “Schwazer Berglehenbuch” (Hs. 1587, approx. 1515) stored by the Tyrolean Regional Archive, Innsbruck (Austria). The user of the KG may explore the montanistic network and relations between people, claims and mines in the late medieval Tyrol. The core regions concern the districts Schwaz and Kufstein (Tyrol, Austria).

    The ontology used to represent the claims is CIDOC CRM, an ISO certified ontology for Cultural Heritage documentation. Supported by the Karma tool the KG is generated as RDF (Resource Description Framework). The generated RDF data is imported into a Triplestore, in this case GraphDB, and then displayed visually. This puts the data from the early mining texts into a semantically structured context and makes the mutual relationships between people, places and mines visible.

    Both documents and the Knowledge Graph were processed and generated by the research team of the project “Text Mining Medieval Mining Texts”. The research project (2019-2022) was carried out at the university of Innsbruck and funded by go!digital next generation programme of the Austrian Academy of Sciences.

    Citeable Transcripts of the historical documents are online available:
    Hs. 37 DOI: 10.5281/zenodo.6274562
    Hs. 1587 DOI: 10.5281/zenodo.6274928

  7. i

    Data from: Twitter Big Data as a Resource for Exoskeleton Research: A...

    • ieee-dataport.org
    Updated Oct 22, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nirmalya Thakur (2022). Twitter Big Data as a Resource for Exoskeleton Research: A Large-Scale Dataset of about 140,000 Tweets and 100 Research Questions [Dataset]. http://doi.org/10.21227/r5mv-ax79
    Explore at:
    Dataset updated
    Oct 22, 2022
    Dataset provided by
    IEEE Dataport
    Authors
    Nirmalya Thakur
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Please cite the following paper when using this dataset:N. Thakur, "Twitter Big Data as a Resource for Exoskeleton Research: A Large-Scale Dataset of about 140,000 Tweets from 2017–2022 and 100 Research Questions", Journal of Analytics, Volume 1, Issue 2, 2022, pp. 72-97, DOI: https://doi.org/10.3390/analytics1020007AbstractThe exoskeleton technology has been rapidly advancing in the recent past due to its multitude of applications and diverse use cases in assisted living, military, healthcare, firefighting, and industry 4.0. The exoskeleton market is projected to increase by multiple times its current value within the next two years. Therefore, it is crucial to study the degree and trends of user interest, views, opinions, perspectives, attitudes, acceptance, feedback, engagement, buying behavior, and satisfaction, towards exoskeletons, for which the availability of Big Data of conversations about exoskeletons is necessary. The Internet of Everything style of today’s living, characterized by people spending more time on the internet than ever before, with a specific focus on social media platforms, holds the potential for the development of such a dataset by the mining of relevant social media conversations. Twitter, one such social media platform, is highly popular amongst all age groups, where the topics found in the conversation paradigms include emerging technologies such as exoskeletons. To address this research challenge, this work makes two scientific contributions to this field. First, it presents an open-access dataset of about 140,000 Tweets about exoskeletons that were posted in a 5-year period from 21 May 2017 to 21 May 2022. Second, based on a comprehensive review of the recent works in the fields of Big Data, Natural Language Processing, Information Retrieval, Data Mining, Pattern Recognition, and Artificial Intelligence that may be applied to relevant Twitter data for advancing research, innovation, and discovery in the field of exoskeleton research, a total of 100 Research Questions are presented for researchers to study, analyze, evaluate, ideate, and investigate based on this dataset.

  8. Titanic Datamining project Yousef

    • kaggle.com
    zip
    Updated Dec 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr THABIT FURSAN (2023). Titanic Datamining project Yousef [Dataset]. https://www.kaggle.com/datasets/drthabitfursan/titanic-datamining-project-yousefib/data
    Explore at:
    zip(22544 bytes)Available download formats
    Dataset updated
    Dec 19, 2023
    Authors
    Dr THABIT FURSAN
    Description

    Dataset

    This dataset was created by Dr THABIT FURSAN

    Contents

  9. d

    Tokenized Forms of Jane Austen Novels with Positional Information

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Sep 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Duckworth, Tyler J (2024). Tokenized Forms of Jane Austen Novels with Positional Information [Dataset]. https://search.dataone.org/view/sha256%3Ad5fa1267f6f5030c07d81a0d2a2e4deaad316b4d9aabfb88d937ba48e93a4ce5
    Explore at:
    Dataset updated
    Sep 24, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Duckworth, Tyler J
    Description

    This dataset contains tokenized forms of four Jane Austen novels sourced from Project Gutenberg--Emma, Persuasion, Pride and Prejudice, and Sense and Sensibility--that are broken down by chapter (and volume where appropriate). Each file also includes positional data for each row which will be used for further analysis. This was created to hold the data for the final project for COSC426: Introduction to Data Mining, a class at the University of Tennessee.

  10. r

    Mine Project Approval Boundary

    • researchdata.edu.au
    • data.nsw.gov.au
    Updated Jul 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.nsw.gov.au (2024). Mine Project Approval Boundary [Dataset]. https://researchdata.edu.au/mine-project-approval-boundary/3362577
    Explore at:
    Dataset updated
    Jul 24, 2024
    Dataset provided by
    data.nsw.gov.au
    Area covered
    Description

    The Project Approval Boundary spatial data set provides information on the location of the project approvals granted for each mine in NSW by an approval authority (either NSW Department of Planning or local Council). This information may not align to the mine authorisation (i.e. mine title etc) granted under the Mining Act 1992. This information is created and submitted by each large mine operator to fulfill the Final Landuse and Rehabilitation Plan data submission requirements required under Schedule 8A of the Mining Regulation 2016. \r \r The collection of this spatial data is administered by the Resources Regulator in NSW who conducts reviews of the data submitted for assessment purposes. In some cases, information provided may contain inaccuracies that require adjustment following the assessment process by the Regulator. The Regulator will request data resubmission if issues are identified. \r \r Further information on the reporting requirements associated with mine rehabilitation can be found at https://www.resourcesregulator.nsw.gov.au/rehabilitation/mine-rehabilitation. \r \r Find more information about the data at https://www.seed.nsw.gov.au/project-approvals-boundary-layer\r \r Any data related questions should be directed to nswresourcesregulator@service-now.com

  11. Data from: DATA MINING THE GALAXY ZOO MERGERS

    • data.nasa.gov
    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    • +2more
    application/rdfxml +5
    Updated Jun 26, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). DATA MINING THE GALAXY ZOO MERGERS [Dataset]. https://data.nasa.gov/dataset/DATA-MINING-THE-GALAXY-ZOO-MERGERS/cs4h-8wda
    Explore at:
    xml, application/rdfxml, application/rssxml, tsv, json, csvAvailable download formats
    Dataset updated
    Jun 26, 2018
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    DATA MINING THE GALAXY ZOO MERGERS

    STEVEN BAEHR*, ARUN VEDACHALAM*, KIRK BORNE*, AND DANIEL SPONSELLER*

    Abstract. Collisions between pairs of galaxies usually end in the coalescence (merger) of the two galaxies. Collisions and mergers are rare phenomena, yet they may signal the ultimate fate of most galaxies, including our own Milky Way. With the onset of massive collection of astronomical data, a computerized and automated method will be necessary for identifying those colliding galaxies worthy of more detailed study. This project researches methods to accomplish that goal. Astronomical data from the Sloan Digital Sky Survey (SDSS) and human-provided classifications on merger status from the Galaxy Zoo project are combined and processed with machine learning algorithms. The goal is to determine indicators of merger status based solely on discovering those automated pipeline-generated attributes in the astronomical database that correlate most strongly with the patterns identified through visual inspection by the Galaxy Zoo volunteers. In the end, we aim to provide a new and improved automated procedure for classification of collisions and mergers in future petascale astronomical sky surveys. Both information gain analysis (via the C4.5 decision tree algorithm) and cluster analysis (via the Davies-Bouldin Index) are explored as techniques for finding the strongest correlations between human-identified patterns and existing database attributes. Galaxy attributes measured in the SDSS green waveband images are found to represent the most influential of the attributes for correct classification of collisions and mergers. Only a nominal information gain is noted in this research, however, there is a clear indication of which attributes contribute so that a direction for further study is apparent.

  12. SNL Metals & Mining Dataset | S&P Global Marketplace

    • marketplace.spglobal.com
    Updated May 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    S&P Global (2020). SNL Metals & Mining Dataset | S&P Global Marketplace [Dataset]. https://www.marketplace.spglobal.com/en/datasets/snl-metals-mining-(19)
    Explore at:
    Dataset updated
    May 14, 2020
    Dataset authored and provided by
    S&P Globalhttps://www.spglobal.com/
    Description

    A comprehensive source of asset and company-level data for the mining sector worldwide, as well as research and news content.

  13. Z

    Meta-study water and mining conflicts

    • data.niaid.nih.gov
    • zenodo.org
    Updated Feb 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meta-study water and mining conflicts [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_5151474
    Explore at:
    Dataset updated
    Feb 17, 2023
    Dataset provided by
    Schoderer, Mirja
    Ott, Marlen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset comprises the raw data and R Script for the following published article: Schoderer, M., & Ott, M. (2022). Contested water-and miningscapes–Explaining the high intensity of water and mining conflicts in a meta-study. World Development, 154, 105888. The article seeks to better understand the dynamics of mining and water conflicts, specifically under which (combinations of) conditions environmental defenders step outside the legal framework in their contestation of mining projects, according to existing case study-based research. More information on the methodology is available in the paper.

    The file Water and mining conflicts full dataset includes the qualitative information extracted from published articles, the scoring scheme and the normalized scores used in the R analysis. The R Script QCA_Preventive water and mining conflicts describes the fuzzy-set, two-step Qualitative Comparative Analysis conduct to understand under which conditions environmental defenders choose non-legal means in conflicts that occur in the planning or licensing stage of a mining project The CSV file Normalized scores_preventive is the raw data used in the R Script QCA_Preventive water and mining conflicts The R Script QCA_Reactive water and mining conflicts describes the fuzzy-set, two-step Qualitative Comparative Analysis conduct to understand under which conditions environmental defenders choose non-legal means in conflicts that occur when the mining project is already in operation The CSV file Normalized scores_reactive is the raw data used in the R Script QCA_Reactive water and mining conflicts

  14. f

    Fig A Hierarchical cluster dendrogram for the pairwise Pearson correlation...

    • plos.figshare.com
    docx
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David G. Covell (2023). Fig A Hierarchical cluster dendrogram for the pairwise Pearson correlation coefficients of the CGP data; Fig B Heatmap of GI50 correlations for the CGP data; Fig C Glmnet EN regression output for PD-0325901; Fig D Heatmap for CGP drugs that yielded a converged EN model with 10 or more genes; Fig E: Heatmap for minimal EN model of bortezomib; Table A. [Dataset]. http://doi.org/10.1371/journal.pone.0127433.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    David G. Covell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Pvclust result for GI50 measures in the CGP tumor cells; Table B. Cluster results for EN genes; Table C. GSEA results for minimal EN models of PD_0325901; Table D. GSEA results for the minimal EN genes of dasatinib; Table E. Summary of counts for B-H adjusted significant MUTs; Table F. Clade members for row-clades A-J of main text Fig 8.; Table G. Best scoring GSEA results for global minimal EN genes.; Table H. Listing of -log(FDR q-values) for GSEA results using genes with significant CN changes; Table I. Listing of tstat values for gene MUT and CN changes for COSMIC drugs (DOCX)

  15. H

    Data from: Mining texts to efficiently generate global data on political...

    • dataverse.harvard.edu
    Updated Jul 8, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shahryar Minhas; Jay Ulfelder; Michael D. Ward (2015). Mining texts to efficiently generate global data on political regime types [Dataset]. http://doi.org/10.7910/DVN/8MC1LO
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 8, 2015
    Dataset provided by
    Harvard Dataverse
    Authors
    Shahryar Minhas; Jay Ulfelder; Michael D. Ward
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    We describe the design and results of an experiment in using text-mining and machine-learning techniques to generate annual measures of national political regime types. Valid and reliable measures of countries’ forms of national government are essential to cross-national and dynamic analysis of many phenomena of great interest to political scientists, including civil war, interstate war, democratization, and coups d’état. Unfortunately, traditional measures of regime type are very expensive to produce, and observations for ambiguous cases are often sharply contested. In this project, we train a series of support vector machine (SVM) classifiers to infer regime type from textual data sources. To train the classifiers, we used vectorized textual reports from Freedom House and the State Department as features for a training set of prelabeled regime type data. To validate our SVM classifiers, we compare their predictions in an out-of-sample context, and the performance results across a variety of metrics (accuracy, precision, recall) are very high. The results of this project highlight the ability of these techniques to contribute to producing real-time data sources for use in political science that can also be routinely updated at much lower cost than human-coded data. To this end, we set up a text-processing pipeline that pulls updated textual data from selected sources, conducts feature extraction, and applies supervised machine learning methods to produce measures of regime type. This pipeline, written in Python, can be pulled from the Github repository associated with this project and easily extended as more data becomes available.

  16. f

    Number of association rules generated using the Apriori rule mining approach...

    • figshare.com
    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Disha Tandon; Mohammed Monzoorul Haque; Sharmila S. Mande (2023). Number of association rules generated using the Apriori rule mining approach with various datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0154493.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Disha Tandon; Mohammed Monzoorul Haque; Sharmila S. Mande
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summarised information pertaining to (a) the number of samples, (b) the number of generated association rules (total as well as rules that involve 3 or more genera), (c) the unique number of microbial genera involved in the identified association rules, (d) execution time, and (e) the number of rules generated using an alternative rule mining strategy (detailed in discussion section of the manuscript).

  17. PubChem Data Mining of OXPHOS inhibitors: scripts, data, and models

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, bin +2
    Updated May 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Spencer Ericksen; Spencer Ericksen (2024). PubChem Data Mining of OXPHOS inhibitors: scripts, data, and models [Dataset]. http://doi.org/10.5281/zenodo.11003006
    Explore at:
    txt, application/gzip, bin, csvAvailable download formats
    Dataset updated
    May 9, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Spencer Ericksen; Spencer Ericksen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    README doc, source, and data files from PubChem data mining project to identify OXPHOS inhibitory chemotypes.

  18. Data from: Resource Projects

    • resourceprojects.org
    • karissamonneymd.com
    zip
    Updated Jan 9, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Natural Resource Governance Institute (2015). Resource Projects [Dataset]. https://resourceprojects.org
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 9, 2015
    Dataset provided by
    Natural Resource Governance Institutehttps://resourcegovernance.org/
    Time period covered
    Jan 1, 2014 - Present
    Area covered
    Worldwide
    Description

    Explore payments made by companies for extracting oil, gas and mining resources around the world.

  19. OSMRE Abandoned Mine Land Award-Winning Projects

    • catalog.data.gov
    • s.cnmilf.com
    Updated Dec 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Office of Surface Mining, Reclamation and Enforcement (2023). OSMRE Abandoned Mine Land Award-Winning Projects [Dataset]. https://catalog.data.gov/dataset/osmre-abandoned-mine-land-award-winning-projects
    Explore at:
    Dataset updated
    Dec 12, 2023
    Dataset provided by
    Office of Surface Mining Reclamation and Enforcementhttp://www.osmre.gov/
    Description

    This layer displays the counties where award-winning AML projects are located along with information about the projects. Since project locations are not always known exactly, or may be near private residences, exact locations are not provided. This layer was created using information from OSMRE's website and joining it to US county boundaries. Agencies responsible for the reclamation projects are listed when known; the quality of the data improves after 1998. When available, hyperlinks to project descriptions and videos are included as well, generally for projects within the last 10 years.

  20. l

    LScD (Leicester Scientific Dictionary)

    • figshare.le.ac.uk
    docx
    Updated Apr 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
    Explore at:
    docxAvailable download formats
    Dataset updated
    Apr 15, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Leicester
    Description

    LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Prutchakorn (2019). Data Mining Project 1 Sapfile [Dataset]. https://www.kaggle.com/prutchakorn/data-mining-project-1-sapfile
Organization logo

Data Mining Project 1 Sapfile

Explore at:
zip(2244 bytes)Available download formats
Dataset updated
Jan 31, 2019
Authors
Prutchakorn
Description

Dataset

This dataset was created by Prutchakorn

Contents

Search
Clear search
Close search
Google apps
Main menu