Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data supporting the Master thesis "Monitoring von Open Data Praktiken - Herausforderungen beim Auffinden von Datenpublikationen am Beispiel der Publikationen von Forschenden der TU Dresden" (Monitoring open data practices - challenges in finding data publications using the example of publications by researchers at TU Dresden) - Katharina Zinke, Institut für Bibliotheks- und Informationswissenschaften, Humboldt-Universität Berlin, 2023
This ZIP-File contains the data the thesis is based on, interim exports of the results and the R script with all pre-processing, data merging and analyses carried out. The documentation of the additional, explorative analysis is also available. The actual PDFs and text files of the scientific papers used are not included as they are published open access.
The folder structure is shown below with the file names and a brief description of the contents of each file. For details concerning the analyses approach, please refer to the master's thesis (publication following soon).
## Data sources
Folder 01_SourceData/
- PLOS-Dataset_v2_Mar23.csv (PLOS-OSI dataset)
- ScopusSearch_ExportResults.csv (export of Scopus search results from Scopus)
- ScopusSearch_ExportResults.ris (export of Scopus search results from Scopus)
- Zotero_Export_ScopusSearch.csv (export of the file names and DOIs of the Scopus search results from Zotero)
## Automatic classification
Folder 02_AutomaticClassification/
- (NOT INCLUDED) PDFs folder (Folder for PDFs of all publications identified by the Scopus search, named AuthorLastName_Year_PublicationTitle_Title)
- (NOT INCLUDED) PDFs_to_text folder (Folder for all texts extracted from the PDFs by ODDPub, named AuthorLastName_Year_PublicationTitle_Title)
- PLOS_ScopusSearch_matched.csv (merge of the Scopus search results with the PLOS_OSI dataset for the files contained in both)
- oddpub_results_wDOIs.csv (results file of the ODDPub classification)
- PLOS_ODDPub.csv (merge of the results file of the ODDPub classification with the PLOS-OSI dataset for the publications contained in both)
## Manual coding
Folder 03_ManualCheck/
- CodeSheet_ManualCheck.txt (Code sheet with descriptions of the variables for manual coding)
- ManualCheck_2023-06-08.csv (Manual coding results file)
- PLOS_ODDPub_Manual.csv (Merge of the results file of the ODDPub and PLOS-OSI classification with the results file of the manual coding)
## Explorative analysis for the discoverability of open data
Folder04_FurtherAnalyses
Proof_of_of_Concept_Open_Data_Monitoring.pdf (Description of the explorative analysis of the discoverability of open data publications using the example of a researcher) - in German
## R-Script
Analyses_MA_OpenDataMonitoring.R (R-Script for preparing, merging and analyzing the data and for performing the ODDPub algorithm)
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset includes all experimental data used for the PhD thesis of Cong Liu, entitled "Software Data Analytics: Architectural Model Discovery and Design Pattern Detection". These data are generated by instrumenting both synthetic and real-life software systems, and are formated according to the IEEE XES format. See http://www.xes-standard.org/ and https://www.win.tue.nl/ieeetfpm/lib/exe/fetch.php?media=shared:downloads:2017-06-22-xes-software-event-v5-2.pdf for more explanations.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Code for getting data,mining text and estimatingVAR model
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Categorization of doctoral theses.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
State-trace data and source code accompanying Chapter 3,4,5 and Appendix A of the dissertation: Automated abstraction of discrete-event simulation models using state-trace data.
Facebook
TwitterThis thesis lays the ground work for enabling scalable data mining in massively parallel dataflow systems, using large datasets. Such datasets have become ubiquitous. We illustrate common fallacies with respect to scalable data mining: It is in no way sufficient to naively implement textbook algorithms on parallel systems; bottlenecks on all layers of the stack prevent the scalability of such naive implementations. We argue that scalability in data mining is a multi-leveled problem and must therefore be approached on the interplay of algorithms, systems, and applications. We therefore discuss a selection of scalability problems on these different levels. We investigate algorithm-specific scalability aspects of collaborative filtering algorithms for computing recommendations, a popular data mining use case with many industry deployments. We show how to efficiently execute the two most common approaches, namely neighborhood methods and latent factor models on MapReduce, and describe a specialized architecture for scaling collaborative filtering to extremely large datasets which we implemented at Twitter. We turn to system-specific scalability aspects, where we improve system performance during the distributed execution of a special class of iterative algorithms by drastically reducing the overhead required for guaranteeing fault tolerance. Therefore we propose a novel optimistic approach to fault-tolerance which exploits the robust convergence properties of a large class of fixpoint algorithms and does not incur measurable overhead in failure-free cases. Finally, we present work on an application-specific scalability aspect of scalable data mining. A common problem when deploying machine learning applications in real-world scenarios is that the prediction quality of ML models heavily depends on hyperparameters that have to be chosen in advance. We propose an algorithmic framework for an important subproblem occuring during hyperparameter search at scale: efficiently generating samples from block-partitioned matrices in a shared-nothing environment. For every selected problem, we show how to execute the resulting computation automatically in a parallel and scalable manner, and evaluate our proposed solution on large datasets with billions of datapoints.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance of the algorithm.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This set of data contains information about the process execution of electronic invoicing. The process of electronic invoicing contains the following activities: invoice scanning, approve invoice, liquidation and so on. The data set contains information about the event name, event type, time of the event's execution and the participant whose execution the event is related to. The data is formatted in the MXML format in order to be used for the process mining analysis using tools such as ProM and so on.
Facebook
Twitterhttps://borealisdata.ca/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7939/DVN/10950https://borealisdata.ca/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7939/DVN/10950
Mine-level copper data (1953-1984) used in Young, D. (1992), "Cost Specification and Firm Behaviour in a Hotelling Model of Resource Extraction," Canadian Journal of Economics XXV, 41-59. Spreadsheet has 5 tabs (including data and explanatory materials).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Previous works comparative table.
Facebook
Twitterhttps://doi.org/10.4121/resource:terms_of_usehttps://doi.org/10.4121/resource:terms_of_use
Label Ranking datasets used in the PhD thesis "Pattern Mining for Label Ranking"
Facebook
TwitterThis file is in Excel (xls) format, and contains data about regression model for input and output parameters (constants) that can be used for the solving of real-world vehicle routing problems with realistic non-standard constraints. All data are real and obtained experimentally by using VRP algorithm on production environment in one of the biggest distribution companies in Bosnia and Herzegovina.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
During my Senior in the Shan Dong University, my tutor give me research direction of University thesis, which is bitcoin transaction data analysis, so I crawled all of bitcoin transaction data from January 2009 to February 2018.I make statistical analysis and quantitative analysis,I hope this data will give you some help, data mining is interesting and helping not only in the skill of data mining but also in our life.
I crawled these data from website https://www.blockchain.com/explorer, each file contains many blocks,the scope of blocks is reflected in the file name,e.g. this file 0-68732.csv is composed of zero block which is also called genesis block until 68732 block.if a block that didn't have input is not in this file. let's see the columns and rows, there has five columns, the Height column represent block height,the Input column represent the input address of this block,the Output column represent the output address of this block,the Sum column represent bitcoin transaction amount corresponding to the Output,the Time column represent the generation time of this block.A block contains many transactions.
The page is part three of all data, others can be found here https://www.kaggle.com/shiheyingzhe/datasets
Facebook
TwitterThe future is shaped and influenced by decisions made today. These decisions need to be made on a solid ground and diverse information sources should be considered in the decision process. For exploring different futures, foresight offers a wide range of methods for gaining insights. The starting point of this thesis is the observation that recent foresight methods particularly use patent and publication data or rely on expert opinion, but few other data sources are used. In times of big data, many other options exist and, for example, social media or websites are currently not a major part of these deliberations. While the volume of data from heterogeneous sources grows considerably, foresight and its methods rarely benefit from such available data. One attempt to access and systematically examine this data is text mining that processes textual data in a largely automated manner. Therefore, this thesis addresses the contribution of text mining and further textual data sources for foresight and its methods. After clarifying the potential of combining text mining and foresight, four concrete examples are outlined. As the results show, the existing foresight methods are improved as exemplified by roadmapping and scenario development. By exploiting new data sources (e.g., Twitter and web mining), new options evolve for analyzing data. Thus, more actors and views are integrated, and more emphasis is laid on analyzing social changes. Summarized, using text mining enhances the detection and examination of emerging topics and technologies by extending the knowledge base of foresight. Hence, new foresight applications can be designed. And, in particular, text mining is promising for explorative approaches that require a solid base for reflecting on possible futures.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Mapping vectors to words.
Facebook
Twitterhttps://doi.org/10.4121/resource:terms_of_usehttps://doi.org/10.4121/resource:terms_of_use
The Beyoglu Preservation Area Building Features Database. A large and quite comprehensive GIS database was constructed in order to implement the data mining analysis, based mainly on the traditional thematic maps of the Master Plan for the Beyoğlu Preservation Area. This database consists of 45 spatial and non-spatial features attributed to the 11,984 buildings located in the Beyoğlu Preservation Area and it is one of the original outputs of the PhD Thesis entitled "A Knowledge Discovery Approach to Urban Analysis: The Beyoglu Preservation Area as a data mine".
Facebook
TwitterPublic Domain Mark 1.0https://creativecommons.org/publicdomain/mark/1.0/
License information was derived automatically
As research communities expand, the number of scientific articles continues to grow rapidly, with no signs of slowing. This information overload drives the need for automated tools to identify relevant materials and extract key ideas. Information extraction (IE) focuses on converting unstructured scientific text into structured knowledge (e.g., ontologies, taxonomies, and knowledge graphs), enabling intelligent systems to excel in tasks like document organization, scientific literature retrieval and recommendation, claim verification even novel idea or hypothesis generation. To pinpoint the scope of this thesis, I focus on the taxonomic structure in this thesis to represent the knowledge in the scientific domain.
To construct a taxonomy from scientific corpora, traditional methods often rely on pipeline frameworks. These frameworks typically follow a sequence: first, extracting scientific concepts or entities from the corpus; second, identifying hierarchical relationships between the concepts; and finally, organizing these relationships into a cohesive taxonomy. However, such methods encounter several challenges: (1) the quality of the corpus or annotation data, (2) error propagation within the pipeline framework, and (3) limited generalization and transferability to other specific domains. The development of large language models (LLMs) offers promising advancements, as these models have demonstrated remarkable abilities to internalize knowledge and respond effectively to a wide range of inquiries. Unlike traditional pipeline-based approaches, generative methods harness LLMs to achieve (1) better utilization of their internalized knowledge, (2) direct text-to-knowledge conversion, and (3) flexible, schema-free adaptability.
This thesis explores innovative methods for integrating text generation technologies to improve IE in the scientific domain, with a focus on taxonomy construction. The approach begins with generating entity names and evolves to create or enrich taxonomies directly via text generation. I will explore combining neighborhood structural context, descriptive textual information, and LLMs' internal knowledge to improve output quality. Finally, this thesis will outline future research directions.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Key indicators.
Facebook
TwitterThese geospatial files are the essential components for the Geologic Map of the Stibnite Mining Area in Valley County, Idaho, which was published by the Idaho Geological Survey in 2022. Three main file types are in this dataset: geographic, geologic, and mining. Geographic files are map extent, lidar base, topographic contours, labels for contours, waterways, and roads. Geologic files are geologic map units, faults, structural lines meaning axial traces, structural points like bedding strike and dip locations, cross section lines, and drill core sample locations. Lastly, mining files are disturbed ground features including open pit polygons or outlines, and general mining features such as the location of an adit. File formats are shape, layer, or raster. Of the 14 shapefiles, 7 have layer files that provide pre-set symbolization for use in ESRI ArcMap that match up with the Geologic Map of the Stibnite Mining Area in Valley County, Idaho. The lidar data have two similar, but distinct, raster format types (ESRI GRID and TIFF) intended to increase end user accessibility. This dataset is a compilation of both legacy data (from Smitherman’s 1985 masters thesis published in 1988, Midas Gold Corporation employees, the Geologic Map of the Stibnite Quadrangle (Stewart and others, 2016) and Reed S. Lewis of the Idaho Geological Survey) and new data from 2013, 2015, and 2016 field work by Niki E. Wintzer.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This datasets comes from a personal project that begun with my MSc thesis in Data Mining at Buenos Aires University. There I detect slums and informal settlements for La Matanza (Buenos Aires) district. The algorithm developed there helps to reduce to 15% of the total territory to analyze.
After successfully finish the thesis, I created a map of slums for whole Argentina. Map and thesis content are available at fedebayle.github.io/potencialesvya.
As far as I know, this is the first research of its kind in Argentina, which I think would help my country to contribute to UN Millennium Development Goal 7, Target 11, "Improving the lives of 100 million slum dwellers".
This datasets contains georeferenced images about urban slums and informal settlements for two districts in Argentina: Buenos Aires and Córdoba (~ 15M habitants).
The image of Cordoba was taken on 2017-06-09 and the images of Buenos Aires on 2017-05-04.
Each image comes from Sentinel-2 sensor, with 32x32px and 4 bands (bands 2, 3, 4, 8A, 10 meter resolution). Those who prefix is "vya_" contains slums (positive class). Sentinel-2 is an Earth observation mission developed by ESA as part of the Copernicus Programme to perform terrestrial observations in support of services such as forest monitoring, land cover changes detection, and natural disaster management.
Images are in .tif format.
Image names consist of:
(vya_)[tile id]_[raster row start]_[raster row end]_[raster column start]_[raster column end].tif
This is a highly imbalanced class problem:
I would not have been able to create this dataset if the Sentinel program did not exist. Thanks to European Space Agency!
The cost of conducting a survey of informal settlements and slums is high and requires copious logistical resources. In Argentina, these surveys have been conducted only each 10 years at census.
Algorithms developed with this data could be used in different countries and help to fight poverty around the world.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data supporting the Master thesis "Monitoring von Open Data Praktiken - Herausforderungen beim Auffinden von Datenpublikationen am Beispiel der Publikationen von Forschenden der TU Dresden" (Monitoring open data practices - challenges in finding data publications using the example of publications by researchers at TU Dresden) - Katharina Zinke, Institut für Bibliotheks- und Informationswissenschaften, Humboldt-Universität Berlin, 2023
This ZIP-File contains the data the thesis is based on, interim exports of the results and the R script with all pre-processing, data merging and analyses carried out. The documentation of the additional, explorative analysis is also available. The actual PDFs and text files of the scientific papers used are not included as they are published open access.
The folder structure is shown below with the file names and a brief description of the contents of each file. For details concerning the analyses approach, please refer to the master's thesis (publication following soon).
## Data sources
Folder 01_SourceData/
- PLOS-Dataset_v2_Mar23.csv (PLOS-OSI dataset)
- ScopusSearch_ExportResults.csv (export of Scopus search results from Scopus)
- ScopusSearch_ExportResults.ris (export of Scopus search results from Scopus)
- Zotero_Export_ScopusSearch.csv (export of the file names and DOIs of the Scopus search results from Zotero)
## Automatic classification
Folder 02_AutomaticClassification/
- (NOT INCLUDED) PDFs folder (Folder for PDFs of all publications identified by the Scopus search, named AuthorLastName_Year_PublicationTitle_Title)
- (NOT INCLUDED) PDFs_to_text folder (Folder for all texts extracted from the PDFs by ODDPub, named AuthorLastName_Year_PublicationTitle_Title)
- PLOS_ScopusSearch_matched.csv (merge of the Scopus search results with the PLOS_OSI dataset for the files contained in both)
- oddpub_results_wDOIs.csv (results file of the ODDPub classification)
- PLOS_ODDPub.csv (merge of the results file of the ODDPub classification with the PLOS-OSI dataset for the publications contained in both)
## Manual coding
Folder 03_ManualCheck/
- CodeSheet_ManualCheck.txt (Code sheet with descriptions of the variables for manual coding)
- ManualCheck_2023-06-08.csv (Manual coding results file)
- PLOS_ODDPub_Manual.csv (Merge of the results file of the ODDPub and PLOS-OSI classification with the results file of the manual coding)
## Explorative analysis for the discoverability of open data
Folder04_FurtherAnalyses
Proof_of_of_Concept_Open_Data_Monitoring.pdf (Description of the explorative analysis of the discoverability of open data publications using the example of a researcher) - in German
## R-Script
Analyses_MA_OpenDataMonitoring.R (R-Script for preparing, merging and analyzing the data and for performing the ODDPub algorithm)