The Human Know-How Dataset describes 211,696 human activities from many different domains. These activities are decomposed into 2,609,236 entities (each with an English textual label). These entities represent over two million actions and half a million pre-requisites. Actions are interconnected both according to their dependencies (temporal/logical orders between actions) and decompositions (decomposition of complex actions into simpler ones). This dataset has been integrated with DBpedia (259,568 links). For more information see: - The project website: http://homepages.inf.ed.ac.uk/s1054760/prohow/index.htm - The data is also available on datahub: https://datahub.io/dataset/human-activities-and-instructions ---------------------------------------------------------------- * Quickstart: if you want to experiment with the most high-quality data before downloading all the datasets, download the file '9of11_knowhow_wikihow', and optionally files 'Process - Inputs', 'Process - Outputs', 'Process - Step Links' and 'wikiHow categories hierarchy'. * Data representation based on the PROHOW vocabulary: http://w3id.org/prohow# Data extracted from existing web resources is linked to the original resources using the Open Annotation specification * Data Model: an example of how the data is represented within the datasets is available in the attached Data Model PDF file. The attached example represents a simple set of instructions, but instructions in the dataset can have more complex structures. For example, instructions could have multiple methods, steps could have further sub-steps, and complex requirements could be decomposed into sub-requirements. ---------------------------------------------------------------- Statistics: * 211,696: number of instructions. From wikiHow: 167,232 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 44,464 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 2,609,236: number of RDF nodes within the instructions From wikiHow: 1,871,468 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 737,768 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 255,101: number of process inputs linked to 8,453 distinct DBpedia concepts (dataset Process - Inputs) * 4,467: number of process outputs linked to 3,439 distinct DBpedia concepts (dataset Process - Outputs) * 376,795: number of step links between 114,166 different sets of instructions (dataset Process - Step Links)
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
The WikiTableQuestions dataset poses complex questions about the contents of semi-structured Wikipedia tables. Beyond merely testing a model's knowledge retrieval capabilities, these questions require an understanding of both the natural language used and the structure of the table itself in order to provide a correct answer. This makes the dataset an excellent testing ground for AI models that aim to replicate or exceed human-level intelligence
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
In order to use the WikiTableQuestions dataset, you will need to first understand the structure of the dataset. The dataset is comprised of two types of files: questions and answers. The questions are in natural language, and are designed to test a model's ability to understand the table structure, understand the natural language question, and reason about the answer. The answers are in a list format, and provide additional information about each table that can be used to answer the questions.
To start working with the WikiTableQuestions dataset, you will need to download both the questions and answers files. Once you have downloaded both files, you can begin working with the dataset by loading it into a pandas dataframe. From there, you can begin exploring the data and developing your own models for answering the questions.
Happy Kaggling!
The WikiTableQuestions dataset can be used to train a model to answer complex questions about semi-structured Wikipedia tables.
The WikiTableQuestions dataset can be used to train a model to understand the structure of semi-structured Wikipedia tables.
The WikiTableQuestions dataset can be used to train a model to understand the natural language questions and reason about the answers
If you use this dataset in your research, please credit the original authors.
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: 0.csv
File: 1.csv
File: 10.csv
File: 11.csv
File: 12.csv
File: 14.csv
File: 15.csv
File: 17.csv
File: 18.csv
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
To increase the accessibility and diversity of easy reading in Slovenian and to create a prototype system that automatically simplifies texts in Slovenian, we prepared a dataset for the Slovenian language that contains aligned simple and complex sentences, which can be used for further development of models for simplifying texts in Slovenian. Dataset is a .json file that usually contains one complex ("kompleksni") and one simplified sentence ("enostavni") per row. However, if a complex sentence contains a lot of information we translated this sentence into more than one simplified sentences. Vice versa, more complex sentences can be translated into one simplified sentence if some information is given through more than one complex sentences but we summarised them into one simplified one.
Scientific and related management challenges in the water domain require synthesis of data from multiple domains. Many data analysis tasks are difficult because datasets are large and complex; standard formats for data types are not always agreed upon nor mapped to an efficient structure for analysis; water scientists may lack training in methods needed to efficiently tackle large and complex datasets; and available tools can make it difficult to share, collaborate around, and reproduce scientific work. Overcoming these barriers to accessing, organizing, and preparing datasets for analyses will be an enabler for transforming scientific inquiries. Building on the HydroShare repository’s established cyberinfrastructure, we have advanced two packages for the Python language that make data loading, organization, and curation for analysis easier, reducing time spent in choosing appropriate data structures and writing code to ingest data. These packages enable automated retrieval of data from HydroShare and the USGS’s National Water Information System (NWIS), loading of data into performant structures keyed to specific scientific data types and that integrate with existing visualization, analysis, and data science capabilities available in Python, and then writing analysis results back to HydroShare for sharing and eventual publication. These capabilities reduce the technical burden for scientists associated with creating a computational environment for executing analyses by installing and maintaining the packages within CUAHSI’s HydroShare-linked JupyterHub server. HydroShare users can leverage these tools to build, share, and publish more reproducible scientific workflows. The HydroShare Python Client and USGS NWIS Data Retrieval packages can be installed within a Python environment on any computer running Microsoft Windows, Apple MacOS, or Linux from the Python Package Index using the PIP utility. They can also be used online via the CUAHSI JupyterHub server (https://jupyterhub.cuahsi.org/) or other Python notebook environments like Google Collaboratory (https://colab.research.google.com/). Source code, documentation, and examples for the software are freely available in GitHub at https://github.com/hydroshare/hsclient/ and https://github.com/USGS-python/dataretrieval.
This presentation was delivered as part of the Hawai'i Data Science Institute's regular seminar series: https://datascience.hawaii.edu/event/data-science-and-analytics-for-water/
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This is the line version of student messy hand written dataset (SMHD) (Nisa, Hiqmat; Thom, James; ciesielski, Vic; Tennakoon, Ruwan (2023). Student Messy Handwritten Dataset (SMHD) . RMIT University. Dataset. https://doi.org/10.25439/rmt.24312715.v1).
Within the central repository, there are subfolders of each document converted into lines. All images are in .png format. In the main folder there are three .txt files.
1)SMHD.txt contain all the line level transcription in the form of
image name, threshold value, label
0001-000,178 Bombay Phenotype :-
2) SMHD-Cross-outsandInsertions.txt contains all the line images from the dataset having crossed-out and inserted text.
3)Class_Notes_SMHD.txt contains more complex cases with cross-outs, insertions and overwriting. This can be used as a test set. The images in this files does not included in the SMHD.txt.
In the transcription files, any crossed-out content is denoted by the '#' symbol, facilitating the easy identification of files with or without such modifications.
We have incorporated contributions from more than 500 students to construct the dataset. Handwritten examination papers are primary sources in academic institutes to assess student learning. In our experience as academics, we have found that student examination papers tend to be messy with all kinds of insertions and corrections and would thus be a great source of documents for investigating HTR in the wild. Unfortunately, student examination papers are not available due to ethical considerations. So, we created an exam-like situation to collect handwritten samples from students. The corpus of the collected data is academic-based. Usually, in academia, handwritten papers have lines in them. For this purpose, we drew lines using light colors on white paper. The height of a line is 1.5 pt and the space between two lines is 40 pt. The filled handwritten documents were scanned at a resolution of 300 dpi at a grey-level resolution of 8 bits.
In the second exercise, we asked participants to write an essay from a given list of topics, or they could write on any topic of their choice. We called it an essay-based dataset. This dataset is collected from 250 High school students. We gave them 30 minutes to think about the topic and write for this task.
In the third exercise, we select participants from different subjects and ask them to write on a topic from their current study. We called it a subject-based dataset. For this study, we used undergraduate students from different subjects, including 33 students from Mathematics, 71 from Biological Sciences, 24 from Environmental Sciences, 17 from Physics, and more than 84 from English studies.
Finally a class-notes dataset, we have collected class notes from almost 31 students on the same topic. We asked students to take notes of every possible sentence the speaker delivered during the lecture. After finishing the lesson in almost 10 minutes, we asked students to recheck their notes and compare them with other classmates. We did not impose any time restrictions for rechecking. We observed more cross-outs and corrections in class-notes compared to summary-based and academic-based collections.
In all four exercises, we did not impose any rules on them, for example, spacing, usage of a pen, etc. We asked them to cross out the text if it seemed inappropriate. Although usually writers made corrections in a second read, we also gave an extra 5 minutes for correction purposes.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset contains the text of a remarkable collection of short stories known as the TinyStories Corpus. With over 2,000 annotated stories, it is populated with an array of diverse styles and genres from multiple sources. This corpus is enriched by intricate annotations across each narrative content, making it a valuable resource for narrative text classification. The text field in each row includes the entirety of each story that can be used to identify plots, characters and other features associated with story-telling techniques. Through this collection of stories, users will gain an extensive insight into a wide range of narratives which could be used to produce powerful machine learning models for Narrative Text Classification
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
In this dataset, each row contains a short story along with its associated labels for narrative text classification tasks. The data consists of the following columns: - text: The story text itself (string) - validation.csv: Contains a set of short stories for validation (dataframe) - train.csv: Contains the text of short stories used for narrative text classification (dataframe)
The data contained in both files can be used for various types of machine learning tasks related to narrative text classification. These include but are not limited to experiments such as determining story genres, predicting user reactions, sentiment analysis etc.
To get started with using this dataset, begin by downloading both validation and train csv files from Kaggle datasets page and saving them on your computer or local environment. Once downloaded, you may need to preprocess both datasets by cleaning up any unnecessary/wrongly formatted values or duplicate entries if any exists within it before proceeding further on to your research work or machine learning task experimentations as these have great impacts on your research results accuracy rate which you do not want compromised!
Next step is simply loading up these two datasets into Python pandas dataframes so that they can easily be manipulated and analyzed using common tools associated with Natural Language Processing(NLP). This would require you writing few simple lines using pandas API functions like read_csv(), .append(), .concat()etc depending upon what kind of analysis/experiment you intend conducting afterwards utilizing this dataset in Python Jupyter Notebook framework as well as other machine learning frameworks popular among data scientists like scikit-learn if it will be something more complex than simple NLP task operations!
By now if done everything mentioned correctly here then we are ready now to finally get into actually working out our desired applications from exploring potential connections between different narratives or character traits via supervised Machine Learning models such as Naive Bayes Classifier among many others that could ultimately provide us useful insights revealing patterns existing underneath all those texts! With all necessary datas loaded up in supporting python platforms correctly so feel free to make interesting discoveries/predictions from extensive analyses provided by this richly annotated TinyStories Narrative Dataset!
- Creating a text classification algorithm to automatically categorize short stories by genre.
- Developing an AI-based summarization tool to quickly summarize the main points in a story.
- Developing an AI-based story generator that can generate new stories based on existing ones in the dataset
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: validation.csv | Column name | Description | |:--------------|:--------------------------------| | text | The text of the story. (String) |
File: train.csv | Column name | Description | |:--------------|:----------------------------...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
While conversing with chatbots, humans typically tend to ask many questions, a significant portion of which can be answered by referring to large-scale knowledge graphs (KG). While Question Answering (QA) and dialog systems have been studied independently, there is a need to study them closely to evaluate such real-world scenarios faced by bots involving both these tasks. Towards this end, we introduce the task of Complex Sequential QA which combines the two tasks of (i) answering factual questions through complex inferencing over a realistic-sized KG of millions of entities, and (ii) learning to converse through a series of coherently linked QA pairs. Through a labor intensive semi-automatic process, involving in-house and crowdsourced workers, we created a dataset containing around 200K dialogs with a total of 1.6M turns. Further, unlike existing large scale QA datasets which contain simple questions that can be answered from a single tuple, the questions in our dialogs require a larger subgraph of the KG. Specifically, our dataset has questions which require logical, quantitative, and comparative reasoning as well as their combinations. This calls for models which can: (i) parse complex natural language questions, (ii) use conversation context to resolve coreferences and ellipsis in utterances, (iii) ask for clarifications for ambiguous queries, and finally (iv) retrieve relevant subgraphs of the KG to answer such questions. However, our experiments with a combination of state of the art dialog and QA models show that they clearly do not achieve the above objectives and are inadequate for dealing with such complex real world settings. We believe that this new dataset coupled with the limitations of existing models as reported in this paper should encourage further research in Complex Sequential QA.
Please visit https://amritasaha1812.github.io/CSQA/ for more details.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the rapid development of deep learning techniques, the generation and counterfeiting of multimedia material are becoming increasingly straightforward to perform. At the same time, sharing fake content on the web has become so simple that malicious users can create unpleasant situations with minimal effort. Also, forged media are getting more and more complex, with manipulated videos (e.g., deepfakes where both the visual and audio contents can be counterfeited) that are taking the scene over still images. The multimedia forensic community has addressed the possible threats that this situation could imply by developing detectors that verify the authenticity of multimedia objects. However, the vast majority of these tools only analyze one modality at a time. This was not a problem as long as still images were considered the most widely edited media, but now, since manipulated videos are becoming customary, performing monomodal analyses could be reductive. Nonetheless, there is a lack in the literature regarding multimodal detectors (systems that consider both audio and video components). This is due to the difficulty of developing them but also to the scarsity of datasets containing forged multimodal data to train and test the designed algorithms.
In this paper we focus on the generation of an audio-visual deepfake dataset. First, we present a general pipeline for synthesizing speech deepfake content from a given real or fake video, facilitating the creation of counterfeit multimodal material. The proposed method uses Text-to-Speech (TTS) and Dynamic Time Warping (DTW) techniques to achieve realistic speech tracks. Then, we use the pipeline to generate and release TIMIT-TTS, a synthetic speech dataset containing the most cutting-edge methods in the TTS field. This can be used as a standalone audio dataset, or combined with DeepfakeTIMIT and VidTIMIT video datasets to perform multimodal research. Finally, we present numerous experiments to benchmark the proposed dataset in both monomodal (i.e., audio) and multimodal (i.e., audio and video) conditions. This highlights the need for multimodal forensic detectors and more multimodal deepfake data.
For the initial version of TIMIT-TTS v1.0
Arxiv: https://arxiv.org/abs/2209.08000
TIMIT-TTS Database v1.0: https://zenodo.org/record/6560159
As more hydrocarbon production from hydraulic fracturing and other methods produce large volumes of water, innovative methods must be explored for treatment and reuse of these waters. However, understanding the general water chemistry of these fluids is essential to providing the best treatment options optimized for each producing area. Machine learning algorithms can often be applied to datasets to solve complex problems. In this study, we used the U.S. Geological Survey’s National Produced Waters Geochemical Database (USGS PWGD) in an exploratory exercise to determine if systematic variations exist between produced waters and geologic environment that could be used to accurately classify a water sample to a given geologic province. Two datasets were used, one with fewer attributes (n = 7) but more samples (n = 58,541) named PWGD7, and another with more attributes (n = 9) but fewer samples (n = 33,271) named PWGD9. The attributes of interest were specific gravity, pH, HCO3, Na, Mg, Ca, Cl, SO4, and total dissolved solids. The two datasets, PWGD7 and PWGD9, contained samples from 20 and 19 geologic provinces, respectively. Outliers across all attributes for each province were removed at a 99% confidence interval. Both datasets were divided into a training and test set using an 80/20 split and a 90/10 split, respectively. Random forest, Naïve Bayes, and k-Nearest Neighbors algorithms were applied to the two different training datasets and used to predict on three different testing datasets. Overall model accuracies across the two datasets and three applied models ranged from 23.5% to 73.5%. A random forest algorithm (split rule = extratrees, mtry = 5) performed best on both datasets, producing an accuracy of 67.1% for a training set based on the PWGD7 dataset, and 73.5% for a training set based on the PWGD9 dataset. Overall, the three algorithms predicted more accurately on the PWGD7 dataset than PWGD9 dataset, suggesting that either a larger sample size and/or fewer attributes lead to a more successful predicting algorithm. Individual balanced accuracies for each producing province ranged from 50.6% (Anadarko) to 100% (Raton) for PWGD7, and from 44.5% (Gulf Coast) to 99.8% (Sedgwick) for PWGD9. Results from testing the model on recently published data outside of the USGS PWGD suggests that some provinces may be lacking information about their true geochemical diversity while others included in this dataset are well described. Expanding on this effort could lead to predictive tools that provide ranges of contaminants or other chemicals of concern within each province to design future treatment facilities to reclaim wastewater. We anticipate that this classification model will be improved over time as more diverse data are added to the USGS PWGD.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset contains valuable records of conversations between humans and AI-driven chatbots in real-world scenarios. This is a great opportunity to explore the nuances and intricacies of conversations between humans and machines, opening the door to interesting research directions for machine learning, artificial intelligence, natural language processing (NLP), and beyond. With this data, researchers can determine how well machines are able to simulate real conversation behavior such as nonverbal exchanges, intonations, humorous insights or even sarcasm. The data also provides an avenue for comparative studies between human behavior and AI capabilities in carrying out meaningful dialogues with humans. This knowledge base is invaluable for those who aim to create more astounding AI systems that can closely imitate comprehensible speech patterns through their trained technology models
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
How to Use this Dataset
This dataset contains conversations between humans and AI-driven chatbots in real-world scenarios. With this dataset, you will be able to use the data to build an AI system that can respond intelligently in natural language conversations. For example, you can build a system with the ability to further engage users by replying with meaningful responses as the conversation progresses.
In order to get started, first familiarize yourself with the columns included in this dataset: 'chat' and 'system'. The column 'chat' contains conversations between humans and chatbot systems while the column 'system' contains responses from AI-driven chatbots.
Once you understand what is included in the data set, it's time for you to start building your AI system! Depending on how complex or advanced your goal is, there are several different approaches that could be used when working with this data set such as supervised learning models like seq2seq network or unsupervised methods like autoencoders etc. To get more detailed information regarding those methods refer to external materials available online.
After having trained your model, now it's time for testing out its performance! Enter some sample text into your model using either a web form or command line interface – then observe how it responds against what’s already stored within training datasets column ‘System’ which indicates expected chatsbot response (see above). You should find that once trained correctly; potential outcomes of such tests explores very closely resembling instances from learning sources (the training dataset) leading evidence of advanced Artificial intelligence applications are possible with sufficient analysis inputs! As always if extra accuracy is needed afterwards tweak any parameters until desired results are achieved - Congratulations!
- AI-driven natural language generation: Using this dataset, developers can train AI systems to automatically generate natural conversations between humans and machines.
- Automatic response selection: The data in the dataset could be used to train AI algorithms which select the most appropriate response in any given conversation.
- Evaluating human-machine interaction: Researchers can use this data to identify areas of improvement in conversational interactions between humans and machines, as well as evaluate various techniques for creating effective dialogue systems
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:--------------------------------------------------------| | chat | Contains dialogues uttered by the human. (String) | | system | Contains responses from the AI-driven chatbot. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Do states copy or reinvent language from complex policies as they diffuse, and does this depend on legislative resources? We argue that states will more frequently reinvent more complex policies, but that states with high-resource legislatures will reinvent more than their low-resource counterparts for more complex policies. We test the theory using the bill texts from 18 policies that diffused across the 50 states from 1983-2014, measuring reinvention and complexity using text analysis tools. In line with expectations, we find that complex policies are reinvented more than simple policies and that high-resource legislatures reinvent bills more than low-resource legislatures on average. However, we also find that low-resource legislatures reinvent complex policies at about the same rate as high-resource legislatures. The results indicate that even legislatures with limited resources work to adapt complex policies during the diffusion process.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F6372737%2F47d4254cf709e8038a199096ce2e3b75%2Fimage_thumbnail%20(4).jpg?generation=1665091063695984&alt=media" alt="">
The dataset, acquired from WISDM Lab, consists of data collected from 36 different users performing six types of human activities (ascending and descending stairs, sitting, walking, jogging, and standing) for specific periods of time.
These data were acquired from accelerometers, which are able of detecting the orientation of the device measuring the acceleration along the three different dimensions. They were collected using a sample rate of 20 Hz (1 sample every 50 millisecond) that is equivalent to 20 samples per second.
These time-series data can be used to perform various techniques, such as human activity recognition.
activity: the activity that the user was carrying out. It could be:
timestamp: generally the phone's uptime in nanoseconds.
x-axis: The acceleration in the x direction as measured by the android phone's accelerometer.
Floating-point values between -20 and 20. A value of 10 = 1g = 9.81 m/s^2, and 0 = no acceleration.
The acceleration recorded includes gravitational acceleration toward the center of the Earth, so that when the phone is at rest on a flat surface the vertical axis will register +-10.
y-axis: same as x-axis, but along y axis.
z-axis: same as x-axis, but along z axis.
Remember to upvote if you found the dataset useful :).
The data can be used to perform human activity prediction. I strongly suggest you to take a look to this article if you want to have a reference for performing this task, and considering that the given dataset was already cleaned. In addition, you can try to perform other feature engineering and selection techniques, and using more complex models for prediction.
Data were fetched from the WISDM dataset website, and they were cleaned, deleting missing values, replacing inconsistent strings and converting the dataset to csv.
Jeffrey W. Lockhart, Tony Pulickal, and Gary M. Weiss (2012).
"Applications of Mobile Activity Recognition,"
Proceedings of the ACM UbiComp International Workshop
on Situation, Activity, and Goal Awareness, Pittsburgh,
PA.
Gary M. Weiss and Jeffrey W. Lockhart (2012). "The Impact of
Personalization on Smartphone-Based Activity Recognition,"
Proceedings of the AAAI-12 Workshop on Activity Context
Representation: Techniques and Languages, Toronto, CA.
Jennifer R. Kwapisz, Gary M. Weiss and Samuel A. Moore (2010).
"Activity Recognition using Cell Phone Accelerometers,"
Proceedings of the Fourth International Workshop on
Knowledge Discovery from Sensor Data (at KDD-10), Washington
DC.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction: Prognostic scores are important tools in oncology to facilitate clinical decision-making based on patient characteristics. To date, classic survival analysis using Cox proportional hazards regression has been employed in the development of these prognostic scores. With the advance of analytical models, this study aimed to determine if more complex machine-learning algorithms could outperform classical survival analysis methods.Methods: In this benchmarking study, two datasets were used to develop and compare different prognostic models for overall survival in pan-cancer populations: a nationwide EHR-derived de-identified database for training and in-sample testing and the OAK (phase III clinical trial) dataset for out-of-sample testing. A real-world database comprised 136K first-line treated cancer patients across multiple cancer types and was split into a 90% training and 10% testing dataset, respectively. The OAK dataset comprised 1,187 patients diagnosed with non-small cell lung cancer. To assess the effect of the covariate number on prognostic performance, we formed three feature sets with 27, 44 and 88 covariates. In terms of methods, we benchmarked ROPRO, a prognostic score based on the Cox model, against eight complex machine-learning models: regularized Cox, Random Survival Forests (RSF), Gradient Boosting (GB), DeepSurv (DS), Autoencoder (AE) and Super Learner (SL). The C-index was used as the performance metric to compare different models.Results: For in-sample testing on the real-world database the resulting C-index [95% CI] values for RSF 0.720 [0.716, 0.725], GB 0.722 [0.718, 0.727], DS 0.721 [0.717, 0.726] and lastly, SL 0.723 [0.718, 0.728] showed significantly better performance as compared to ROPRO 0.701 [0.696, 0.706]. Similar results were derived across all feature sets. However, for the out-of-sample validation on OAK, the stronger performance of the more complex models was not apparent anymore. Consistently, the increase in the number of prognostic covariates did not lead to an increase in model performance.Discussion: The stronger performance of the more complex models did not generalize when applied to an out-of-sample dataset. We hypothesize that future research may benefit by adding multimodal data to exploit advantages of more complex models.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Agricultural Fields 2D and 3D Models Dataset
Introduction
This dataset was created to address the lack of comprehensive datasets in the literature that provide necessary information to evaluate and validate path planning approaches on both 2D and 3D surfaces of agricultural fields. It comprises 30 manually-selected agricultural fields located in France, chosen to cover a diverse range of shapes and sizes (from 1.83 to 13.21 hectares). The dataset includes simple shapes that do not require field decomposition and more complex shapes that necessitate field decomposition, ensuring a broad representation of real-world scenarios.
This dataset was initially produced to validate our Complete Coverage Path Planning approach, and we are pleased to make these data available for future research. In sharing this dataset, we kindly ask that users cite this dataset in any publications or presentations that make use of the data. This will help acknowledge our contribution and encourage further collaboration and research in this area.
Background
Agricultural field shapes result from a complex interplay of historical, geographic, and topographic factors, as well as cultural and economic practices. Fields in countries with a more recent history of land ownership and partitioning may have simpler shapes, while those with more complex histories may have irregular shapes. Geography and topography also influence field shapes, with fields in flat, open areas having simpler shapes than those in mountainous or hilly regions. This dataset focuses on French fields due to the variety of field shapes and the availability of high-precision elevation data from the French government.
Dataset Content
For each of the 30 agricultural fields, this dataset provides the following information in separate files:
Aerial image (PNG)
2D polygon (XML)
2D triangulated surface (PLY) with a grid resolution of 0.25 m
Elevation grid (PLY) with a grid resolution of 5 m
3D triangulated surface (PLY) with a grid resolution of 0.25 m
Set of 2D line segments representing access segments (XML)
Set of dividing lines for fields 20-30 to decompose them into sub-polygons in different ways
The obtained result by our "Advanced 3D Hybrid Path Planning with Multiple Objectives for complete coverage of agricultural field by wheeled robots", which includes:
A way-points in a CSV file
An illustration of the result projected on the field surface
Note: All coordinates are represented in Cartesian coordinates with centimeter precision.
The table below provides links to the field data in the Géoportail platform and coordinates (longitude and latitude) of a point inside each field for all 30 fields. These links and coordinates can be used to access the data and for visualization purposes.
Field Link Lon / Lat
1 bit.ly/3FYtuKu 7.435° / 48.7732°
2 bit.ly/3WGAyRI 7.474° / 48.7825°
3 bit.ly/3zX1vqJ 2.9205° / 49.8115°
4 bit.ly/3DJL0PI 1.6713° / 47.9864°
5 bit.ly/3htb8H3 3.3216° / 50.6623°
6 bit.ly/3WGTfER 7.4311° / 48.8245°
7 bit.ly/3DP8vqG 2.4845° / 50.3106°
8 bit.ly/3NLmQJf 7.5924° / 48.831°
9 bit.ly/3EeTvUo 7.4641° / 48.8146°
10 bit.ly/3UOyTrv 1.3491° / 48.012°
11 bit.ly/3zW7v30 3.4701° / 46.652°
12 bit.ly/3UMC6I3 7.5742° / 48.8071°
13 bit.ly/3TjkOkA 3.578° / 46.7016°
14 bit.ly/3UAmdo0 7.4269° / 48.8194°
15 bit.ly/3GpjdXZ 3.5611° / 46.6875°
16 bit.ly/3tcGhRN 2.5127° / 48.2645°
17 bit.ly/3zW26sE 2.6443° / 48.2546°
18 bit.ly/3Trsqlq 7.9196° / 48.9513°
19 bit.ly/3DWHgKJ 2.1269° / 46.8124°
20 bit.ly/3NN8pnT 1.5874° / 47.1346°
21 bit.ly/3DShkA3 0.6254° / 49.191°
22 bit.ly/3zZK1dg 2.7067° / 50.3336°
23 bit.ly/3TmwcMC 7.4416° / 48.7223°
24 bit.ly/3E3l8OK 3.1021° / 48.2449°
25 bit.ly/3E0Raeq 1.6183° / 49.9655°
26 bit.ly/3tvN0Xg 3.5476° / 50.1441°
27 bit.ly/3A0tZ2D 3.6644° / 48.0046°
28 bit.ly/3fTlQGl 1.7086° / 47.2054°
29 bit.ly/3hBeLL2 1.6893° / 47.1421°
30 bit.ly/3Edm2cN 3.1018° / 48.5853°
Hybrid_CCPP_Result Subdirectory
Hybrid_CCPP_Result subdirectory contains the results of our path planning algorithm for complete coverage of agricultural fields by wheeled robots. The provided files include way-points in CSV format and an illustration of the result projected on the field surface.
Approach Parameters
The results were obtained under the following considerations: The driving direction step size ($\ell_s$), the spacing of access segment discretization ($\ell_a$) and the spacing of working trajectory discretization for slope computation ($\ell_{slp}$). These parameters were respectively set to $3°$, $0.5m$, and $0.5m$. The values of other parameters are listed in the table below:
Parameter Description Value
$w$ working width 3m
$\gamma_{on}$ minimum turning radius - implement on 10m
$\gamma_{off}$ minimum turning radius - implement off 2.8m
$V_{on}$ average speed - implement on 4.5m/s
$V_{gap}$ average speed - implement transition 1.5m/s
$V_{off}$ average speed - implement off 3.5m/s
$\ell_t$ transition trajectory length 1.5m
$\ell_o$ robot-implement offset 1.0m
$\Delta_{mwd}$ minimum working distance threshold 3m
$p$ number of inner trajectories 2
$g$ number of gap-covering trajectories 1
$W_{cov}$ weight of $S_{cov}$ 0.30
$W_{ovl}$ weight of $S_{ovl}$ 0.15
$W_{nwd}$ weight of $S_{nwd}$ 0.10
$W_{otm}$ weight of $S_{otm}$ 0.10
$W_{slp}$ weight of $S_{slp}$ 0.35
$W_{s0}$ weight of $\ell_{s0}$ 0.00
$W_{s1}$ weight of $\ell_{s1}$ 0.10
$W_{s2}$ weight of $\ell_{s2}$ 0.15
$W_{s3}$ weight of $\ell_{s3}$ 0.20
$W_{s4}$ weight of $\ell_{s4}$ 0.25
$W_{s5}$ weight of $\ell_{s5}$ 0.30
For an in-depth understanding of these parameters, we kindly invite you to consult our published article:
Pour Arab, D., Spisser, M. & Essert, C. (2024) 3D hybrid path planning for optimized coverage of agricultural fields: a novel approach for wheeled robots. Journal of Field Robotics, 1–19. https://doi.org/10.1002/rob.22422
Way-point Structure
A way-point is represented by the following format:
Point X, Point Y, Point Z, Heading, Type, Move
where Heading is in radians, and Type and Move are according to the following structures:
enum WayPointType { WORKING = 1, TURN_OFF = 2, TURN_ON = 3, TRANSITION_OFF_TO_ON = 4, TRANSITION_ON_TO_OFF = 5 };
enum RobotMove { FORWARD = 1, REVERSE = -1 };
WayPointType
WORKING: The robot implement for driving at this point must be on.
TURN_OFF: The robot is performing a turn while its implement is off and elevated from the ground.
TURN_ON: The robot is performing a turn while its implement is on.
TRANSITION_OFF_TO_ON: The robot is traveling a straight transition trajectory for turning on its implement.
TRANSITION_ON_TO_OFF: The robot is traveling a straight transition trajectory for turning off its implement.
RobotMove
FORWARD: The robot is moving forward.
REVERSE: The robot is moving in reverse.
Files
CSV file contains the way-points generated by our proposed approach.
SVG file providing an illustration of the result projected on the field surface.
Usage
This dataset is intended for researchers and developers working on path planning algorithms for agricultural applications. Users can leverage the data to evaluate and validate their path planning approaches in various scenarios, from simple to complex field shapes, and on both 2D and 3D surfaces.
Please ensure that you cite this dataset appropriately in any publications or presentations that make use of the data.
Batteries represent complex systems whose internal state vari- ables are either inaccessible to sensors or hard to measure un- der operational conditions. This work exemplifies how more detailed model information and more sophisticated prediction techniques can improve both the accuracy as well as the re- sidual uncertainty of the prediction in Prognostics and Health Management. The more dramatic performance improvement between various prediction techniques is in their ability to learn complex non-linear degradation behavior from the train- ing data and discard any external noise disturbances. An algorithm that manages these sources of uncertainty well can yield higher confidence in predictions, expressed by narrower uncertainty bounds. We observed that the particle filter approach results in RUL distributions which have better precision (narrower pdfs) by several σs (if approximated as Gaussian) as compared to the other regression methods. How- ever, PF requires a more complex implementation and compu- tational overhead than the other methods. This illustrates the basic tradeoff between modeling and algorithm development versus prediction accuracy and precision. For situations like battery health management where the rate of capacity degrada- tion is rather slow, one can rely on simple regression methods that tend to perform well as more data are accumulated and still predict far enough in advance to avoid any catastrophic failures. Techniques like GPR or even the baseline approach can offer a suitable platform in these situations by managing the uncertainty fairly well with much simpler implementations. Other data sets may allow much smaller prediction horizons and hence require precise techniques like particle filters. In this study, we conclude that there are several methods one could employ for battery health management applica- tions. Based on end user requirements and available resources, a choice can be made between simple or more elegant tech- niques. The particle filter based approach emerges as the winner when accuracy and precision are considered more important than other requirements.
Our advanced data extraction tool is designed to empower businesses, researchers, and developers by providing an efficient and reliable way to collect and organize information from any online source. Whether you're gathering market insights, monitoring competitors, tracking trends, or building data-driven applications, our platform offers a perfect solution for automating the extraction and processing of structured data from websites. With seamless integration of AI, our tool takes the process a step further, enabling smarter, more refined data extraction that adapts to your needs over time.
In a digital world where information is continuously updated, timely access to data is critical. Our tool allows you to set up automated data extraction schedules, ensuring that you always have access to the most current information. Whether you're tracking stock prices, monitoring social media trends, or gathering product information, you can configure extraction schedules to suit your needs. Our AI-powered system also allows the tool to learn and optimize based on the data it collects, improving efficiency and accuracy with repeated use. From frequent updates by the minute to less frequent daily, weekly, or monthly collections, our platform handles it all seamlessly.
Our tool doesn’t just gather data—it organizes it. The extracted information is automatically structured into easily usable formats like CSV, JSON, or XML, making it ready for immediate use in applications, databases, or reports. We offer flexibility in the output format to ensure smooth integration with your existing tools and workflows. With AI-enhanced data parsing, the system recognizes and categorizes information more effectively, providing higher quality data for analysis, visualization, or importing into third-party systems.
Whether you’re collecting data from a handful of pages or millions, our system is built to scale. We can handle both small and large-scale extraction tasks with high reliability and performance. Our infrastructure ensures fast, efficient processing, even for the most demanding tasks. With parallel extraction capabilities, you can gather data from multiple sources simultaneously, reducing the time it takes to compile large datasets. AI-powered optimization further improves performance, making the extraction process faster and more adaptive to fluctuating data volumes.
Our tool doesn’t stop at extraction. We provide options for enriching the data by cross-referencing it with other sources or applying custom rules to transform raw information into more meaningful insights. This leads to a more insightful and actionable dataset, giving you a competitive edge through superior data-driven decision-making.
Modern websites often use dynamic content generated by JavaScript, which can be challenging to extract. Our tool, enhanced with AI, is designed to handle even the most complex web architectures, including dynamic loading, infinite scrolling, and paginated content.
Finally, our platform provides detailed logs of all extraction activities, giving you full visibility into the process. With built-in analytics, AI-powered insights can help you monitor progress, and identify issues.
In today’s fast-paced digital world, access to accurate, real-time data is critical for success. Our AI-integrated data extraction tool offers a reliable, flexible, and scalable solution to help you gather and organize the information you need with minimal effort. Whether you’re looking to gain a competitive edge, conduct in-depth research, or build sophisticated applications, our platform is designed to meet your needs and exceed expectations.
https://qdr.syr.edu/policies/qdr-standard-access-conditionshttps://qdr.syr.edu/policies/qdr-standard-access-conditions
This is an Annotation for Transparent Inquiry (ATI) data project. The annotated article can be viewed on the Publisher's Website. Data Generation The research project engages a story about perceptions of fairness in criminal justice decisions. The specific focus involves a debate between ProPublica, a news organization, and Northpointe, the owner of a popular risk tool called COMPAS. ProPublica wrote that COMPAS was racist against blacks, while Northpointe posted online a reply rejecting such a finding. These two documents were the obvious foci of the qualitative analysis because of the further media attention they attracted, the confusion their competing conclusions caused readers, and the power both companies wield in public circles. There were no barriers to retrieval as both documents have been publicly available on their corporate websites. This public access was one of the motivators for choosing them as it meant that they were also easily attainable by the general public, thus extending the documents’ reach and impact. Additional materials from ProPublica relating to the main debate were also freely downloadable from its website and a third party, open source platform. Access to secondary source materials comprising additional writings from Northpointe representatives that could assist in understanding Northpointe’s main document, though, was more limited. Because of a claim of trade secrets on its tool and the underlying algorithm, it was more difficult to reach Northpointe’s other reports. Nonetheless, largely because its clients are governmental bodies with transparency and accountability obligations, some of Northpointe-associated reports were retrievable from third parties who had obtained them, largely through Freedom of Information Act queries. Together, the primary and (retrievable) secondary sources allowed for a triangulation of themes, arguments, and conclusions. The quantitative component uses a dataset of over 7,000 individuals with information that was collected and compiled by ProPublica and made available to the public on github. ProPublica’s gathering the data directly from criminal justice officials via Freedom of Information Act requests rendered the dataset in the public domain, and thus no confidentiality issues are present. The dataset was loaded into SPSS v. 25 for data analysis. Data Analysis The qualitative enquiry used critical discourse analysis, which investigates ways in which parties in their communications attempt to create, legitimate, rationalize, and control mutual understandings of important issues. Each of the two main discourse documents was parsed on its own merit. Yet the project was also intertextual in studying how the discourses correspond with each other and to other relevant writings by the same authors. Several more specific types of discursive strategies were of interest in attracting further critical examination: Testing claims and rationalizations that appear to serve the speaker’s self-interest Examining conclusions and determining whether sufficient evidence supported them Revealing contradictions and/or inconsistencies within the same text and intertextually Assessing strategies underlying justifications and rationalizations used to promote a party’s assertions and arguments Noticing strategic deployment of lexical phrasings, syntax, and rhetoric Judging sincerity of voice and the objective consideration of alternative perspectives Of equal importance in a critical discourse analysis is consideration of what is not addressed, that is to uncover facts and/or topics missing from the communication. For this project, this included parsing issues that were either briefly mentioned and then neglected, asserted yet the significance left unstated, or not suggested at all. This task required understanding common practices in the algorithmic data science literature. The paper could have been completed with just the critical discourse analysis. However, because one of the salient findings from it highlighted that the discourses overlooked numerous definitions of algorithmic fairness, the call to fill this gap seemed obvious. Then, the availability of the same dataset used by the parties in conflict, made this opportunity more appealing. Calculating additional algorithmic equity equations would not thereby be troubled by irregularities because of diverse sample sets. New variables were created as relevant to calculate algorithmic fairness equations. In addition to using various SPSS Analyze functions (e.g., regression, crosstabs, means), online statistical calculators were useful to compute z-test comparisons of proportions and t-test comparisons of means. Logic of Annotation Annotations were employed to fulfil a variety of functions, including supplementing the main text with context, observations, counter-points, analysis, and source attributions. These fall under a few categories. Space considerations. Critical discourse analysis offers a rich method...
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
PLEASE NOTE: This record has been retired. It has been superseded by: https://environment.data.gov.uk/dataset/b5aaa28d-6eb9-460e-8d6f-43caa71fbe0e
This dataset is not suitable for identifying whether an individual property will flood. GIS layer showing the extent of flooding from surface water that could result from a flood with a 3.3% chance of happening in any given year. This dataset is one output of our Risk of Flooding from Surface Water (RoFSW) mapping, previously known as the updated Flood Map for Surface Water (uFMfSW). It is one of a group of datasets previously available as the uFMfSW Complex Package. Further information on using these datasets can be found at the Resource Locator link below. Information Warnings: Risk of Flooding from Surface Water is not to be used at property level. If the Content is displayed in map form to others we recommend it should not be used with basemapping more detailed than 1:10,000 as the data is open to misinterpretation if used as a more detailed scale. Because of the way they have been produced and the fact that they are indicative, the maps are not appropriate to act as the sole evidence for any specific planning or regulatory decision or assessment of risk in relation to flooding at any scale without further supporting studies or evidence.No Lineage recorded.Click Here to go straight to the DSP Metadata Page for this Dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here are a few use cases for this project:
Data Extraction from Complex Documents: The model could be used to segment and extract data from complex documents such as financial statements, invoices or reports. Its ability to identify lines and headers could help in parsing data accurately.
Improvement of Accessibility Features: The model could be deployed in applications for visually impaired people, helping them understand text-based data represented in tables by recognizing and vocally relaying the content of each cell organized by lines and headers.
Automating Data Conversion: The model could be used for automating conversion of printed tables into digital format. It can help in scanning books, research papers or old documents and convert tables in them into easily editable and searchable digital format.
Intelligent Data Analysis Tools: It could be incorporated into a Data Analysis Software to pull out specific table data from a large number of documents, thus making the data analysis process more efficient.
Aid in Educational Settings: The model can be used in educational tools to recognize and interpret table data for online learning systems, making studying more interactive and efficient, especially in subjects where tables are commonly used such as Statistics, Economics, and Sciences.
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
This dataset contains the usage data of a single electric car collected in as part of the EVE study (Enquête des Vehicles Electrique) run by the Observatoire du Transition Energétique Grenoble (OTE-UGA). This dataset includes the following variables for a single Renault ZOE 2014 Q90: - Speed, distance covered, and other drivetrain data variables; - State of charge, State of health and other battery characteristics; as well as - external temperature variables. The Renault ZOE 2014 Q90 has a battery capacity of 22 KWh and a maximum speed of 135 KM/h. More information about on the specifications can be found here If you find this dataset useful or have any questions, please feel free to comment on the discussion dedicated to this dataset on the OTE forum . The electric car is used for personal use exclusively including occasional transit to work but mostly for personal errands and trips. The dataset was collected using a CanZE app and a generic car lighter dongle. The dataset spans three years from October 2020 to October 2023. A simple Python notebook that visualises the datasets can be found here. More complex use-cases for the datasets can be found in the following links: - Comparison of the carbon footprint of driving across countries: link - Feedback indicators of electric car charging behaviours: link There is also more information on the collection process and other potential uses in the data paper here. Please don't hesitate to contact the authors if you have any further questions about the dataset.
The Human Know-How Dataset describes 211,696 human activities from many different domains. These activities are decomposed into 2,609,236 entities (each with an English textual label). These entities represent over two million actions and half a million pre-requisites. Actions are interconnected both according to their dependencies (temporal/logical orders between actions) and decompositions (decomposition of complex actions into simpler ones). This dataset has been integrated with DBpedia (259,568 links). For more information see: - The project website: http://homepages.inf.ed.ac.uk/s1054760/prohow/index.htm - The data is also available on datahub: https://datahub.io/dataset/human-activities-and-instructions ---------------------------------------------------------------- * Quickstart: if you want to experiment with the most high-quality data before downloading all the datasets, download the file '9of11_knowhow_wikihow', and optionally files 'Process - Inputs', 'Process - Outputs', 'Process - Step Links' and 'wikiHow categories hierarchy'. * Data representation based on the PROHOW vocabulary: http://w3id.org/prohow# Data extracted from existing web resources is linked to the original resources using the Open Annotation specification * Data Model: an example of how the data is represented within the datasets is available in the attached Data Model PDF file. The attached example represents a simple set of instructions, but instructions in the dataset can have more complex structures. For example, instructions could have multiple methods, steps could have further sub-steps, and complex requirements could be decomposed into sub-requirements. ---------------------------------------------------------------- Statistics: * 211,696: number of instructions. From wikiHow: 167,232 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 44,464 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 2,609,236: number of RDF nodes within the instructions From wikiHow: 1,871,468 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 737,768 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 255,101: number of process inputs linked to 8,453 distinct DBpedia concepts (dataset Process - Inputs) * 4,467: number of process outputs linked to 3,439 distinct DBpedia concepts (dataset Process - Outputs) * 376,795: number of step links between 114,166 different sets of instructions (dataset Process - Step Links)