100+ datasets found
  1. E

    The Human Know-How Dataset

    • find.data.gov.scot
    • dtechtive.com
    pdf, zip
    Updated Apr 29, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). The Human Know-How Dataset [Dataset]. http://doi.org/10.7488/ds/1394
    Explore at:
    zip(19.78 MB), zip(0.2837 MB), zip(19.67 MB), zip(69.8 MB), zip(9.433 MB), zip(62.92 MB), zip(20.43 MB), zip(43.28 MB), zip(92.88 MB), zip(13.06 MB), zip(14.86 MB), zip(5.372 MB), zip(0.0298 MB), pdf(0.0582 MB), zip(5.769 MB), zip(90.08 MB)Available download formats
    Dataset updated
    Apr 29, 2016
    Description

    The Human Know-How Dataset describes 211,696 human activities from many different domains. These activities are decomposed into 2,609,236 entities (each with an English textual label). These entities represent over two million actions and half a million pre-requisites. Actions are interconnected both according to their dependencies (temporal/logical orders between actions) and decompositions (decomposition of complex actions into simpler ones). This dataset has been integrated with DBpedia (259,568 links). For more information see: - The project website: http://homepages.inf.ed.ac.uk/s1054760/prohow/index.htm - The data is also available on datahub: https://datahub.io/dataset/human-activities-and-instructions ---------------------------------------------------------------- * Quickstart: if you want to experiment with the most high-quality data before downloading all the datasets, download the file '9of11_knowhow_wikihow', and optionally files 'Process - Inputs', 'Process - Outputs', 'Process - Step Links' and 'wikiHow categories hierarchy'. * Data representation based on the PROHOW vocabulary: http://w3id.org/prohow# Data extracted from existing web resources is linked to the original resources using the Open Annotation specification * Data Model: an example of how the data is represented within the datasets is available in the attached Data Model PDF file. The attached example represents a simple set of instructions, but instructions in the dataset can have more complex structures. For example, instructions could have multiple methods, steps could have further sub-steps, and complex requirements could be decomposed into sub-requirements. ---------------------------------------------------------------- Statistics: * 211,696: number of instructions. From wikiHow: 167,232 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 44,464 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 2,609,236: number of RDF nodes within the instructions From wikiHow: 1,871,468 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 737,768 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 255,101: number of process inputs linked to 8,453 distinct DBpedia concepts (dataset Process - Inputs) * 4,467: number of process outputs linked to 3,439 distinct DBpedia concepts (dataset Process - Outputs) * 376,795: number of step links between 114,166 different sets of instructions (dataset Process - Step Links)

  2. Reading habit Dataset

    • kaggle.com
    Updated Sep 3, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Overfitted (2020). Reading habit Dataset [Dataset]. https://www.kaggle.com/vipulgote4/reading-habit-dataset/tasks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 3, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Overfitted
    Description

    Context

    this dataset contain how much amount of peoples read books or audiobooks and therir age,income ,Education etc.

    Acknowledgements

    dataset collected from this site: https://www.pewresearch.org/internet/

  3. US Broadband Usage Across Counties

    • kaggle.com
    Updated Jan 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). US Broadband Usage Across Counties [Dataset]. https://www.kaggle.com/datasets/thedevastator/us-broadband-usage-across-counties-and-zip-codes
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 6, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    Area covered
    United States
    Description

    US Broadband Usage Across Counties

    Utilizing Microsoft's Data to Estimate Access

    By Amber Thomas [source]

    About this dataset

    This dataset provides an estimation of broadband usage in the United States, focusing on how many people have access to broadband and how many are actually using it at broadband speeds. Through data collected by Microsoft from our services, including package size and total time of download, we can estimate the throughput speed of devices connecting to the internet across zip codes and counties.

    According to Federal Communications Commission (FCC) estimates, 14.5 million people don't have access to any kind of broadband connection. This data set aims to address this contrast between those with estimated availability but no actual use by providing more accurate usage numbers downscaled to county and zip code levels. Who gets counted as having access is vastly important -- it determines who gets included in public funding opportunities dedicated solely toward closing this digital divide gap. The implications can be huge: millions around this country could remain invisible if these number aren't accurately reported or used properly in decision-making processes.

    This dataset includes aggregated information about these locations with less than 20 devices for increased accuracy when estimating Broadband Usage in the United States-- allowing others to use it for developing solutions that improve internet access or label problem areas accurately where no real or reliable connectivity exists among citizens within communities large and small throughout the US mainland.. Please review the license terms before using these data so that you may adhere appropriately with stipulations set forth under Microsoft's Open Use Of Data Agreement v1.0 agreement prior to utilizing this dataset for your needs-- both professional and educational endeavors alike!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    How to Use the US Broadband Usage Dataset

    This dataset provides broadband usage estimates in the United States by county and zip code. It is ideally suited for research into how broadband connects households, towns and cities. Understanding this information is vital for closing existing disparities in access to high-speed internet, and for devising strategies for making sure all Americans can stay connected in a digital world.

    The dataset contains six columns: - County – The name of the county for which usage statistics are provided. - Zip Code (5-Digit) – The 5-digit zip code from which usage data was collected from within that county or metropolitan area/micro area/divisions within states as reported by the US Census Bureau in 2018[2].
    - Population (Households) – Estimated number of households defined according to [3] based on data from the US Census Bureau American Community Survey's 5 Year Estimates[4].
    - Average Throughput (Mbps)- Average Mbps download speed derived from a combination of data collected anonymous devices connected through Microsoft services such as Windows Update, Office 365, Xbox Live Core Services, etc.[5]
    - Percent Fast (> 25 Mbps)- Percentage of machines with throughput greater than 25 Mbps calculated using [6]. 6) Percent Slow (< 3 Mbps)- Percentage of machines with throughput less than 3Mbps calculated using [7].

    Research Ideas

    • Targeting marketing campaigns based on broadband use. Companies can use the geographic and demographic data in this dataset to create targeted advertising campaigns that are tailored to individuals living in areas where broadband access is scarce or lacking.
    • Creating an educational platform for those without reliable access to broadband internet. By leveraging existing technologies such as satellite internet, media streaming services like Netflix, and platforms such as Khan Academy or EdX, those with limited access could gain access to new educational options from home.
    • Establishing public-private partnerships between local governments and telecom providers need better data about gaps in service coverage and usage levels in order to make decisions about investments into new infrastructure buildouts for better connectivity options for rural communities

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    See the dataset description for more information.

    Columns

    File: broadband_data_2020October.csv

    Acknowledgements

    If you use this dataset in your research,...

  4. LinkedIn Dataset - US People Profiles

    • kaggle.com
    Updated May 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joseph from Proxycurl (2023). LinkedIn Dataset - US People Profiles [Dataset]. https://www.kaggle.com/datasets/proxycurl/10000-us-people-profiles/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 16, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Joseph from Proxycurl
    Description

    Full profile of 10,000 people in the US - download here, data schema here, with more than 40 data points including - Full Name - Education - Location - Work Experience History and many more!

    There are additionally 258+ Million US people profiles available, visit the LinkDB product page here.

    Our LinkDB database is an exhaustive database of publicly accessible LinkedIn people and companies profiles. It contains close to 500 Million people and companies profiles globally.

  5. h

    male-selfie-image-dataset

    • huggingface.co
    Updated May 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Training Data (2024). male-selfie-image-dataset [Dataset]. https://huggingface.co/datasets/TrainingDataPro/male-selfie-image-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 2, 2024
    Authors
    Training Data
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Face Recognition, Face Detection, Male Photo Dataset 👨

      If you are interested in biometric data - visit our website to learn more and buy the dataset :)
    

    110,000+ photos of 74,000+ men from 141 countries. The dataset includes photos of people's faces. All people presented in the dataset are men. The dataset contains a variety of images capturing individuals from diverse backgrounds and age groups. Our dataset will diversify your data by adding more photos of men of… See the full description on the dataset page: https://huggingface.co/datasets/TrainingDataPro/male-selfie-image-dataset.

  6. z

    Image Dataset of Accessibility Barriers

    • zenodo.org
    • explore.openaire.eu
    zip
    Updated Mar 25, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jakob Stolberg; Jakob Stolberg (2022). Image Dataset of Accessibility Barriers [Dataset]. http://doi.org/10.5281/zenodo.6382090
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 25, 2022
    Dataset provided by
    Zenodo
    Authors
    Jakob Stolberg; Jakob Stolberg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Data
    The dataset consist of 5538 images of public spaces, annotated with steps, stairs, ramps and grab bars for stairs and ramps. The dataset has annotations 3564 of steps, 1492 of stairs, 143 of ramps and 922 of grab bars.

    Each step annotation is attributed with an estimate of the height of the step, as falling into one of three categories: less than 3cm, 3cm to 7cm or more than 7cm. Additionally it is attributed with a 'type', with the possibilities 'doorstep', 'curb' or 'other'.

    Stair annotations are attributed with the number of steps in the stair.

    Ramps are attributed with an estimate of their width, also falling into three categories: less than 50cm, 50cm to 100cm and more than 100cm.

    In order to preserve all additional attributes of the labels, the data is published in the CVAT XML format for images.

    Annotating Process
    The labelling has been done using bounding boxes around the objects. This format is compatible with many popular object detection models, e.g. the YOLO object model. A bounding box is placed so it contains exactly the visible part of the respective objects. This implies that only objects that are visible in the photo are annotated. This means in particular a photo of a stair or step from above, where the object cannot be seen, have not been annotated, even when a human viewer can possibly infer that there is a stair or a step from other features in the photo.

    Steps
    A step is annotated, when there is an vertical increment that functions as a passage between two surface areas intended human or vehicle traffic. This means that we have not included:

    • Increments that are to high to reasonably be considered at passage.
    • Increments that does not lead to a surface intended for human or vehicle traffic, e.g. a 'step' in front of a wall or a curb in front of a bush.

    In particular, the bounding box of a step object contains exactly the incremental part of the step, but does not extend into the top or bottom horizontal surface any more than necessary to enclose entirely the incremental part. This has been chosen for consistency reasons, as including parts of the horizontal surfaces would imply a non-trivial choice of how much to include, which we deemed would most likely lead to more inconstistent annotations.

    The height of the steps are estimated by the annotators, and are therefore not guarranteed to be accurate.

    The type of the steps typically fall into the category 'doorstep' or 'curb'. Steps that are in a doorway, entrance or likewise are attributed as doorsteps. We also include in this category steps that are immediately leading to a doorway within a proximity of 1-2m. Steps between different types of pathways, e.g. between streets and sidewalks, are annotated as curbs. Any other type of step are annotated with 'other'. Many of the 'other' steps are for example steps to terraces.

    Stairs
    The stair label is used whenever two or more steps directly follow each other in a consistent pattern. All vertical increments are enclosed in the bounding box, as well as intermediate surfaces of the steps. However the top and bottom surface is not included more than necessary for the same reason as for steps, as described in the previous section.

    The annotator counts the number of steps, and attribute this to the stair object label.

    Ramps
    Ramps have been annotated when a sloped passage way has been placed or built to connect two surface areas intended for human or vehicle traffic. This implies the same considerations as with steps. Alike also only the sloped part of a ramp is annotated, not including the bottom or top surface area.

    For each ramp, the annotator makes an assessment of the width of the ramp in three categories: less than 50cm, 50cm to 100cm and more than 100cm. This parameter is visually hard to assess, and sometimes impossible due to the view of the ramp.

    Grab Bars
    Grab bars are annotated for hand rails and similar that are in direct connection to a stair or a ramp. While horizontal grab bars could also have been included, this was omitted due to the implied ambiguities of fences and similar objects. As the grab bar was originally intended as an attributal information to stairs and ramps, we chose to keep this focus. The bounding box encloses the part of the grab bar that functions as a hand rail for the stair or ramp.

    Usage
    As is often the case when annotating data, much information depends on the subjective assessment of the annotator. As each data point in this dataset has been annotated only by one person, caution should be taken if the data is applied.

    Generally speaking, the mindset and usage guiding the annotations have been wheelchair accessibility. While we have strived to annotate at an object level, hopefully making the data more widely applicable than this, we state this explicitly as it may have swayed untrivial annotation choices.

    The attributal data, such as step height or ramp width are highly subjective estimations. We still provide this data to give a post-hoc method to adjust which annotations to use. E.g. for some purposes, one may be interested in detecting only steps that are indeed more than 3cm. The attributal data makes it possible to sort away the steps less than 3cm, so a machine learning algorithm can be trained on this more appropriate dataset for that use case. We stress however, that one cannot expect to train accurate machine learning algorithms inferring the attributal data, as this is not accurate data in the first place.

    We hope this dataset will be a useful building block in the endeavours for automating barrier detection and documentation.

  7. Multi-Camera Action Dataset (MCAD)

    • zenodo.org
    application/gzip +2
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wenhui Li; Yongkang Wong; An-An Liu; Yang Li; Yu-Ting Su; Mohan Kankanhalli; Wenhui Li; Yongkang Wong; An-An Liu; Yang Li; Yu-Ting Su; Mohan Kankanhalli (2020). Multi-Camera Action Dataset (MCAD) [Dataset]. http://doi.org/10.5281/zenodo.884592
    Explore at:
    application/gzip, json, txtAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Wenhui Li; Yongkang Wong; An-An Liu; Yang Li; Yu-Ting Su; Mohan Kankanhalli; Wenhui Li; Yongkang Wong; An-An Liu; Yang Li; Yu-Ting Su; Mohan Kankanhalli
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Action recognition has received increasing attentions from the computer vision and machine learning community in the last decades. Ever since then, the recognition task has evolved from single view recording under controlled laboratory environment to unconstrained environment (i.e., surveillance environment or user generated videos). Furthermore, recent work focused on other aspect of action recognition problem, such as cross-view classification, cross domain learning, multi-modality learning, and action localization. Despite the large variations of studies, we observed limited works that explore the open-set and open-view classification problem, which is a genuine inherited properties in action recognition problem. In other words, a well designed algorithm should robustly identify an unfamiliar action as “unknown” and achieved similar performance across sensors with similar field of view. The Multi-Camera Action Dataset (MCAD) is designed to evaluate the open-view classification problem under surveillance environment.

    In our multi-camera action dataset, different from common action datasets we use a total of five cameras, which can be divided into two types of cameras (StaticandPTZ), to record actions. Particularly, there are three Static cameras (Cam04 & Cam05 & Cam06) with fish eye effect and two PanTilt-Zoom (PTZ) cameras (PTZ04 & PTZ06). Static camera has a resolution of 1280×960 pixels, while PTZ camera has a resolution of 704×576 pixels and a smaller field of view than Static camera. What’s more, we don’t control the illumination environment. We even set two contrasting conditions (Daytime and Nighttime environment) which makes our dataset more challenge than many controlled datasets with strongly controlled illumination environment.The distribution of the cameras is shown in the picture on the right.

    We identified 18 units single person daily actions with/without object which are inherited from the KTH, IXMAS, and TRECIVD datasets etc. The list and the definition of actions are shown in the table. These actions can also be divided into 4 types actions. Micro action without object (action ID of 01, 02 ,05) and with object (action ID of 10, 11, 12 ,13). Intense action with object (action ID of 03, 04 ,06, 07, 08, 09) and with object (action ID of 14, 15, 16, 17, 18). We recruited a total of 20 human subjects. Each candidate repeats 8 times (4 times during the day and 4 times in the evening) of each action under one camera. In the recording process, we use five cameras to record each action sample separately. During recording stage we just tell candidates the action name then they could perform the action freely with their own habit, only if they do the action in the field of view of the current camera. This can make our dataset much closer to reality. As a results there is high intra action class variation among different action samples as shown in picture of action samples.

    URL: http://mmas.comp.nus.edu.sg/MCAD/MCAD.html

    Resources:

    • IDXXXX.mp4.tar.gz contains video data for each individual
    • boundingbox.tar.gz contains person bounding box for all videos
    • protocol.json contains the evaluation protocol
    • img_list.txt contains the download URLs for the images version of the video data
    • idt_list.txt contians the download URLs for the improved Dense Trajectory feature
    • stip_list.txt contians the download URLs for the STIP feature

    How to Cite:

    Please cite the following paper if you use the MCAD dataset in your work (papers, articles, reports, books, software, etc):

    • Wenhui Liu, Yongkang Wong, An-An Liu, Yang Li, Yu-Ting Su, Mohan Kankanhalli
      Multi-Camera Action Dataset for Cross-Camera Action Recognition Benchmarking
      IEEE Winter Conference on Applications of Computer Vision (WACV), 2017.
      http://doi.org/10.1109/WACV.2017.28
  8. f

    ORBIT: A real-world few-shot dataset for teachable object recognition...

    • city.figshare.com
    bin
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniela Massiceti; Lida Theodorou; Luisa Zintgraf; Matthew Tobias Harris; Simone Stumpf; Cecily Morrison; Edward Cutrell; Katja Hofmann (2023). ORBIT: A real-world few-shot dataset for teachable object recognition collected from people who are blind or low vision [Dataset]. http://doi.org/10.25383/city.14294597.v3
    Explore at:
    binAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    City, University of London
    Authors
    Daniela Massiceti; Lida Theodorou; Luisa Zintgraf; Matthew Tobias Harris; Simone Stumpf; Cecily Morrison; Edward Cutrell; Katja Hofmann
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Object recognition predominately still relies on many high-quality training examples per object category. In contrast, learning new objects from only a few examples could enable many impactful applications from robotics to user personalization. Most few-shot learning research, however, has been driven by benchmark datasets that lack the high variation that these applications will face when deployed in the real-world. To close this gap, we present the ORBIT dataset, grounded in a real-world application of teachable object recognizers for people who are blind/low vision. We provide a full, unfiltered dataset of 4,733 videos of 588 objects recorded by 97 people who are blind/low-vision on their mobile phones, and a benchmark dataset of 3,822 videos of 486 objects collected by 77 collectors. The code for loading the dataset, computing all benchmark metrics, and running the baseline models is available at https://github.com/microsoft/ORBIT-DatasetThis version comprises several zip files:- train, validation, test: benchmark dataset, organised by collector, with raw videos split into static individual frames in jpg format at 30FPS- other: data not in the benchmark set, organised by collector, with raw videos split into static individual frames in jpg format at 30FPS (please note that the train, validation, test, and other files make up the unfiltered dataset)- *_224: as for the benchmark, but static individual frames are scaled down to 224 pixels.- *_unfiltered_videos: full unfiltered dataset, organised by collector, in mp4 format.

  9. Effect of suicide rates on life expectancy dataset

    • zenodo.org
    csv
    Updated Apr 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Filip Zoubek; Filip Zoubek (2021). Effect of suicide rates on life expectancy dataset [Dataset]. http://doi.org/10.5281/zenodo.4694270
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 16, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Filip Zoubek; Filip Zoubek
    License

    Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
    License information was derived automatically

    Description

    Effect of suicide rates on life expectancy dataset

    Abstract
    In 2015, approximately 55 million people died worldwide, of which 8 million committed suicide. In the USA, one of the main causes of death is the aforementioned suicide, therefore, this experiment is dealing with the question of how much suicide rates affects the statistics of average life expectancy.
    The experiment takes two datasets, one with the number of suicides and life expectancy in the second one and combine data into one dataset. Subsequently, I try to find any patterns and correlations among the variables and perform statistical test using simple regression to confirm my assumptions.

    Data

    The experiment uses two datasets - WHO Suicide Statistics[1] and WHO Life Expectancy[2], which were firstly appropriately preprocessed. The final merged dataset to the experiment has 13 variables, where country and year are used as index: Country, Year, Suicides number, Life expectancy, Adult Mortality, which is probability of dying between 15 and 60 years per 1000 population, Infant deaths, which is number of Infant Deaths per 1000 population, Alcohol, which is alcohol, recorded per capita (15+) consumption, Under-five deaths, which is number of under-five deaths per 1000 population, HIV/AIDS, which is deaths per 1 000 live births HIV/AIDS, GDP, which is Gross Domestic Product per capita, Population, Income composition of resources, which is Human Development Index in terms of income composition of resources, and Schooling, which is number of years of schooling.

    LICENSE

    THE EXPERIMENT USES TWO DATASET - WHO SUICIDE STATISTICS AND WHO LIFE EXPECTANCY, WHICH WERE COLLEECTED FROM WHO AND UNITED NATIONS WEBSITE. THEREFORE, ALL DATASETS ARE UNDER THE LICENSE ATTRIBUTION-NONCOMMERCIAL-SHAREALIKE 3.0 IGO (https://creativecommons.org/licenses/by-nc-sa/3.0/igo/).

    [1] https://www.kaggle.com/szamil/who-suicide-statistics

    [2] https://www.kaggle.com/kumarajarshi/life-expectancy-who

  10. a

    Downsampled Open Images V4 Dataset

    • academictorrents.com
    bittorrent
    Updated Dec 19, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    None (2018). Downsampled Open Images V4 Dataset [Dataset]. https://academictorrents.com/details/9208d33aceb2ca3eb2beb70a192600c9c41efba1
    Explore at:
    bittorrent(85220313799)Available download formats
    Dataset updated
    Dec 19, 2018
    Authors
    None
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    This is the downsampled version of the Open Images V4 Dataset. The Open Images V4 dataset contains 15.4M bounding-boxes for 600 categories on 1.9M images and 30.1M human-verified image-level labels for 19794 categories. The dataset is available at this link. This total size of the full dataset is 18TB. There s also a smaller version which contains rescaled images to have at most 1024 pixels on the longest side. However, the total size of the rescaled dataset is still large (513GB for training, 12GB for validation and 36GB for testing). I provide a much smaller version of the Open Images Dataset V4, as inspired by Downsampled ImageNet datasets @PatrykChrabaszcz. These downsampled dataset are much smaller in size so everyone can download it with ease (59GB for training with 512px version and 16GB for training with 256px version). Experiments on these downsampled datasets are also much faster than the original. | Dataset | Train Size | Validation Size | Test Size | Test Challenge Size |

  11. o

    Geonames - All Cities with a population > 1000

    • public.opendatasoft.com
    • data.smartidf.services
    • +2more
    csv, excel, geojson +1
    Updated Mar 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Geonames - All Cities with a population > 1000 [Dataset]. https://public.opendatasoft.com/explore/dataset/geonames-all-cities-with-a-population-1000/
    Explore at:
    csv, json, geojson, excelAvailable download formats
    Dataset updated
    Mar 10, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    All cities with a population > 1000 or seats of adm div (ca 80.000)Sources and ContributionsSources : GeoNames is aggregating over hundred different data sources. Ambassadors : GeoNames Ambassadors help in many countries. Wiki : A wiki allows to view the data and quickly fix error and add missing places. Donations and Sponsoring : Costs for running GeoNames are covered by donations and sponsoring.Enrichment:add country name

  12. Empathy dataset

    • zenodo.org
    bin, csv, html
    Updated Dec 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2024). Empathy dataset [Dataset]. http://doi.org/10.5281/zenodo.7683907
    Explore at:
    bin, html, csvAvailable download formats
    Dataset updated
    Dec 18, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The database for this study (Briganti et al. 2018; the same for the Braun study analysis) was composed of 1973 French-speaking students in several universities or schools for higher education in the following fields: engineering (31%), medicine (18%), nursing school (16%), economic sciences (15%), physiotherapy, (4%), psychology (11%), law school (4%) and dietetics (1%). The subjects were 17 to 25 years old (M = 19.6 years, SD = 1.6 years), 57% were females and 43% were males. Even though the full dataset was composed of 1973 participants, only 1270 answered the full questionnaire: missing data are handled using pairwise complete observations in estimating a Gaussian Graphical Model, meaning that all available information from every subject are used.

    The feature set is composed of 28 items meant to assess the four following components: fantasy, perspective taking, empathic concern and personal distress. In the questionnaire, the items are mixed; reversed items (items 3, 4, 7, 12, 13, 14, 15, 18, 19) are present. Items are scored from 0 to 4, where “0” means “Doesn’t describe me very well” and “4” means “Describes me very well”; reverse-scoring is calculated afterwards. The questionnaires were anonymized. The reanalysis of the database in this retrospective study was approved by the ethical committee of the Erasmus Hospital.

    Size: A dataset of size 1973*28

    Number of features: 28

    Ground truth: No

    Type of Graph: Mixed graph

    The following gives the description of the variables:

    FeatureFeatureLabelDomainItem meaning from Davis 1980
    0011FSGreenI daydream and fantasize, with some regularity, about things that might happen to me.
    0022ECPurpleI often have tender, concerned feelings for people less fortunate than me.
    0033PT_RYellowI sometimes find it difficult to see things from the “other guy’s” point of view.
    0044EC_RPurpleSometimes I don’t feel very sorry for other people when they are having problems.
    0055FSGreenI really get involved with the feelings of the characters in a novel.
    0066PDRedIn emergency situations, I feel apprehensive and ill-at-ease.
    0077FS_RGreenI am usually objective when I watch a movie or play, and I don’t often get completely caught up in it.(Reversed)
    0088PTYellowI try to look at everybody’s side of a disagreement before I make a decision.
    0099ECPurpleWhen I see someone being taken advantage of, I feel kind of protective towards them.
    01010PDRedI sometimes feel helpless when I am in the middle of a very emotional situation.
    01111PTYellowsometimes try to understand my friends better by imagining how things look from their perspective
    01212FS_RGreenBecoming extremely involved in a good book or movie is somewhat rare for me. (Reversed)
    01313PD_RRedWhen I see someone get hurt, I tend to remain calm. (Reversed)
    01414EC_RPurpleOther people’s misfortunes do not usually disturb me a great deal. (Reversed)
    01515PT_RYellowIf I’m sure I’m right about something, I don’t waste much time listening to other people’s arguments. (Reversed)
    01616FSGreenAfter seeing a play or movie, I have felt as though I were one of the characters.
    01717PDRedBeing in a tense emotional situation scares me.
    01818EC_RPurpleWhen I see someone being treated unfairly, I sometimes don’t feel very much pity for them. (Reversed)
    01919PD_RRedI am usually pretty effective in dealing with emergencies. (Reversed)
    02020FSGreenI am often quite touched by things that I see happen.
    02121PTYellowI believe that there are two sides to every question and try to look at them both.
    02222ECPurpleI would describe myself as a pretty soft-hearted person.
    02323FSGreenWhen I watch a good movie, I can very easily put myself in the place of a leading character.
    02424PDRedI tend to lose control during emergencies.
    02525PTYellowWhen I’m upset at someone, I usually try to “put myself in his shoes” for a while.
    02626FSGreenWhen I am reading an interesting story or novel, I imagine how I would feel if the events in the story were happening to me.
    02727PDRedWhen I see someone who badly needs help in an emergency, I go to pieces.
    02828PTYellowBefore criticizing somebody, I try to imagine how I would feel if I were in their place

    More information about the dataset is contained in empathy_description.html file.

  13. USA Name Data

    • kaggle.com
    zip
    Updated Feb 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data.gov (2019). USA Name Data [Dataset]. https://www.kaggle.com/datasets/datagov/usa-names
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Feb 12, 2019
    Dataset provided by
    Data.govhttps://data.gov/
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    United States
    Description

    Context

    Cultural diversity in the U.S. has led to great variations in names and naming traditions and names have been used to express creativity, personality, cultural identity, and values. Source: https://en.wikipedia.org/wiki/Naming_in_the_United_States

    Content

    This public dataset was created by the Social Security Administration and contains all names from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in this data. For others who did apply, records may not show the place of birth, and again their names are not included in the data.

    All data are from a 100% sample of records on Social Security card applications as of the end of February 2015. To safeguard privacy, the Social Security Administration restricts names to those with at least 5 occurrences.

    Fork this kernel to get started with this dataset.

    Acknowledgements

    https://bigquery.cloud.google.com/dataset/bigquery-public-data:usa_names

    https://cloud.google.com/bigquery/public-data/usa-names

    Dataset Source: Data.gov. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

    Banner Photo by @dcp from Unplash.

    Inspiration

    What are the most common names?

    What are the most common female names?

    Are there more female or male names?

    Female names by a wide margin?

  14. h

    Data from: stereoset

    • huggingface.co
    • opendatalab.com
    Updated Dec 8, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    McGill NLP Group (2021). stereoset [Dataset]. https://huggingface.co/datasets/McGill-NLP/stereoset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 8, 2021
    Dataset authored and provided by
    McGill NLP Group
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for StereoSet

      Dataset Summary
    

    StereoSet is a dataset that measures stereotype bias in language models. StereoSet consists of 17,000 sentences that measures model preferences across gender, race, religion, and profession.

      Supported Tasks and Leaderboards
    

    multiple-choice question answering

      Languages
    

    English (en)

      Dataset Structure
    
    
    
    
    
      Data Instances
    

    intersentence

    {'bias_type': 'race', 'context': 'Many people live in… See the full description on the dataset page: https://huggingface.co/datasets/McGill-NLP/stereoset.

  15. d

    Traffic Crashes - People

    • catalog.data.gov
    • data.cityofchicago.org
    Updated Aug 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.cityofchicago.org (2025). Traffic Crashes - People [Dataset]. https://catalog.data.gov/dataset/traffic-crashes-people
    Explore at:
    Dataset updated
    Aug 11, 2025
    Dataset provided by
    data.cityofchicago.org
    Description

    This data contains information about people involved in a crash and if any injuries were sustained. This dataset should be used in combination with the traffic Crash and Vehicle dataset. Each record corresponds to an occupant in a vehicle listed in the Crash dataset. Some people involved in a crash may not have been an occupant in a motor vehicle, but may have been a pedestrian, bicyclist, or using another non-motor vehicle mode of transportation. Injuries reported are reported by the responding police officer. Fatalities that occur after the initial reports are typically updated in these records up to 30 days after the date of the crash. Person data can be linked with the Crash and Vehicle dataset using the “CRASH_RECORD_ID” field. A vehicle can have multiple occupants and hence have a one to many relationship between Vehicle and Person dataset. However, a pedestrian is a “unit” by itself and have a one to one relationship between the Vehicle and Person table. The Chicago Police Department reports crashes on IL Traffic Crash Reporting form SR1050. The crash data published on the Chicago data portal mostly follows the data elements in SR1050 form. The current version of the SR1050 instructions manual with detailed information on each data elements is available here. Change 11/21/2023: We have removed the RD_NO (Chicago Police Department report number) for privacy reasons.

  16. f

    Data from: Facial Expression Image Dataset for Computer Vision Algorithms

    • salford.figshare.com
    Updated Apr 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ali Alameer; Odunmolorun Osonuga (2025). Facial Expression Image Dataset for Computer Vision Algorithms [Dataset]. http://doi.org/10.17866/rd.salford.21220835.v2
    Explore at:
    Dataset updated
    Apr 29, 2025
    Dataset provided by
    University of Salford
    Authors
    Ali Alameer; Odunmolorun Osonuga
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset for this project is characterised by photos of individual human emotion expression and these photos are taken with the help of both digital camera and a mobile phone camera from different angles, posture, background, light exposure, and distances. This task might look and sound very easy but there were some challenges encountered along the process which are reviewed below: 1) People constraint One of the major challenges faced during this project is getting people to participate in the image capturing process as school was on vacation, and other individuals gotten around the environment were not willing to let their images be captured for personal and security reasons even after explaining the notion behind the project which is mainly for academic research purposes. Due to this challenge, we resorted to capturing the images of the researcher and just a few other willing individuals. 2) Time constraint As with all deep learning projects, the more data available the more accuracy and less error the result will produce. At the initial stage of the project, it was agreed to have 10 emotional expression photos each of at least 50 persons and we can increase the number of photos for more accurate results but due to the constraint in time of this project an agreement was later made to just capture the researcher and a few other people that are willing and available. These photos were taken for just two types of human emotion expression that is, “happy” and “sad” faces due to time constraint too. To expand our work further on this project (as future works and recommendations), photos of other facial expression such as anger, contempt, disgust, fright, and surprise can be included if time permits. 3) The approved facial emotions capture. It was agreed to capture as many angles and posture of just two facial emotions for this project with at least 10 images emotional expression per individual, but due to time and people constraints few persons were captured with as many postures as possible for this project which is stated below: Ø Happy faces: 65 images Ø Sad faces: 62 images There are many other types of facial emotions and again to expand our project in the future, we can include all the other types of the facial emotions if time permits, and people are readily available. 4) Expand Further. This project can be improved furthermore with so many abilities, again due to the limitation of time given to this project, these improvements can be implemented later as future works. In simple words, this project is to detect/predict real-time human emotion which involves creating a model that can detect the percentage confidence of any happy or sad facial image. The higher the percentage confidence the more accurate the facial fed into the model. 5) Other Questions Can the model be reproducible? the supposed response to this question should be YES. If and only if the model will be fed with the proper data (images) such as images of other types of emotional expression.

  17. S

    Dataset: Deenz Dark Triad Scale – Poland

    • sodha.be
    tsv
    Updated Feb 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deen Mohd Dar; Deen Mohd Dar (2025). Dataset: Deenz Dark Triad Scale – Poland [Dataset]. http://doi.org/10.34934/DVN/4WYRN9
    Explore at:
    tsv(6069)Available download formats
    Dataset updated
    Feb 20, 2025
    Dataset provided by
    Social Sciences and Digital Humanities Archive – SODHA
    Authors
    Deen Mohd Dar; Deen Mohd Dar
    License

    https://www.sodha.be/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34934/DVN/4WYRN9https://www.sodha.be/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34934/DVN/4WYRN9

    Area covered
    Poland
    Description

    This dataset comes from a study conducted in Poland with 44 participants. The goal of the study was to measure personality traits known as the Dark Triad. The Dark Triad consists of three key traits that influence how people think and behave towards others. These traits are Machiavellianism, Narcissism, and Psychopathy. Machiavellianism refers to a person's tendency to manipulate others and be strategic in their actions. People with high Machiavellianism scores often believe that deception is necessary to achieve their goals. Narcissism is related to self-importance and the need for admiration. Individuals with high narcissism scores may see themselves as special and expect others to recognize their greatness. Psychopathy is linked to impulsive behavior and a lack of empathy. People with high psychopathy scores tend to be less concerned about the feelings of others and may take risks without worrying about consequences. Each participant in the dataset answered 30 questions, divided into three sections, with 10 questions per trait. The answers were recorded using a Likert scale from 1 to 5, where: 1 means "Strongly Disagree" 2 means "Disagree" 3 means "Neutral" 4 means "Agree" 5 means "Strongly Agree" This scale helps measure how much a person agrees with statements related to each of the three traits. The dataset also includes basic demographic information. Each participant has a unique ID (such as P001, P002, etc.) to keep their identity anonymous. The dataset records their age, which ranges from 18 to 60 years old, and their gender, which is categorized as "Male," "Female," or "Other." The responses in the dataset are realistic, with small variations to reflect natural differences in personality. On average, participants scored around 3.2 for Machiavellianism, meaning most people showed a moderate tendency to be strategic or manipulative. The average Narcissism score was 3.5, indicating that some participants valued themselves highly and sought admiration. The average Psychopathy score was 2.8, showing that most participants did not strongly exhibit impulsive or reckless behaviors. This dataset can be useful for many purposes. Researchers can use it to analyze personality traits and see how they compare across different groups. The data can also be used for cross-cultural comparisons, allowing researchers to study how personality traits in Poland differ from those in other countries. Additionally, psychologists can use this data to understand how Dark Triad traits influence behavior in everyday life. The dataset is saved in a CSV format, which makes it easy to open in programs like Excel, SPSS, or Python for further analysis. Because the data is structured and anonymized, it can be used safely for research without revealing personal information. In summary, this dataset provides valuable insights into personality traits among people in Poland. It allows researchers to explore how Machiavellianism, Narcissism, and Psychopathy vary among individuals. By studying these traits, psychologists can better understand human behavior and how it affects relationships, decision-making, and personal success.

  18. d

    Overtone Journalistic Content Bot/Human Indicator Dataset

    • datarade.ai
    Updated Jan 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Overtone (2023). Overtone Journalistic Content Bot/Human Indicator Dataset [Dataset]. https://datarade.ai/data-products/overtone-journalistic-content-bot-human-indicator-dataset-overtone
    Explore at:
    Dataset updated
    Jan 23, 2023
    Dataset authored and provided by
    Overtone
    Area covered
    Finland, Virgin Islands (U.S.), Russian Federation, Australia, Panama, Aruba, Belarus, Brazil, Belize, Falkland Islands (Malvinas)
    Description

    We indicate how likely a piece of content is computer generated or human written. Content: any text in English or Spanish, from a single sentence to articles of 1,000s words length.

    Data uniqueness: we use custom built and trained NLP algorithms to assess human effort metrics that are inherent in text content. We focus on what's in the text, not metadata such as publication or engagement. Our AI algorithms are co-created by NLP & journalism experts. Our datasets have all been human-reviewed and labeled.

    Dataset: CSV containing URL and/or body text, with attributed scoring as an integer and model confidence as a percentage. We ignore metadata such as author, publication, date, word count, shares and so on, to provide a clean and maximally unbiased assessment of how much human effort has been invested in content. Our data is provided in CSV/RSS/JSON format. One row = one scored article. CSV contains URL and/or body text, with attributed scoring as an integer and model confidence as a percentage.

    Integrity indicators provided as integers on a 1–5 scale. We also have custom models with 35 categories that can be added on request.

    Data sourcing: public websites, crawlers, scrapers, other partnerships where available. We generally can assess content behind paywalls as well as without paywalls. We source from ~4,000 news outlets, examples include: Bloomberg, CNN, BCC are one each. Countries: all English-speaking markets world-wide. Includes English-language content from non English majority regions, such as Germany, Scandinavia, Japan. Also available in Spanish on request.

    Use-cases: assessing the implicit integrity and reliability of an article. There is correlation between integrity and human value: we have shown that articles scoring highly according to our scales show increased, sustained, ongoing end-user engagement. Clients also use this to assess journalistic output, publication relevance and to create datasets of 'quality' journalism.

    Overtone provides a range of qualitative metrics for journalistic, newsworthy and long-form content. We find, highlight and synthesise content that shows added human effort and, by extension, added human value.

  19. h

    diffusiondb

    • huggingface.co
    Updated Aug 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Polo Club of Data Science (2022). diffusiondb [Dataset]. https://huggingface.co/datasets/poloclub/diffusiondb
    Explore at:
    Dataset updated
    Aug 20, 2022
    Dataset authored and provided by
    Polo Club of Data Science
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    DiffusionDB is the first large-scale text-to-image prompt dataset. It contains 2 million images generated by Stable Diffusion using prompts and hyperparameters specified by real users. The unprecedented scale and diversity of this human-actuated dataset provide exciting research opportunities in understanding the interplay between prompts and generative models, detecting deepfakes, and designing human-AI interaction tools to help users more easily use these models.

  20. Face Dataset Of People That Don't Exist

    • kaggle.com
    Updated Sep 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BwandoWando (2023). Face Dataset Of People That Don't Exist [Dataset]. http://doi.org/10.34740/kaggle/dsv/6433550
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 8, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    BwandoWando
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    All the images of faces here are generated using https://thispersondoesnotexist.com/

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1842206%2F4c3d3569f4f9c12fc898d76390f68dab%2FBeFunky-collage.jpg?generation=1662079836729388&alt=media" alt="">

    Copyrighting of AI Generated images

    Under US copyright law, these images are technically not subject to copyright protection. Only "original works of authorship" are considered. "To qualify as a work of 'authorship' a work must be created by a human being," according to a US Copyright Office's report [PDF].

    https://www.theregister.com/2022/08/14/ai_digital_artwork_copyright/

    Tagging

    I manually tagged all images as best as I could and separated them between the two classes below

    • Female- 3860 images
    • Male- 3013 images

    Some may pass either female or male, but I will leave it to you to do the reviewing. I included toddlers and babies under Male/ Female

    How it works

    Each of the faces are totally fake, created using an algorithm called Generative Adversarial Networks (GANs).

    A generative adversarial network (GAN) is a class of machine learning frameworks designed by Ian Goodfellow and his colleagues in June 2014. Two neural networks contest with each other in a game (in the form of a zero-sum game, where one agent's gain is another agent's loss).

    Given a training set, this technique learns to generate new data with the same statistics as the training set. For example, a GAN trained on photographs can generate new photographs that look at least superficially authentic to human observers, having many realistic characteristics. Though originally proposed as a form of generative model for unsupervised learning, GANs have also proved useful for semi-supervised learning, fully supervised learning,and reinforcement learning.

    Github implementation of website

    How I gathered the images

    Just a simple Jupyter notebook that looped and invoked the website https://thispersondoesnotexist.com/ , saving all images locally

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2016). The Human Know-How Dataset [Dataset]. http://doi.org/10.7488/ds/1394

The Human Know-How Dataset

Explore at:
76 scholarly articles cite this dataset (View in Google Scholar)
zip(19.78 MB), zip(0.2837 MB), zip(19.67 MB), zip(69.8 MB), zip(9.433 MB), zip(62.92 MB), zip(20.43 MB), zip(43.28 MB), zip(92.88 MB), zip(13.06 MB), zip(14.86 MB), zip(5.372 MB), zip(0.0298 MB), pdf(0.0582 MB), zip(5.769 MB), zip(90.08 MB)Available download formats
Dataset updated
Apr 29, 2016
Description

The Human Know-How Dataset describes 211,696 human activities from many different domains. These activities are decomposed into 2,609,236 entities (each with an English textual label). These entities represent over two million actions and half a million pre-requisites. Actions are interconnected both according to their dependencies (temporal/logical orders between actions) and decompositions (decomposition of complex actions into simpler ones). This dataset has been integrated with DBpedia (259,568 links). For more information see: - The project website: http://homepages.inf.ed.ac.uk/s1054760/prohow/index.htm - The data is also available on datahub: https://datahub.io/dataset/human-activities-and-instructions ---------------------------------------------------------------- * Quickstart: if you want to experiment with the most high-quality data before downloading all the datasets, download the file '9of11_knowhow_wikihow', and optionally files 'Process - Inputs', 'Process - Outputs', 'Process - Step Links' and 'wikiHow categories hierarchy'. * Data representation based on the PROHOW vocabulary: http://w3id.org/prohow# Data extracted from existing web resources is linked to the original resources using the Open Annotation specification * Data Model: an example of how the data is represented within the datasets is available in the attached Data Model PDF file. The attached example represents a simple set of instructions, but instructions in the dataset can have more complex structures. For example, instructions could have multiple methods, steps could have further sub-steps, and complex requirements could be decomposed into sub-requirements. ---------------------------------------------------------------- Statistics: * 211,696: number of instructions. From wikiHow: 167,232 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 44,464 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 2,609,236: number of RDF nodes within the instructions From wikiHow: 1,871,468 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 737,768 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 255,101: number of process inputs linked to 8,453 distinct DBpedia concepts (dataset Process - Inputs) * 4,467: number of process outputs linked to 3,439 distinct DBpedia concepts (dataset Process - Outputs) * 376,795: number of step links between 114,166 different sets of instructions (dataset Process - Step Links)

Search
Clear search
Close search
Google apps
Main menu