https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
IMDb Datasets Subsets of IMDb data are available for access to customers for personal and non-commercial use. You can hold local copies of this data, and it is subject to our terms and conditions. Please refer to the Non-Commercial Licensing and copyright/license and verify compliance.
Data Location
The dataset files can be accessed and downloaded from https://datasets.imdbws.com/. The data is refreshed daily.
IMDb Dataset Details
Each dataset is in a tab-separated-values (TSV) formatted file in the UTF-8 character set. The first line in each file contains headers that describe what is in each column. A ‘\N’ is used to denote that a particular field is missing or null for that title/name.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Arto (From Huggingface) [source]
The train.csv file contains a list of image filenames, captions, and the actual images used for training the image captioning models. Similarly, the test.csv file includes a separate set of image filenames, captions, and images specifically designated for testing the accuracy and performance of the trained models.
Furthermore, the valid.csv file contains a unique collection of image filenames with their respective captions and images that serve as an independent validation set to evaluate the models' capabilities accurately.
Each entry in these CSV files includes both a filename string that indicates the name or identifier of an image file stored in another location or directory. Additionally,** each entry also provides a list (or multiple rows) o**f strings representing written descriptions or captions describing each respective image given its filename.
Considering these details about this dataset's structure, it can be immensely valuable to researchers, developers, and enthusiasts working on developing innovative computer vision algorithms such as automatic text generation based on visual content analysis. Whether it's training machine learning models to automatically generate relevant captions based on new unseen images or evaluating existing systems' performance against diverse criteria.
Stay updated with cutting-edge research trends by leveraging this comprehensive dataset containing not only captio**ns but also corresponding imag**es across different sets specifically designed to cater to varied purposes within computer vision tasks. »
Overview of the Dataset
The dataset consists of three primary files:
train.csv
,test.csv
, andvalid.csv
. These files contain information about image filenames and their respective captions. Each file includes multiple captions for each image to support diverse training techniques.Understanding the Files
- train.csv: This file contains filenames (
filename
column) and their corresponding captions (captions
column) for training your image captioning model.- test.csv: The test set is included in this file, which contains a similar structure as that of
train.csv
. The purpose of this file is to evaluate your trained models on unseen data.- valid.csv: This validation set provides images with their respective filenames (
filename
) and captions (captions
). It allows you to fine-tune your models based on performance during evaluation.Getting Started
To begin utilizing this dataset effectively, follow these steps:
- Extract the zip file containing all relevant data files onto your local machine or cloud environment.
- Familiarize yourself with each CSV file's structure:
train.csv
,test.csv
, andvalid.csv
. Understand how information like filename(s) (filename
) corresponds with its respective caption(s) (captions
).- Depending on your specific use case or research goals, determine which portion(s) of the dataset you wish to work with (e.g., only train or train+validation).
- Load the dataset into your preferred programming environment or machine learning framework, ensuring you have the necessary dependencies installed.
- Preprocess the dataset as needed, such as resizing images to a specific dimension or encoding captions for model training purposes.
- Split the data into training, validation, and test sets according to your experimental design requirements.
- Use appropriate algorithms and techniques to train your image captioning models on the provided data.
Enhancing Model Performance
To optimize model performance using this dataset, consider these tips:
- Explore different architectures and pre-trained models specifically designed for image captioning tasks.
- Experiment with various natural language
- Image Captioning: This dataset can be used to train and evaluate image captioning models. The captions can be used as target labels for training, and the images can be paired with the captions to generate descriptive captions for test images.
- Image Retrieval: The dataset can be used for image retrieval tasks where given a query caption, the model needs to retrieve the images that best match the description. This can be useful in applications such as content-based image search.
- Natural Language Processing: The dataset can also be used for natural language processing tasks such as text generation or machine translation. The captions in this dataset are descriptive ...
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Discover the Walmart Products Free Dataset, featuring 2,000 records in CSV format. This dataset includes detailed information about various Walmart products, such as names, prices, categories, and descriptions.
It’s perfect for data analysis, e-commerce research, and machine learning projects. Download now and kickstart your insights with accurate, real-world data.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Please note: this archive requires support for dangling symlinks, which excludes the Windows operating system.
To use this dataset, you will need to download the MS COCO 2017 detection images and expand them to a folder called coco17 in the train_val_combined directory. The download can be found here: https://cocodataset.org/#download You will also need to download the AI2D image description dataset and expand them to a folder called ai2d in the train_val_combined directory. The download can be found here: https://prior.allenai.org/projects/diagram-understanding
License Notes for Train and Val: Since the images in this dataset come from different sources, they are bound by different licenses.
Images for bar charts, x-y plots, maps, pie charts, tables, and technical drawings were downloaded directly from wikimedia commons. License and authorship information is stored independently for each image in these categories in the wikimedia_commons_licenses.csv file. Each row (note: some rows are multi-line) is formatted so:
Images in the slides category were taken from presentations which were downloaded from Wikimedia Commons. The names of the presentations on Wikimedia Commons omits the trailing underscore, number, and file extension, and ends with .pdf instead. The source materials' licenses are shown in source_slices_licenses.csv.
Wikimedia commons photos' information page can be found at "https://commons.wikimedia.org/wiki/File:
License Notes for Testing: The testing images have been uploaded to SlideWiki by SlideWiki users. The image authorship and copyright information is available in authors.csv.
Further information can be found for each image using the SlideWiki file service. Documentation is available at https://fileservice.slidewiki.org/documentation#/ and in particular: metadata is available at "https://fileservice.slidewiki.org/metadata/
This is the SlideImages dataset, which has been assembled for the SlideImages paper. If you find the dataset useful, please cite our paper: https://doi.org/10.1007/978-3-030-45442-5_36
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset used in this study experiment was from the preliminary competition dataset of the 2018 Guangdong Industrial Intelligent Manufacturing Big Data Intelligent Algorithm Competition organized by Tianchi Feiyue Cloud (https://tianchi.aliyun.com/competition/entrance/231682/introduction). We have selected the dataset, removing images that do not meet the requirements of our experiment. All datasets have been classified for training and testing. The image pixels are all 2560×1960. Before training, all defects need to be labeled using labelimg and saved as json files. Then, all json files are converted to txt files. Finally, the organized defect dataset is detected and classified.Description of the data and file structureThis is a project based on the YOLOv8 enhanced algorithm for aluminum defect classification and detection tasks.All code has been tested on Windows computers with Anaconda and CUDA-enabled GPUs. The following instructions allow users to run the code in this repository based on a Windows+CUDA GPU system already in use.Files and variablesFile: defeat_dataset.zipDescription:SetupPlease follow the steps below to set up the project:Download Project RepositoryDownload the project repository defeat_dataset.zip from the following location.Unzip and navigate to the project folder; it should contain a subfolder: quexian_datasetDownload data1.Download data .defeat_dataset.zip2.Unzip the downloaded data and move the 'defeat_dataset' folder into the project's main folder.3. Make sure that your defeat_dataset folder now contains a subfolder: quexian_dataset.4. Within the folder you should find various subfolders such as addquexian-13, quexian_dataset, new_dataset-13, etc.softwareSet up the Python environment1.Download and install the Anaconda.2.Once Anaconda is installed, activate the Anaconda Prompt. For Windows, click Start, search for Anaconda Prompt, and open it.3.Create a new conda environment with Python 3.8. You can name it whatever you like; for example. Enter the following command: conda create -n yolov8 python=3.84.Activate the created environment. If the name is , enter: conda activate yolov8Download and install the Visual Studio Code.Install PyTorch based on your system:For Windows/Linux users with a CUDA GPU: bash conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forgeInstall some necessary libraries:Install scikit-learn with the command: conda install anaconda scikit-learn=0.24.1Install astropy with: conda install astropy=4.2.1Install pandas using: conda install anaconda pandas=1.2.4Install Matplotlib with: conda install conda-forge matplotlib=3.5.3Install scipy by entering: conda install scipy=1.10.1RepeatabilityFor PyTorch, it's a well-known fact:There is no guarantee of fully reproducible results between PyTorch versions, individual commits, or different platforms. In addition, results may not be reproducible between CPU and GPU executions, even if the same seed is used.All results in the Analysis Notebook that involve only model evaluation are fully reproducible. However, when it comes to updating the model on the GPU, the results of model training on different machines vary.Access informationOther publicly accessible locations of the data:https://tianchi.aliyun.com/dataset/public/Data was derived from the following sources:https://tianchi.aliyun.com/dataset/140666Data availability statementThe ten datasets used in this study come from Guangdong Industrial Wisdom Big Data Innovation Competition - Intelligent Algorithm Competition Rematch. and the dataset download link is https://tianchi.aliyun.com/competition/entrance/231682/information?lang=en-us. Officially, there are 4,356 images, including single blemish images, multiple blemish images and no blemish images. The official website provides 4,356 images, including single defect images, multiple defect images and no defect images. We have selected only single defect images and multiple defect images, which are 3,233 images in total. The ten defects are non-conductive, effacement, miss bottom corner, orange, peel, varicolored, jet, lacquer bubble, jump into a pit, divulge the bottom and blotch. Each image contains one or more defects, and the resolution of the defect images are all 2560×1920.By investigating the literature, we found that most of the experiments were done with 10 types of defects, so we chose three more types of defects that are more different from these ten types and more in number, which are suitable for the experiments. The three newly added datasets come from the preliminary dataset of Guangdong Industrial Wisdom Big Data Intelligent Algorithm Competition. The dataset can be downloaded from https://tianchi.aliyun.com/dataset/140666. There are 3,000 images in total, among which 109, 73 and 43 images are for the defects of bruise, camouflage and coating cracking respectively. Finally, the 10 types of defects in the rematch and the 3 types of defects selected in the preliminary round are fused into a new dataset, which is examined in this dataset.In the processing of the dataset, we tried different division ratios, such as 8:2, 7:3, 7:2:1, etc. After testing, we found that the experimental results did not differ much for different division ratios. Therefore, we divide the dataset according to the ratio of 7:2:1, the training set accounts for 70%, the validation set accounts for 20%, and the testing set accounts for 10%. At the same time, the random number seed is set to 0 to ensure that the results obtained are consistent every time the model is trained.Finally, the mean Average Precision (mAP) metric obtained from the experiment was tested on the dataset a total of three times. Each time the results differed very little, but for the accuracy of the experimental results, we took the average value derived from the highest and lowest results. The highest was 71.5% and the lowest was 71.1%, resulting in an average detection accuracy of 71.3% for the final experiment.All data and images utilized in this research are from publicly available sources, and the original creators have given their consent for these materials to be published in open-access formats.The settings for other parameters are as follows. epochs: 200,patience: 50,batch: 16,imgsz: 640,pretrained: true,optimizer: SGD,close_mosaic: 10,iou: 0.7,momentum: 0.937,weight_decay: 0.0005,box: 7.5,cls: 0.5,dfl: 1.5,pose: 12.0,kobj: 1.0,save_dir: runs/trainThe defeat_dataset.(ZIP)is mentioned in the Supporting information section of our manuscript. The underlying data are held at Figshare. DOI: 10.6084/m9.figshare.27922929.The results_images.zipin the system contains the experimental results graphs.The images_1.zipand images_2.zipin the system contain all the images needed to generate the manuscript.tex manuscript.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Javanese Script Dataset is a dataset for object detection tasks - it contains Javanese Script Words Dataset annotations for 3,900 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
This data provides results from field analyses, from the California Environmental Data Exchange Network (CEDEN). The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.
Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data.
Users who want to manually download more specific subsets of the data can also use the CEDEN Query Tool, which provides access to the same data presented here, but allows for interactive data filtering.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset tabulates the population of Gratis by gender, including both male and female populations. This dataset can be utilized to understand the population distribution of Gratis across both sexes and to determine which sex constitutes the majority.
Key observations
There is a slight majority of female population, with 50.0% of total population being female. Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.
Scope of gender :
Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis. No further analysis is done on the data reported from the Census Bureau.
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Gratis Population by Race & Ethnicity. You can refer the same here
This dataset contains the metadata of the datasets published in 77 Dataverse installations, information about each installation's metadata blocks, and the list of standard licenses that dataset depositors can apply to the datasets they publish in the 36 installations running more recent versions of the Dataverse software. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation on October 2 and October 3, 2022 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another named "apikey" listing my accounts' API tokens. The Python script expects and uses the API tokens in this CSV file to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation).csv │ ├── basic.csv │ ├── contributor(citation).csv │ ├── ... │ └── topic_classification(citation).csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2022.10.02_17.11.19.zip │ ├── dataset_pids_Abacus_2022.10.02_17.11.19.csv │ ├── Dataverse_JSON_metadata_2022.10.02_17.11.19 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0.json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2022.10.02_17.26.19.zip │ ├── ADA_Dataverse_2022.10.02_17.26.57.zip │ ├── Arca_Dados_2022.10.02_17.44.35.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2022.10.02_22.59.36.zip └── dataset_pids_from_most_known_dataverse_installations.csv └── licenses_used_by_dataverse_installations.csv └── metadatablocks_from_most_known_dataverse_installations.csv This dataset contains two directories and three CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 18 CSV files that contain the values from common metadata fields of all 77 Dataverse installations. For example, author(citation)_2022.10.02-2022.10.03.csv contains the "Author" metadata for all published, non-deaccessioned, versions of all datasets in the 77 installations, where there's a row for each author name, affiliation, identifier type and identifier. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 77 zipped files, one for each of the 77 Dataverse installations whose dataset metadata I was able to download using Dataverse APIs. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate whether or not the Python script was able to download the Dataverse JSON metadata for each dataset. For Dataverse installations using Dataverse software versions whose Search APIs include each dataset's owning Dataverse collection name and alias, the CSV files also include which Dataverse collection (within the installation) that dataset was published in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I saved them so that they can be used when extracting metadata from the Dataverse JSON files. The dataset_pids_from_most_known_dataverse_installations.csv file contains the dataset PIDs of all published datasets in the 77 Dataverse installations, with a column to indicate if the Python script was able to download the dataset's metadata. It's a union of all of the "dataset_pids_..." files in each of the 77 zip files. The licenses_used_by_dataverse_installations.csv file contains information about the licenses that a number of the installations let depositors choose when creating datasets. When I collected this data, 36 installations were running versions of the Dataverse software that allow depositors to choose a license or data use agreement from a dropdown menu in the dataset deposit form. For more information, see https://guides.dataverse.org/en/5.11.1/user/dataset-management.html#choosing-a-license. The metadatablocks_from_most_known_dataverse_installations.csv file contains the metadata block names, field names and child field names (if the field is a compound field) of the 77 Dataverse installations' metadata blocks. The metadatablocks_from_most_known_dataverse_installations.csv file is useful for comparing each installation's dataset metadata model (the metadata fields and the metadata blocks that each installation uses). The CSV file was created using a Python script at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_csv_file_with_metadata_block_fields_of_all_installations.py, which takes as inputs the directories and files created by the get_dataset_metadata_of_all_installations.py script. Known errors The metadata of two datasets from one of the known installations could not be downloaded because the datasets' pages and metadata could not be accessed with the Dataverse APIs. About metadata blocks Read about the Dataverse software's metadata blocks system at http://guides.dataverse.org/en/latest/admin/metadatacustomization.html
This dataset contains hourly capacity factors for each renewable resource class and region (in this case, county). Technologies like large-scale utility PV (UPV), onshore (land-based) wind, offshore wind, and concentrating solar power (CSP) are included. Hourly profiles are provided for 15 weather years covering 2007-2013 and 2016-2023 for all technologies except for CSP, which is only provided for 2007-2013 due to the lack of data covering the latter years. These data are used as inputs to the ReEDS-2.0 model (see the "ReEDS 2.0 GitHub Repository" resource link below), developed by NREL. The weather profiles apply to any capacity that exists or is built in each region and class. This helps calculate the generation that can be provided using these resources. These data are compatible with ReEDs Version 2025.1 and newer. To run county-level with older versions (2024.0-2024.3), use the data posted with the "Link to Data for Older Versions of ReEDS" resource below. Open, reference, and limited are 3 scenarios based on land-use allowance, derived from the Renewable Energy Potential (reV) model developed by NREL, which helps generate supply curves for renewable technologies and assess the maximum potential of renewable resources in a designated area. Each zipped file in this dataset corresponds to a technology and contains the respective land-use scenario files required to run that technology in ReEDS. To use this dataset, download and place the extracted files in the locally cloned ReEDS repository inside one of the folders (inputs/variability/multi_year). After completing this copy, upon running the ReEDS model at the county-level spatial resolution for respective analysis purposes, the program will detect the presence of these files and will not fail. The data provided here correspond to the 2024 supply curves from the reV model; for details see the "Renewable Energy Technical Potential and Supply Curves for the Contiguous United States - 2024 Edition" resource below. For additional details on the profiles see the README below.
https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/2.0/customlicense?persistentId=doi:10.18419/DARUS-4143https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/2.0/customlicense?persistentId=doi:10.18419/DARUS-4143
This dataset features both data and code related to the research article titled "Rayleigh Invariance Enables Estimation of Effective CO2 Fluxes Resulting from Convective Dissolution in Water-Filled Fractures." It includes raw data packaged in tarball format, including Python scripts used to derive the results presented in the publication. High-resolution raw data for contour plots is available upon request. 1 Download the Dataset: Download the dataset file using Access Dataset. Ensure you have sufficient disk space available for storing and processing the dataset. 2 Extract the Dataset: Once the dataset file is downloaded, extract its contents. The dataset is compressed in a tar.xz format. Use appropriate tools to extract it. For example, in Linux, you can use the following command: tar -xf Publication_CCS.tar.xz tar -xf Publication_Karst.tar.xz tar -xf Validation_Sim.tar.xz This will create a directory containing the dataset files. 3 Install Required Python Packages: Before running any code, ensure you have the necessary Python (version 3.10 tested) packages installed. The required packages and their versions are listed in the requirements.txt file. You can install the required packages using pip: pip install -r requirements.txt 4 Run the Post Processing Script: After extracting the dataset and installing the required Python packages, you can run the provided post processing script. The post processing script (post_process.py) is designed to replicate all the plots from a publication based on the dataset. Execute the script using Python: python3 post_process.py This script will generate the plots and output them to the specified directory. 5 Explore and Analyze: Once the script has completed running, you can explore the generated plots to gain insights from the dataset. Feel free to modify the script or use the dataset in your own analysis and experiments. High-resolution data, such as the vtu's for contour plots is available upon request; please feel free to reach out if needed. 6 Small Grid Study: There is a tarball for the data that was generated to study the grid used in the related publication. tar -xf Publication_CCS.tar.xz If you unpack the tarball and have the requirements from above installed, you can use the python script to generate the plots. 7 Citation: If you use this dataset in your research or publication, please cite the original source appropriately to give credit to the authors and contributors.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset tabulates the United States population over the last 20 plus years. It lists the population for each year, along with the year on year change in population, as well as the change in percentage terms for each year. The dataset can be utilized to understand the population change of United States across the last two decades. For example, using this dataset, we can identify if the population is declining or increasing. If there is a change, when the population peaked, or if it is still growing and has not reached its peak. We can also compare the trend with the overall trend of United States population over the same period of time.
Key observations
In 2022, the population of United States was 333,287,557, a 0.38% increase year-by-year from 2021. Previously, in 2021, United States population was 332,031,554, an increase of 0.16% compared to a population of 331,511,512 in 2020. Over the last 20 plus years, between 2000 and 2022, population of United States increased by 51,125,146. In this period, the peak population was 333,287,557 in the year 2022. The numbers suggest that the population has not reached its peak yet and is showing a trend of further growth. Source: U.S. Census Bureau Population Estimates Program (PEP).
When available, the data consists of estimates from the U.S. Census Bureau Population Estimates Program (PEP).
Data Coverage:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for United States Population by Year. You can refer the same here
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The aim of this paper is to investigate the re-use of research data deposited in digital data archive in the social sciences. The study examines the quantity, type, and purpose of data downloads by analyzing enriched user log data collected from Swiss data archive. The findings show that quantitative datasets are downloaded increasingly from the digital archive and that downloads focus heavily on a small share of the datasets. The most frequently downloaded datasets are survey datasets collected by research organizations offering possibilities for longitudinal studies. Users typically download only one dataset, but a group of heavy downloaders form a remarkable share of all downloads. The main user group downloading data from the archive are students who use the data in their studies. Furthermore, datasets downloaded for research purposes often, but not always, serve to be used in scholarly publications. Enriched log data from data archives offer an interesting macro level perspective on the use and users of the services and help understanding the increasing role of repositories in the social sciences. The study provides insights into the potential of collecting and using log data for studying and evaluating data archive use.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the accompanying dataset to the following paper https://www.nature.com/articles/s41597-023-01975-w
Caravan is an open community dataset of meteorological forcing data, catchment attributes, and discharge daat for catchments around the world. Additionally, Caravan provides code to derive meteorological forcing data and catchment attributes from the same data sources in the cloud, making it easy for anyone to extend Caravan to new catchments. The vision of Caravan is to provide the foundation for a truly global open source community resource that will grow over time.
If you use Caravan in your research, it would be appreciated to not only cite Caravan itself, but also the source datasets, to pay respect to the amount of work that was put into the creation of these datasets and that made Caravan possible in the first place.
All current development and additional community extensions can be found at https://github.com/kratzert/Caravan
Channel Log:
23 May 2022: Version 0.2 - Resolved a bug when renaming the LamaH gauge ids from the LamaH ids to the official gauge ids provided as "govnr" in the LamaH dataset attribute files.
24 May 2022: Version 0.3 - Fixed gaps in forcing data in some "camels" (US) basins.
15 June 2022: Version 0.4 - Fixed replacing negative CAMELS US values with NaN (-999 in CAMELS indicates missing observation).
1 December 2022: Version 0.4 - Added 4298 basins in the US, Canada and Mexico (part of HYSETS), now totalling to 6830 basins. Fixed a bug in the computation of catchment attributes that are defined as pour point properties, where sometimes the wrong HydroATLAS polygon was picked. Restructured the attribute files and added some more meta data (station name and country).
16 January 2023: Version 1.0 - Version of the official paper release. No changes in the data but added a static copy of the accompanying code of the paper. For the most up to date version, please check https://github.com/kratzert/Caravan
10 May 2023: Version 1.1 - No data change, just update data description.
17 May 2023: Version 1.2 - Updated a handful of attribute values that were affected by a bug in their derivation. See https://github.com/kratzert/Caravan/issues/22 for details.
16 April 2024: Version 1.4 - Added 9130 gauges from the original source dataset that were initially not included because of the area thresholds (i.e. basins smaller than 100sqkm or larger than 2000sqkm). Also extended the forcing period for all gauges (including the original ones) to 1950-2023. Added two different download options that include timeseries data only as either csv files (Caravan-csv.tar.xz) or netcdf files (Caravan-nc.tar.xz). Including the large basins also required an update in the earth engine code
16 Jan 2025: Version 1.5 - Added FAO Penman-Monteith PET (potential_evaporation_sum_FAO_PENMAN_MONTEITH) and renamed the ERA5-LAND potential_evaporation band to potential_evaporation_sum_ERA5_LAND. Also added all PET-related climated indices derived with the Penman-Monteith PET band (suffix "_FAO_PM") and renamed the old PET-related indices accordingly (suffix "_ERA5_LAND").
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Person Tracking is a dataset for object detection tasks - it contains Person annotations for 473 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Home Objects is a dataset for object detection tasks - it contains Objects annotations for 4,467 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Red Tomatoes is a dataset for object detection tasks - it contains Tomatoes annotations for 308 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset contains version 3.0 (March 2025 release) of the Global Fishing Watch apparent fishing effort dataset. Data is available for 2012-2024 and based on positions of >190,000 unique automatic identification system (AIS) devices on fishing vessels, of which up to ~96,000 are active in a given year. Fishing vessels are identified via a machine learning model, vessel registry databases, and manual review by GFW and regional experts. Vessel time is measured in hours, calculated by assigning to each AIS position the amount of time elapsed since the previous AIS position of the vessel. The time is counted as apparent fishing hours if the GFW fishing detection model - a neural network machine learning model - determines the vessel is engaged in fishing behavior during that AIS position.
Data are spatially binned into grid cells that measure 0.01 or 0.1 degrees on a side; the coordinates defining each cell are provided in decimal degrees (WGS84) and correspond to the lower-left corner. Data are available in the following formats:
The fishing effort dataset is accompanied by a table of vessel information (e.g. gear type, flag state, dimensions).
Fishing effort and vessel presence data are available as .csv files in daily formats. Files for each year are stored in separate .zip files. A README.txt and schema.json file is provided for each dataset version and contains the table schema and additional information. There is also a README-known-issues-v3.txt file outlining some of the known issues with the version 3 release.
Files are names according to the following convention:
Daily file format:
[fleet/mmsi]-daily-csvs-[100/10]-v3-[year].zip
[fleet/mmsi]-daily-csvs-[100/10]-v3-[date].csv
Monthly file format:
fleet-monthly-csvs-10-v3-[year].zip
fleet-monthly-csvs-10-v3-[date].csv
Fishing vessel format: fishing-vessels-v3.csv
README file format: README-[fleet/mmsi/fishing-vessels/known-issues]-v3.txt
File identifiers:
[fleet/mmsi]: Data by fleet (flag and geartype) or by MMSI
[100/10]: 100th or 10th degree resolution
[year]: Year of data included in .zip file
[date]: Date of data included in .csv files. For monthly data, [date]corresponds to the first date of the month
Examples: fleet-daily-csvs-100-v3-2020.zip; mmsi-daily-csvs-10-v3-2020-01-10.csv; fishing-vessels-v3.csv; README-fleet-v3.txt; fleet-monthly-csvs-10-v3-2024.zip; fleet-monthly-csvs-10-v3-2024-08-01.csv
For an overview of how GFW turns raw AIS positions into estimates of fishing hours, see this page.
The models used to produce this dataset were developed as part of this publication: D.A. Kroodsma, J. Mayorga, T. Hochberg, N.A. Miller, K. Boerder, F. Ferretti, A. Wilson, B. Bergman, T.D. White, B.A. Block, P. Woods, B. Sullivan, C. Costello, and B. Worm. "Tracking the global footprint of fisheries." Science 361.6378 (2018). Model details are available in the Supplementary Materials.
The README-known-issues-v3.txt file describing this dataset's specific caveats can be downloaded from this page. We highly recommend that users read this file in full.
The README-mmsi-v3.txt file, the README-fleet-v3.txt file, and the README-fishing-vessels-v3.txt files are downloadable from this page and contain the data description for (respectively) the fishing hours by MMSI dataset, the fishing hours by fleet dataset, and the vessel information file. These readmes contain key explanations about the gear types and flag states assigned to vessels in the dataset.
File name structure for the datafiles are available below on this page and file schema can be downloaded from this page.
A FAQ describing the updates in this version and the differences between this dataset and the data available from the GFW Map and APIs is available here.
The apparent fishing hours dataset is intended to allow users to analyze patterns of fishing across the world’s oceans at temporal scales as fine as daily and at spatial scales as fine as 0.1 or 0.01 degree cells. Fishing hours can be separated out by gear type, vessel flag and other characteristics of vessels such as tonnage.
Potential applications for this dataset are broad. We offer suggested use cases to illustrate its utility. The dataset can be integrated as a static layer in multi-layered analyses, allowing researchers to investigate relationships between fishing effort and other variables, including biodiversity, tracking, and environmental data, as defined by their research objectives.
A few example questions that these data could be used to answer:
What flag states have fishing activity in my area of interest?
Do hotspots of longline fishing overlap with known migration routes of sea turtles?
How does fishing time by trawlers change by month in my area of interest? Which seasons see the most trawling hours and which see the least?
This global dataset estimates apparent fishing hours effort. The dataset is based on publicly available information and statistical classifications which may not fully capture the nuances of local fishing practices. While we manually review the dataset at a global scale and in a select set of smaller test regions to check for issues, given the scale of the dataset we are unable to manually review every fleet in every region. We recognize the potential for inaccuracies and encourage users to approach regional analyses with caution, utilizing their own regional expertise to validate findings. We welcome your feedback on any regional analysis at research@globalfishingwatch.org to enhance the dataset's accuracy.
Caveats relating to known sources of inaccuracy as well as interpretation pitfalls to avoid are described in the README-known-issues-v3.txt file available for download from this page. We highly recommend that users read this file in full. The issues described include:
Data from 2024 should be considered provisional, as vessel classifications may change as more data from 2025 becomes available.
MMSI is used in this dataset as the vessel identifier. While MMSI is intended to serve as the unique AIS identifier for an individual vessel, this does not always hold in practice.
The Maritime Identification Digits (MID), the first 3 digits of MMSI, are the only source of information on vessel flag state when the vessel does not appear on a registry. The MID may be entered incorrectly, obscuring information about an MMSI’s flag state.
AIS reception is not consistent across all areas and changes over time.
Query using SQL in the Global Fishing Watch public BigQuery dataset: global-fishing-watch.fishing_effort_v3
Download the entire dataset from the Global Fishing Watch Data Download Portal (https://globalfishingwatch.org/data-download/datasets/public-fishing-effort)
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
This repository contains the data related to the paper ** "Granulometry transformer: image-based granulometry of concrete aggregate for an automated concrete production control" ** where a deep learning based method is proposed for the image based determination of concrete aggregate grading curves (cf. video).
More specifically, the data set consists of images showing concrete aggregate particles and reference data of the particle size distribution (grading curves) associated to each image. It is distinguished between the CoarseAggregateData and the FineAggregateData.
The coarse data consists of aggregate samples with different particles sizes ranging from 0.1 mm to 32 mm. The grading curves are designed by linearly interpolation between a very fine and a very coarse distribution for three variants with maximum grain sizes of 8 mm, 16 mm, and 32 mm, respectively. For each variant, we designed eleven grading curves, resulting in a total number 33, which are shown in the figure below. For each sample, we acquired 50 images with a GSD of 0.125 mm, resulting in a data set of 1650 images in total. Example images for a subset of the grading curves of this data set are shown in the following figure.
https://data.uni-hannover.de/dataset/ecb0bf04-84c8-45b1-8a43-044f3f80d92c/resource/8cb30616-5b24-4028-9c1d-ea250ac8ac84/download/examplecoarse.png" alt="Example images and grading curves of the coarse data set" title=" ">
Similar to the previous data set, the fine data set contains grading curves for the fine
fraction of concrete aggregate of 0 to 2 mm with a GSD of 28.5 $\mu$m.
We defined two base distributions of different shapes for the upper and lower bound, respectively, resulting in two interpolated grading curve sets (Set A and Set B). In total, 1700
images of 34 different particle size distributions were acquired. Example images of the data set and the corresponding grading curves are shown in the figure below.
https://data.uni-hannover.de/dataset/ecb0bf04-84c8-45b1-8a43-044f3f80d92c/resource/c56f4298-9663-457f-aaa7-0ba113fec4c9/download/examplefine.png" alt="Example images and grading curves of the finedata set" title=" ">
If you make use of the proposed data, please cite.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Roboflow Can Data is a dataset for object detection tasks - it contains Can annotations for 3,154 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
IMDb Datasets Subsets of IMDb data are available for access to customers for personal and non-commercial use. You can hold local copies of this data, and it is subject to our terms and conditions. Please refer to the Non-Commercial Licensing and copyright/license and verify compliance.
Data Location
The dataset files can be accessed and downloaded from https://datasets.imdbws.com/. The data is refreshed daily.
IMDb Dataset Details
Each dataset is in a tab-separated-values (TSV) formatted file in the UTF-8 character set. The first line in each file contains headers that describe what is in each column. A ‘\N’ is used to denote that a particular field is missing or null for that title/name.