Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”
A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org
Please cite this when using the dataset.
Detailed description of the dataset:
1 Film Dataset: Festival Programs
The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.
The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.
The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.
The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.
2 Survey Dataset
The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.
The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.
The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.
The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.
3 IMDb & Scripts
The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.
The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.
The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.
The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.
The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.
The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.
The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.
The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.
The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.
The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.
The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.
The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.
The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.
The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.
The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.
The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.
The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.
The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.
The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.
4 Festival Library Dataset
The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.
The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.
The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This
Facebook
Twitter|
Field Name |
Description |
|
Dataset Name |
The official title of the dataset as listed in the inventory. |
|
Brief Description of Data |
A short summary explaining the contents and purpose of the dataset. |
|
Data Source |
The origin or system from which the data is collected or generated. |
|
Home Department |
The primary department responsible for the dataset. |
|
Home Department Division |
The specific division within the department that manages the dataset. |
|
Data Steward (Business) Name |
The name of person responsible for the dataset’s accuracy and relevance. |
|
Data Custodian (Technical) Name) |
The technical contact responsible for maintaining and managing the dataset infrastructure. |
|
Data Classification |
The sensitivity level of the data (e.g., Public, Internal, Confidential) |
|
Data Format |
The file format(s) in which the dataset is available (e.g., CSV, JSON, Shapefile). |
|
Frequency of Data Change |
How often the dataset is updated (e.g., Daily, Weekly, Monthly, Annually). |
|
Time Spam |
The overall time period the dataset covers. |
|
Start Date |
The beginning date of the data |
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset was generated as part of the study aimed at profiling global scientific academies, which play a significant role in promoting scholarly communication and scientific progress. Below is a detailed description of the dataset:Data Generation Procedures and Tools: The dataset was compiled using a combination of web scraping, manual verification, and data integration from multiple sources, including Wikipedia categories,member of union of scientific organizations, and web searches using specific query phrases (e.g., "country name + (academy OR society) AND site:.country code"). The records were enriched by cross-referencing data from the Wikidata API, the VIAF API, and the Research Organisation Registry (ROR). Additional manual curation ensured accuracy and consistency.Temporal and Geographical Scopes: The dataset covers scientific academies from a wide temporal scope, ranging from the 15th century to the present. The geographical scope includes academies from all continents, with emphasis on both developed and post-developing countries. The dataset aims to capture the full spectrum of scientific academies across different periods of historical development.Tabular Data Description: The dataset comprises a total of 301 academy records and 14,008 website navigation sections. Each row in the dataset represents a single scientific academy, while the columns describe attributes such as the academy’s name, founding date, location (city and country), website URL, email, and address.Missing Data: Although the dataset offers comprehensive coverage, some entries may have missing or incomplete fields. For instance, section was not available for all records.Data Errors and Error Ranges: The data has been verified through manual curation, reducing the likelihood of errors. However, the use of crowd-sourced data from platforms like Wikipedia introduces potential risks of outdated or incomplete information. Any errors are likely minor and confined to fields such as navigation menu classifications, which may not fully reflect the breadth of an academy's activities.Data Files, Formats, and Sizes: The dataset is provided in CSV format and JSON format, ensuring compatibility with a wide range of software applications, including Microsoft Excel, Google Sheets, and programming languages such as Python (via libraries like pandas).This dataset provides a valuable resource for further research into the organizational behaviors, geographic distribution, and historical significance of scientific academies across the globe. It can be used for large-scale analyses, including comparative studies across different regions or time periods.Any feedback on the data is welcome! Please contact the maintaner of the dataset!If you use the data, please cite the following paper:Xiaoli Chen and Xuezhao Wang. 2024. Profiling Global Scientific Academies. In The 2024 ACM/IEEE Joint Conference on Digital Libraries (JCDL ’24), December 16–20, 2024, Hong Kong, China. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3677389.3702582
Facebook
Twitterhttps://borealisdata.ca/api/datasets/:persistentId/versions/4.0/customlicense?persistentId=doi:10.5683/SP3/QVOT0Yhttps://borealisdata.ca/api/datasets/:persistentId/versions/4.0/customlicense?persistentId=doi:10.5683/SP3/QVOT0Y
UNI-CEN Standardized Census Data Tables contain Census data that have been reformatted into a common table format with standardized variable names and codes. The data are provided in two tabular formats for different use cases. "Long" tables are suitable for use in statistical environments, while "wide" tables are commonly used in GIS environments. The long tables are provided in Stata Binary (dta) format, which is readable by all statistics software. The wide tables are provided in comma-separated values (csv) and dBase 3 (dbf) formats with codebooks. The wide tables are easily joined to the UNI-CEN Digital Boundary Files. For the csv files, a .csvt file is provided to ensure that column data formats are correctly formatted when importing into QGIS. A schema.ini file does the same when importing into ArcGIS environments. As the DBF file format supports a maximum of 250 columns, tables with a larger number of variables are divided into multiple DBF files. For more information about file sources, the methods used to create them, and how to use them, consult the documentation at https://borealisdata.ca/dataverse/unicen_docs. For more information about the project, visit https://observatory.uwo.ca/unicen.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Japanese Chain of Thought prompt-response dataset, a meticulously curated collection containing 3000 comprehensive prompt and response pairs. This dataset is an invaluable resource for training Language Models (LMs) to generate well-reasoned answers and minimize inaccuracies. Its primary utility lies in enhancing LLMs' reasoning skills for solving arithmetic, common sense, symbolic reasoning, and complex problems.
This COT dataset comprises a diverse set of instructions and questions paired with corresponding answers and rationales in the Japanese language. These prompts and completions cover a broad range of topics and questions, including mathematical concepts, common sense reasoning, complex problem-solving, scientific inquiries, puzzles, and more.
Each prompt is meticulously accompanied by a response and rationale, providing essential information and insights to enhance the language model training process. These prompts, completions, and rationales were manually curated by native Japanese people, drawing references from various sources, including open-source datasets, news articles, websites, and other reliable references.
Our chain-of-thought prompt-completion dataset includes various prompt types, such as instructional prompts, continuations, and in-context learning (zero-shot, few-shot) prompts. Additionally, the dataset contains prompts and completions enriched with various forms of rich text, such as lists, tables, code snippets, JSON, and more, with proper markdown format.
To ensure a wide-ranging dataset, we have included prompts from a plethora of topics related to mathematics, common sense reasoning, and symbolic reasoning. These topics encompass arithmetic, percentages, ratios, geometry, analogies, spatial reasoning, temporal reasoning, logic puzzles, patterns, and sequences, among others.
These prompts vary in complexity, spanning easy, medium, and hard levels. Various question types are included, such as multiple-choice, direct queries, and true/false assessments.
To accommodate diverse learning experiences, our dataset incorporates different types of answers depending on the prompt and provides step-by-step rationales. The detailed rationale aids the language model in building reasoning process for complex questions.
These responses encompass text strings, numerical values, and date and time formats, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers.
This fully labeled Japanese Chain of Thought Prompt Completion Dataset is available in JSON and CSV formats. It includes annotation details such as a unique ID, prompt, prompt type, prompt complexity, prompt category, domain, response, rationale, response type, and rich text presence.
Quality and Accuracy
Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses and rationales are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance.
The Japanese version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset.
Continuous Updates and Customization
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Ongoing efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to gather custom chain of thought prompt completion data tailored to specific needs, providing flexibility and customization options.
License
The dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy Japanese Chain of Thought Prompt Completion Dataset to enhance the rationale and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.
Facebook
TwitterThe data comes from the American Kennel Club courtesy
breed_traits - trait information on each dog breed and scores for each trait (wide format)
trait_description - long descriptions of each trait and values corresponding to Trait_Score
breed_rank_all- popularity of dog breeds by AKC registration statistics from 2013-2020
breed_traits_long.csv| variable | class | description |
|---|---|---|
| Breed | character | Dog Breed |
| Trait | character | Name of trait/characteristic |
| Trait_Score | character | Placement on scale of 1-5 for the trait, with the exception of a description for coat type and length |
breed_traits.csv| variable | class | description |
|---|---|---|
| Breed | character | Dog Breed |
| Affectionate With Family | character | Placement on scale of 1-5 for the breed's tendancy to be "Affectionate With Family" (Trait_Score) |
| Good With Young Children | character | Placement on scale of 1-5 for the breed's tendancy to be "Good With Young Children" (Trait_Score) |
| Good With Other Dogs | character | Placement on scale of 1-5 for the breed's tendancy to be "Good With Other Dogs" (Trait_Score) |
| Shedding Level | character | Placement on scale of 1-5 for the breed's "Shedding Level" (Trait_Score) |
| Coat Grooming Frequency | character | Placement on scale of 1-5 for the breed's "Coat Grooming Frequency" (Trait_Score) |
| Drooling Level | character | Placement on scale of 1-5 for the breed's "Drooling Level" (Trait_Score) |
| Coat Type | character | Description of the breed's coat type (Trait_Score) |
| Coat Length | character | Description of the breed's coat length (Trait_Score) |
| Openness To Strangers | character | Placement on scale of 1-5 for the breed's tendancy to be open to strangers (Trait_Score) |
| Playfulness Level | character | Placement on scale of 1-5 for the breed's tendancy to be playful (Trait_Score) |
| Watchdog/Protective Nature | character | Placement on scale of 1-5 for the breed's "Watchdog/Protective Nature" (Trait_Score) |
| Adaptability Level | character | Placement on scale of 1-5 for the breed's tendancy to be adaptable (Trait_Score) |
| Trainability Level | character | Placement on scale of 1-5 for the breed's tendancy to be adaptable (Trait_Score) |
| Energy Level | character | Placement on scale of 1-5 for the breed's "Energy Level" (Trait_Score) |
| Barking Level | character | Placement on scale of 1-5 for the breed's "Barking Level" (Trait_Score) |
| Mental Stimulation Needs | character | Placement on scale of 1-5 for the breed's "Mental Stimulation Needs" (Trait_Score) |
trait_description.csv| variable | class | description |
|---|---|---|
| Trait | character | Dog Breed |
| Trait_1 | character | Value corresponding to Trait when Trait_Score = 1 |
| Trait_5 | character | Value corresponding to Trait when Trait_Score = 5 |
| Description | character | Long description of trait |
breed_rank_all.csv| variable | class | description |
|---|---|---|
| Breed | character | Dog Breed |
| 2013 Rank | character | Popularity of breed based on AKC registration statistics in 2013 |
| 2014 Rank | character | Popularity of breed based on AKC registration statistics in 2014 |
| 2015 Rank | character | Popularity of breed based on AKC registration statistics in 2015 |
| 2016 Rank | character | Popularity of breed based on AKC registration statistics in 2016 |
| 2017 Rank | character | Popularity of breed based on AKC registration statistics in 2017 |
| 2018 Rank | character | Popularity of breed based on AKC registration statistics in 2018 |
| 2019 Rank | character | Popularity of breed based on AKC registration statistics in 2019 |
| 2020 Rank | character | Popularity of breed based on AKC registration statistics in 2020 |
| links | character | Link to the dog breed's AKC webpage |
| Image | character | Link to image of dog breed |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a CSV file containing responses from an online survey titled “AI in Business – EU Contextual Regulations and Future Workforce Impacts.” The survey ran from late 2023 through 2025, gathering over 300 responses from participants (primarily in Romania and other EU regions). Each row in the CSV represents one respondent’s answers. Questions span multiple-choice and Likert scale responses, divided into three sections: (A) demographics and background, (B) perceptions of AI in business and relevant EU regulations, and (C) motivations and experiences with AI adoption in business. The questionnaire was provided in both English and Romanian to maximize clarity and participation in a bilingual context[1]. Table 1 below summarizes the structure of the dataset, listing each field (column) and its contents or question topic.
Table 1: Survey Dataset Structure
<tr|
Timestamp Date and time when the response was submitted. |
|
Email address Email address of the respondent. |
|
Age group Respondent's age group (e.g., 18–24, 25–34, etc.). |
|
Gender Gender of the respondent (Female, Male, Prefer not to say). |
|
Highest level of education Highest educational attainment of the respondent (High school or less; Bachelor’s/Master’s; Doctorate or above; Vocational; Prefer not to say). |
|
Current job role Current job role or position of the respondent (Leadership/management; Professional/technical; Administrative/support; Sales/marketing; Student; Retired; Unemployed). |
|
Sector Industry sector in which the respondent currently works (e.g., Technology, Healthcare, Finance, Education, Manufacturing, Public, Retail, Other). |
|
Years of experience Number of years the respondent has been working in their current field. |
|
AI knowledge Self-rated knowledge and understanding of AI technology (No knowledge; Beginner; Intermediate; Advanced; Expert). |
|
Region Region (in Romania or EU) where the respondent currently resides. |
|
AI familiarity Familiarity with the concept of AI in a business context (1 = Not very familiar, 10 = Very familiar). |
|
AI improves operations Belief that AI can significantly improve business operations (1 = Strongly disagree, 10 = Strongly agree). |
|
EU AI regulations too restrictive Opinion on whether current EU regulations on AI in business are too restrictive (1 = Strongly disagree, 10 = Strongly agree). |
|
AI impact on job creation Perceived impact of AI on job creation in the EU (1 = Not important, 10 = Significantly positive). |
|
AI job losses Expectation that AI will lead to job losses in the EU (1 = Definitely will not, 10 = Definitely will). |
|
EU ethical AI leadership Belief that the EU is well-positioned to lead globally in ethical AI practices (1 = Strongly disagree, 10 = Strongly agree). |
|
Data privacy regs effective Perception of how effectively EU regulations address data privacy concerns in AI (1 = Very ineffectively, 10 = Very effectively). |
|
AI in decision-making stance Stance on AI’s role in business decision-making processes (1 = Fully oppose, 10 = Fully support). |
|
Invest in AI education Support for investing more in AI education and workforce training (1 = Definitely no, 10 = Definitely yes). |
|
AI creates new jobs Belief that AI can lead to new types of jobs not existing today (1 = Strongly do not believe, 10 = Strongly believe). |
|
Transparency of AI processes Perception of how transparent AI-driven processes are in businesses (1 = Completely opaque, 10 = Completely transparent). |
|
Confidence in AI data handling Confidence in AI’s ability to ethically handle sensitive data (1 = Not confident at all, 10 = Very confident). |
|
AI widens SME gap Belief that AI technologies will widen the gap between large corporations and SMEs (1 = Strongly disagree, 10 = Strongly agree). |
|
Future AI outlook (EU) Outlook on the future of AI in EU business over the next decade (1 = Extremely pessimistic, 5 = Extremely optimistic). |
|
AI integration cost perception Perceived cost of integrating AI into operations for SMEs (1 = Very affordable, 10 = Extremely costly). |
|
Primary motivation for AI adoption Primary motivation for adopting AI in the business (e.g., Improving efficiency; Customer experience; Competitive advantage; Cost reduction). |
|
AI helps navigate uncertainties Belief that AI can help the business navigate market uncertainties more effectively (1 = Strongly disagree, 10 = Strongly agree). |
|
Awareness of EU AI regulations Self-assessed understanding and awareness of EU AI regulations and their impact (1 = Not aware, 10 = Fully aware). |
|
Challenges in data access for AI Whether the business has faced challenges in accessing or analyzing data for AI applications (Frequently; Occasionally; Rarely; Never; Not applicable). |
|
AI critical for SME future Perceived criticality of AI in shaping future strategies and business models for SMEs (Extremely critical; Important; Somewhat important; Unimportant; Unsure). |
|
Plan to invest in AI (1-2 yrs) Plan to invest in AI technologies or tools in the next 1–2 years (Yes, definitely; Considering; Unsure; Unlikely; No plans). |
|
Support needed for AI adoption Type of support or resources perceived to most help SMEs adopt AI effectively (Financial grants; Training resources; Government policies; Technical assistance). |
|
AI impact on job roles Perceived impact of AI on employee roles in the business (Create new roles; Augment existing roles; Replace certain roles; No significant impact; Unsure). |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The unified database of ozonesounding profiles was obtained through the merging of three existing ozonesounding datasets, provided by the Southern Hemisphere Additional OZonesondes (SHADOZ), the Network for the Detection of Atmospheric Composition Change (NDACC), and the World Ozone and Ultraviolet Radiation Data Centre (WOUDC).
Only a selected set of variables of interest, both data and metadata, were considered to build the unified dataset, due to the heterogeneous formats and varying levels of detail provided by each network, even when referring to measurements shared across different initiatives. These variables are listed in the following Table.
|
Standard name |
Description |
Unit |
|
idstation |
The name of the station. |
N.A. |
|
location_latitude |
Latitude of station. |
deg |
|
location_longitude |
Longitude of station. |
deg |
|
lacation_height |
Height is defined as the altitude, elevation, or height of the defined platform + instrument above sea level. |
m |
|
date_of_observation |
Date when the ozonesonde was launched (in format yyyy-mm-dd hh:mm:ss with time zone). |
N.A. |
|
time |
Elapsed flight time since released. |
s |
|
pressure |
Atmospheric pressure of each level in Pascals. |
Pa |
|
geop_alt |
Geopotential height in meters. |
m |
|
temperature |
Air temperature in degrees Kelvin. |
K |
|
relative_humidity |
Relative humidity in 1. |
1 |
|
wind_speed |
Wind speed in meters per seconds. |
m/s |
|
wind_direction |
Wind direction in degrees. |
deg |
|
latitude |
Observation latitude (during the flight). |
deg |
|
longitude |
Observation longitude (during the flight). |
deg |
|
altitude |
Height of sensor above local ground or sea surface. Positive values for above surface (e.g., sondes), negative for below (e.g., xbt). For visual observations, the height of the visual observing platform. |
m (a. s. l.) |
|
sample_temperature |
Temperature where sample is measured in degrees Kelvin. |
K |
|
o3_partial_pressure |
The level partial pressure of ozone in Pascals. |
Pa |
|
ozone_concentration |
The level mixing ratio of ozone in ppmv. |
ppmv |
|
ozone_partial_pressure_total_uncertainty |
Total uncertainty in the calculation of the ozone partial pressure as a composite of the individual uncertainty contribution. Uncertainties due to systematic bias are assumed as random and follow a random normal distribution. The uncertainty calculation also accounts for the increased uncertainty incurred by homogenizing the data record. |
Pa |
|
network |
Source network of the profile. |
N.A. |
|
type |
Station classification flag. |
N.A. |
|
filter_check |
Profile quality control flag. |
N.A. |
The dataset is organized into two main tables:
To improve accessibility and performance, both tables are further subdivided into year-specific subtables, allowing for more efficient querying and data management across temporal ranges.
Among the metadata variables included in the unified_header table, type and filter_check play a key role in characterizing the quality and coverage of the ozonesounding profiles. The type variable classifies each station based on the continuity of its time series: stations are grouped into Long Coverage (G), Medium Coverage (Y), or Short Coverage (R), depending on whether they provide at least one profile per month for at least 95% of the months in their time series, spanning:
The filter_check variable is a quality control flag ranging from 0 to 3, summarizing the results of three structural checks applied to each profile: completeness of monthly coverage (at least three ascents per month), vertical coverage (reaching at least 10 hPa), and vertical resolution (minimum one data point every 100 meters). A higher filter_check value indicates better compliance with these criteria and, consequently, higher data reliability.
Furthermore, an algorithm was implemented able to merge the different datasets by handling their different
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
cigChannel (V1.0) is a dataset created by the Computational Interpretation Group (CIG) for the deep-learning-based paleochannel interpretation in 3D seismic volumes. Guangyu Wang, Xinming Wu and Wen Zhang are the main contributors to the dataset.
cigChannel (V1.0) contains 1,600 synthetic 3D seismic volumes with labels of meandering channels, tributary channel networks and submarine canyons. Seismic impedance and sedimentary facies (only for submarine canyons) volumes correspond to the seismic volumes are also included in this dataset. Components of this dataset are listed below:
| Subset name | Sample amount & size | Contents | Features |
| Meandering channel | 400, 256x256x256 |
|
|
|
Tributary channel network (Formerly distributary channel) | 400, 256x256x256 |
|
|
|
Submarine canyon (Formerly submarine channel) | 400, 256x256x256 |
|
|
| Assorted channel | 400, 256x256x256 |
|
|
Further details about this dataset are available in our paper published in Earth System Science Data:
Wang, G., Wu, X., and Zhang, W.: cigChannel: a large-scale 3D seismic dataset with labeled paleochannels for advancing deep learning in seismic interpretation, Earth Syst. Sci. Data, 17, 3447–3471, https://doi.org/10.5194/essd-17-3447-2025, 2025.
Due to the size limitation of the uploaded files, we have to publish the dataset in separated versions. This version includes the assorted channel subset, which contains the following zip files:
Samples in this subset feature different geologic structures:
Portal to the expansion package: https://doi.org/10.5281/zenodo.15500696
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We collected all available global soil carbon (C) and heterotrophic respiration (RH) maps derived from data-driven estimates, sourcing them from public repositories and supplementary materials of previous studies (Table 1). All spatial datasets were converted to NetCDF format for consistency and ease of use.
Because the maps had varying spatial resolutions (ranging from 0.0083° to 0.5°), we harmonized all datasets to a common resolution of 0.5° (approximately 50 km at the equator). We then merged the processed maps by computing the mean, maximum, and minimum values at each grid cell, resulting in harmonized global maps of soil C (for the top 0–30 cm and 0–100 cm depths) and RH at 0.5° resolution.
Grid cells with fewer than three soil C estimates or fewer than four RH estimates were assigned NA values. Land and water grid cells were automatically distinguished by combining multiple datasets containing soil C and RH information over land.
Soil carbon turnover time (years), denoted as τ, was calculated under the assumption of a quasi-equilibrium state using the formula:
τ = CS / RH
where CS is soil carbon stock and RH is the heterotrophic respiration rate. The uncertainty range of τ was estimated for each grid cell using:
τmax = CS+ / RH− τmin = CS− / RH+
where CS+ and CS− are the maximum and minimum soil C values, and RH+ and RH− are the maximum and minimum RH values, respectively.
To calculate the temperature sensitivity of decomposition (Q10)—the factor by which decomposition rates increase with a 10 °C rise in temperature—we followed the method described in Koven et al. (2017). The uncertainty of Q10 (maximum and minimum values) was derived using τmax and τmin, respectively.
All files are provided in NetCDF format. The SOC file includes the following variables:
· longitude, latitude
· soc: mean soil C stock (kg C m⁻²)
· soc_median: median soil C (kg C m⁻²)
· soc_n: number of estimates per grid cell
· soc_max, soc_min: maximum and minimum soil C (kg C m⁻²)
· soc_max_id, soc_min_id: study IDs corresponding to the maximum and minimum values
· soc_range: range of soil C values
· soc_sd: standard deviation of soil C (kg C m⁻²)
· soc_cv: coefficient of variation (%)
The RH file includes:
· longitude, latitude
· rh: mean RH (g C m⁻² yr⁻¹)
· rh_median, rh_n, rh_max, rh_min: as above
· rh_max_id, rh_min_id: study IDs for max/min
· rh_range, rh_sd, rh_cv: analogous variables for RH
The mean, maximum, and minimum values of soil C turnover time are provided as separate files. The Q10 files contain estimates derived from the mean values of soil C and RH, along with associated uncertainty values.
The harmonized dataset files available in the repository are as follows:
· harmonized-RH-hdg.nc: global soil heterotrophic respiration map
· harmonized-SOC100-hdg.nc: global soil C map for 0–100 cm
· harmonized-SOC30-hdg.nc: global soil C map for 0–30 cm
· Q10.nc: global Q10 map
· Turnover-time_max.nc: global soil C turnover time estimated using maximum soil C and minimum RH
· Turnover-time_min.nc: global soil C turnover time estimated using minimum soil C and maximum RH
· Turnover-time_mean.nc: global soil C turnover time estimated using mean soil C and RH
· Turnover-time30_mean.nc: global soil C turnover time estimated using the soil C map for 0-30 cm
Version history
Version 1.1: Median values were added. Bug fix for SOC30 (n>2 was inactive in the former version)
More details are provided in: Hashimoto S. Ito, A. & Nishina K. (in revision) Harmonized global soil carbon and respiration datasets with derived turnover time and temperature sensitivity. Scientific Data
Reference
Koven, C. D., Hugelius, G., Lawrence, D. M. & Wieder, W. R. Higher climatological temperature sensitivity of soil carbon in cold than warm climates. Nat. Clim. Change 7, 817–822 (2017).
Table1 : List of soil carbon and heterotrophic respiration datasets used in this study.
<td style="width:|
Dataset |
Repository/References (Dataset name) |
Depth |
ID in NetCDF file*** |
|
Global soil C |
Global soil data task 2000 (IGBP-DIS)1 |
0–100 |
3,- |
|
|
Shangguan et al. 2014 (GSDE)2,3 |
0–100, 0–30* |
1,1 |
|
|
Batjes 2016 (WISE30sec)4,5 |
0–100, 0–30 |
6,7 |
|
|
Sanderman et al. 2017 (Soil-Carbon-Debt) 6,7 |
0–100, 0–30 |
5,5 |
|
|
Soilgrids team and Hengl et al. 2017 (SoilGrids)8,9 |
0–30** |
-,6 |
|
|
Hengl and Wheeler 2018 (LandGIS)10 |
0–100, 0–30 |
4,4 |
|
|
FAO 2022 (GSOC)11 |
0–30 |
-,2 |
|
|
FAO 2023 (HWSD2)12 |
0–100, 0–30 |
2,3 |
|
Circumpolar soil C |
Hugelius et al. 2013 (NCSCD)13–15 |
0–100, 0–30 |
7,8 |
|
Global RH |
Hashimoto et al. 201516,17 |
- |
1 |
|
|
Warner et al. 2019 (Bond-Lamberty equation based)18,19 |
- |
2 |
|
|
Warner et al. 2019 (Subke equation based)18,19 |
- |
3 |
|
|
Tang et al. 202020,21 |
- |
4 |
|
|
Lu et al. 202122,23 |
- |
5 |
|
|
Stell et al. 202124,25 |
- |
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains raw and labeled YouTube comments from the Jubilee video titled "Doctor Mike vs 20 Anti-Vaxxers | Surrounded" up to June 2, 2025. The comments reflect a wide range of opinions, sentiments, and arguments related to vaccination, medical science, and public discourse. This makes the dataset particularly valuable for Natural Language Processing (NLP) tasks in real-world, socially charged contexts.
This dataset is ideal for a wide range of NLP tasks:
| Column Name | Description |
|---|---|
text | Raw comment text |
This dataset offers a real-world sample of social media discourse on a controversial and medically relevant topic. It includes both supportive and oppositional viewpoints and can help train robust, bias-aware NLP models.
Because the video includes professional input from Doctor Mike and diverse opinions from 20 participants with strong anti-vaccine views, the comment section becomes a rich playground for studying digital rhetoric, misinformation, and science communication.
Please use this dataset responsibly. Public comments may include misinformation or strong personal views. Consider including disclaimers or filters when using this data for deployment or educational use. Always be mindful of bias, representation, and the propagation of harmful narratives.
All comments are sourced from the publicly available YouTube comment section of Jubilee’s video. We are not affiliated with Jubilee or Doctor Mike.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset consists of simulated and observed salinity/temperature data which were used in the manuscript "A method for assessment of the general circulation model quality using k-means clustering algorithm" submitted to Geoscientific Model Development. The model simulation dataset is from long-term 3D circulation model simulation (Maljutenko and Raudsepp 2014, 2019). The observations are from the "Baltic Sea - Eutrophication and Acidity aggregated datasets 1902/2017 v2018" SMHI (2018).
The files are in simple comma separated table format without headers. The Dout-t_z_lat_lon_Smod_Sobs_Tmod_Tobs.csv file contains columns with following variables [units]: Time [matlab datenum units], Vertical coordinate [m], latitude [oN], longitude [oE], model salinity [g/kg], observed salinity [g/kg], model temperature [oC], observed temperature [oC].
The Dout-t_z_lat_lon_dS_dT_K1_K2_K3_K4_K5_K6_K7_K8_K9.csv file contains columns with following variables [units]: 4 first columns are the same as in the previous file, salinity error [g/kg], temperature error [oC], columns 7-8 are integers showing the cluster to which the error pair is designated.
do_clust_valid_DataFig.m is a Matlab script which reads the two csv files (and optionally mask file Model_mask.mat), performs the clustering analysis and creates plots which are used in Manuscript. The script is organized into %% blocks which can be executed separately (default: ctrl+enter).
k-means function is used from the Matlab Statistics and Machine Learning Toolbox.
Additional software used in the do_clust_valid_DataFig.m:
Author's auxiliary formatting scripts script/
datetick_cst.m
do_fitfig.m
do_skipticks.m
do_skipticks_y.m
Colormaps are generated using cbrewer.m (Charles, 2021). Moving average smoothing is performed using nanmoving_average.m (Aguilera, 2021).
Refferences:
Aguilera, C. A. V., 2021. moving_average v3.1 (Mar 2008) (https://www.mathworks.com/matlabcentral/fileexchange/12276-moving_average-v3-1-mar-2008), MATLAB Central File Exchange. Retrieved March 2, 2021.
Charles, 2021. cbrewer : colorbrewer schemes for Matlab (https://www.mathworks.com/matlabcentral/fileexchange/34087-cbrewer-colorbrewer-schemes-for-matlab), MATLAB Central File Exchange. Retrieved March 2, 2021.
Maljutenko, I., Raudsepp, U., 2019. Long-term mean, interannual and seasonal circulation in the Gulf of Finland—the wide salt wedge estuary or gulf type ROFI. Journal of Marine Systems, 195, pp.1-19. doi:10.1016/j.jmarsys.2019.03.004
Maljutenko, I., Raudsepp, U., 2014. Validation of GETM model simulated long-term salinity fields in the pathway of saltwater transport in response to the Major Baltic Inflows in the Baltic Sea. Measuring and Modeling of Multi-Scale Interactions in the Marine Environment - IEEE/OES Baltic International Symposium 2014, BALTIC 2014, 6887830. doi:10.1109/BALTIC.2014.6887830
SMHI 2018, Swedish Meteorological and Hydrological Institute (SMHI) (2018). Baltic Sea - Eutrophication and Acidity aggregated datasets 1902/2017 v2018. Aggregated datasets were generated in the framework of EMODnet Chemistry III, under the support of DG MARE Call for Tender EASME/EMFF/2016/006 - lot4. doi:10.6092/595D233C-3F8C-4497-8BD2-52725CEFF96B
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Pakistan Super League (PSL) is a premier Twenty20 cricket league in Asia, established in 2015 by the Pakistan Cricket Board (PCB). It features six franchise teams representing major cities in Pakistan, competing in a round-robin format followed by playoffs and a grand final. Known for its high-octane action, world-class players, and massive fan following, PSL has become one of the most exciting T20 leagues globally.
This dataset captures the complete history of PSL matches from its inception through March 2025, making it a valuable resource for cricket analysts, machine learning practitioners, sports journalists, and fans who want to dive deep into player and team performance trends.
The dataset includes ball-level data and match-level summaries, making it ideal for both high-level analytics and granular delivery-by-delivery insights.
| Column Name | Description |
|---|---|
id | Unique identifier for each delivery |
match_id | Unique identifier for each match |
date | Date of the match |
season | PSL season during which the match was played |
venue | Stadium where the match was held |
inning | Inning number (1 or 2) |
batting_team | Team currently batting |
bowling_team | Team currently bowling |
over | Over number in the innings (0 to 19) |
ball | Ball number within the over (1 to 6) |
batter | Name of the batsman facing the delivery |
bowler | Name of the bowler delivering the ball |
non_striker | Name of the non-striking batsman |
batsman_runs | Runs scored by the batter on that delivery |
extra_runs | Runs awarded as extras (wide, no-ball, etc.) |
total_runs | Total runs scored on the delivery (batsman + extras) |
extras_type | Type of extra run (e.g., wide, no-ball, bye) |
is_wicket | 1 if a wicket fell on the delivery; 0 otherwise |
player_dismissed | Name of the player dismissed on the delivery (if any) |
dismissal_kind | Mode of dismissal (e.g., caught, bowled, run out) |
fielder | Name of the fielder involved in the dismissal (if applicable) |
winner | Team that won the match |
win_by | Margin of victory (e.g., "wickets 6", "runs 25") |
match_type | Stage of the match (e.g., league, eliminator, qualifier, final) |
player_of_match | Best-performing player of the match |
umpire_1 | Name of the first on-field umpire |
umpire_2 | Name of the second on-field umpire |
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
ATLAS (Applied Traffic Light Annotation Set) is a new, publicly available dataset designed to improve traffic light detection for autonomous driving. Existing open-source datasets often omit certain traffic light states and lack camera configurations for near and far distances. To address this, ATLAS features over 33,000 images collected from three synchronized cameras—wide, medium, and tele—with varied fields of view in the German city of Karlsruhe. This setup captures traffic lights at diverse distances and angles, including difficult overhead views. Each of the dataset’s 72,998 bounding boxes is meticulously labeled for 25 unique pictogram-state classes, covering rare but critical states (e.g., red-yellow) and pictograms (straight-right, straight-left). Additional annotations include challenging conditions such as heavy rain. All data is anonymized using state-of-the-art tools. ATLAS provides a comprehensive, high-quality resource for robust traffic light detection, overcoming limitations of existing datasets.
| Camera | FOV [°] | Resolution | Images |
| Front-Medium | 61 × 39 | 1920 × 1200 | 25,158 |
| Front-Tele | 31 × 20 | 1920 × 1200 | 5,109 |
| Front-Wide | 106 × 92 | 2592 × 2048 | 2,777 |
We provide the dataset in the following format:
├── ATLAS ├── train ├── front_medium ├── images ├── front_medium_1722622455-950002160.jpg ├── labels ├── front_medium_1722622455-950002160.txt ├── front_tele ├── front_wide ├── test ├── ATLAS_classes.yaml ├── LICENSE └── README.md
Each line in an annotation file describes one bounding box using five fields:
class_id x_center y_center width height
ATLAS_classes.yaml contains human-readable names corresponding to each numeric label.
The ATLAS Dataset by FZI Research Center for Information Technology is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Therefore, the Dataset is only allowed to be used for non-commercial purposes, such as teaching and research. The Licensor thus grants the End User the right to use the dataset for its own internal and non-commercial use and the purpose of scientific research only. There may be inaccuracies, although the Licensor tried and will try its best to rectify any inaccuracy once found. We invite all users to report remarks via mail at polley@fzi.de
If the dataset is used in media, a link to the Licensor’s website is to be included. In case the End User uses the dataset within research papers, the following publication should be quoted:
Polley et al.: The ATLAS of Traffic Lights: A Reliable Perception Framework for Autonomous Driving (under review)
Facebook
TwitterThe dataset consists of diverse PDF files covering a wide range of topics. These files include reports, articles, manuals, and more, spanning various fields such as science, technology, history, literature, and business. With its broad content, the dataset offers versatility for testing and various purposes, making it valuable for researchers, developers, educators, and enthusiasts alike.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Robotic manipulation remains a core challenge in robotics, particularly for contact-rich tasks such as industrial assembly and disassembly. Existing datasets have significantly advanced learning in manipulation but are primarily focused on simpler tasks like object rearrangement, falling short of capturing the complexity and physical dynamics involved in assembly and disassembly. To bridge this gap, we present REASSEMBLE (Robotic assEmbly disASSEMBLy datasEt), a new dataset designed specifically for contact-rich manipulation tasks. Built around the NIST Assembly Task Board 1 benchmark, REASSEMBLE includes four actions (pick, insert, remove, and place) involving 17 objects. The dataset contains 4,551 demonstrations, of which 4,035 were successful, spanning a total of 781 minutes. Our dataset features multi-modal sensor data including event cameras, force-torque sensors, microphones, and multi-view RGB cameras. This diverse dataset supports research in areas such as learning contact-rich manipulation, task condition identification, action segmentation, and more. We believe REASSEMBLE will be a valuable resource for advancing robotic manipulation in complex, real-world scenarios.
Each demonstration starts by randomizing the board and object poses, after which an operator teleoperates the robot to assemble and disassemble the board while narrating their actions and marking task segment boundaries with key presses. The narrated descriptions are transcribed using Whisper [1], and the board and camera poses are measured at the beginning using a motion capture system, though continuous tracking is avoided due to interference with the event camera. Sensory data is recorded with rosbag and later post-processed into HDF5 files without downsampling or synchronization, preserving raw data and timestamps for future flexibility. To reduce memory usage, video and audio are stored as encoded MP4 and MP3 files, respectively. Transcription errors are corrected automatically or manually, and a custom visualization tool is used to validate the synchronization and correctness of all data and annotations. Missing or incorrect entries are identified and corrected, ensuring the dataset’s completeness. Low-level Skill annotations were added manually after data collection, and all labels were carefully reviewed to ensure accuracy.
The dataset consists of several HDF5 (.h5) and JSON (.json) files, organized into two directories. The poses directory contains the JSON files, which store the poses of the cameras and the board in the world coordinate frame. The data directory contains the HDF5 files, which store the sensory readings and annotations collected as part of the REASSEMBLE dataset. Each JSON file can be matched with its corresponding HDF5 file based on their filenames, which include the timestamp when the data was recorded. For example, 2025-01-09-13-59-54_poses.json corresponds to 2025-01-09-13-59-54.h5.
The structure of the JSON files is as follows:
{"Hama1": [
[x ,y, z],
[qx, qy, qz, qw]
],
"Hama2": [
[x ,y, z],
[qx, qy, qz, qw]
],
"DAVIS346": [
[x ,y, z],
[qx, qy, qz, qw]
],
"NIST_Board1": [
[x ,y, z],
[qx, qy, qz, qw]
]
}
[x, y, z] represent the position of the object, and [qx, qy, qz, qw] represent its orientation as a quaternion.
The HDF5 (.h5) format organizes data into two main types of structures: datasets, which hold the actual data, and groups, which act like folders that can contain datasets or other groups. In the diagram below, groups are shown as folder icons, and datasets as file icons. The main group of the file directly contains the video, audio, and event data. To save memory, video and audio are stored as encoded byte strings, while event data is stored as arrays. The robot’s proprioceptive information is kept in the robot_state group as arrays. Because different sensors record data at different rates, the arrays vary in length (signified by the N_xxx variable in the data shapes). To align the sensory data, each sensor’s timestamps are stored separately in the timestamps group. Information about action segments is stored in the segments_info group. Each segment is saved as a subgroup, named according to its order in the demonstration, and includes a start timestamp, end timestamp, a success indicator, and a natural language description of the action. Within each segment, low-level skills are organized under a low_level subgroup, following the same structure as the high-level annotations.
📁
The splits folder contains two text files which list the h5 files used for the traning and validation splits.
The project website contains more details about the REASSEMBLE dataset. The Code for loading and visualizing the data is avaibile on our github repository.
📄 Project website: https://tuwien-asl.github.io/REASSEMBLE_page/
💻 Code: https://github.com/TUWIEN-ASL/REASSEMBLE
| Recording | Issue |
| 2025-01-10-15-28-50.h5 | hand cam missing at beginning |
| 2025-01-10-16-17-40.h5 | missing hand cam |
| 2025-01-10-17-10-38.h5 | hand cam missing at beginning |
| 2025-01-10-17-54-09.h5 | no empty action at |
Facebook
TwitterThis dataset is a collection of TFRecord files to train the models for the UW-Madison GI Tract Image Segmentation competition, specifically on TPU. Each TFRecord file contains the following data:
| Name | Type | Description |
|---|---|---|
| id | bytes | sample ID taken from the 'id' column in 'train.csv', utf-8 encoded. |
| case number | int64 | case number taken from 'id' at caseNNN |
| day number | int64 | day number taken from 'id' at dayNN |
| slice number | int64 | slice number taken from 'id' at slice_NNNN |
| image | bytes | numpy format image bytes read from the associated file |
| mask | bytes | PNG format mask bytes generated from the 'segmentation' column in 'train.csv' |
| fold | int64 | fold number that this sample belongs to |
| height | int64 | slice height taken from the file name |
| width | int64 | slice width taken from the file name |
| space height | float32 | pixel spacing height taken from the file name |
| space width | float32 | pixel spacing width taken from the file name |
| large bowel dice coef | float32 | how well the model predicted for large bowel |
| small bowel dice coef | float32 | how well the model predicted for small bowel |
| stomach dice coef | float32 | how well the model predicted for stomach |
| slice count | int64 | number of slices for case/day |
A sample format definition to read the record is as follows.
TFREC_FORMAT = {
'id': tf.io.FixedLenFeature([], tf.string),
'case_no': tf.io.FixedLenFeature([], tf.int64),
'day_no': tf.io.FixedLenFeature([], tf.int64),
'slice_no': tf.io.FixedLenFeature([], tf.int64),
'image': tf.io.FixedLenFeature([], tf.string),
'mask': tf.io.FixedLenFeature([], tf.string),
'fold': tf.io.FixedLenFeature([], tf.int64),
'height': tf.io.FixedLenFeature([], tf.int64),
'width': tf.io.FixedLenFeature([], tf.int64),
'space_h': tf.io.FixedLenFeature([], tf.float32),
'space_w': tf.io.FixedLenFeature([], tf.float32),
'large_bowel_dice_coef': tf.io.FixedLenFeature([], tf.float32),
'small_bowel_dice_coef': tf.io.FixedLenFeature([], tf.float32),
'stomach_dice_coef': tf.io.FixedLenFeature([], tf.float32),
'slice_count': tf.io.FixedLenFeature([], tf.int64),
}
Here is the notebook used to make this dataset. Here is the notebook to train a model by using this dataset.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The world's population has undergone remarkable growth, exceeding 7.5 billion by mid-2019 and continuing to surge beyond previous estimates. Notably, China and India stand as the two most populous countries, with China's population potentially facing a decline while India's trajectory hints at surpassing it by 2030. This significant demographic shift is just one facet of a global landscape where countries like the United States, Indonesia, Brazil, Nigeria, and others, each with populations surpassing 100 million, play pivotal roles.
The steady decrease in growth rates, though, is reshaping projections. While the world's population is expected to exceed 8 billion by 2030, growth will notably decelerate compared to previous decades. Specific countries like India, Nigeria, and several African nations will notably contribute to this growth, potentially doubling their populations before rates plateau.
This dataset provides comprehensive historical population data for countries and territories globally, offering insights into various parameters such as area size, continent, population growth rates, rankings, and world population percentages. Spanning from 1970 to 2023, it includes population figures for different years, enabling a detailed examination of demographic trends and changes over time.
Structured with meticulous detail, this dataset offers a wide array of information in a format conducive to analysis and exploration. Featuring parameters like population by year, country rankings, geographical details, and growth rates, it serves as a valuable resource for researchers, policymakers, and analysts. Additionally, the inclusion of growth rates and world population percentages provides a nuanced understanding of how countries contribute to global demographic shifts.
This dataset is invaluable for those interested in understanding historical population trends, predicting future demographic patterns, and conducting in-depth analyses to inform policies across various sectors such as economics, urban planning, public health, and more.
This dataset (world_population_data.csv) covering from 1970 up to 2023 includes the following columns:
| Column Name | Description |
|---|---|
Rank | Rank by Population |
CCA3 | 3 Digit Country/Territories Code |
Country | Name of the Country |
Continent | Name of the Continent |
2023 Population | Population of the Country in the year 2023 |
2022 Population | Population of the Country in the year 2022 |
2020 Population | Population of the Country in the year 2020 |
2015 Population | Population of the Country in the year 2015 |
2010 Population | Population of the Country in the year 2010 |
2000 Population | Population of the Country in the year 2000 |
1990 Population | Population of the Country in the year 1990 |
1980 Population | Population of the Country in the year 1980 |
1970 Population | Population of the Country in the year 1970 |
Area (km²) | Area size of the Country/Territories in square kilometer |
Density (km²) | Population Density per square kilometer |
Growth Rate | Population Growth Rate by Country |
World Population Percentage | The population percentage by each Country |
The primary dataset was retrieved from the World Population Review. I sincerely thank the team for providing the core data used in this dataset.
© Image credit: Freepik
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset comprises joint Sentinel-2 and Landsat-8 and Landsat-9 Satellite Image Time Series (SITS) over year 2022. It is composed of 64 Areas Of Interest (AOIs) for the training set and 41 AOIs for the testing set, each covering an area of 9.9x9.9 km² (990x990 pixels for Sentinel-2 and 330x330 pixels for Landsat) covering Europe as well as a few spots in west Africa and north of South America. All dates with more than 25% of clear pixels over the AOI are included in the dataset. This yields a total of 2 984 Sentinel-2 images and 1 609 Landsat-8 and 9 images in the training set. Additional statistics about the number of dates per AOI for each sensor are presented in the following table:
| Sentinel-2 | Landsat | |||||||
| Total | Average | Min | Max | Total | Average | Min | Max | |
| Train split | 2984 | 44 | 11 | 94 | 1581 | 26 | 0 | 61 |
| Test split | 1609 | 39 | 6 | 98 | 736 | 18 | 1 | 56 |
Important Notice : A multi-year, wolrdwide complement to this dataset called LS2S2MYWW is available here: https://doi.org/10.6096/1029 (it is too large to be hosted on Zenodo).
For each sensor, Top-of-Canopy surface reflectance from level 2 products are used. The spectral bands included in the dataset are presented in the following table. It can be observed that the Landsat sensor does not have Red Edge bands or wide Infra-Red, and conversely Sentinel-2 sensor does not retrieve Land Surface Temperature (LST).
| Sentinel-2 | Landsat | Description | ||
| Band | Resolution (m) | Band | Resolution (m) | |
| B1 | 30. | Deep blue | ||
| B02 | 10. | B2 | 30. | Blue |
| B03 | 10. | B3 | 30. | Green |
| B04 | 10. | B4 | 30. | Red |
| B05, B06, B07 | 20. | Red Edge | ||
| B08 | 20. | B5 | 30. | Near Infra-Red |
| B8a | 10. | Wide Near Infra-Red | ||
| B11, B12 | 20. | B6, B7 | 30. | Short Wavelength Infra-Red |
| B10 | 100. | Land Surface Temperature |
In addition to the spectral bands, corresponding quality masks have been used to derive a validity mask for each date of each sensor. This dataset has been gathered through the OpenEO API, in the frame of following work:
Julien Michel, Jordi Inglada. Temporal Attention Multi-Resolution Fusion of Satellite Image Time-Series, applied to Landsat-8 and Sentinel-2: all bands, any time, at best spatial resolution. 2025. ⟨hal-05101526⟩
The source code associated with the paper, including the download script that created the dataset, is available here: https://github.com/Evoland-Land-Monitoring-Evolution/tamrfsits
Two main zip files are provided: ls2s2_train.zip contains the training split, and ls2s2_test.zip contains the test split. Both zip files contains one internal zip file per AOI, organized as follows.
Note that we provide test_31TCJ_12.zip as a sample for previewing the content of the dataset before downloading the train or test split.
The dataset comprises one zip file per AOI. The naming pattern for the zip file is as follows {test/train}_{mgrs_tile}_{subtile}.zip. The {test/train} field indicates if the file is part of the training or testing set. The {mgrs_tile} field correspond to the MGRS tile from which the AOI has been sampled. The {subtile} field indicate which su-btile of the MGRS tile has been sampled. Sub-tiles correspond to 1024x1204 internal JPEG2000 tiles of the Sentinel-2 product. Their numbering follows the lexicographical order (columns then rows).
Each zip file contains the following layout:
{train/test}/{mgrs_tile}_{subtile}/ {mgrs_tile}_{subtile}.json {mgrs_tile}_{subtile}_sentinel2_synopsis.png {mgrs_tile}_{subtile}_landsat_synopsis.png sentinel2/ index.csv 2022mmdd/ sentinel2_mask_2022mmdd.tif sentinel2_bands_2022mmdd.tif ... landsat/ index.csv index_pan.csv 2022mmdd/ landsat_mask_2022mmdd.tif landsat_bands_2022mmdd.tif landsat_pan_mask_2022mmdd.tif landsat_pan_2022mmdd.tif ...
Here is a description of the different files:
<td style="height:| File name | Description |
{mgrs_tile}_{subtile}.json | A json file describing the AOI. |
{mgrs_tile}_{subtile}_sentinel2_synopsis.png | A synopsis PNG file allowing to see all Sentinel-2 images and mask of the AOI at a glance. |
{mgrs_tile}_{subtile}_landsat_synopsis.png | A synopsis PNG file allowing to see all Landsat images and mask of the AOI at a glance. |
index.csv | The csv file indexing the Sentinel-2 or Landsat data for the AOI. |
index_pan.csv | The csv file indexing the Landsat panchromatic data for the AOI. |
sentinel2_mask_2022mmdd.tif | 990x990 pixels GeoTIFF file containing the validity mask for the current date (0 for valid and 1 for invalid). |
sentinel2_bands_2022mmdd.tif | 990x990 pixels GeoTIFF file containing the Sentinel-2 spectral bands are in surface reflectance * 10 000. Band order is B2, B3, B4, B5, B6, B7, B8, B8A, B11, B12. 20m bands are up-sampled to 10m resolution by means of bicubic interpolation. No data pixels have -10 000 value. |
landsat_mask_2022mmdd.tif | 330x330 pixels GeoTIFF file containing the validity mask for the current date (0 for valid and 1 for invalid). Spatial resolution is 30m. |
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”
A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org
Please cite this when using the dataset.
Detailed description of the dataset:
1 Film Dataset: Festival Programs
The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.
The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.
The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.
The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.
2 Survey Dataset
The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.
The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.
The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.
The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.
3 IMDb & Scripts
The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.
The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.
The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.
The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.
The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.
The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.
The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.
The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.
The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.
The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.
The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.
The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.
The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.
The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.
The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.
The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.
The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.
The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.
The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.
4 Festival Library Dataset
The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.
The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.
The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This